Editor’s note: Today’s post is from Neil Craig at the British Broadcasting Corporation (BBC), the national broadcaster of the United Kingdom. Neil is part of the BBC’s Digital Distribution team, which is responsible for building the services such as the public-facing www bbc.co.uk and .com websites and ensuring they are able to scale and operate reliably.
The BBC’s public-facing websites inform, educate, and entertain over 498 million adults per week across the world. Because breaking news is so unpredictable, we need a core content delivery platform that can easily scale in response to surges in traffic, which can be quite unpredictable.
To this end, we recently began running our log-processing infrastructure on a Google Cloud serverless platform. We’ve found the new system, based on Cloud Run and BigQuery, to be more reliable, scalable, and cost-effective than our previous infrastructure. And — news flash! — it also freed our team from having to do a lot of manual labor, and opened the door to being more collaborative and data-driven.
A log in time
To operate the site and ensure our services run smoothly we continually monitor Traffic Manager and CDN access logs. Our websites generate more than 3B log lines per day, and handle large data bursts during major news events; on a busy day our system supports over 26B log lines in a single day.
As initially designed, we stored log data in a Cloud Storage bucket. But every time we needed to access that data, we had to download terabytes of logs down to a virtual machine (VM) with a large amount of attached storage, and use the ‘grep’ tool to search and analyze them. From beginning to end, this took us several hours. On heavy news days, the time lag made it difficult for the engineering team to do their jobs.
We needed a more efficient way to make this log data available, so we designed and deployed a new system that deals with logs and reacts to spikes more efficiently as they arrive, improving the timeliness of critical information significantly.
In this new system, we still leverage Cloud Storage buckets, but on arrival, each log generates an event using EventArc. That event triggers Cloud Run to validate, transform and enrich various pieces of information about the log file such as filename, prefix, and type, then processes it and outputs the processed data as a stream into BigQuery. This event-driven design allows us to process files quickly and frequently — processing a single log file typically takes less than a second. Most of the files that we feed into the system are small, fewer than 100 Megabytes, but for larger files, Cloud Run automatically creates additional parallel instances very quickly, helping the system scale almost magically.
And because we’re, erm, lucky, and get frequent distributed denial-of-service attacks (free load tests!), we’re confident in the system’s ability to handle significant traffic. For example, not long before the announcement of the Queen’s passing in September, we had an attack that generated a colossal traffic spike. Within one minute, we went from running 150 to 200 container instances to over 1000…. and the infrastructure just worked. Because we engineered the log processing system to rely on the elasticity of a serverless architecture, we knew from the get-go that it would be able to handle this type of scaling.
Our initial concern about choosing serverless was cost. It turns out that using Cloud Run is significantly more cost-effective than running the number of VMs we would need for a system that could survive reasonable traffic spikes with a similar level of confidence.
It’s also saved us a lot of time. We picked Cloud Run intentionally because we wanted a system that could scale well without manual intervention. As the digital distribution team, our job is not to do ops. We leave that to Google, the experts. The new system is massively more reliable and cost-effective, but it’s also easier for us to build and maintain.
What surprises new team members who aren’t familiar with Google Cloud is how easy it is to fit all the pieces together. Google’s inter-service auth is automatically managed and really simple to configure. When the Cloud Run service writes to BigQuery or reads from Cloud Storage, I tell it to use OIDC auth (which it manages automatically via Service Account permissions), import the client library — and it just works. Another example is pushing events into Cloud Run, where we can configure Cloud Run authorization to only accept events from specific EventArc triggers, so it is fully private.
Going forward, the new system has also opened up many opportunities for the BBC organization to make better use of its data. For example, thanks to BigQuery’s per-column permissions, we can easily open up access to our logs to other engineering teams, without having to worry about sharing PII that’s restricted to approved users.
The goal of our team is to empower all teams within the BBC to get the content they want on the web when they want it, make it reliable, secure, and make sure it can scale. Google Cloud serverless products helped us to achieve these goals with relatively little effort and require significantly less management than previous generations of technology.
Cloud BlogRead More