In 2021 and 2022, Google I/O invited remote attendees to meet each other and explore a virtual world in the I/O Adventure online conference experience powered by Google Cloud technologies. (For more details on how this was done, see my previous post.)
When building an online experience like I/O Adventure, it’s important to decide on a provisioning strategy early on. Before we could make that decision, however, we needed to have a reasonably accurate estimate of how many attendees to expect.
Estimating the actual number of attendees in advance is easier for an in-person event than it is for a (free) online event. We recognized that this number could vary wildly from our best guesses, in either direction. The only safe strategy was to design a server architecture that would be able to handle much more traffic than we actually expected. Since the I/O Adventure experience was going to be live for only a few days during the conference, we determined that it would be affordable to overprovision by spinning up many server instances before the event started.
To further ensure that heavy traffic would not degrade the attendees’ experience, we decided to implement a queue. Our system would welcome as many potential simultaneous attendees as it could smoothly support, and steer additional users to a waiting queue. In the unlikely event that the actual traffic exceeded the large allocated capacity, the queue would prevent the system from becoming overly congested.
Designing a scalable cloud architecture for our project was one thing. Making sure that it would actually be able to support a heavy load is quite another! This post describes how we performed load testing on the cloud backend as a whole system, rather than on individual cloud components such as a single VM.
When load testing an entire cloud backend, you need to address several concerns that are not necessarily accounted for in component-level testing. Quotas are a good example. Quota issues are difficult to foresee before you actually hit them. And, we needed to consider many quotas! Do we have enough quota to spin up 200 VMs? More specifically, do we have enough for the type of machine (E2, N2, etc.) that we use? And even more specifically, in the cloud region where the project is deployed?
Infrastructure
To design the load test and simulate thousand of attendees, we had to take two key factors into account:
Attendees communicate from their browser with the cloud backend using WebSocketsA typical attendee session lasts for at least 15 minutes without disconnecting, which is long-lived compared to some common load testing methodologies more focused on individual HTTP requests
While it is possible to set up a load test suite by provisioning and managing VMs, I have a strong preference for serverless solutions like Cloud Functions and Cloud Run. So, I wanted to know if we could use one of them to simulate user sessions – to essentially play the role of a load injector. Does Google Cloud’s serverless infrastructure support the necessary protocol – WebSockets – for this use case?
Yes! It turns out that Cloud Run supports WebSockets for egress and ingress, and has a configurable per-request timeout up to 1 hour.
In the load test we mimicked a typical attendee session, in which a WebSocket connection transmits thousands of messages over several minutes, without disconnecting.
Concurrency
On the backend, the I/O Adventures servers handle thousands of simultaneous attendees by:
Accepting 500 attendees in each “shard”, where each shard is a server representing a part of the whole conference world, in which attendees can interact with each other;
Having hundreds of independent, preprovisioned shards;
Running several shards in each GKE Node;
Routing each incoming attendee to a shard with free capacity.
On the client side (the load injector), we implemented multiple levels of concurrency:
Each trigger (e.g. an HTTPS request from my workstation initiated with curl) can launch many concurrent sessions of 15 minutes and wait for their completion.
Each Cloud Run instance can handle many concurrent triggering requests (maximum concurrent requests per instance).
Cloud Run automatically starts new instances when the existing instances approach their full capacity. A Cloud Run service can scale to hundreds or thousands of container instances as needed.
We created an additional Cloud Run service specifically for triggering more simultaneous requests to the main Cloud Run injector as a way to amplify the load test.
Simulating a single attendee story
A simulated “user story” is a load test scenario that consists of logging in, being routed to the GKE pod of a shard, making a few hundred random attendee movements for 15 minutes, and disconnecting.
For this project, I ran the simulation in Cloud Run, and I kicked off the test by issuing a curl command from my laptop. In this setup, the scenario (story) initiates a connection as a WebSocket client, and the pods are WebSocket servers.
Initiating many attendee stories with one trigger
We created an injector service handler (implemented as a Node.js script) to start many stories in parallel, and wait for their completion.
Injecting many attendee stories with several concurrent triggers
I can multiply the load by triggering the injector many times concurrently with curl, from a command line terminal:
Cloud Run automatically scales up by spinning new injector machines (i.e Cloud Run instances) when needed.
Injecting more attendee stories through an extra Cloud Run service
In the previous setup, my workstation became the bottleneck as I was launching too many long-lived trigger requests with curl. My workstation limited the number of TCP connections, and struggled to keep up with the CPU load for all the SSL handshakes.
We fixed this by creating a new intermediate service specifically for dealing with a large number of triggers. We tried several parameter values (number of stories per trigger, max requests per Cloud Run instance, etc.) to maximize the injector’s throughput.
Note the BigQuery component, used for log ingestion on both the injector side and the server side.
Measuring the success rate
Unlike HTTP requests, which have an explicit response code, WebSocket messages are unidirectional and by default don’t expect an acknowledgement.
To keep track of how many stories have run successfully to completion, we wrote a few events (login, start, finish…) to the standard output and activated a logging sink to stream all of the logs to BigQuery. The events were logged from the point of view of the clients (the injector) and from the point of view of the servers (GKE pods).
This made it very convenient, with aggregate SQL queries, to:
make sure that at least 99% of all the stories did finish successfully, and
make sure the stories did not take more time than expected.
We also kept an eye on several Grafana dashboards to monitor live metrics of our GKE cluster, and make sure that CPU, memory, and bandwidth resources didn’t get overwhelmed.
Visual check
As a bonus, it was very fun to connect as a “real” attendee and watch hundreds of bots running everywhere!
Connecting to the system under stress with a browser also enabled us to assess the subjective experience. We could see, for example, how smooth the animations were when a shard was hosting 500 attendees and the frontend was rendering dozens of moving avatars.
Results
With a total of 4000 triggers for 40 stories each, and a max concurrency of 40 requests per Cloud Run instance, our tests used just over 100 instances and successfully injected 160,000 simultaneous active attendees. We ran this load test script several times over a few days, for a total cost of about $100. The test took advantage of the full capacity of all of the server CPU cores (used by the GKE cluster) that our quota allowed. Mission accomplished!
We learned that:
The cost of the load test was acceptable.The quotas we needed to raise were the number of specific CPU cores and the number of external IPv4 addresses.Our platform would successfully sustain a load target of 160K attendees.
During the actual event, the peak traffic turned out to be less than the maximum supported load. (As a result, no attendees had to wait in the queue that we had implemented.) Following our tests, we were confident that the backend would handle the target load without any major issues, and it did.
Of course, Cloud Run and Cloud Run Jobs can handle many types of workloads, not only website backends and load tests. Take some time to explore them further and think about where you can put them to use in your own workflows!
Cloud BlogRead More