Building large-scale systems can be challenging. Systems that are predictable and reliable on a smaller scale can become chaotic and uncontrollable when scaled up, revealing the limits of your design, exhibiting suboptimal performance, or even not working.
To mitigate these effects, it’s important to test large-scale systems before deploying them in production. But predicting system performance and reliability when scaling up can be a complex task that requires extra care. Typically, the amount of compute resources predicts a system’s capacity to scale. In other cases, other dimensions are more important, for example the number of concurrent connections, the number of tenants, or the security rules that can be applied. Further, how you perform scalability testing depends on the nature of the application — is it a greenfield product, an application you are migrating from another cloud provider, or a system you are resizing? Based on our experience assessing large Google Kubernetes Engine (GKE) clusters, let’s walk through the best practices and advice we give GKE customers when they want to scale up their workloads.
Benefits of scalability testing
The major benefit of scalability testing your system is to identify bottlenecks and optimization opportunities that may not be obvious at a lower scale, so you can confirm that it will be reliable and perform well at scale. There are other benefits too:
Gain trust in the system. If you successfully managed to scale it to twice its current size, you should feel good about running it at its regular scale.
Identify potential issues early on. This may prompt you to enhance your observability metrics. For example, many customers don’t track GKE API latency. But once they notice how well Kubernetes API latency detects early signs of problems on the control plane, they decide to add it to Cloud Operations, Google Cloud’s observability suite.
Confirm the cost of system operation at large scale. In particular, GKE customers might be surprised to learn that the cost of running a large system may be smaller, when you consider the number of sessions, users, or Jobs completed.
Setting your scalability testing goals
Setting proper goals for your scalability testing is particularly important. For one, testing the wrong assumptions can be expensive. Then, on top of the actual cost of testing, having poorly defined goals builds a false sense of trust in the system, which can result in serious incidents when the system is scaled in production.
As a rule of thumb, your high-level goal should be expressed as a business-oriented value and should be measurable, for example, double the number of concurrently processed jobs, triple the number of user queries per second, or graceful region failure. Take this goal and display it prominently on your testing dashboard.
You should also map your high-level goals to specific resources. For example, if you are rescaling the existing system, the most common way to do that is to assume linear resource utilization growth, and add a safe cushion. If your system works reliably with 100k CPUs, and you want to double the performance, you might use 250k CPUs. These raw assumptions just provide an estimate and help you assess how far you are from hard system limits; you need to confirm the actual numbers during testing.
When setting your goal, be sure to check the following resources to ensure they’ll be sufficient once you’ve scaled the system:
Number of nodes/Pods/containers
Number of Services
Number of namespaces
Number of CPU/GPU/memory
Number of secrets/ConfigMaps/CRDs
Additionally, verify typical Google Cloud resources utilization such as number of VPCs, their instances and aliases, or the total number of clusters or nodepools in the project.
Large systems can be very reliable, as long as things are stable. It’s during times of change that things can go wrong. To build a proper test case, be sure to verify common friction points during the test. These include cluster upgrades, zonal outages, or large PVM preemptions. Check if all essential metrics and logs are properly collected during the testing procedure.
Determining scalability testing costs
There’s no doubt that scalability testing can be pretty costly. In extreme cases, some tests take hundreds of thousands of dollars to run! We know of what we speak: At Google Cloud, we test GKE releases with up to 20 different test configurations on 15,000 nodes at least twice per week. We also verify the scale of open source Kubernetes releases, performing 5,000 node tests on a daily basis.
Even if you run your scalability tests on an ad hoc basis, you still need to optimize your costs. The most common way to do that is to make a test as short as possible. Our internal tests are a good example: we reduced a 15-hour testing procedure to less than three hours, keeping the scope of the test unchanged.
Here are a few other optimization methods you can use:
Switch to less costly compute resources. If possible, run your tests on Spot VMs, which are up to 91% cheaper than regular instances.
Simplify the storage layer. As long as you’re not testing the storage layer, you can replace default targets (Cloud Storage, LocalSSD) with stubs on standard persistent disks.
If appropriate, optimize the test workloads to ‘stub’ the actual containers with less memory/cpu than you need in production.
And when it comes to estimating the cost of testing, be sure to include all cost factors (compute, network, storage) at their highest expected value. To estimate the length of the test, don’t forget to include the time to scale up and scale down. And based on our experience, you’ll need at least two to three complete tests to reach the goal.
At the end of the day, keep in mind that there is no golden rule here. You need to balance optimizing cost with achieving results that are as close to the production system as possible.
Preparing your infrastructure
When running a large scalability test, it’s a good idea to run the test in a separate Google Cloud project with a dedicated billing account. Separation on the project level helps keep the usage of quotas independently from your other environments. And because this is a separate project, you’ll need to ask for an increased quota for all the resources you are likely to use. It is helpful to follow the guidelines listed in GKE documentation, Plan for large workloads | Google Kubernetes Engine (GKE). Pay particular attention to network configuration. Properly define CIDR addressing or Load Balancers to scale in many dimensions including the most common ones such as number of nodes, Pods, or Services. We also recommend you follow GKE address management: Introduction and overview | Cloud Architecture Center.
Although it is technically possible to run scalability testing without support from your account team, it’s a good idea to include them early in the process, so they can help you understand platform limits and connect you to subject matter experts. You may even gain access to best practices and known workarounds before they are publicly available.
You should also inform and include Google’s capacity team in the preparation phase. Some times of year are worse than others for running large tests, for example Black Friday/Cyber Monday or New Years. The best time for test preparation is the first quarter of the year, and the best time of year to perform the actual testing is the second quarter.
Once you’ve established the timeframe for the testing, schedule a quick dry run a few days before the actual test, where you run a small test to prove that all the wires are properly connected. You are ready to go when you’ve written down all your test cases and proved that they work, stored the configuration in repositories, and developed dashboards to display your testing results.
Running the test
Although each test execution is different, successful tests have a lot in common with one another. One of the most common rules is to run the test with a good team. Your test runners might include DevOps, Architects, or Developers. We have customers that reserve a day in a team’s calendar and hold a war-room. Having the architects and developers in one place can also help you figure out quick workarounds if something goes wrong.
Be prepared that during the run, some elements of the system might be unstable. The goal of a test runner is to collect as much data as possible to debug the issue. For the sake of future investigations, we tend to record all the discussions and investigations that were led during the test sessions. Combined with collected logs and metrics, this recorded data can greatly help with later debugging.
The test run’s summary should include not only technical metrics, but also a pricing breakdown. The value of having a cost estimate of running at scale is hard to overestimate.
Having run many tests, our engineers are experts in the field of scalability. Even so, it’s typical for them to detect several issues on the first test run and rarely is a single run enough to meet scalability testing goals. Depending on the amount of additional tweaking, you’ll usually need to execute another run in about a month. Sometimes, you need to perform deeper refactoring or another round of tests. Without testing, you never would have learned that!
Long story short
Moving to the cloud gives you the opportunity to look at the machines you use to run your workloads from a new and interesting angle. Cloud computing is, in fact, driven by software. Software is the primary interface to network and storage, to your workloads, and to the systems that will manage them. This opens up an endless number of ways to integrate, imagine, and use it.
Testing changes to your code is an integral part of modern software development lifecycle culture. By looking at cloud systems as software, it becomes clear that each change to those systems — whether it’s a new Kuberentes version, a new component like a in-memory database or a caching system, or an architectural change like moving from a single-cluster to a multi-cluster scenario — is a change rollout or a new release, and should be treated as such.
Although the cost of running scalability testing can be considerable, it’s often the most efficient and fastest way to learn how to prepare the system to operate at a larger scale. We believe that GKE is the best platform to run large and complex workloads, and following the best practices that we developed over the years testing Kubernetes and GKE can make the process manageable. We are happy to share our experience and support you on your way to build the most advanced system having our platform as a foundation. Click here for more on GKE scalability best practices, or reach out to your Technical Account Manager if you’d like to talk further.
Cloud BlogRead More