Large scale hybrid API management: Best practices in configuring clusters, scaling, and operations

By mullaned2002

February 8, 2023

479

Competing and staying relevant in an economy that has almost become digital-first overnight, required organizations to rethink their API strategy, and in many cases, a hybrid cloud approach. As we uncovered in our 2022 research into API usage, 59% of respondents were planning to increase their hybrid cloud adoption in the next 12 months. But operating APIs consistently in a hybrid cloud environment is easier in theory than in practice. We see many organizations struggling with costs, maintenance, and monitoring due to the heterogeneity in architecture.

In the last blog we have discussed the common challenges in architecting the right team structures and platform. In this blog, we will cover some best practices we have learnt from our customers as they made their move to hybrid runtime operations. Especially,

Determining the size and placement of your Kubernetes clusters

Considering how you will handle upgrades, security, and automation

Ensuring that you have appropriate monitoring and dashboards in place to meet your strict service level objectives

Before we get into the best practices, let us cover some commonly used technical terms in this blog:

Node – Worker machine in Kubernetes and may be either a virtual or a physical machine, depending on the cluster

Cluster – set of nodes that run containerized applications

Pod – smallest, most basic deployable objects in Kubernetes. Represents a single instance of a running process in your cluster, Pods contain one or more containers

Cassandra – Apache Cassandra is the runtime datastore that provides data persistence for Apigee Hybrid runtime

CI/CD – Combined practices of continuous integration and continuous delivery or continuous deployment

GitOps – An operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation

#1 Start with the right cluster size and capacity

There is no one optimal infrastructure size for all clusters running Apigee hybrid, but there are some easy ways to figure out the right size for your organization.

First, you’ll need to determine the right starting size for your clusters. Identifying a starting point for cluster capacity requires estimating the number of processes and the average memory required by each process. In the world of API management, this would require estimating the average transactions per second (TPS) the cluster would process and the number of environments running in a particular cluster.

Although this information provides a foundation to build on, you will need to account for other factors such as – API response times, policies executed in an API call, size of payloads, average TPS, or CPU needs for any custom policies. Using this foundation, you can rely on the help of a trusted partner like your Google Customer Engineer to make a reasonable capacity estimate.

#2 Optimize your cluster capacity through testing

While operating your APIs in hybrid environments, it is vital to right-size your clusters running your containerized runtime services. Undersizing a cluster will impact the performance and oversizing a cluster impacts your costs. As a best practice, never rely just on the estimates you used when you set up the cluster capacity. The starting point is based on a set of averages and will never replicate your actual workloads, which will most likely require a different cluster capacity.

This is where testing comes in handy. Since testing in production is not optimal, the best way to determine the optimal cluster capacity is to run your own load tests with your own workloads. You can run average and peak TPS tests to find a cluster size that can maximize your cost efficiency and minimize your carbon footprint efficiency, all the while delivering the top notch API experience for your developers.

As a note, Cassandra nodes should have a minimum of 8 vCPUs and 15Gi of RAM for production workloads. We recommend that your runtime pods have 2 vCPUs and 2 Gi of ram. You can override the default values using the appropriate configuration properties for Cassandra and runtime in your overrides.yaml file.

#3 Protect against failures with regional clusters

Your production platform should have at least two Kubernetes clusters, each in a different region. Having two clusters in different regions will protect against issues that can cause cluster failures, such as regional outages. When setting up regional clusters for disaster recovery, we recommend that they are the same size. That way, one can handle the full load if the other cluster fails.

The other main consideration is picking regions where your consumers are accessing your services. Since you can have more than two regions and even different environments running in each region, you can optimize your cluster size for each region based on the traffic it serves. As you are sizing your clusters though, you should always be aware of what happens if a cluster in another region fails and that traffic is suddenly distributed unevenly among your clusters.

#4 Optimize scaling of your runtime pods and datastore

Since Apigee hybrid runs on Kubernetes, you get all the benefits Kubernetes has to offer — including the ability to autoscale. By default, all the pods that process your API traffic will autoscale using the horizontal pod autoscaler except for the Cassandra pods.

To further optimize your autoscaling, test and fine-tune your settings. Through testing, you can determine the maximum and minimum clusters for your replicas. As part of optimization, give the environments higher maximum limits where further capacity is required and others lower maximum to reduce the resources they consume. This ability to autoscale also means that you don’t have to get the scale perfect (Ideally, you will be running on a Kubernetes service such as GKE (Google Kubernetes Engine) with elastic compute capability where you can add new nodes and remove nodes on the fly to ensure that you are using resources as efficiently as possible).

The one exception as noted above is Cassandra, since it does not have the ability to autoscale. Cassandra can easily be scaled up and down, but it must be done on-demand and in sets of three. If you are heading into peak season, add three or six Cassandra nodes. After the peak traffic is through, go ahead and scale back down.

Lastly, note that auto scaling works best when there is a gradual ramp of traffic. While runtime pods are designed to spin up quickly, nodes may take a bit longer to spin up and be added to your cluster. If you are expecting a “wall” of traffic — for example if you are dropping that highly anticipated limited release sneaker at midnight — make sure to pre-warm the system by setting the minimum replicas high enough to handle that load. The autoscaler can’t react quickly enough to handle peak traffic hitting your application in the same second.

#5 Stay up to date with automated CI/CD pipelines

You probably already know the benefits of using an automated CI/CD pipeline to deploy your proxies, but automation can also help streamline installing and upgrading your hybrid runtimes. Some of the reasons why automation can be useful is it:

Simplifies recreating a cluster in the event of a disaster

Makes it easy to add additional regions

Makes it easy for teams to install their own instances

Offers a tested, repeatable process to reduce errors

Can be set up as part of the GitOps deployment model

Protects your Apigee hybrid settings from manual cluster incidents when settings like config syncs are in place

Considering how often there are enterprise software updates, upgrades are perhaps the most important thing to automate. Apigee hybrid was born in the cloud, which means it receives the same frequent updates and patches that we put out for Apigee on Google Cloud. With many patches and updates a year – it can be easy to fall behind, creating issues from running outdated, unsupported versions. By automating the process, you can be sure that you are always getting the features and security patches in the latest version.

#6 Ensure reliable performance with monitoring

No matter how well-automated the system is, the best way to ensure consistent uptime is with strong monitoring tools. Apigee hybrid makes it easy to monitor and set alerts with the power of Google Operations Suite. By default, Apigee sends your proxy health metrics to Cloud Monitoring automatically, and you can choose to send system logs as well. Once these metrics are in Cloud Monitoring, you can use them to define dashboards and service level indicators, and apply general SRE (Site reliability engineering) practices to keep your instances humming along without issue.

Apigee can also work with the stacks you already have in place for logging and monitoring. You can also use your own tools to scrape metrics or install your own agents to gather logs and send them to the logging tool of your choice.

Conclusion

Like with any large scale IT project, there are too many variables to define a single “correct” way to operate APIs in a hybrid cloud environment. Finding the right approach requires tailoring Apigee for your use case based on your organizational constraints.

Check out our documentation to learn more or start using Apigee hybrid.

Cloud BlogRead More

Previous articleLarge scale hybrid API management: Common challenges with structuring the right teams and platform

Next articleBuild a real-time fraud detection solution using Amazon Neptune ML

Large scale hybrid API management: Best practices in configuring clusters, scaling, and operations

#1 Start with the right cluster size and capacity

#2 Optimize your cluster capacity through testing

#3 Protect against failures with regional clusters

#4 Optimize scaling of your runtime pods and datastore

#5 Stay up to date with automated CI/CD pipelines

#6 Ensure reliable performance with monitoring

Conclusion

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Help users discover more with new Places API updates

Enable fine-grained access control and observability for API operations in Amazon DynamoDB

What Is Enterprise Automation? A Guide To Simplifying Your Workflow

POPULAR CATEGORY