Scaling Microservices Applications: From Open Source to Redis Enterprise on Google Cloud

By mullaned2002

February 3, 2023

437

Redis excels as a fast caching solution or in-memory database, and is commonly utilized by startups as a self-managed open source technology. But what happens when you eventually outgrow open source limitations and the in-house expertise required to manage OSS deployments? How do you scale without business disruption or data loss? Read on to discover how Google Cloud and Redis make it easy to transition from open source to enterprise-ready as your business grows.

Let’s explore this common scenario for digital-native companies and illustrate it using a hypothetical retailer we will call “Online Boutique”. Online Boutique is a popular online shop that recently launched a digital storefront to connect consumers with its products. This storefront utilizes a microservices architecture running on Google Kubernetes Engine (GKE) in addition to caching with Redis open source.

Online Boutique’s DevOps team is responsible for deploying and scaling this home-grown microservices application in production. Its engineers selected GKE as their Kubernetes engine because it allows their infrastructure to scale to support application and business growth, while guaranteeing a 99.95% uptime SLA.

In order to scale performance and prevent downtime should failure occur, the engineering team set up the microservices to scale out by simply increasing the number of replicas for each instance. The DevOps team decided to implement GKE’s Cluster Autoscaler and Horizontal Pod Autoscaling to allow the cluster and microservices to grow dynamically without human intervention. Online Boutique’s configuration and management approach to Redis open source made it the only service that couldn’t scale automatically once multiple replicas were created. Because Redis stores persistent data, there was a need to ensure that data was consistent between replicas. This required a higher level of orchestration internally and added management complexity.

The team had a conversation about the fact that their approach to Redis OSS potentially introduced a single point of failure, and about their need to deploy Redis as a cluster to improve scalability and reliability. With the deadline to launch their online store looming, leadership decided that a single Redis open source database instance would be an acceptable risk for initial launch. When running Kubernetes on the Google Compute Engine (GCE) a service level agreement (SLA) of 99.5% uptime can be achieved, which means that if the Redis instance went down, another node would be recreated and mapped to the persistent volume. The decision to deploy Redis open source as a single instance was made with the promise that this vulnerability would be resolved shortly after launch.

Let’s say a few months go by and it finally happened, it’s 3am and a phone buzzes off the nightstand. There are dozens of notifications that Online Boutique’s online store is down. As a support engineer starts digging into the logs, metrics, and traces using observability tools, they realize that the Redis instance has been restarted 13 times and won’t stay in a “ready” state. After further investigation, it is determined that the sheer number of transactions to this single Redis instance is well over a million per second, which is beyond the upper limit a single Redis open source instance can handle.

Once the engineer had diagnosed the issue, a site event incident (SEV) was triggered to get all hands on deck and bring the site back online. The entire DevOps team was now on a call along with the head of engineering and a few senior software engineers. It was determined that the amount of traffic to the website had overwhelmed their self-managed Redis open source instance, and without the traffic on the website being reduced, the only way to bring things back online was to scale Redis. The team broke into three groups, the first group put up a static landing page for the storefront to let users know the store was temporarily offline. The second group dug into whether the traffic was legitimate or the result of a malicious attack. The third group rehashed conversations from months back on their approach to scaling Redis.

The team discussed a number of different paths they could take, laying out each approach and its pros and cons.

Redis Operator from Artifact Hub

Pros:

Free

Easy procurement process

Can run in existing GKE cluster

Cloud agnostic

Cons:

No support

Requires high level of expertise to deploy, scale, and upgrade

Requires time to design, test, & deploy

Multi-cluster deployments are not supported with this operator, making multi-region availability impossible

No guarantee of scalability

Redis Enterprise on Kubernetes

Pros:

24/7×365 Support

Can run in existing GKE cluster

Cloud agnostic

Free trial available

Easy-to-follow documentation to: deploy, scale, and upgrade

Supports Multi-Cluster deployments to ensure multi-region availability

200+ Million queries per second (QPS)

Cons:

Costs money (after free trial)

Requires some expertise to: deploy, scale, and upgrade

Requires time to: design, test, and deploy

Google Cloud Memorystore

Pros:

24/7×365 Support

99.9% SLA

Fully managed service

Simplified billing through our Google Cloud Account

Minimal expertise needed to: Deploy, Scale, & Upgrade

Instantly deployable

Single pane of glass to manage infrastructure

Cons:

Costs money

Maximum of 1 million queries per second (QPS)

No multi-region availability

Maximum dataset size: 300GB

Not cloud agnostic

Cannot scale writes (read replicas only)

Redis Enterprise Cloud

Pros:

24/7×365 Support

99.999% SLA

Fully managed service

Instantly deployable

Simplified billing through Google Cloud account when using Google Cloud Marketplace

24TB+ dataset size

Requires Flash Storage for 24TB+

Maximum in RAM is 12TB+

Multi-region & multi-cloud availability

Cloud agnostic

Cons:

Costs money (after free trial)

Two panes of glass to manage infrastructure

Due to time pressure to get the application back online, options one and two were eliminated. There was not enough time for the team to design, test, and deploy an open source-based solution that would prevent an event like this from happening again. This left the team considering both SaaS solutions, Google Cloud Memorystore & Redis Enterprise Cloud. After performing an analysis of the two, the only real point of contention was the price. However, after pricing out similarly-sized clusters as accurately as possible on both Google Cloud Memorystore and Redis Enterprise Cloud, the prices were comparable.

Plus, Redis Enterprise Cloud is highly available and persistent with 99.999% SLA to minimize future downtime and data loss. With sub-millisecond latency, support for significantly larger datasets, and the ability to seamlessly scale clusters vertically and horizontally Redis Enterprise on Google Cloud was the clear choice to future proof their application for optimal performance at any scale.

Now, the team had to deploy it and migrate their previous Redis deployment from their self-managed open source instance to Redis Enterprise Cloud with minimal downtime and without losing any data. Deploying proved quite simple through Google Cloud Marketplace which the team was already familiar with. With just a few clicks (there is a Terraform provider too) the team was able to provision a highly available Redis Enterprise database in a matter of minutes. Once the database was created, the team was able to provide a GCP Project and Network name for the GKE cluster and Redis Enterprise Cloud provided a gcloud command to peer the GKE VPC with the Redis VPC. After the peering was complete the GKE cluster could connect to the Redis Enterprise database through the database’s private endpoint.

At this point the team needed to get all of the data from the existing Redis database instance into the Redis Enterprise Cloud database. This was done by creating a Kubernetes secret with the connection information for both the local Redis instance and the cloud instance that looked a little something like this:

code_block[StructValue([(u’code’, u’apiVersion: v1rnkind: Secretrnmetadata:rn name: redis-credsrntype: OpaquernstringData:rn REDIS_SOURCE: redis://${REDIS_SOURCE}rn REDIS_DEST: redis://${REDIS_DEST}rn REDIS_DEST_PASS: ${REDIS_DEST_PASS}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e20c79ec810>)])]

The team then kicked off a Kubernetes job to migrate the data from the local instance to the cloud instance (which was back online since the static webpage that was put in place took the load off of the failed Redis OSS database). The job looks like this:

code_block[StructValue([(u’code’, u’apiVersion: batch/v1rnkind: Jobrnmetadata:rn name: redis-migratorrnspec:rn template:rn spec:rn containers:rn – name: redis-migratorrn image: jruaux/riot-redis:v2.18.4rn command: [“./riot-redis-2.18.4/bin/riot-redis”]rn args: [“-u”, $(REDIS_SOURCE), “replicate-ds”, “-u”, $(REDIS_DEST), “-a”, $(REDIS_DEST_PASS)]rn envFrom:rn – secretRef:rn name: redis-credsrn restartPolicy: Never’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e20c4118790>)])]

Once the job was complete the team was able to update the deployment to point to the new location of Redis using a Kubernetes patch command:

code_block[StructValue([(u’code’, u’kubectl patch deployment cartservice –patch \rn ‘{rn “spec”: {rn “template”: {rn “spec”: {rn “containers”: [rn {rn “name”: “server”,rn “env”: [rn {rn “name”: “REDIS_ADDR”,rn “value”: “‘$REDIS_ENDPOINT'”rn }rn ]rn }rn ]rn }rn }rn }rn }”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e20c4b31c90>)])]

The team ran checks to verify that everything was working properly, and was pleasantly surprised to find that everything “just worked”. The team removed the static webpage, allowing traffic back to the storefront, and got their application back up and migrated without an issue.

The team reconvened and held a retrospective on what happened, lessons learned and how to prevent the same failure from happening again. One question that hadn’t been answered yet was: “How did we get such a surge in traffic?!” Apparently, a social media influencer mentioned how they loved a pair of shoes that were sold on the site, and their entire following decided to browse and shop at the store at the same time. They discovered this through complaints on their own social media account, with shoppers stating they wanted to buy the same shoes, but couldn’t. The social media influencer shoutout would have been a nice problem to have if they could have handled the surge in traffic. Moving forward, this will no longer be an issue and they will be able to scale to meet the needs of their business at any time.

Thanks for following along as we illustrated how to scale a microservices application from open source to Redis Enterprise on Google Cloud to achieve the scalability and high availability your real-time applications need. Visit Google Marketplace to learn more and take advantage of the latest offers like free marketplace credits to help you get started. And don’t take our word for it, deploy this microservices application and test out how easy it is to migrate from Redis open source to Redis Enterprise on Google Cloud for yourself. All of the code and the instructions are located on GitHub!

Cloud BlogRead More

Previous articleAnalyze and visualize multi-camera events using Amazon SageMaker Studio Lab

Next articleHow multicloud changes devops

Scaling Microservices Applications: From Open Source to Redis Enterprise on Google Cloud

Redis Operator from Artifact Hub

Redis Enterprise on Kubernetes

Google Cloud Memorystore

Redis Enterprise Cloud

The overwhelmed person’s guide to Google Cloud: week of April 18

Announcing PyTorch/XLA 2.3: Distributed training, dev improvements, and GPUs

Transforming How New York Protects and Serves its Community

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 18

Inpainting and Outpainting with Stable Diffusion

Announcing PyTorch/XLA 2.3: Distributed training, dev improvements, and GPUs

GQL: The ISO standard for graphs has arrived

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Are your SLOs realistic? How to analyze your risks like an SRE

Integrate Amazon Managed Blockchain identities with Amazon Cognito

Does Open-Source Software Hold the Key to Data Security?

POPULAR CATEGORY