Thursday, February 22, 2024
No menu items!
HomeCloud ComputingCanary deployments using Kubernetes Gateway API, Flagger and Google Cloud Deploy

Canary deployments using Kubernetes Gateway API, Flagger and Google Cloud Deploy

Canary deployment is an advanced technique used to test changes in a production environment by gradually rolling out the changes to a small subset of users before fully deploying them to the entire user base. This allows for real-world testing of the changes, and the ability to quickly roll back the changes in the event of any issues. Canary deployments are particularly useful for testing changes to critical parts of an application, such as new features or updates to the database schema. By using canary deployments, you can ensure that any new changes do not negatively impact the user experience, and can fix issues before they affect the entire user base.

The new Kubernetes Gateway API gives you a great new tool for managing traffic to applications running on your Google Kubernetes Engine clusters. Together with Google Cloud Deploy you can leverage this new capability to enable faster releases to production for your applications. At the end of this post, you are going to have a Continuous Deployment pipeline that is using an iterative traffic shift pattern to release your application to production, allowing you to do fast zero downtime deployments of your applications.

Flagger is a OSS tool that allows you to do canary releases or A/B testing in a declarative fashion using your K8S cluster. It monitors configurable metrics from your application to determine the health of your release and controls the release process based on those metrics. It supports metrics from various sources like Prometheus or Google Cloud Monitoring. In this post, I’m using Google Managed Prometheus as a metrics source.

Flagger is often used with service meshes likeIstio orAnthos Service Mesh, but since recently it also supports the new Kubernetes Gateway API for traffic management, that we are using in this blog post. I updated the implementation of Gateway API in Flagger to support the latest version v1beta1, and decided to put together this blog post.

High Level Design

Here is a small architecture diagram of how the components in this blog post connect with each other:

We are going to need several resources in our Google Cloud setup. We are using Artifact Registry to store the container image. Cloud Load Balancing is used for routing traffic to the application. Cloud Deploy is providing us with a managed continuous delivery pipeline that deploys the application to the various environments. Google Managed Prometheus is providing us with observability of the application so that the canary strategy can be data driven.

On the GKE cluster, we are using a 2 namespace setup with a dev namespace for the development environment that is directly deployed from Cloud Deploy and a prod namespace where the K8S deployment is done with a gradual traffic shift using Flagger. For the prod namespace we are also going to deploy a Google-managed Prometheus (GMP) query interface.

Since we are using an internal Cloud Load Balancer, we are going to need a jump host VM on Compute Engine to actually access the application.

Environment setup

Let’s start with setting up our environment. In order to follow this post, you are going to needkubectl,gcloud,jq andskaffold installed on your machine, or you can use Cloud Shell since it has all of them installed. We are also going to set a few variables that will help us in the next steps.

code_block[StructValue([(u’code’, u’export GOOGLE_CLOUD_PROJECT_ID=<your_project_on_google_cloud>rnexport GOOGLE_CLOUD_REGION=<your_google_cloud_region>’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22e5827d10>)])]

Also we need to enable a few APIs upfront.

code_block[StructValue([(u’code’, u’gcloud services enable –project $GOOGLE_CLOUD_PROJECT_IDrngcloud services enable –project $GOOGLE_CLOUD_PROJECT_IDrngcloud services enable –project $GOOGLE_CLOUD_PROJECT_IDrngcloud services enable –project $GOOGLE_CLOUD_PROJECT_ID’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22d8fc46d0>)])]

In order to set-up Artifact Registry and configure your environment run the following commands.

code_block[StructValue([(u’code’, u’gcloud artifacts repositories create canary-repo –repository-format=docker \rn–location=$GOOGLE_CLOUD_REGION –project $GOOGLE_CLOUD_PROJECT_ID –description=”Docker repository for canary blog”rngcloud auth configure-docker $GOOGLE_CLOUD_REGION-docker.pkg.devrnexport SKAFFOLD_DEFAULT_REPO=$$GOOGLE_CLOUD_PROJECT_ID/canary-repo’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22db2fcd90>)])]

We are going to need a proxy-only subnet in our VPC for the Load Balancer. If you don’t already have one, create one with the following command. You might need to change up the IP range to a free range in your network. This example uses the default VPC, but feel free to choose whichever VPC you prefer.

code_block[StructValue([(u’code’, u’gcloud compute networks subnets create proxy \rn–purpose=REGIONAL_MANAGED_PROXY \rn–role=ACTIVE \rn–region $GOOGLE_CLOUD_REGION –project $GOOGLE_CLOUD_PROJECT_ID \rn–network=default \rn–range=′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22db2fc390>)])]

We need aGKE cluster withGateway API,Horizontal Pod Autoscaling andWorkload Identity enabled, with a newer GKE version higher or equal to 1.24, so let’s create that one next:

code_block[StructValue([(u’code’, u’gcloud container clusters create “example-cluster” \rn–region $GOOGLE_CLOUD_REGION \rn–project $GOOGLE_CLOUD_PROJECT_ID \rn–cluster-version “1.24.5-gke.600” \rn–machine-type “e2-medium” \rn–num-nodes “1” \rn–max-pods-per-node “30” \rn–enable-autoscaling \rn–min-nodes “0” \rn–max-nodes “3” \rn–enable-managed-prometheus \rn–workload-pool “$” \rn–enable-shielded-nodes \rn–gateway-api=standard \rn–enable-ip-alias’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22e56da8d0>)])]

After the creation is complete, we connect our local machine to the cluster:

code_block[StructValue([(u’code’, u’gcloud container clusters get-credentials example-cluster \rn –region $GOOGLE_CLOUD_REGION –project $GOOGLE_CLOUD_PROJECT_ID’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65ae310>)])]

Lastly, we are going to need an example app. I created a small example app in Golang that you can check out.

code_block[StructValue([(u’code’, u’git clone ./blog-examples/cd-flagger-gateway-api’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65ae210>)])]

Now everything is ready.

Deploy Google Managed Prometheus (GMP) Query Interface in the cluster

Flagger is using telemetry data to get the status of the deployment. The example Golang application we are deploying for this demo provides Prometheus metrics. We already enabled managed collection on the cluster, so the metrics from the app should be available in Cloud Monitoring already. For this demo, we use a GMP query interface inside the cluster so that Flagger can check the deployment health. Flagger can also query Google Cloud Operations directly but we found it easier to calculate success rates with PromQL.

code_block[StructValue([(u’code’, u’kubectl create namespace prodrnkubectl create namespace devrnrnkubectl create serviceaccount gmp -n prodrnrngcloud iam service-accounts create gmp-sa –project=$GOOGLE_CLOUD_PROJECT_IDrnrngcloud iam service-accounts add-iam-policy-binding gmp-sa@$ \rn–role roles/iam.workloadIdentityUser –project=$GOOGLE_CLOUD_PROJECT_ID \rn–member “serviceAccount:$[prod/gmp]”rnrngcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT_ID \rn–member=serviceAccount:gmp-sa@$ \rn–role=roles/monitoring.viewerrnrnkubectl annotate serviceaccount gmp \rn–namespace prod \$GOOGLE_CLOUD_PROJECT_ID.iam.gserviceaccount.comrnrnsed -i “s/GOOGLE_CLOUD_PROJECT_ID/$GOOGLE_CLOUD_PROJECT_ID/g” gmp-frontend.yamlrnrnkubectl apply -n prod -f gmp-frontend.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65ae350>)])]

Install Flagger in the Kubernetes cluster

In order to start with canary deployments, we need to install Flagger with Gateway API enabled in our cluster. You can simply do that by running:

code_block[StructValue([(u’code’, u’kubectl apply -k’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65ae590>)])]

It will install the flagger components and CRDs into the flagger-system namespace.

Bootstrap the environment

Next we will bootstrap the environment with:

a K8S gateway for dev and prod (Using an Internal L7 LB)
a Metric Template for querying the Success Rate (we are going to take a look at this later)
and a Canary Release object for Flagger (will be explained further down as well)

code_block[StructValue([(u’code’, u’kubectl apply -f bootstrap.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65ae790>)])]

We need to fetch the IP address of the gateway for DNS setup (it might take a few minutes for it to show up):

code_block[StructValue([(u’code’, u’kubectl get app -n dev \rn -o=jsonpath=”{.status.addresses[0].value}”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65aeb10>)])]

If you like you can now go ahead and try to deploy directly from Skaffold:

code_block[StructValue([(u’code’, u’skaffold run’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65aea50>)])]

You can now call the service from a VM inside the same VPC:

code_block[StructValue([(u’code’, u’curl -H “Host:” http://<DEV_IP>’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65aea90>)])]

The response should show “Hello World!”. It might take a couple of seconds for the state to reconcile and the backend to become healthy:

Create a Cloud Deploy Pipeline

First, set permissions for Cloud Deploy and apply the pipeline. This example uses a simplified IAM configuration using the default compute service account to reduce complexity. To improve security you should use a custom service account when you set this up for production usage.

code_block[StructValue([(u’code’, u’gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT_ID \rn–member=serviceAccount:$(gcloud projects describe $GOOGLE_CLOUD_PROJECT_ID \rn–format=”value(projectNumber)”)[email protected] \rn–role=”roles/clouddeploy.jobRunner”rnrngcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT_ID \rn–member=serviceAccount:$(gcloud projects describe $GOOGLE_CLOUD_PROJECT_ID \rn–format=”value(projectNumber)”)[email protected] \rn–role=”roles/container.developer”rnrnsed -i “s/GOOGLE_CLOUD_PROJECT_ID/$GOOGLE_CLOUD_PROJECT_ID/g” clouddeploy.yamlrnrnsed -i “s/GOOGLE_CLOUD_REGION/$GOOGLE_CLOUD_REGION/g” clouddeploy.yamlrnrngcloud deploy apply –file clouddeploy.yaml \rn –region=$GOOGLE_CLOUD_REGION –project=$GOOGLE_CLOUD_PROJECT_ID’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65aea10>)])]

Next, create a new release for deployment to prod with Cloud Deploy:

code_block[StructValue([(u’code’, u’skaffold build -p prodrngcloud deploy releases create release-001 \rn –project=$GOOGLE_CLOUD_PROJECT_ID –region=$GOOGLE_CLOUD_REGION \rn –delivery-pipeline=canary \rn –images=skaffold-kustomize=$(skaffold build -q | jq -r “.builds[].tag”)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65aec10>)])]

Next, we promote the release to prod, this step can also take some time to complete:

code_block[StructValue([(u’code’, u’gcloud deploy releases promote –release=release-001 \rn –project=$GOOGLE_CLOUD_PROJECT_ID –region=$GOOGLE_CLOUD_REGION \rn –delivery-pipeline=canary –to-target=prod’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22c65aef10>)])]

Let’s fetch the IP for the prod gateway:

code_block[StructValue([(u’code’, u’kubectl get app -n prod \rn -o=jsonpath=u201d{.status.addresses[0].value}u201d’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22da6fc350>)])]

And curl the prod gateway from a VM inside the clusters VPC:

code_block[StructValue([(u’code’, u”curl -H ‘Host:’ http://<PROD_IP>”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22da6fced0>)])]

Once the deployment is finished you should see “Hello World!” again. Since there wasn’t a version of the prod deployment already running, it “skipped” the canary step. 

Canary Deployment

So let’s try the canary functionality. Let’s make a small change in the “app/main.go” file. For example, let’s add your name to the output string in line 27 and then deploy the new version directly to prod skipping the dev stage (which you shouldn’t do in a real production scenario of course).

code_block[StructValue([(u’code’, u’skaffold build -p prodrngcloud deploy releases create release-002 \rn –project=$GOOGLE_CLOUD_PROJECT_ID –region=$GOOGLE_CLOUD_REGION \rn –delivery-pipeline=canary –to-target=prod \rn –images=skaffold-kustomize=$(skaffold build -q | jq -r “.builds[].tag”)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22da6fc9d0>)])]

You can observe the canary process using:

code_block[StructValue([(u’code’, u’kubectl -n prod describe canary/app’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22da6fc590>)])]

Now when you curl the prod gateway again, you should be able to see a mixture of messages, with the ratio shifting on how far the release has already progressed.

You can also check the traffic split directly on the GCLB Url Map:

code_block[StructValue([(u’code’, u’# First fetch the url-map name (since the name is generated), we need the part after the last ‘/’rnkubectl get app -n prod \rn-o=jsonpath=”{.metadata.annotations.networking\.gke\.io/url-maps}”rn# thenrngcloud compute url-maps export <URL_MAP> \rn–region=$GOOGLE_CLOUD_REGION –project=$GOOGLE_CLOUD_PROJECT_ID’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22da6fc990>)])]

The result should look something like this:

The application also contains a failed endpoint that will cause a 500 – Internal Server Error response. How about you make another small change to the app/main.go to trigger a new deployment and observe how Flagger will stop the rollout of the new version due to lower request success rate?

How it works

Inside the bootstrap.yaml, we defined a simple PromQL query for the success rate of the app; requests without a 200 status code are taken as failed. In the Flagger canary object we define a target success rate of 60%

code_block[StructValue([(u’code’, u’1 – (sum(rn rate(rn promhttp_metric_handler_requests_total{rn namespace=”{{ namespace }}”,rn pod=~”{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)”,rn code!=”200″rn }[{{ interval }}]rn )rn )rn /rn sum(rn rate(rn promhttp_metric_handler_requests_total{rn namespace=”{{ namespace }}”,rn pod=~”{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)”rn }[{{ interval }}]rn )rn ))’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22e56da690>)])]

We also defined a Flagger Canary object where this success rate query is referenced. The canary object observes new app deployments, and intercepts the routing configuration to gradually shift the traffic to the new version instead of shift the whole traffic all at once:

code_block[StructValue([(u’code’, u’apiVersion: Canaryrnmetadata:rn name: apprn namespace: prodrnspec:rn # deployment referencern targetRef:rn apiVersion: apps/v1rn kind: Deploymentrn name: app-prodrn # the maximum time in seconds for the canary deploymentrn # to make progress before it is rolled back (default 600s)rn progressDeadlineSeconds: 60rn service:rn # service port numberrn port: 8080rn # container port number or name (optional)rn targetPort: 8080rn # Gateway API HTTPRoute host namesrn hosts:rn – # Reference to the Gateway that the generated HTTPRoute would attach to.rn gatewayRefs:rn – name: apprn namespace: prodrn analysis:rn # schedule interval (default 60s)rn interval: 60srn # max number of failed metric checks before rollbackrn threshold: 5rn # max traffic percentage routed to canaryrn # percentage (0-100)rn maxWeight: 50rn # canary increment steprn # percentage (0-100)rn stepWeight: 10rn metrics:rn – name: success-ratern templateRef:rn name: success-ratern namespace: prodrn thresholdRange:rn min: 0.9rn interval: 1m’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e22da6fca10>)])]

So that’s it, now you have a nice little canary Continuous Delivery pipeline up and running on Google Cloud with the new Gateway API and Google Cloud Deploy.

Next Steps

If you would like to learn more about CI/CD on Google Cloud I would recommend the following articles: 

Building a secure CI/CD pipeline using Google Cloud built-in services contains a great end-to-end example for CI/CD on Google CloudIntroducing Software Delivery Shield for end-to-end software supply chain security shows how Google helps you protect your software deliveryThe evolution of Kubernetes networking with the GKE Gateway controller gives an overview of GKE Gateway API and its capabilities.

Related Article

Google Kubernetes Engine Gateway controller is now GA for single cluster deployments

Google Cloud is excited to announce the General Availability of the GKE Gateway controller, Google Cloud’s implementation of the Kubernet…

Read Article

Cloud BlogRead More



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments