Auditing GKE Clusters across the entire organization

By mullaned2002

December 2, 2022

537

Introduction

Companies moving to the cloud and running containers are often looking for elasticity. The ability to scale up or down as needed, means paying only for the resources used. Using automation allows engineers to focus on applications rather than on the infrastructure. These are key features of the cloud native and managed container orchestration platforms like Google Kubernetes Engine (GKE).

GKE clusters leverage Google Cloud to achieve the best in class security and scalability. They come with two modes of operation and a lot of advanced features. In Autopilot mode, clusters use more automation to reduce operational cost. This comes with less configuration options though. For use cases where you need more flexibility, the Standard mode offers greater control and configuration options. Irrespective of the selected operational mode, there are always recommended, GKE specific features and best practices to adopt. The official product documentation provides comprehensive descriptions and enlists these best practices.

But how do you ensure that your clusters are following them? Did you consider configuring the Google Groups for RBAC feature to make Kubernetes user management easier ? Or did you remember to set NodeLocal DNS cacheon standard GKE clusters to improve DNS lookup times?

Lapses in GKE cluster configuration may lead to reduced scalability or security. Over time, this may decrease the benefits of using the cloud and managed Kubernetes platform. Thus, keeping an eye on cluster configuration is an important task! There are many solutions to enforce policies for resources inside a cluster, but only a few address the clusters themselves. Organizations that implemented the Infrastructure-as-code approach may apply controls there. Yet, this requires change validation processes and code coverage for the entire infrastructure. Also, creation of GKE specific policies will need time investment and product expertise. And even then, there might be often a need to check the configurations of running clusters (i.e. for auditing purposes).

Automating cluster checks

The GKE Policy Automation is a tool that will check all clusters in your Google Cloud organization. It comes with a comprehensive library of codified cluster configuration policies. These follow the best practices and recommendations from the Google Product and Professional Services teams. Both the tool and the policy library are free and released as an open source project on Github. Also, the solution does not need any modifications on the clusters to operate. It is simple and secure to use, leverages read-only access to cluster data via Google Cloud APIs.

You can use GKE Policy Automation to run a manual one time check, or in an automated & serverless way for continuous verification. The second approach will discover your clusters and check if they comply with the defined policies on a regular basis.

After successful cluster identification, the tool pulls information using the Kubernetes Engine API. In the next releases, the tool will support more data inputs to cover additional cluster validation use cases, like scalability limits check.

GKE Policy Automation engine evaluates the gathered data against the set of codified policies, originating from Google Github repository by default; but users can specify their own repositories. This is useful for adding custom policies or in cases when public repository access is not allowed.

The tool supports a variety of ways for storing the policy check results. Besides the console output, it can save the results in JSON format on Cloud Storage or to Pub/Sub. Although those are good cloud integration patterns, they need further JSON data processing. We recommend leveraging the GKE Policy Automation integration with the Security Command Center.

The Security Command Center is Google Cloud’s centralized vulnerability and threat reporting service. The GKE Policy Automation registers itself as an additional source of findings there. Next, for each cluster evaluation, the tool creates new or updates existing findings. This brings all SCC features like finding visualization and management together. Also, the cluster check findings will be subject to the configured SCC notifications.

In the next chapters we will show how to run GKE Policy Automation in a serverless way. The solution will leverage cluster discovery mechanisms and Security Command Center integration.

Continuous cluster evaluation

The GKE Policy Automation comes with a sample Terraform code that creates the infrastructure for serverless operation. The below picture shows the overall architecture of this solution.

Serverless deployment of GKE Policy Automation tool

The serverless GKE Policy Automation solution uses a containerized version of the tool.

The Cloud Run Jobs service executes the container as a job to perform cluster checks. This happens in configurable intervals, triggered by the Cloud Scheduler.

The solution discovers GKE clusters running in your organization using Cloud Asset Inventory.

On each run, the solution gathers cluster data and evaluates configuration policies against them. The policies originate from the Google Github repository by default or from user specified repositories.

At the end, the tool sends evaluation results to the Security Command Center as findings.

GKE Policy Automation container images are available in the Github container registry. To run containers with Cloud Run, the images have to be either built in the Cloud or copied there. The provided Terraform code will provision Artifact Registry repository for this purpose. The following chapters of this post describe how to copy GKE Policy Automation’s image to Google Cloud.

Prerequisites

Existing Google Cloud project for GKE Policy Automation resources

Terraform tool version >= 1.3

Clone of GKE Policy Automation Github repository

The operator will also need sufficient IAM roles to create the necessary resources:

roles/editor role or equivalent on GKE Policy Automation project

roles/iam.securityAdmin role or equivalent on Google Cloud organization – to set IAM policies for Asset Inventory or Security Command Center

Adjusting variables

The Terraform code needs inputs to provision the desired infrastructure. In our scenario, GKE Policy Automation will check clusters in the entire organization. The dedicated IAM service account will be created for the purpose of running a tool. The account will need the following IAM roles on the Google Cloud organization level:

roles/cloudasset.viewer to detect running GKE clusters

roles/container.clusterViewer to get GKE clusters configuration

roles/securitycenter.sourcesAdmin to register the tool as SCC source

roles/securitycenter.findingsEditor to create findings in SCC

The .tfvars file below provides all of the necessary inputs to create the above role bindings. Remember to adjust the project_id, region and organization values accordingly.

code_block[StructValue([(u’code’, u’project_id = “gke-policy-123″rnregion = “europe-west2″rnrndiscovery = {rn organization = “123456789012”rn}rnrnoutput_scc = {rn enabled = truern organization = “123456789012”rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6483fd9e50>)])]

The tool can be also used to check clusters in a given folder or project only. This can be useful when granting organization-wide permissions is not a viable option. In such a case, the discovery parameters in the above example need to be adjusted. Please refer to the input variables README documentation for more details.

Besides the infrastructure, we need to configure the tool itself. The repository provides an example config.yaml file, with the following content:

code_block[StructValue([(u’code’, u’silent: truernclusterDiscovery:rn enabled: truern organization: ${SCC_ORGANIZATION}rnoutputs:rn – securityCommandCenter:rnt provisionSource: truern organization: ${SCC_ORGANIZATION}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648d29bf10>)])]

The Terraform will populate the variable values in the config.yaml file above and copy it to the Secret Manager. The secret will be then mounted as a volume in the Cloud Run job image. The full documentation of the tool’s configuration file is available in the GKE Policy Automation user guide.

Both configuration files should be saved in a terraform subdirectory in the GKE Policy Automation folder.

Running Terraform

Initialize Terraform by running terraform init

Create and inspect plan by running terraform plan -out tfplan

Apply plan by running terraform apply tfplan

Copying container image

The steps below describe how to copy the GKE Policy Automation container image from Github to Google Artifact Registry. Please

1. Set the environment variables. Remember to adjust the values accordingly.

code_block[StructValue([(u’code’, u’export GKE_PA_PROJECT_ID=gke-policy-123rnexport GKE_PA_REGION=europe-west2′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648c106250>)])]

2. Pull the latest image

code_block[StructValue([(u’code’, u’docker pull ghcr.io/google/gke-policy-automation:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648c5b4f10>)])]

3. Login

code_block[StructValue([(u’code’, u’gcloud auth print-access-token | docker login -u oauth2accesstoken –password-stdin https://${GKE_PA_REGION}-docker.pkg.dev’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648cdb5950>)])]

4. Tag the container image

code_block[StructValue([(u’code’, u’docker tag ghcr.io/google/gke-policy-automation:latest ${GKE_PA_REGION}-docker.pkg.dev/${GKE_PA_PROJECT_ID}/gke-policy-automation/gke-policy-automation:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648cd9a510>)])]

5. Push the container image

code_block[StructValue([(u’code’, u’docker push ${GKE_PA_REGION}-docker.pkg.dev/${GKE_PA_PROJECT_ID}/gke-policy-automation/gke-policy-automation:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648c1e2c10>)])]

Creating Cloud Run job

As of the moment of writing this article, Google’s Terraform provider does not yet support Cloud Run Jobs.

Therefore, the Cloud Run job creation has to be done manually. The command below uses gcloud command and environment variables defined before.

code_block[StructValue([(u’code’, u’gcloud beta run jobs create gke-policy-automation \rn –image ${GKE_PA_REGION}-docker.pkg.dev/${GKE_PA_PROJECT_ID}/gke-policy-automation/gke-policy-automation:latest\rn –command=/gke-policy,check \rn –args=-c,/etc/secrets/config.yaml \rn –set-secrets /etc/secrets/config.yaml=gke-policy-automation:latest \ rn–service-account=gke-policy-automation@${GKE_PA_PROJECT_ID}.iam.gserviceaccount.com \rn –set-env-vars=GKE_POLICY_LOG=INFO \rn –region=${GKE_PA_REGION} \rn –project=${GKE_PA_PROJECT_ID}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e648c0dce50>)])]

Observing the results

The configured Cloud Scheduler will run the GKE Policy Automation job once per day. To observe results immediately, we advise to run the job manually. The successful Cloud Run job executions can be viewed in the Cloud Console, as in the example below.

GKE Policy Automation Cloud Run jobs

Security Command Center Integration

Once the GKE Policy Automation job runs successfully, it will produce findings in the Security Command Center for the discovered GKE clusters. It is possible to view them in the SCC Findings view as shown below. Additionally, findings will be available via the SCC API and subject to configured SCC Pub/Sub notifications. This gives the possibility to leverage any existing SCC integrations i.e. with the systems used by your Security Operations Center teams.

GKE Policy Automation Findings in Security Command Center

Selecting the specific finding will navigate to the detailed finding view. In this view, all the finding attributes are shown. Additionally, in the Source Properties tab are the GKE Policy Automation specific properties. Those include:

Evaluated policy file name

Recommended actions and external documentation reference

Center for Internet Security (CIS) benchmarks for GKE mapping for security related findings

The below example shows source properties of a GKE Policy Automation finding in SCC.

GKE Policy Automation Finding Source Properties

GKE Policy Automation will update existing SCC findings during each run. The findings will be set as inactive when the corresponding policy will become valid for a given cluster. Same way, the existing inactive findings will be set to active when policy will violate.

For the cases when some policies are not relevant for given clusters, we recommend using the Security Command Center mute feature. For example, if the Binary Authorization policy is not relevant for development clusters, a muting rule with a development project identifier can be created.

Summary

In the article we have shown how to establish GKE cluster governance for Google Cloud organization using the GKE Policy Automation, an open-source tool created by the Google Professional Services team. The tool along with Google Cloud serverless solutions, like Cloud Run Jobs allows you to build a fully automated solution; alongside the Security Command Center integration, giving you the possibility to process GKE policy evaluation results in a unified and cloud native way.

Cloud BlogRead More

Previous articleImproving model quality at scale with Vertex AI Model Evaluation

Next articleIntroducing one-step classification and entity recognition with Amazon Comprehend for intelligent document processing

Auditing GKE Clusters across the entire organization

Introduction

Automating cluster checks

Continuous cluster evaluation

Security Command Center Integration

Summary

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

How Deutsche Bank uses Google Distributed Cloud for its Autobahn FX real-time trading platform

Keeping an eye on your cattle using AI technology

Blur faces in videos automatically with Amazon Rekognition Video

POPULAR CATEGORY