AWS recently released the new Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like databases or message queues simply by using the Kubernetes API. The new SageMaker ACK Operators make it easier for machine learning (ML) developers and data scientists who use Kubernetes as their control plane to train, tune, and deploy ML models in Amazon SageMaker without signing in to the SageMaker console.
Kubernetes and SageMaker
Building scalable ML workflows involves many iterative steps, including sourcing and preparing data, building ML models, training and evaluating these models, deploying them to production, and monitoring workloads after deployment.
SageMaker is a fully managed service designed and optimized specifically for managing these ML workflows. It removes the undifferentiated heavy lifting of infrastructure management and eliminates the need to invest in IT and DevOps to manage clusters for ML model building, training, and inference. Compute resources are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing near 100% utilization. SageMaker provides many performance and cost optimizations for distributed training, spot training, automatic model tuning, inference latency, and multi-model endpoints.
Many AWS customers who have portability requirements implement a hybrid cloud approach, or implement on-premises and use Kubernetes, an open-source, general-purpose container orchestration system, to set up repeatable ML pipelines running training and inference workloads. However, to support ML workloads, these developers still need to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. Kubernetes customers therefore want to use fully managed ML services such as SageMaker for cost-optimized and managed infrastructure, but want platform and infrastructure teams to continue using Kubernetes for orchestration and managing pipelines to retain standardization and portability.
To address this need, AWS allows you to train, tune, and deploy models in SageMaker by using the new SageMaker ACK Operators, which includes a set of custom resource definitions for SageMaker resources that extends the Kubernetes API. With the SageMaker ACK Operators, you can take advantage of fully managed SageMaker infrastructure, tools, and optimizations natively from Kubernetes.
How did we get here?
In late 2019, AWS introduced the SageMaker Operators for Kubernetes to enable developers and data scientists to manage the end-to-end SageMaker training and production lifecycle using Kubernetes as the control plane. SageMaker operators were installed from the GitHub repo by downloading a YAML configuration file that configured your Kubernetes cluster with the custom resource definitions and operator controller service.
In 2020, AWS introduced ACK to facilitate a Kubernetes-native way of managing AWS Cloud resources. ACK includes a common controller runtime, a code generator, and a set of AWS service-specific controllers, one of which is the SageMaker controller.
Going forward, new functionality will be added to the SageMaker Operators for Kubernetes through the ACK project.
How does ACK work?
The following diagram illustrates how ACK works.
In this example, Alice is a Kubernetes user. She wants to run model training on SageMaker from within the Kubernetes cluster using the Kubernetes API. Alice issues a call to kubectl apply, passing in a file that describes a Kubernetes custom resource describing her SageMaker training job. kubectl apply passes this file, called a manifest, to the Kubernetes API server running in the Kubernetes controller node (Step 1 in the workflow diagram).
The Kubernetes API server receives the manifest with the SageMaker training job specification and determines whether Alice has permissions to create a custom resource of kind sageMaker.services.k8s.aws/TrainingJob, and whether the custom resource is properly formatted (Step 2).
If Alice is authorized and the custom resource is valid, the Kubernetes API server writes (Step 3) the custom resource to its etcd data store and then responds back (Step 4) to Alice that the custom resource has been created.
The SageMaker controller, which is running on a Kubernetes worker node within the context of a normal Kubernetes Pod, is notified (Step 5) that a new custom resource of kind SageMaker.services.k8s.aws/TrainingJob has been created.
The SageMaker controller then communicates (Step 6) with the SageMaker API, calling the SageMaker CreateTrainingJob API to create the training job in AWS. After communicating with the SageMaker API, the SageMaker controller calls the Kubernetes API server to update (Step 7) the custom resource’s status with information it received from SageMaker. The SageMaker controller therefore provides the same information to the developers that they would have received using the AWS SDK. This results in a better and consistent developer experience.
Machine learning use case
For this post, we follow the SageMaker example provided in the following notebook. However, you can reuse the components in this example with your preference of SageMaker built-in or custom algorithms and your own datasets.
We use the Abalone dataset originally from the UCI data repository [1]. In the libsvm converted version, the nominal feature (male/female/infant) has been converted into a real valued feature. The age of abalone is to be predicted from eight physical measurements. This dataset is already processed and stored in Amazon Simple Storage Service (Amazon S3). We train an XGBoost model on the UCI Abalone dataset to replicate the flow in the example Jupyter notebook.
Prerequisites
For this walkthrough, you should have the following prerequisites:
An AWS account.
An existing Amazon Elastic Kubernetes Service (Amazon EKS) cluster. It should be Kubernetes version 1.16+. For automated cluster creation using eksctl, see Getting started with Amazon EKS – eksctl and create your cluster with Amazon EC2 Linux managed nodes.
Install the following tools on the client machine used to access your Kubernetes cluster (you can use AWS Cloud9, a cloud-based integrated development environment (IDE) for the Kubernetes cluster setup):
kubectl – A command line tool for working with Kubernetes clusters.
Helm version 3.7+ – A tool for installing and managing Kubernetes applications.
AWS Command Line Interface (AWS CLI) – A command line tool for interacting with AWS services.
eksctl – A command line tool for working with Amazon EKS clusters that automates many individual tasks.
yq – A command line YAML processor. (For Linux environments, use the wget plain binary installation).
Set up IAM role-based authentication for the controller Pod
IAM roles for service accounts (IRSA) allows fine-grained roles at the Kubernetes Pod level by combining an OpenID Connect (OIDC) identity provider with Kubernetes service account annotations. In this section, we associate the Amazon EKS cluster with an OIDC provider and create an AWS Identity and Access Management (IAM) role that is assumed by the ACK controller Pod via its service account to access AWS services.
Create a cluster and OIDC ID provider
Make sure you’re connected to the right cluster. Substitute the values for CLUSTER_NAME and CLUSTER_REGION below:
Set up the OIDC ID provider (IdP) in AWS and associate it with your Amazon EKS cluster:
Get the identity issuer URL by running the following code:
Set up an IAM role
Next, let’s set up the IAM role that defines the access to the SageMaker and Application Auto Scaling services. For this, we also need to have an IAM trust policy in place, allowing the specified Kubernetes service account (for example, ack-sagemaker-controller) to assume the IAM role.
Create a file named trust.json and insert the following trust relationship code block required for IAM role:
Updating an Application Auto Scaling Scalable Target requires additional permissions. First, create a service-linked role for Application Auto Scaling.
Create a file named pass_role_policy.json to create the policy required for the IAM role.
Run the following command to create a role with the trust relationship defined in trust.json. This trust relationship is required so that Amazon EKS (via a webhook) can inject the necessary environment variables and mount volumes into the Pod that are required by the AWS SDK to assume this role.
Install SageMaker and Application Auto Scaling controllers
Choose an AWS Region for the SageMaker and automatic scaling resources we create in this post. For convenience, we recommend using us-east-1:
Now, let’s install the SageMaker and Application Auto Scaling controller using the following helper script. This script pulls the helm charts from ACK’s public Amazon Elastic Container Registry (Amazon ECR) repository and configures the values of the AWS account, default Region for resources to be created, and IAM role (created in previous step) in the service account to be used by the controller Pod to assume the role. Create a file named install-controllers.sh and insert the following code block:
Run the script:
The output contains the following:
Next, we run the following commands to verify custom resource definitions were applied and controller Pods are running:
The output of the command should contain a number of custom resource definitions related to SageMaker (such as trainingjobs or endpoint) and Application Auto Scaling (such as scalingpolicies and scalabletargets):
We see one controller Pod per service running in the ack-system namespace:
Prepare SageMaker resources
Next, we create an S3 bucket and IAM role for SageMaker.
To train a model with SageMaker, we need an S3 bucket to store the dataset and artifacts from the training process. We simply use the preprocessed dataset at s3://SageMaker-sample-files/datasets/tabular/uci_abalone[1].
Let’s create a variable for the S3 bucket:
Create a file named create-bucket.sh and insert the following code block:
Run the script to create the S3 bucket and copy the dataset:
The SageMaker training job that we run later in the post needs an IAM role to access Amazon S3 and SageMaker. Run the following commands to create a SageMaker execution IAM role that is used by SageMaker to access AWS resources:
Note down the execution role ARN to use in later steps.
Train an XGBoost model
Now, we create a training.yaml file to specify the parameters for a SageMaker training job. SageMaker training jobs enable remote training of ML models. You can customize each training job to run your own ML scripts with custom architectures, data loaders, hyperparameters, and more. To submit a SageMaker training job, we require a job name. Let’s create that variable first:
In the following code, we create a training.yaml file that contains the hyperparameters for the training job as well as the location of the training and validation data. It’s also where we specify the Amazon ECR image used for training.
Note: If your $SERVICE_REGION isn’t us-east-1, change the following image URI. For the XGBoost algorithm version 1.2-1 Region-specific image URI, see Docker Registry Paths and Example Code.
Now, we can create the training job:
You should see the following output:
You can watch the status of the training job. It takes a few minutes for STATUS to show as Completed.
Deploy the results of the SageMaker training job
To deploy the model, we need to specify a model name, an endpoint config name, and an endpoint name:
We deploy this model on a c5.large instance type. In the following .yaml file, we define the model, the endpoint config, and the endpoint:
Now, the endpoint is ready to be deployed:
You should see the following output:
We can observe that the model and endpoint config were created. Deploying the endpoint may take some time:
We can watch this process using the following command:
After some time, the STATUS changes to InService:
This indicates the deployed endpoint is ready for use.
Verify the inference capabilities of the trained model
We invoke the model endpoint using Python to emulate a typical use case. We reuse the code in SageMaker example notebook.
We first download the test set from Amazon S3. Then we load a single sample from the test set and use it to invoke the endpoint we deployed in the previous section. Download the test file with the following code:
Use the Python interpreter to test inference. The Python interpreter is usually installed as /usr/local/bin/python<version> on those machines where it’s available; putting /usr/local/bin in your Unix/Linux shell’s search path makes it possible to start it by entering the Python command.
Create a file named predict.py and insert the following code block:
Running this sample should give us the following result:
The age of the abalone that is provided in the test example is estimated to be 13 by the ML model. The actual age was 12. This suggests that our ML model has been trained and provides reasonable predictions. However, the experienced ML user may realize that we haven’t performed hyperparameter tuning and other methods of increasing accuracy yet, which is outside the scope of this post.
Dynamically scale the endpoint according to the load
SageMaker ACK Operators support custom resource definitions for automatic scaling (using ScalableTarget and ScalingPolicy) for your hosted models. The following resources adjust the number of instances (minimum 1 to maximum 20) provisioned for a model in response to changes in metric SageMakerVariantInvocationsPerInstancetracking, which is the average number of times per minute that each instance for a variant is invoked:
Apply with the following code:
You should see the following output:
We can observe that scalingpolicy was created:
The output of scalingpolicy looks like the following:
Clean up
Run the following commands to delete the resources created in this post:
Create a file named uninstall-controller.sh and insert the following code block required for deleting the controller and custom resource definitions:
Run the following commands to uninstall the controller and custom resource definitions, and delete the namespace, IAM roles, and S3 bucket you created:
Conclusion
SageMaker ACK Operators provide engineering teams with a native Kubernetes experience for creating and interacting with the ML jobs on SageMaker, either with the Kubernetes API or with Kubernetes command line utilities such as kubectl. You can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these controllers—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with fully managed SageMaker training, tuning, and inference jobs, as you would with Kubernetes jobs running locally. Logs from SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in the command line.
ACK is a community-driven project and will soon include service controllers for other AWS service APIs.
Links
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
About the Authors
Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Suraj Kota is a Software Engineer specialized in Machine Learning infrastructure. He builds tools to easily get started and scale machine learning workload on AWS. He worked on the AWS Deep Learning Containers, Deep Learning AMI, SageMaker Operators for Kubernetes, and other open source integrations like Kubeflow.
Archis Joglekar is an AI/ML Partner Solutions Architect in the Emerging Technologies team. He is interested in performant, scalable deep learning and scientific computing using the building blocks at AWS. His past experiences range from computational physics research to machine learning platform development in academia, national labs, and startups. His time away from the computer is spent playing soccer and with friends and family.
Read MoreAWS Machine Learning Blog