Sunday, June 16, 2024
No menu items!
HomeArtificial Intelligence and Machine LearningServe multiple models with Amazon SageMaker and Triton Inference Server

Serve multiple models with Amazon SageMaker and Triton Inference Server

Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

In 2021, AWS announced the integration of NVIDIA Triton Inference Server in SageMaker. You can use NVIDIA Triton Inference Server to serve models for inference in SageMaker. By using an NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

In some scenarios, users want to deploy multiple models. For example, an application for revising English composition always includes several models, such as BERT for text classification and GECToR to grammar checking. A typical request may flow across multiple models, like data preprocessing, BERT, GECToR, and postprocessing, and they run serially as inference pipelines. If these models are hosted on different instances, the additional network latency between these instances increases the overall latency. For an application with uncertain traffic, deploying multiple models on different instances will inevitably lead to inefficient utilization of resources.

Consider another scenario, in which users develop multiple models with different versions, and each model uses a different training framework. A common practice is to use multiple containers, each of which deploys a model. But this will cause increased workload and costs for development, operation, and maintenance. In this post, we discuss how SageMaker and NVIDIA Triton Inference Server can solve this problem.

Solution overview

Let’s look at how SageMaker inference works. SageMaker invokes the hosting service by running a Docker container. The Docker container launches a RESTful inference server (such as Flask) to serve HTTP requests for inference. The inference server loads the model and listens to port 8080 providing external service. The client application sends a POST request to the SageMaker endpoint, SageMaker passes the request to the container, and returns the inference result from the container to the client.

In our architecture, we use NVIDIA Triton Inference Server, which provides concurrent runs of multiple models from different frameworks, and we use a Flask server to process client-side requests and dispatch these requests to the backend Triton server. While launching a Docker container, the Triton server and Flask server are started automatically. The Triton server loads multiple models and exposes ports 8000, 8001, and 8002 as gRPC, HTTP, and metrics server. The Flask server listens to 8080 ports and parses the original request and payload, and then invokes the local Triton backend via model name and version information. For the client side, it adds the model name and model version in the request in addition to the original payload, so that Flask is able to route the inference request to the correct model on Triton server.

The following diagram illustrates this process.

A complete API call from the client is as follows:

The client assembles the request and initiates the request to a SageMaker endpoint.
The Flask server receives and parses the request, and gets the model name, version, and payload.
The Flask server assembles the request again and routes to the corresponding endpoint of the Triton server according to the model name and version.
The Triton server runs an inference request and sends responses to the Flask server.
The Flask server receives the response message, assembles the message again, and returns it to the client.
The client receives and parses the response, and continues to subsequent business procedures.

In the following sections, we introduce the steps needed to prepare a model and build the TensorRT engine, prepare a Docker image, create a SageMaker endpoint, and verify the result.

Prepare models and build the engine

We demonstrate hosting three typical ML models in our solution: image classification (ResNet50), object detection (YOLOv5), and a natural language processing (NLP) model (BERT-base). NVIDIA Triton Inference Server supports multiple formats, including TensorFlow 1. x and 2. x, TensorFlow SavedModel, TensorFlow GraphDef, TensorRT, ONNX, OpenVINO, and PyTorch TorchScript.

The following table summarizes our model details.

Model Name
Model Size
Tensor RT
Tensor RT

NVIDIA provides detailed documentation describing how to generate the TensorRT engine. To achieve best performance, the TensorRT engine must be built over the device. This means the build time and runtime require the same computer capacity. For example, a TensorRT engine built on a g4dn instance can’t be deployed on a g5 instance.

You can generate your own TensorRT engines according to your needs. For test purposes, we prepared sample codes and deployable models with the TensorRT engine. The source code is also available on GitHub.

Next, we use an Amazon Elastic Compute Cloud (Amazon EC2) G4dn instance to generate the TensorRT engine with the following steps. We use YOLOv5 as an example.

Launch a G4dn.2xlarge EC2 instance with the Deep Learning AMI (Ubuntu 20.04) in the us-east-1 Region.
Open a terminal window and use the ssh command to connect to the instance.
Run the following commands one by one:

nvidia-docker run –gpus all -it –rm -v `pwd`/workspace:/workspace
git clone -b v7.0.1
pip install seaborn
pip install onnx-simplifier
cd yolov5
python –weights –include onnx –simplify –imgsz 640 640 –device 0
onnxsim yolov5s.onnx yolov5s-sim.onnx

trtexec –onnx=yolov5s-sim.onnx –saveEngine=model.plan –explicitBatch –workspace=1024*12

Create a config.pbtxt file:

name: “yolov5s”
platform: “tensorrt_plan”
input: [
name: “images”
data_type: TYPE_FP32
dims: [1, 3, 640, 640 ]
output: [
name: “output”,
data_type: TYPE_FP32
dims: [1,25200,85 ]

Create the following file structure and put the generated files in the appropriate location:

mkdir yolov5s
mkdir -p yolov5s/1
cp config.pbtxt yolov5s
cp model.plan yolov5s/1

├── 1
│   └── model.plan
└── config.pbtxt

Test the TensorRT engine

Before we deploy to SageMaker, we start a Triton server to verify these three models are configured correctly. Use the following command to start a Triton server and load the models:

docker run –gpus all –rm -p8000:8000 -p8001:8001 -v<MODEL_ROOT_DIR>/model_repository:/models tritonserver –model-repository=/models

If you receive the following prompt message, it means the Triton server is started correctly.

Enter nvidia-smi in the terminal to see GPU memory usage.

Client implementation for inference

The file structure is as follows:

serve – The wrapper that starts the inference server. The Python script starts the NGINX, Flask, and Triton server. – The Flask implementation for /ping and /invocations endpoints, and dispatching requests. – The startup shell for the individual server workers. – The abstract method definition that each client requires to implement their inference method.
client folder – One folder per client:

nginx.conf – The configuration for the NGINX primary server.

We define an abstract method to implement the inference interface, and each client implements this method:

from abc import ABC, abstractmethod
class Base(ABC):
def inference(self,img):

The Triton server exposes an HTTP endpoint on port 8000, a gRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. The following is a sample ResNet client with a gRPC call. You can implement the HTTP interface or gRPC interface according to your use case.

from base import Base
import numpy as np
import tritonclient.grpc as grpcclient
from PIL import Image
import cv2
class Resnet(Base):
def image_transform_onnx(self, image, size: int) -> np.ndarray:
”’Image transform helper for onnx runtime inference.”’
img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #OpenCV follows BGR convention and PIL follows RGB
image = Image.fromarray(img)
image = image.resize((size,size))

# now our image is represented by 3 layers – Red, Green, Blue
# each layer has a 224 x 224 values representing
image = np.array(image)

# dummy input for the model at export – torch.randn(1, 3, 224, 224)
image = image.transpose(2,0,1).astype(np.float32)

# our image is currently represented by values ranging between 0-255
# we need to convert these values to 0.0-1.0 – those are the values that are expected by our model
image /= 255
image = image[None, …]
return image

def inference(self, img):
INPUT_SHAPE = (224, 224)

TRITON_IP = “localhost”
MODEL_NAME = “resnet”

INPUTS.append(grpcclient.InferInput(INPUT_LAYER_NAME, [1, 3, INPUT_SHAPE[0], INPUT_SHAPE[1]], “FP32”))
OUTPUTS.append(grpcclient.InferRequestedOutput(OUTPUT_LAYER_NAME, class_count=3))
TRITON_CLIENT = grpcclient.InferenceServerClient(url=f”{TRITON_IP}:{TRITON_PORT}”)

INPUTS[0].set_data_from_numpy(self.image_transform_onnx(img, 224))

results = TRITON_CLIENT.infer(model_name=MODEL_NAME, inputs=INPUTS, outputs=OUTPUTS, headers={})
output = np.squeeze(results.as_numpy(OUTPUT_LAYER_NAME))
lista = [x.decode(‘utf-8’) for x in output.tolist()]
return lista

In this architecture, the NGINX, Flask, and Triton servers should be started at the beginning. Edit the serve file and add a line to start the Triton server.

Build a Docker image and push the image to Amazon ECR

The Docker file code looks as follows:


# Add arguments to achieve the version, python and url
ARG PYTHON=python3
ARG PYTHON_PIP=python3-pip
ARG PIP=pip3


RUN apt-key adv –keyserver –recv-keys A4B469963BF863CC
&& apt-get update
&& apt-get install -y nginx
&& apt-get install -y libgl1-mesa-glx
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

RUN ${PIP} install -U –no-cache-dir

ldconfig &&
apt-get clean &&
apt-get autoremove &&
rm -rf /var/lib/apt/lists/* /tmp/* ~/* &&
mkdir -p /opt/program/models/

COPY sm /opt/program
COPY model /opt/program/models
WORKDIR /opt/program

ENTRYPOINT [“python3”, “serve”]

Install and configure the aws-cli client with the following code:

sudo apt install awscli
sudo apt install git-all
aws configure
# # input AWS Access Key ID, AWS Secret Access Key, Default region name and Default output format

Run the following command to build the Docker image and push the image to Amazon Elastic Container Registry (Amazon ECR). Provide your Region and account ID.

aws ecr get-login-password –region <regionID> | docker login –username AWS –password-stdin <accountID>.dkr.ecr.<regionID>

docker build -t inference/mytriton .

docker tag inference/mytriton:latest <accountID>.dkr.ecr. <regionID>

docker push <accountID>.dkr.ecr.<regionID>

Create a SageMaker endpoint and test the endpoint

Now it’s time to verify the result. Launch a notebook instance with an ml.c5.xlarge instance from the SageMaker console, and create a notebook with the conda_python3 kernel. The following code snippet shows an example deployment of an inference endpoint. The source code is available in the GitHub repo.

role = get_execution_role()
sess = sage.Session()
account = sess.boto_session.client(‘sts’).get_caller_identity()[‘Account’]
region = sess.boto_session.region_name
image = ‘{}.dkr.ecr.{}’.format(account, region)
model = sess.create_model(
name=”mytriton”, role=role, container_defs=image)
endpoint_name=”MyTritonEndpoint”, config_name=”MYTRITONCFG”)

Wait about 3 minutes until the inference server is started to verify the result.

The following code is the ResNet client request:

## resnet client
runtime = boto3.Session().client(‘runtime.sagemaker’)
img = cv2.imread(‘dog.jpg’)
string_img = base64.b64encode(cv2.imencode(‘.jpg’, img)[1]).decode()
payload = json.dumps({“modelname”: “resnet”,”payload”: {“img”:string_img}})

response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType=”application/json”,Body=payload,Accept=’application/json’)


We get the following response:

{‘modelname’: ‘resnet’, ‘result’: [‘11.250000:250:250:malamute, malemute, Alaskan malamute’, ‘9.914062:249:249:Eskimo dog, husky’, ‘9.906250:248:248:Saint Bernard, St Bernard’]}

The following code is the YOLOv5 client request:

# yolov5 client
payload = json.dumps({“modelname”: “yolov5″,”payload”: {“img”:string_img}})

response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType=”application/json”,Body=payload,Accept=’application/json’)


We get the following response:

b'{“modelname”: “yolov5”, “result”: [[16, 0.9168673157691956, 111.92530059814453, 258.53240966796875, 262.0159606933594, 533.407958984375, 768, 576], [2, 0.6941519379615784, 392.20037841796875, 573.6005249023438, 142.55178833007812, 224.56454467773438, 768, 576], [1, 0.5813695788383484, 131.8942413330078, 473.7420654296875, 179.61459350585938, 427.0913391113281, 768, 576], [7, 0.5316226482391357, 392.82275390625, 572.4647216796875, 144.685546875, 223.052734375, 768, 576]]}’

The following code is the BERT client request:

# bert client
text=”The world has [MASK] people.”

payload = json.dumps({“modelname”: “bert_base”,”payload”: {“text”:text}})

response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType=”application/json”,Body=payload,Accept=’application/json’)


We get the following response:

{‘modelname’: ‘bert_base’, ‘result’: [{‘token’: ‘The world has many people.’, ‘score’: 0.16609132289886475}, {‘token’: ‘The world has no people.’, ‘score’: 0.07334889471530914}, {‘token’: ‘The world has few people.’, ‘score’: 0.0617995485663414}, {‘token’: ‘The world has two people.’, ‘score’: 0.03924647718667984}, {‘token’: ‘The world has its people.’, ‘score’: 0.023465465754270554}]}

Here we see our architecture is working as expected.

Note that hosting an endpoint will incur some costs. Therefore, delete the endpoint after you complete the test:


Cost estimation

To estimate cost, assume that you have three models, but not all of them are long-running. You’re using one endpoint for each model, and the online time of each endpoint is different. Using ml.g4dn.xlarge as an example, the total cost is about $971.52/month. The following table lists the details.

Model Name
Endpoint Running /Day
Instance Type
Cost/Month (us-east-1)
24 hours
0.736 * 24 * 30=$529.92
8 hours
0.736 * 8 * 30=$176.64
12 hours
0.736 * 12 * 30=$264.96

The following table shows the cost for sharing one endpoint for three models using the preceding architecture. The total cost is about $676.8/month. From this result, we can conclude that you can save 30% in costs while also having 24/7 service from your endpoint.

Model Name
Endpoint Running /Day
Instance Type
Cost/Month (us-east-1)
ResNet, YOLOv5, BERT
24 hours
0.94 * 24 * 30 = $676.8


In this post, we introduced an improved architecture in which multiple models share one endpoint in SageMaker. Under some conditions, this solution can help you save costs and improve resource utilization. It is suitable for business scenarios with low concurrency and latency-insensitive requirements.

To learn more about SageMaker and AI/ML solutions, refer to Amazon SageMaker.


Use Triton Inference Server with Amazon SageMaker
Achieve low-latency hosting for decision tree-based ML models on NVIDIA Triton Inference Server on Amazon SageMaker
Deploy fast and scalable AI with NVIDIA Triton Inference Server in Amazon SageMaker
NVIDIA Triton Inference Server
Export to ONNX
 amazon-sagemaker-examples/sagemaker-triton GitHub repo
Amazon SageMaker Pricing

About the authors

Zheng Zhang is a Senior Specialist Solutions Architect in AWS, he focuses on helping customers accelerate model training, inference and deployment for machine learning solutions. He also has rich experience in large-scale distributed training, design AI/ML solutions.

Yinuo He is an AI/ML specialist in AWS. She has experiences in designing and developing machine learning based products to provide better user experiences. She now works to help customers succeed in their ML journey.

Read MoreAWS Machine Learning Blog



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments