Serving open-source large language models efficiently on Vertex AI Model Garden

By mullaned2002

October 25, 2023

177

Google Cloud is dedicated to providing customers with the best technologies, whether they are powered by Google’s own advancements or come from our open source community. We have over 100 open-source models in Vertex AI’s Model Garden, including Meta’s Llama 2 and Code-Llama.

Today, we are excited to announce an updated LLM-efficient serving solution that improves serving throughput in Vertex AI. Our solution is built on the popular open-source vLLM libraries and is competitive with the current state-of-the-art LLM serving solutions in the industry.

In this blog, we will describe our solution in detail with benchmark results and provide you with the Colab notebook example to help you get started.

Our Solution

Our solution is based on the integration of the state-of-the-art open-source LLM serving framework vLLM, which includes features such as:

Optimized transformer implementation with PagedAttentionContinuous batching to improve the overall serving throughputTensor parallelism and distributed serving on multiple GPUs

We conducted benchmark experiments of our solution on Vertex AI, where we fully reproduced the benefit of vLLM on Vertex AI Online Prediction. The benchmark results are below.

We used OpenLLaMA 13B, the open-sourced reproduction of Llama, as our model architecture for evaluation. We compared our serving solution (Vertex Model Garden vLLM) with HuggingFace Transformers (HF Transformers), the most popular LLM library and HuggingFace Text Generation Inference (HF TGI), another state-of-the-art LLM serving library but not open sourced. We sampled 1000 prompts of different input/output lengths in the ShareGPT dataset and measured the serving throughput.

From our benchmark results, the Vertex AI serving stack achieved up to 19X higher throughput compared to HF Transformers and on par performance with the state-of-the-art HF TGI. This means that users can serve more requests at a lower cost. We also showed that it is possible to serve LLMs using multiple GPUs in a single node and included these benchmark results. We further tested the same solution on the latest Llama 2 and Code Llama and it produced similar results.

Get started with Custom Serving from Vertex AI Model Garden

We have provided the Colab notebook example in Vertex AI Model Garden, which shows how to deploy open-source foundation models on Vertex.

An example of deploying the OpenLLaMA models to Vertex AI with vLLM serving can be accessed here. We provide a pre-built vLLM serving docker image. Deploying Llama 2 and Code Llama follows similar steps. You can easily call the Vertex AI SDK API to deploy models using the docker image:

code_block<ListValue: [StructValue([(‘code’, ‘from google.cloud import aiplatformrnrnVLLM_DOCKER_URI = “us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve”rnrnaiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)rnrnendpoint_vllm = aiplatform.Endpoint.create(display_name=f”open-llama-endpoint”)rnrnvllm_args = [rn “–host=0.0.0.0”,rn “–port=7080”,rn “–model=openlm-research/open_llama_13b”,rn “–tensor-parallel-size=1”,rn “–swap-space=16”,rn “–gpu-memory-utilization=0.95”,rn “–disable-log-stats”,rn]rnmodel = aiplatform.Model.upload(rn display_name=”open-llama-13b”,rn serving_container_image_uri=VLLM_DOCKER_URI,rn serving_container_command=[“python”, “-m”, “vllm.entrypoints.api_server”],rn serving_container_args=vllm_args,rn serving_container_ports=[7080],rn serving_container_predict_route=”/generate”,rn serving_container_health_route=”/ping”,rn)rnrnmodel.deploy(rn endpoint=endpoint,rn machine_type=”a2-highgpu-1g”,rn accelerator_type=”NVIDIA_TESLA_A100″,rn accelerator_count=1,rn deploy_request_timeout=1800,rn service_account=SERVICE_ACCOUNT,rn)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0889ec9c40>)])]>

After the model has deployed to Vertex AI, you will be able to send requests with a prompt to the endpoint and get a response:

code_block<ListValue: [StructValue([(‘code’, ‘instance = { “prompt”: “hi google”, “n”: 1, “max_tokens”: 200, }rnresponse = endpoint_vllm.predict(instances=[instance])’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0889ec9d60>)])]>

NOTE: These code snippets are just for demonstration, please refer to the notebook for the entire process to deploy the model.

Cloud BlogRead More

Previous articleSmile please! 3D Predict revolutionizes orthodontics with generative AI technology running on Google Cloud

Next articleReduce downtime with Amazon Aurora MySQL database restart time optimizations

Serving open-source large language models efficiently on Vertex AI Model Garden

Our Solution

Get started with Custom Serving from Vertex AI Model Garden

The overwhelmed person’s guide to Google Cloud: week of April 25

Uncharmed: Untangling Iran’s APT42 Operations

Google is a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud AI Developer Services

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 25

Uncharmed: Untangling Iran’s APT42 Operations

Google is a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud AI Developer Services

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

SAS: PROC SUMMARY with Examples

Run ensemble ML models on Amazon SageMaker

Step-by-step guide to resolve DEADLINE_EXCEEDED errors on Cloud Spanner

POPULAR CATEGORY