Cloud Run: the fastest way to get your AI applications to production

By mullaned2002

May 31, 2024

71

It’s no secret that Cloud Run offers one of the most straightforward ways for delivering AI-powered applications to production, letting developers focus on their application logic without having to worry about the underlying infrastructure or how to scale from zero to millions of users. But did you know that Cloud Run is also many customers’ platform of choice for providing their AI researchers with tools they need to run and productionize their experiments outside of their trusted Python notebooks?

Cloud Run provides a number of services on top of the container runtime that offer a comprehensive platform for building and running AI-powered applications. In this blog post we highlight some of key features of Cloud Run that help speed up the development of AI-powered applications:

Time to market – by easily moving from prototyping in Vertex AI Studio to a deployed containerized application
Observability – by using Cloud Run’s built-in SLO monitoring and Google Cloud observability solutions
Rate of innovation – trying multiple versions of your service in parallel with revisions and traffic splitting
Relevance and factuality – building RAG implementations by directly and securely connecting to Cloud databases.
Multi-regional deployments and HA – by fronting multiple Cloud Run services behind a single external global application load balancer

From prototyping in AI Studio to publishing a Cloud Run service

Many gen AI-based products start in Vertex AI Studio, which allows for rapid prototyping on a range of models without the need for writing code. From there, the “Generate Code” functionality offers a handy segway for turning experiments into code in one of several popular programming languages.

The generated code snippet consists of a script that invokes the Vertex AI APIs that offer the AI service. Depending on the type of application you are trying to build, the process of turning that script into a web application might be as easy as turning the hardcoded prompt into a templated string and wrapping it all in a web framework. For example, in Python, this could be done by wrapping the prompt in a small Flask application and parametrizing the request with a simple Python f-string:

code_block
<ListValue: [StructValue([(‘code’, ‘from flask import Flask, request, jsonifyrnimport osrnimport vertexairnimport loggingrnfrom vertexai.preview.generative_models import GenerativeModelrnimport vertexai.preview.generative_models as generative_modelsrnrnrnapp = Flask(__name__)rnvertexai.init(project=os.getenv(“PROJECT_ID”), location=os.getenv(“GCP_REGION”))rnmodel = GenerativeModel(“gemini-1.0-pro-001″)rnrndef vertex_movie_recommendation(movies, scenario):rn try:rn response = model.generate_content(rn f”””You are an expert on box office movies. rn Provide a recommendation for which of the following movies to watch: {” or “.join(movies)}rn where the scenario is: {scenario}”””)rn return response.textrnrn except Exception as e:rn logging.exception(e)rn return “Unable to generate recommendation.”[email protected](‘/recommendations’, methods=[‘POST’])rndef movie_recommendations():rn “””rn Returns movie recommendation based on the user’s input.rn “””rn if not request.json:rn return jsonify({‘error’: ‘Missing JSON payload’}), 400rn movies = request.json[‘movies’]rn scenario = request.json[‘scenario’]rn recommendation_response = vertex_movie_recommendation(movies, scenario)rn return jsonify({‘recommendation’: recommendation_response})rnrnif __name__ == ‘__main__’:rn port = int(os.environ.get(‘PORT’, 8080))rn app.run(debug=True, host=’0.0.0.0′, port=port)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05eb658d60>)])]>

Together with a simple package.txt file that includes the required dependencies, we can already containerize and deploy our application. Thanks to Cloud Run’s support for Buildpacks we don’t even need to provide a Dockerfile to describe how our container should be built.

With a simple gcloud command the application is ready to serve traffic:

code_block
<ListValue: [StructValue([(‘code’, ‘gcloud run deploy movie-recommendation-eu \rn–source . \rn–set-env-vars GCP_REGION=europe-west1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e05eb658ee0>)])]>

Monitor application performance with SLOs and telemetry

Implementing observability is crucial to both ensuring that the application meets user expectations and capturing the business impact that the application provides. Cloud Run offers both Service Level Objective (SLO) monitoring and observability out of the box.

Monitoring SLOs is an important way to start managing your application on the basis of Error Budgets and then use that metric to balance stability and rate of innovation. With Cloud Run, SLO monitoring can be defined on the basis of availability, latency, and custom metrics.

Traditional observability in the form of logging, monitoring, and tracing is also available out of the box and directly integrates with Google Cloud Observability to collect all the relevant data in a central place. Tracing in particular has proven to be invaluable for investigating the latency decomposition of AI applications. It is often used to better understand complex orchestration scenarios and RAG implementations.

Rapid innovation with concurrent revisions and Cloud Deploy

Many AI use cases fundamentally rethink how we approach and solve problems. Because of the nature of LLMs and the effects of parameters like temperature or nuances in prompting, the overall outcome are oftentimes hard to predict. Having the ability to run experiments in parallel can therefore help innovate and iterate quickly.

Cloud Run’s built-in traffic splitting allows developers to run multiple concurrent versions of different revisions of a service at once, and fine-grained control over how traffic should be distributed amongst them. For AI applications, this could mean that different versions of a prompt could be served to different sets of users and compared on the basis of a common success metric like click-rate or probability of purchase.

Cloud Deploy offers a managed service to orchestrate the rollout of different revisions of a Cloud Run service automatically. It also integrates with your existing development flows such that a deployment pipeline can be triggered through push events in source control.

Connecting to cloud databases to integrate with enterprise data

Sometimes a static pre-trained model can’t generate accurate responses because it lacks the specific context of a domain. In many cases, providing additional data in the prompt through techniques like retrieval-augmented generation (RAG) can help provide the model with enough contextual information to make the model’s responses more relevant for a specific use case.

Cloud Run provides direct and private connectivity from the orchestrating AI application to cloud databases like AlloyDB or Cloud SQL that can then be leveraged as a vector store for RAG implementations. With direct VPC egress capabilities Cloud Run can now connect to private database endpoints without the extra step of a serverless VPC connector.

Custom domains and multi-regional deployments

By default any Cloud Run service receives a URL in the form of <service-name>.<project-region-hash>.a.run.app that can be used to access the service with HTTP requests. This is handy for quick prototyping and internal services but often it presents two problems.

First, that URL is not very memorable, and the domain suffix does not map to the service provider. It’s therefore impossible for consumers of the service to identify whether this is a legitimate service. Even the SSL certificate is issued to Google and does not provide information about the ownership of that service.

The second concern is that if you scale your service to multiple regions to provide HA and lower latency to your distributed user base, the different regions will have different URLs. This means that switching from one service region to another has to be done at the client or DNS level and isn’t transparent to users.

To solve for both of these problems, Cloud Run offers support for custom domain names and the ability to unify Cloud Run deployments in different regions behind a global external load balancer with a single external IP address based on anycast. With the load balancer in place and with the activation of the outlayer traffic detection feature of Cloud Run, you can run your AI service with a custom domain, your own certificate and with automatic failover in a regional outage scenario.

Let Cloud Run power your AI app

We explored five different aspects that make Cloud Run a natural starting point for building AI-powered applications on top of the powerful services provided by Vertex AI.

If you’re interested in building RAG-capable generative AI applications with Cloud Run, then this solution blueprint is a fantastic resource to dive deeper into the different architectural building blocks. If you think generative AI applications in Cloud Run have to be written in Python then this session from Cloud Next will show you how to build AI applications in a familiar Java-based enterprise environment. And lastly make sure you bookmark Google’s solutions catalog that contains a range of developer-focussed resources and pointers to get you started.

Cloud BlogRead More

Previous articleCloud FinOps and GKE: How Generali optimized spend with Google Cloud Consulting

Next articleAnthropic’s Claude 3 Opus and tool use are generally available on Vertex AI

Cloud Run: the fastest way to get your AI applications to production

From prototyping in AI Studio to publishing a Cloud Run service

Monitor application performance with SLOs and telemetry

Rapid innovation with concurrent revisions and Cloud Deploy

Connecting to cloud databases to integrate with enterprise data

Custom domains and multi-regional deployments

Let Cloud Run power your AI app

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

The biggest obstacle to cloud is people

Celebrating the winners of the 2021 Google Cloud Customer Awards

Writing an Essay with ChatGPT

POPULAR CATEGORY

Cloud Run: the fastest way to get your AI applications to production

From prototyping in AI Studio to publishing a Cloud Run service

Monitor application performance with SLOs and telemetry

Rapid innovation with concurrent revisions and Cloud Deploy

Connecting to cloud databases to integrate with enterprise data

Custom domains and multi-regional deployments

Let Cloud Run power your AI app

The container platform for the next decade of AI and beyond

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY