Integrating BigQuery data into your LangChain application

By mullaned2002

January 18, 2024

213

Data is the heart of any AI solution. You’ve got lots of valuable BigQuery data, but how can you integrate it into an LLM application? Large language models excel at using unstructured data. Including structured data can enrich and ground your model’s responses, and capture new relationships in your data. Let’s explore how to use the LangChain BigQuery Data Loader to do just that.

LangChain is an open source framework that enables you to connect LLMs to data sources and control the flow of information between them. By using LangChain, you can simplify LLM development with a modular architecture and pre-built connectors. Some of the most popular use cases include chatbots and virtual assistants, code generation tools, and custom web applications that use large language models. When you see a LLM workflow consisting of multiple steps and integration points, you should consider LangChain for the task.

Whether you’re new to LangChain, or just haven’t used a data loader, this post will walk through each step. In our example, our model will analyze multiple e-commerce tables to select the most appropriate customers for a marketing campaign. You’ll learn how to use a data loader to support generative AI use cases from summarization to code generation.

Getting started with LangChain

To find out more about BigQuery and LangChain at the same time, all we need to do is ask the Vertex AI text foundation model. Six lines of code from start-to-finish, not too bad!

code_block<ListValue: [StructValue([(‘code’, ‘# Install LangChain and the Vertex AI SDKrn!pip install –quiet google-cloud-aiplatform langchainrnrnrn# Initialize Vertex AI SDKrnimport vertexairnvertexai.init(project=”<your-project-id>”, location=”us-central1″)rnrnrn# Query the model rnfrom langchain.llms import VertexAIrnllm = VertexAI(model_name=”text-bison@001″, temperature=0)rnllm(“What’s BigQuery?”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350529a0>)])]>

And what does it tell us?

BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyze all their data very quickly. It is a cloud-based service that offers fast performance, scalability, and flexibility. BigQuery is easy to use and can be integrated with other Google Cloud Platform services.

Pro tip: now you can useBigQuery Studioto run notebooks and try out SQL in this blog post directly within BigQuery.

Using the data loader

Now that we’ve climbed the “Hello world!” mountain, let’s learn how to use the document loader. We’ll use data from a fictional eCommerce clothing site calledTheLook, available as a BigQuery public dataset.

Let’s say we’re starting from scratch with a bunch of tables we don’t know well. And our marketing team is about to start a campaign in Japan, oh my! Can we ask the LLM to identify our target customers?

The first step to understanding our data is loading it. Let’s query the schema from this dataset to extract thedata definition language(DDL). DDL is used to create and modify tables, and can tell us about each column and its type.

As a prerequisite, let’s make sure we have the BigQuery client library installed:

code_block<ListValue: [StructValue([(‘code’, ‘# Install BigQuery libraryrn!pip install google-cloud-bigquery’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3735052e20>)])]>

Let’s now define the query and load the data:

code_block<ListValue: [StructValue([(‘code’, ‘# Define our queryrnquery = f”””rnSELECT table_name, ddlrnFROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.TABLES`rnWHERE table_type = ‘BASE TABLE’rnORDER BY table_name;rn”””rnrnrn# Load the datarnloader = BigQueryLoader(query, project=”<your-project-id>”, metadata_columns=”table_name”, page_content_columns=”ddl”)rndata = loader.load()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350526a0>)])]>

Our query is extracting the table name and DDL for each of the tables. We then create a data loader, specifying that the table name is a metadata column and the DDL is the content.

The result is an array of documents, that looks like this:

code_block<ListValue: [StructValue([(‘code’, ‘[Document(page_content=’ddl:rn CREATE TABLE `bigquery-public-data.thelook_ecommerce.distribution_centers`rn (id INT64, name STRING, latitude FLOAT64, longitude FLOAT64)rn OPTIONS(description=rn “The Look fictitious e-commerce dataset: distribution_centers table”);’,rn metadata={‘table_name’: ‘distribution_centers’}),…]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350524c0>)])]>

Writing our first chain

Now that we’ve loaded the documents, let’s put them to work! We need a query that gives us the answers we need – understanding for our marketing campaign. We’ll use the code generation model for this task.

We will create a basic chain that “stuffs” together all of the table metadata into one prompt. For larger datasets with many more tables, a more sophisticated chaining approach will be needed. That’s because there’s a limited length to each prompt, i.e. a context window.

For example, you could compress highlights from each individual table’s content into smaller documents, and then summarize those using a map-reduce method. Or, you could iterate over each table, refining your query as you go.

Here’s how to do it. We’ll use the LangChain Expression Language (LCEL) to define the chain with 3 steps:

We’ll combine the page_content from each document (remember, that’s the DDL of each table) into a string called content.Create a prompt to find our most valuable customers, passing in content, the combined set of table metadata .Pass the prompt to the LLM.

code_block<ListValue: [StructValue([(‘code’, ‘# Use code generation modelrnllm = VertexAI(model_name=”code-bison@latest”, max_output_tokens=2048)rnrnrn# Define the chainrnfrom langchain.prompts import PromptTemplaternfrom langchain.schema import format_documentrnchain = (rn {rn “content”: lambda docs: “\n\n”.join(rn format_document(doc, PromptTemplate.from_template(“{page_content}”)) for doc in docsrn )rn }rn | PromptTemplate.from_template(“Suggest a query that will help me identify my most valuable customers, with an emphasis on recent sales:\n\n{content}”)rn | llmrn)rnrnrn# Invoke the chain with the documents, and remove code backticksrnresult = chain.invoke(data).strip(‘“`’)rnprint(result)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350520d0>)])]>

Let’s take a look at the query:

code_block<ListValue: [StructValue([(‘code’, “SELECTrn users.id AS user_id,rn users.first_name AS first_name,rn users.last_name AS last_name,rn users.email AS email,rn SUM(order_items.sale_price) AS total_spend,rn users.country AS countryrnFROM `bigquery-public-data.thelook_ecommerce.users` AS usersrnJOIN `bigquery-public-data.thelook_ecommerce.orders` AS ordersrnON users.id = orders.user_idrnJOIN `bigquery-public-data.thelook_ecommerce.order_items` AS order_itemsrnON orders.order_id = order_items.order_idrnWHERE users.country = ‘Japan’rnGROUP BY users.id, users.first_name, users.last_name, users.email, users.countryrnORDER BY total_spend DESCrnLIMIT 10;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3735052e80>)])]>

Great, we’ve got the query! It’s recommended to try it out first with a dry run..Now let’s get the answer to our question of fetching the users from Japan:

code_block<ListValue: [StructValue([(‘code’, ‘import google.cloud.bigquery as bqrnrnrnclient = bq.Client(project=”<your-project-id>”)rnclient.query(result).result().to_dataframe()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3735052520>)])]>

index

user_id

first_name

last_name

email

total_spend

country

0

86707

Robin

Oneill

[email protected]

1533.96

Japan

1

68275

Michael

Fisher

[email protected]

1422.46

Japan

2

1988

Alexander

Gill

[email protected]

1061.95

Japan

Cloud BlogRead More

Previous articlePowering Partnerships, Driving Impact: The Public Sector Partner Learning Center is Here

Next articleOptimizing BigQuery for astronomy datasets using HealPix Index

Integrating BigQuery data into your LangChain application

Getting started with LangChain

Using the data loader

Writing our first chain

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Ransomware Protection and Containment Strategies: Practical Guidance for Hardening and Protecting Infrastructure, Identities and Endpoints

Detect real and live users and deter bad actors using Amazon Rekognition Face Liveness

10 questions to help boards safely maximize cloud opportunities

POPULAR CATEGORY