Tuesday, May 28, 2024
No menu items!
HomeCloud ComputingIntegrating BigQuery data into your LangChain application

Integrating BigQuery data into your LangChain application

Data is the heart of any AI solution. You’ve got lots of valuable BigQuery data, but how can you integrate it into an LLM application? Large language models excel at using unstructured data. Including structured data can enrich and ground your model’s responses, and capture new relationships in your data. Let’s explore how to use the LangChain BigQuery Data Loader to do just that.

LangChain is an open source framework that enables you to connect LLMs to data sources and control the flow of information between them. By using LangChain, you can simplify LLM development with a modular architecture and pre-built connectors. Some of the most popular use cases include chatbots and virtual assistants, code generation tools, and custom web applications that use large language models. When you see a LLM workflow consisting of multiple steps and integration points, you should consider LangChain for the task.

Whether you’re new to LangChain, or just haven’t used a data loader, this post will walk through each step. In our example, our model will analyze multiple e-commerce tables to select the most appropriate customers for a marketing campaign. You’ll learn how to use a data loader to support generative AI use cases from summarization to code generation.

Getting started with LangChain

To find out more about BigQuery and LangChain at the same time, all we need to do is ask the Vertex AI text foundation model. Six lines of code from start-to-finish, not too bad!

code_block<ListValue: [StructValue([(‘code’, ‘# Install LangChain and the Vertex AI SDKrn!pip install –quiet google-cloud-aiplatform langchainrnrnrn# Initialize Vertex AI SDKrnimport vertexairnvertexai.init(project=”<your-project-id>”, location=”us-central1″)rnrnrn# Query the model rnfrom langchain.llms import VertexAIrnllm = VertexAI(model_name=”text-bison@001″, temperature=0)rnllm(“What’s BigQuery?”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350529a0>)])]>

And what does it tell us?

BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyze all their data very quickly. It is a cloud-based service that offers fast performance, scalability, and flexibility. BigQuery is easy to use and can be integrated with other Google Cloud Platform services.

Pro tip: now you can useBigQuery Studioto run notebooks and try out SQL in this blog post directly within BigQuery.

Using the data loader

Now that we’ve climbed the “Hello world!” mountain, let’s learn how to use the document loader. We’ll use data from a fictional eCommerce clothing site calledTheLook, available as a BigQuery public dataset.

Let’s say we’re starting from scratch with a bunch of tables we don’t know well. And our marketing team is about to start a campaign in Japan, oh my! Can we ask the LLM to identify our target customers?

The first step to understanding our data is loading it. Let’s query the schema from this dataset to extract thedata definition language(DDL). DDL is used to create and modify tables, and can tell us about each column and its type.

As a prerequisite, let’s make sure we have the BigQuery client library installed:

code_block<ListValue: [StructValue([(‘code’, ‘# Install BigQuery libraryrn!pip install google-cloud-bigquery’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3735052e20>)])]>

Let’s now define the query and load the data:

code_block<ListValue: [StructValue([(‘code’, ‘# Define our queryrnquery = f”””rnSELECT table_name, ddlrnFROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.TABLES`rnWHERE table_type = ‘BASE TABLE’rnORDER BY table_name;rn”””rnrnrn# Load the datarnloader = BigQueryLoader(query, project=”<your-project-id>”, metadata_columns=”table_name”, page_content_columns=”ddl”)rndata = loader.load()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350526a0>)])]>

Our query is extracting the table name and DDL for each of the tables. We then create a data loader, specifying that the table name is a metadata column and the DDL is the content.

The result is an array of documents, that looks like this:

code_block<ListValue: [StructValue([(‘code’, ‘[Document(page_content=’ddl:rn CREATE TABLE `bigquery-public-data.thelook_ecommerce.distribution_centers`rn (id INT64, name STRING, latitude FLOAT64, longitude FLOAT64)rn OPTIONS(description=rn “The Look fictitious e-commerce dataset: distribution_centers table”);’,rn metadata={‘table_name’: ‘distribution_centers’}),…]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350524c0>)])]>

Writing our first chain

Now that we’ve loaded the documents, let’s put them to work! We need a query that gives us the answers we need – understanding for our marketing campaign. We’ll use the code generation model for this task.

We will create a basic chain that “stuffs” together all of the table metadata into one prompt. For larger datasets with many more tables, a more sophisticated chaining approach will be needed. That’s because there’s a limited length to each prompt, i.e. a context window.

For example, you could compress highlights from each individual table’s content into smaller documents, and then summarize those using a map-reduce method. Or, you could iterate over each table, refining your query as you go.

Here’s how to do it. We’ll use the LangChain Expression Language (LCEL) to define the chain with 3 steps:

We’ll combine the page_content from each document (remember, that’s the DDL of each table) into a string called content.Create a prompt to find our most valuable customers, passing in content, the combined set of table metadata .Pass the prompt to the LLM.

code_block<ListValue: [StructValue([(‘code’, ‘# Use code generation modelrnllm = VertexAI(model_name=”code-bison@latest”, max_output_tokens=2048)rnrnrn# Define the chainrnfrom langchain.prompts import PromptTemplaternfrom langchain.schema import format_documentrnchain = (rn {rn “content”: lambda docs: “\n\n”.join(rn format_document(doc, PromptTemplate.from_template(“{page_content}”)) for doc in docsrn )rn }rn | PromptTemplate.from_template(“Suggest a query that will help me identify my most valuable customers, with an emphasis on recent sales:\n\n{content}”)rn | llmrn)rnrnrn# Invoke the chain with the documents, and remove code backticksrnresult = chain.invoke(data).strip(‘“`’)rnprint(result)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e37350520d0>)])]>

Let’s take a look at the query:

code_block<ListValue: [StructValue([(‘code’, “SELECTrn users.id AS user_id,rn users.first_name AS first_name,rn users.last_name AS last_name,rn users.email AS email,rn SUM(order_items.sale_price) AS total_spend,rn users.country AS countryrnFROM `bigquery-public-data.thelook_ecommerce.users` AS usersrnJOIN `bigquery-public-data.thelook_ecommerce.orders` AS ordersrnON users.id = orders.user_idrnJOIN `bigquery-public-data.thelook_ecommerce.order_items` AS order_itemsrnON orders.order_id = order_items.order_idrnWHERE users.country = ‘Japan’rnGROUP BY users.id, users.first_name, users.last_name, users.email, users.countryrnORDER BY total_spend DESCrnLIMIT 10;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3735052e80>)])]>

Great, we’ve got the query! It’s recommended to try it out first with a dry run..Now let’s get the answer to our question of fetching the users from Japan:

code_block<ListValue: [StructValue([(‘code’, ‘import google.cloud.bigquery as bqrnrnrnclient = bq.Client(project=”<your-project-id>”)rnclient.query(result).result().to_dataframe()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3735052520>)])]>












[email protected]







[email protected]







[email protected]



Cloud BlogRead More



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments