BigQuery integrates with Doc AI to help build document analytics and generative AI use cases

By mullaned2002

January 4, 2024

159

As digital transformation accelerates, organizations are generating vast amounts of text and other document data, all of which holds immense potential for insights and powering novel generative AI use cases. To help harness this data, we’re excited to announce an integration between BigQuery and Document AI, letting you easily extract insights from document data and build new large language model (LLM) applications.

BigQuery customers can now create Document AI Custom Extractors, powered by Google’s cutting-edge foundation models, which they can customize based on their own documents and metadata. These customized models can then be invoked from BigQuery to extract structured data from documents in a secure, governed manner, using the simplicity and power of SQL.

Prior to this integration, some customers tried to construct independent Document AI pipelines, which involved manually curating extraction logic and schema. The lack of native integration capabilities left them to develop bespoke infrastructure to synchronize and maintain data consistency. This turned each document analytics project into a substantial undertaking that required significant investment. Now, with this integration, customers can easily create remote models in BigQuery for their custom extractors in Document AI, and use them to perform document analytics and generative AI at scale, unlocking a new era of data-driven insights and innovation.

A unified, governed data to AI experience

You can build a custom extractor in the Document AI Workbench with three steps:

Define the data you need to extract from your documents. This is called document schema, stored with each version of the custom extractor, accessible from BigQuery.Optionally, provide extra documents with annotations as samples of the extraction.Train the model for the custom extractor, based on the foundation models provided in Document AI.

In addition to custom extractors that require manual training, Document AI also provides ready-to-use extractors for expenses, receipts, invoices, tax forms, government ids, and a multitude of other scenarios, in the processor gallery. You may use them directly without performing the above steps.

Then, once you have the custom extractor ready, you can move to BigQuery Studio to analzye the documents using SQL in the following four steps:

Register a BigQuery remote model for the extractor using SQL. The model can understand the document schema (created above), invoke the custom extractor, and parse the results.Create object tables using SQL for the documents stored in Cloud Storage. You can govern the unstructured data in the tables by setting row-level access policies, which limits users’ access to certain documents and thus restricts the AI power for privacy and security.Use the function ML.PROCESS_DOCUMENT on the object table to extract relevant fields by making inference calls to the API endpoint. You can also filter out the documents for the extractions with a “WHERE” clause outside of the function. The function returns a structured table, with each column being an extracted field.Join the extracted data with other BigQuery tables to combine structured and unstructured data, producing business values.

The following example illustrates the user experience:

code_block<ListValue: [StructValue([(‘code’, ‘# Show a screenshot of curating Doc AI custom extractor in Workbench’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef785de8610>)])]>

code_block<ListValue: [StructValue([(‘code’, “# Create an object table in BigQuery that maps to the document files stored in Cloud Storage.rnCREATE OR REPLACE EXTERNAL TABLE `my_dataset.receipt_table`rnWITH CONNECTION `my_project.us.example_connection`rnOPTIONS (rn object_metadata = ‘SIMPLE’,rn uris = [‘gs://my_bucket/path/*’],rn metadata_cache_mode= ‘AUTOMATIC’,rn max_staleness= INTERVAL 1 HOURrn);rnrn# Create a remote model to register your Doc AI processor in BigQuery.rnCREATE OR REPLACE MODEL `my_dataset.invoice_parser`rnREMOTE WITH CONNECTION `my_project.us.example_connection`rnOPTIONS (rn remote_service_type = ‘CLOUD_AI_DOCUMENT_V1’, rn document_processor=’projects/…/locations/us/processors/…/processorVersions/pretrained-invoice-v1.3-2022-07-15’rn);rnrn# Invoke the registered model over the object table to parse PDF expense receiptsrnSELECT uri, total_amount, invoice_daternFROM ML.PROCESS_DOCUMENT(rn MODEL `my_dataset.invoice_parser`,rn TABLE `my_dataset.receipt_table`)rnWHERE content_type = ‘application/pdf’;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef785de8b20>)])]>

Table of results

Text analytics, summarization and other document analysis use cases

Once you have extracted text from your documents, you can then perform document analytics in a few ways:

Use BigQuery ML to perform text-analytics: BigQuery ML supports training and deploying text models in a variety of ways. For example, you can use BigQuery ML to identify customer sentiment in support calls, or to classify product feedback into different categories. If you are a Python user, you can also use BigQuery DataFrames for pandas, and scikit-learn-like APIs for text analysis on your data.Use PaLM 2 LLM to summurize the documents: BigQuery has a ML.GENERATE_TEXT function that calls the PaLM 2 model to generate texts, which can be used to summarize the documents. For instance, you can use a Document AI to extract customer feedback and summarize the feedback using PaLM 2, all with BigQuery SQL.Join document metadata with other structured data stored in BigQuery tables: This allows you to combine structured and unstructured data for more powerful use cases. For example, you could identify high customer lifetime value (CLTV) customers with feedback captured from online reviews, or shortlist the most requested product features from customer feedback.

code_block<ListValue: [StructValue([(‘code’, “// Example of document summarization using PaLM 2rnSELECTrn ml_generate_text_result[‘predictions’][0][‘content’] AS generated_text,rn ml_generate_text_result[‘predictions’][0][‘safetyAttributes’]rn AS safety_attributes,rn * EXCEPT (ml_generate_text_result)rnFROMrn ML.GENERATE_TEXT(rn MODEL `my_dataset.llm_model`,rn (rn SELECTrn CONCAT(rn ‘Summarize the following text: ‘,customer_feedback) AS prompt,rn *rn FROM ML.PROCESS_DOCUMENT(rn MODEL `my_dataset.customer_feedback_extractor`,rn TABLE `my_dataset.customer_feecback_documents`)rn ),rn STRUCT(rn 0.2 AS temperature,rn 1024 AS max_output_tokens));”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef785de8910>)])]>

Implement search and generative AI use cases

Once you’ve extracted structured text from your documents, you can build indexes optimized for needle-in-the-haystack queries, made possible by BigQuery’s search and indexing capabilities, unlocking powerful search functionality.

This integration also helps unlock new generative LLM applications like executing text-file processing for privacy filtering, content safety checks, and token chunking using SQL and custom Document AI models. The extracted text, combined with other metadata, simplifies the curation of the training corpus required to fine-tune large language models. Moreover, you’re building LLM use cases on governed, enterprise data that’s been grounded through BigQuery’s embedding generation and vector index management capabilities. By synchronizing this index with Vertex AI, you can implement retrieval-augmented generation use cases, for a more governed and streamlined AI experience.

Next steps

The above capabilities are now available in preview. To get started, reach out to your Google sales representative, or check out the following tutorials:

Create a Custom Extractor in Document AI Workbench Process documents with the ML.PROCESS_DOCUMENT function in BigQuery

Cloud BlogRead More

Previous articleAll About AI: Top Data Trends and Predictions for 2024

Next articleGenerating value from enterprise data: Best practices for Text2SQL and generative AI

BigQuery integrates with Doc AI to help build document analytics and generative AI use cases

A unified, governed data to AI experience

Text analytics, summarization and other document analysis use cases

Implement search and generative AI use cases

Next steps

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Google, Udacity offer free course on Gemini API

MIT Sloan Artificial Intelligence: Implications for Business Strategy

How to Build a Decision Tree in SAS

POPULAR CATEGORY