Unifying data and AI to bring unstructured data analytics to BigQuery

By mullaned2002

October 18, 2022

686

Over one third of organizations believe that data analytics and machine learning have the most potential to significantly alter the way they run business over the next 3 to 5 years. However, only 26% of organizations are data driven. One of the biggest reasons for this gap is that a major portion of the data generated today is unstructured, which includes images, documents, and videos. It is estimated to cover roughly up to 80% of all data, which has so far remained untapped by organizations.

One of the goals of Google’s data cloud is to help customers realize value from data of all types and formats. Earlier this year, we announced BigLake, which unifies data lakes and warehouses under a single management framework, enabling you to analyze, search, secure, govern and share unstructured data using BigQuery.

At Next ‘22, we announced the preview of object tables, a new table type in BigQuery that provides a structured record interface for unstructured data stored in Google Cloud Storage. This enables you to directly run analytics and machine learning on images, audio, documents and other file types using existing frameworks like SQL and remote functions natively in BigQuery itself. Object tables also extend our best practices of securing, sharing and governing structured data to unstructured, without needing to learn or deploy new tools.

Directly process unstructured data using BigQuery ML

Object tables contain metadata such as URI (Uniform Resource Identifier), content type, and size that can be queried just like other BigQuery tables. You can then derive inferences using machine learning models on unstructured data with BigQuery ML. As part of preview, you can import open source TensorFlow Hub image models, or your own custom models to annotate the images. Very soon, we plan to enable this for audio, video, text and many other formats, and pre-trained models to enable out-of-the box analysis. Check out this video to learn more and watch a demo.

code_block[StructValue([(u’code’, u’# Create an object tablernCREATE EXTERNAL TABLE my_dataset.object_tablernWITH CONNECTION us.my_connection rnOPTIONS(uris=[“gs://mybucket/images/*.jpg”],rn object_metadata=”SIMPLE”, metadata_cache_mode=”AUTOMATIC”);rnrn # Generate inferences with BQMLrnSELECT * FROM ML.PREDICT(rn MODEL my_dataset.vision_model, rn (SELECT ML.DECODE_IMAGE(data) AS img FROM my_dataset.object_table)rn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea05d8524d0>)])]

By analyzing unstructured data natively in BigQuery, businesses can

Eliminate manual effort as pre-processing steps such as tuning image sizes to model requirements are automated

Leverage the simple and familiar SQL interface to quickly gain insights

Save costs by utilizing existing BigQuery slots without needing to provision new forms of compute

Adswerve is a leading Google Marketing, Analytics and Cloud partner on a mission to humanize data. Twiddy & Co. is Adswerve’s client – a vacation rental company in North Carolina. By combining structured and unstructured data, Twiddy and Adswerve used BigQuery ML to analyze images of rental listings and predict the click-through rate, enabling data-driven photo editorial decisions.

“Twiddy now has the capability to use advanced image analysis to stay competitive in an ever changing landscape of vacation rental providers – and can do this using their in-house SQL skills.” said Pat Grady, Technology Evangelist, Adswerve

Process unstructured data using remote functions

Customers today use remote functions (UDFs) to process structured data for languages and libraries that are not supported in BigQuery. We are extending this capability to process unstructured data using object tables.

Object tables provide signed URLs to allow remote UDFs running on Cloud Functions or Cloud Run to process the object table content. This is particularly useful for running Google’s pre-trained AI models, including Vision AI, Speech-to-Text, Document AI, open source libraries such as Apache Tika, or deploying your own custom models where performance SLAs are important.

Here’s an example of an object table being created over PDF files that are parsed using an open source library running as a remote UDF.

code_block[StructValue([(u’code’, u’SELECT uri, extract_title(samples.parse_tika(signed_url)) AS titlernFROM EXTERNAL_OBJECT_TRANSFORM(TABLE pdf_files_object_table, rn [“SIGNED_URL”]);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea080199ed0>)])]

Extending more BigQuery capabilities to unstructured data

Business intelligence – The results of analyzing unstructured data either directly in BigQuery ML or via UDFs can be combined with your structured data to build unified reports using Looker Studio (at no charge), Looker or any of your preferred BI solutions. This allows you to gain more comprehensive business insights. For example, online retailers can analyze product return rates by correlating them with the images of defective products. Similarly, digital advertisers can correlate ad performance with various attributes of ad creatives to make more informed decisions.

BigQuery search index – Customers are increasingly using the search functionality of BigQuery to power search use cases. These capabilities now extend to unstructured data analytics as well. Whether you use BigQueryML to produce inference on images or use remote UDFs with Doc AI to produce document extraction, the results can now be search indexed and used to support search access patterns.

Here’s an example of search index on data that is parsed from PDF files:

code_block[StructValue([(u’code’, u’CREATE SEARCH INDEX my_index ON pdf_text_extract(ALL COLUMNS);rnrnSELECT * FROM pdf_text_extract WHERE SEARCH(pdf_text, “Google”);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea076d8bb90>)])]

Security and governance – We are extending BigQuery’s row-level security capabilities to help you secure objects in Google Cloud Storage. By securing specific rows in an object table, you can restrict the ability of end users to retrieve the signed URLs of corresponding URIs present in the table. This is a shared responsibility security model, for which administrators need to ensure that end users don’t have direct access to Google Cloud Storage, and use signed URLs from object tables as the only access mechanism.

Here’s an example of a policy for PII images that are secured to be first processed through a blur pipeline:

code_block[StructValue([(u’code’, u’CREATE ROW ACCESS POLICY pii_data ON object_table_imagesrnGRANT TO (“group:[email protected]”) rnFILTER USING (ARRAY_LENGTH(metadata)=1 AND rn metadata[OFFSET(0)].name=”face_detected”)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea080614a50>)])]

Soon, Dataplex will support object tables, allowing you to automatically create object tables in BigQuery and manage and govern unstructured data at scale.

Data sharing – You can now use Analytics Hub to share unstructured data with partners, customers and suppliers while not compromising on security and governance. Subscribers can consume the rows of object tables that are shared with them, and use signed URLs for unstructured data objects.

Getting Started

Submit this form to try these new capabilities that unlock the power of your unstructured data in BigQuery. Watch this demo to learn more about these new capabilities.

Special thanks to engineering leaders Amir Hormati, Justin Levandoski and Yuri Volobuev for contributing to this post.

Cloud BlogRead More

Previous articleTrain a time series forecasting model faster with Amazon SageMaker Canvas Quick build

Next articleOrchestrating Data/ML Workflows at Scale With Netflix Maestro

Unifying data and AI to bring unstructured data analytics to BigQuery

Directly process unstructured data using BigQuery ML

Process unstructured data using remote functions

Extending more BigQuery capabilities to unstructured data

Getting Started

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Cloud Spanner powers Kochava’s mobile analytics platform

Easily Build Advanced Similarity Search With The Pinecone Vector Database

PostgreSQL extension turned Cloud microservice

POPULAR CATEGORY

Unifying data and AI to bring unstructured data analytics to BigQuery

Directly process unstructured data using BigQuery ML

Process unstructured data using remote functions

Extending more BigQuery capabilities to unstructured data

Getting Started

Built with BigQuery: BigQuery ML enables Faraday to make predictions for any US consumer brand

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY