Thursday, December 5, 2024
No menu items!
HomeCloud ComputingIntroducing BigQuery text embeddings for NLP tasks

Introducing BigQuery text embeddings for NLP tasks

Text embeddings are a key enabler and building block for applications such as semantic search, recommendation, text clustering, sentiment analysis, and named entity extractions. Today, we’re announcing a set of new features in BigQuery to generate text embeddings and apply them to downstream application tasks with familiar SQL commands. Starting today, you can use four types of text embedding generation directly from BigQuery SQL:

textembedding-gecko for embedding with generative AI 

BERT for natural language processing tasks that require context or/and multi-language support

NNLM for simple NLP tasks such as text classification and sentiment analysis

SWIVEL for a large corpus of data that needs to capture complex relationships between words

The newly supported array<numeric> feature type allows these generated embeddings to be used by any ML model supported by BigQuery for data analysis based on proximity and distance within the vector space. 

Generate your first embedding using BigQuery

To set the stage for the BQML applications covered below, we first review the newly added function using textembedding-gecko PaLM API for generating embeddings. More specifically, it can be invoked via the new BigQuery ML function called ML.GENERATE_TEXT_EMBEDDING, using a simple two-step process.

First, we register the textembedding-gecko model as a remote model.

code_block[StructValue([(u’code’, u”CREATE MODEL my_project.my_company.llm_embedding_modelrnREMOTE WITH CONNECTION test_project.us.bqml_test_connectionrnOPTIONS(remote_service_type = ‘CLOUD_AI_TEXT_EMBEDDING_MODEL_V1′);”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e33c571ba90>)])]

Second, we use the ML.GENERATE_TEXT_EMBEDDING function to generate embeddings. The example below uses the imdb review dataset as input.

code_block[StructValue([(u’code’, u’SELECT * FROM ML.GENERATE_TEXT_EMBEDDING(rnMODEL my_project.my_company.llm_embedding_model,rn(rnSELECT review as contentrnFROM bigquery-public-data.imdb.reviews));’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3392cfb7d0>)])]

More details can be found by visiting the documentation page. You can also choose to generate text embeddings using smaller models, namely BERT, NNLM, and SWIVEL. Their generated embeddings offer reduced capacity for encoding the semantic meaning of text, but are more scalable for handling a very large corpus. Check out this public tutorial for more details about how to use them in BigQuery ML.

Build embedding applications in BigQuery ML

With the creation of a text embedding table, we showcase two common applications: classification and a basic version of similarity search. 

Sentiment analysis via classification
Let’s take a look at how to build a logistic regression model that predicts the sentiment (positive or negative) of an IMDB review using embeddings generated from the NNLM model, combined with the original data column reviewer_rating.

code_block[StructValue([(u’code’, u’CREATE OR REPLACE MODEL my_project.my_company.imdb_sentiment OPTIONS(rnmodel_type=”logistic_reg”, max_iteration=10) AS (rnSELECTrn embedding,rn reviewer_rating,rn labelrnFROMrn ML.PREDICT(rn MODEL my_project.my_company.nnlm_embedding_model,rn (rn SELECTrn review AS embedding_input,rn reviewer_rating,rn label,rn FROMrn bigquery-public-data.imdb.reviewsrn WHERErn label != “Unsupervised”)));’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e33c5974150>)])]

Once the model is created, you can call ML.PREDICT to obtain sentiment of a review and ML.EVALUATE for overall model performance. One thing to highlight is that text input needs to be transformed to embedding first before feeding into the model. Below is an example of ML.PREDICT query:

code_block[StructValue([(u’code’, u’SELECTrn*rnFROMrn ML.PREDICT(MODEL `my_project.my_company.imdb_classifier`,rn (rn SELECTrn *rn FROMrn ML.PREDICT(MODEL `my_project.my_company.nnlm_embedding_model,`,rn (rn SELECTrn “Isabelle Huppert must be one of the greatest actresses of her or any other generation. ‘The Piano Teacher’ truly confirms it.” AS embedding_input,rn 7 AS reviewer_ratingrn ))));’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e33c59740d0>)])]

Check out this tutorial for more details.

Basic similarity search via clustering
Another example we introduce is training a K-means model to partition the search space for a basic approximate search use case.

Figure 1. Workflow demonstrated in this example. (1) The text corpus is fed into the embedding model to generate embeddings; (2-3) A K-means model is trained on the embeddings to generate the search index, which is for fast approximate search. (4) The search query is fed into the same embedding model to generate embedding; (5-6) Use the trained Kmeans to locate the search index (i.e., cluster number); (7) compute similarity measure for the query embedding and all candidate embeddings.

Back in 2020, we wrote a blog that presented an approach to do document similarity search and clustering, leveraging an open-source embedding model and a workaround to use embeddings for model training. Today, you can accomplish the search task with a more streamlined and concise SQL syntax. 

To better illustrate, we use the same wind_reports public dataset. Assuming we have used the above textembedding-gecko model to generate embeddings for the “reports” text column, we obtain a new table named semantic_search_tutorial.wind_reports_embedding that has embeddings and original data.

Next, we train a K-means model to partition the search space.

code_block[StructValue([(u’code’, u’CREATE OR REPLACE MODEL semantic_search_tutorial.clustering rnOPTIONS(model_type=”KMEANS”, num_clusters=10) rnAS (rnSELECT text_embeddingrnFROM semantic_search_tutorial.wind_reports_embeddingrn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3392ca4d10>)])]

The next step is to use the ML.PREDICT function with the above trained K-means model and the query embedding to find the K-means cluster that the search candidates reside in. By computing the cosine distance between the query embedding and the embeddings of the search candidate in the predicted cluster, you can get a set of the most similar items to the query item. An example query is shown below.

code_block[StructValue([(u’code’, u’WITH query_embedding AS (rn SELECTrn *rn FROMrn ML.GENERATE_TEXT_EMBEDDING(MODEL text.embedding_model,rn (rn SELECTrn “TREES DOWN NEAR THE INTERSECTION OF HIGHWAY” AS content))rn)rnSELECTrn q.content AS query_text,rn c.content AS candidate_text,rn ML.DISTANCE(q.text_embedding, c.text_embedding, ‘COSINE’) AS distancernFROMrn query_embedding AS q,rn semantic_search_tutorial.candidate_embedding_cluster9 AS crnORDER BYrn distance ASCrnLIMIT 20;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e3392ca45d0>)])]

Check out this tutorial for more details.

What’s next?

BigQuery ML text embedding is publicly available in Preview to unlock powerful capabilities for both embedding generation and downstream application tasks. Check out the tutorial for a comprehensive walkthrough of the above examples. For more details, please refer to the documentation.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments