Sentiment analysis with BigQuery ML

By mullaned2002

March 6, 2023

892

Introduction

We recently announced BigQuery support for sparse features which help users to store and process the sparse features efficiently while working with them. That functionality enables users to represent sparse tensors and train machine learning models directly in the BigQuery environment. Being able to represent sparse tensors is a useful feature because sparse tensors are used extensively in encoding schemes like TF-IDF as part of data pre-processing in NLP applications and for pre-processing images with a lot of dark pixels in computer vision applications.

There are numerous applications of sparse features such as text generation and sentiment analysis. In this blog, we’ll demonstrate how to perform sentiment analysis with the space features in BigQuery ML by training and inferencing machine learning models using a public dataset. This blog also highlights how easy it is to work with unstructured text data on BigQuery, an environment traditionally used for structured data.

Using sample IMDb dataset

Let’s say you want to conduct a sentiment analysis on movie reviews from the IMDb website. For the benefit of readers who want to follow along, we will be using the IMDb reviews dataset from BigQuery public datasets. Let’s look at the top 2 rows of the dataset.

Although the reviews table has 7 columns, we only use reviews and label columns to perform sentiment analysis for this case. Also, we are only considering negative and positive values in the label columns. The following query can be used to select only the required information from the dataset.

code_block[StructValue([(u’code’, u”SELECTrn review,rn label,rnFROM rn `bigquery-public-data.imdb.reviews`rnWHERErn label IN (‘Negative’, ‘Positive’)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d6a7af190>)])]

The top 2 rows of the result is as follows:

Methodology

Based on the dataset that we have, the following steps will be carried out:

Build a vocabulary list using the review column

Convert the review column into sparse tensors

Train a classification model using the sparse tensors to predict the label (“positive” or “negative”)

Make predictions on new test data to classify reviews as positive or negative.

Feature engineering

In this section, we will convert the text from the reviews column to numerical features so that we can feed them into a machine learning model. One of the ways is the bag-of-words approach where we build a vocabulary using the words from the reviews and select the most common words to build numerical features for model training. But first, we must extract the words from each review. The following code creates a dataset and a table with row numbers and extracted words from reviews.

code_block[StructValue([(u’code’, u”– Create a dataset named `sparse_features_demo` if doesnu2019t existrnCREATE SCHEMA IF NOT EXISTS sparse_features_demo;rnrnrnrnrn– Select unique reviews with only negative and positive labelsrnCREATE OR REPLACE TABLE sparse_features_demo.processed_reviews AS (rn SELECTrn ROW_NUMBER() OVER () AS review_number,rn review,rn REGEXP_EXTRACT_ALL(LOWER(review), ‘[a-z]{2,}’) AS words,rn label,rn splitrn FROM (rn SELECTrn DISTINCT review,rn label,rn splitrn FROMrn `bigquery-public-data.imdb.reviews`rn WHERErn label IN (‘Negative’, ‘Positive’)rn )rn);”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d4fd7af50>)])]

The output table from the query above should look like this:

The next step is to build a vocabulary using the extracted words. The following code creates a vocabulary including word frequency and word index from reviews. For this case, we are going to select only the top 20,000 words to reduce the computation time.

code_block[StructValue([(u’code’, u’– Create a vocabulary using train dataset and select only top 20,000 words based on frequencyrnCREATE OR REPLACE TABLE sparse_features_demo.vocabulary AS (rn SELECTrn word,rn word_frequency,rn word_indexrn FROM (rn SELECTrn word,rn word_frequency,rn ROW_NUMBER() OVER (ORDER BY word_frequency DESC) – 1 AS word_indexrn FROM (rn SELECTrn word,rn COUNT(word) AS word_frequencyrn FROMrn sparse_features_demo.processed_reviews,rn UNNEST(words) AS wordrn WHERErn split = “train”rn GROUP BYrn wordrn )rn )rn WHERErn word_index < 20000 # Select top 20,000 words based on word countrn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d4fd9ddd0>)])]

The following shows the top 10 words based on frequency and their respective index from the resulting table of the query above.

Creating a sparse feature

Now we will use the newly added feature to create a sparse feature in BigQuery. For this case, we aggregate `word_index` and `word_frequency` in each review, which generates a column as ARRAY[STRUCT<int, numerical>] type. Now, each review is represented as ARRAY[(word_index, word_frequency)].

code_block[StructValue([(u’code’, u’– Generate a sparse feature by aggregating word_index and word_frequency in each review.rnCREATE OR REPLACE TABLE sparse_features_demo.sparse_feature AS (rn SELECTrn review_number,rn review,rn ARRAY_AGG(STRUCT(word_index, word_frequency)) AS feature,rn label,rn splitrn FROM (rn SELECTrn DISTINCT review_number,rn review,rn word,rn label,rn splitrn FROMrn sparse_features_demo.processed_reviews,rn UNNEST(words) AS wordrn WHERErn word IN (SELECT word FROM sparse_features_demo.vocabulary)rn ) AS word_listrn LEFT JOINrn sparse_features_demo.vocabulary AS topk_wordsrn ONrn word_list.word = topk_words.wordrn GROUP BYrn review_number,rn review,rn label,rn splitrn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d4eafad10>)])]

Once the query is executed, a sparse feature named `feature` will be created. That `feature` column is an `ARRAY of STRUCT` column which is made of `word_index` and `word_frequency` columns. The picture below displays the resulting table at a glance.

Training a BigQuery ML model

We just created a dataset with a sparse feature in BigQuery. Let’s see how we can use that dataset to train with a machine learning model with BigQuery ML. In the following query, we will train a logistic regression model using the review_number, review, and feature to predict the label:

code_block[StructValue([(u’code’, u’– Train a logistic regression classifier using the data with sparse featurernCREATE OR REPLACE MODEL sparse_features_demo.logistic_reg_classifierrn TRANSFORM (rn * EXCEPT (rn review_number,rn reviewrn )rn )rn OPTIONS(rn MODEL_TYPE=’LOGISTIC_REG’,rn INPUT_LABEL_COLS = [‘label’]rn ) ASrn SELECTrn review_number,rn review,rn feature,rn labelrn FROMrn sparse_features_demo.sparse_featurern WHERErn split = “train”rn;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d6a77e190>)])]

Now that we have trained a BigQuery ML Model using a sparse feature, we evaluate the model and tune it as needed.

code_block[StructValue([(u’code’, u’– Evaluate the trained logistic regression classifierrnSELECT * FROM ML.EVALUATE(MODEL sparse_features_demo.logistic_reg_classifier);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d4eb1e050>)])]

The score looks like a decent starting point, so let’s go ahead and test the model with the test dataset.

code_block[StructValue([(u’code’, u’– Evaluate the trained logistic regression classifier using test datarnSELECT * FROM ML.EVALUATE(MODEL sparse_features_demo.logistic_reg_classifier,rn (rn SELECTrn review_number,rn review,rn feature,rn labelrn FROMrn sparse_features_demo.sparse_featurern WHERErn split = “test”rn )rn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d9815fc50>)])]

The model performance for the test dataset looks satisfactory and it can now be used for inference. One thing to note here is that since the model is trained on the numerical features, the model will only accept numeral features as input. Hence, the new reviews have to go through the same transformation steps before they can be used for inference. The next step shows how the transformation can be applied to a user-defined dataset.

Sentiment predictions from the BigQuery ML model

All we have left to do now is to create a user-defined dataset, apply the same transformations to the reviews, and use the user-defined sparse features to perform model inference. It can be achieved using a WITH statement as shown below.

code_block[StructValue([(u’code’, u’WITHrn — Create a user defined reviewsrn user_defined_reviews AS (rn SELECTrn ROW_NUMBER() OVER () AS review_number,rn review,rn REGEXP_EXTRACT_ALL(LOWER(review), ‘[a-z]{2,}’) AS wordsrn FROM (rn SELECT “What a boring movie” AS review UNION ALLrn SELECT “I don’t like this movie” AS review UNION ALLrn SELECT “The best movie ever” AS reviewrn )rn ),rnrnrn — Create a sparse feature from user defined reviewsrn user_defined_sparse_feature AS (rn SELECTrn review_number,rn review,rn ARRAY_AGG(STRUCT(word_index, word_frequency)) AS featurern FROM (rn SELECTrn DISTINCT review_number,rn review,rn wordrn FROMrn user_defined_reviews,rn UNNEST(words) as wordrn WHERErn word IN (SELECT word FROM sparse_features_demo.vocabulary)rn ) AS word_listrn LEFT JOINrn sparse_features_demo.vocabulary AS topk_wordsrn ONrn word_list.word = topk_words.wordrn GROUP BYrn review_number,rn reviewrn )rnrnrn– Evaluate the trained model using user defined datarnSELECT review, predicted_label FROM ML.PREDICT(MODEL sparse_features_demo.logistic_reg_classifier,rn (rn SELECTrn *rn FROMrn user_defined_sparse_featurern )rn);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1d68684210>)])]

Here is what you would get for executing the query above:

And that’s it! We just performed a sentiment analysis on the IMDb dataset from a BigQuery Public Dataset using only SQL statements and BigQuery ML. Now that we have demonstrated how sparse features can be used with BigQuery ML models, we can’t wait to see all the amazing projects that you would create by harnessing this functionality.

If you’re just getting started with BigQuery, check out our interactive tutorial to begin exploring.

Cloud BlogRead More

Previous article5 GKE console features to help you optimize your clusters

Next articleRapidly expand the reach of Spanner databases with read-only replicas and zero-downtime moves

Sentiment analysis with BigQuery ML

Introduction

Using sample IMDb dataset

Methodology

Feature engineering

Creating a sparse feature

Training a BigQuery ML model

Sentiment predictions from the BigQuery ML model

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

What is Istio? The Kubernetes service mesh explained

Google Cloud and Apollo24|7: Building Clinical Decision Support System (CDSS) together

Better application networking and security with CAKES

POPULAR CATEGORY