Monday, April 29, 2024
No menu items!
HomeCloud ComputingAdvanced text analyzers and preprocessing functions in BigQuery

Advanced text analyzers and preprocessing functions in BigQuery

BigQuery stores massive amounts of unstructured and structured text data, including customer information and business operations. BigQuery’s Search and Machine Learning (ML) capabilities provide powerful analytical tools for extracting valuable business insights from these text data sets.

Text preprocessing is a critical step in text analysis and information retrieval pipelines, as it transforms raw natural language or unstructured text into machine-friendly formats. It is a prerequisite for many text-based operations, such as full-text search indexing and machine learning pipelines. In many cases, the effectiveness of a text search index is greatly influenced by the quality of the tokenization algorithm chosen. Similarly, the performance of machine learning models relies heavily on the quality of the preprocessed inputs.

We are announcing the public preview of a set of text analysis and preprocessing functions and capabilities in BigQuery. These new features will be an essential part of text processing on top of primitive string functions to further enhance the Search and ML experience in BigQuery.

Empowering text searches with analyzers

Fraud investigation scenario

Consider a hypothetical fraud investigation and prevention scenario. During the investigation, it would be beneficial to search business logs data to identify suspicious activities associated with the reported transactions. The process involves retrieving entries containing relevant customer information from a logs table that is generated by day-to-day business activities.

The following information may be of interest:

customer IDaccess IP addressemail addresslast 4 digits of credit card number.

With the newly added powerful PATTERN_ANALYZER and its configurability, we will create a search index to help us search for the above specific information.

Constructing a search index configuration

A search index with the PATTERN_ANALYZER allows us to extract and index information that matches a specified RE2 regular expression. In this case, our regular expression will have the following form (note that “?:” denotes non-capturing group since we will capture the entire pattern as a whole):

(?:<customer_uuid>)|(?:<ip_address>)|(?:<email_address>)|(?:<credit_card_4_digits_number>)

Here’s an example of a possible regular expression for each of the above component:

Putting them all together, we have the following regular expression:

(?:[a-f0-9]{8}-?[a-f0-9]{4}-?4[a-f0-9]{3}-?[89ab][a-f0-9]{3}-?[a-f0-9]{12})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:[a-zA-Z][a-zA-Z0-9]+@[a-zA-Z]+[.]com)|(?:(?:X{4}-){3}[0-9]{4})

We then can make further steps on the captured tokens to improve the search index effectiveness and search usability:

lowercase the text to enable case-insensitive searchesremove some known email addresses such as trusted system robot emails or testing emailsremove some fixed/known IP addresses like localhost.

To do this, we can use the token filters option of the analyzer:

Experimenting with different configurations

Prior to creating our search index, which can be a time-consuming and expensive process, we use the newly added TEXT_ANALYZE function to experiment with various configurations to find the one that works as intended. Below we show some example queries to test the configurations.

Since we will use the same analyzer options in our example, we can declare a variable called CUSTOM_OPTIONS so that we can reuse it in multiple queries:

code_block<ListValue: [StructValue([(‘code’, ‘DECLARE CUSTOM_OPTIONS STRING;rnSET CUSTOM_OPTIONS = ”'{rn “patterns”: [“(?:[a-f0-9]{8}-?[a-f0-9]{4}-?4[a-f0-9]{3}-?[89ab][a-f0-9]{3}-?[a-f0-9]{12})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:[a-zA-Z][a-zA-Z0-9]+@[a-zA-Z]+[.]com)|(?:(?:X{4}-){3}[0-9]{4})”],rn “token_filters”: [rn { “normalizer”: { “mode”: “LOWER” } },rn { “stop_words”: [ “[email protected]”, “[email protected]” ] },rn { “stop_words”: [ “127.0.0.1”, “0.0.0.0” ] }rn ]rn }”’;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689fbb0>)])]>

Example 1: text containing both UUID and IP address

code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(rn “the uuid is f8161363-6166-44ed-b99d-776a6ebd183c and the ip address is 198.95.180.234″,rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689fbe0>)])]>

Result:

Note: in this example, we have two separate lists of stop words for illustration purposes. In practice, combining the list to be 1 list would make the search execution faster.

Example 2: text containing multiple email addresses and an obfuscated credit card number

code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(rn “email addresses are [email protected] and [email protected] and the credit card number is XXXX-XXXX-XXXX-7193″,rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689fca0>)])]>

Result:

We can continue iterating until we identify a good configuration for our use cases. Let’s say that in addition to the critical information for our investigation, we also want to add regular words to our index. This way, the searches for normal words can also make use of the search index. We can do this by appending the following regular expression to the end of our pattern:

bw+b

So our regular pattern becomes

(?:[a-f0-9]{8}-?[a-f0-9]{4}-?4[a-f0-9]{3}-?[89ab][a-f0-9]{3}-?[a-f0-9]{12})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:[a-zA-Z][a-zA-Z0-9]+@[a-zA-Z]+[.]com)|(?:(?:X{4}-){3}[0-9]{4})|(?:bw+b)

Since we are indexing normal words now, it is also useful to remove some common English words so they don’t pollute the index with unnecessary entries. We do this by specifying the stop_words list in the analyzer options:

{“stop_words”: [“a”, “an”, “and”, “are”, “is”, “the”]}

The latest configuration and verification using TEXT_ANALYZE function can be seen below (note that we need to escape the backslashes in our SQL statements):

code_block<ListValue: [StructValue([(‘code’, ‘– We already declared above, so don’t need to redeclare.rn– DECLARE CUSTOM_OPTIONS STRING;rnSET CUSTOM_OPTIONS = ”'{rn “patterns”: [“(?:[a-f0-9]{8}-?[a-f0-9]{4}-?4[a-f0-9]{3}-?[89ab][a-f0-9]{3}-?[a-f0-9]{12})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:[a-zA-Z][a-zA-Z0-9]+@[a-zA-Z]+[.]com)|(?:(?:X{4}-){3}[0-9]{4})|(?:\\\\b\\\\w+\\\\b)”],rn “token_filters”: [rn { “normalizer”: { “mode”: “LOWER” } },rn {rn “stop_words”: [rn “[email protected]”,rn “[email protected]”,rn “127.0.0.1”,rn “0.0.0.0”,rn “a”, “an”, “and”, “are”, “is”, “the”rn ]rn }rn ]rn }”’;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689fd00>)])]>
code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(“the uuid is f8161363-6166-44ed-b99d-776a6ebd183c and the ip address is 198.95.180.234″,rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689fd60>)])]>

Result:

code_block<ListValue: [StructValue([(‘code’, ‘SELECT TEXT_ANALYZE(“email addresses are [email protected] and [email protected] and the credit card number is XXXX-XXXX-XXXX-7193″, rn analyzer => ‘PATTERN_ANALYZER’,rn analyzer_options => CUSTOM_OPTIONSrn) AS tokens’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689ff70>)])]>

Result:

Creating the search index

Once we conclude our testing effort and find the suitable configuration, we can create the search index:

code_block<ListValue: [StructValue([(‘code’, “CREATE SEARCH INDEX MyIndexrnON MyTablernOPTIONS(rn analyzer = ‘PATTERN_ANALYZER’,rn analyzer_options = (CUSTOM_OPTIONS)rn)”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c7689ff40>)])]>

Using the search index

As a reminder of how the SEARCH function works, it applies the analyzer (with the specified configuration) on both the search data as well as the input search query, and returns TRUE if the tokens from the search query is a subset of the tokens from the searched data.

While the SEARCH function works without indexes, it can take advantage of the search index for better performance if the function and the index share the same analyzer configurations. We set the analyzer and analyzer_options arguments in the SEARCH function below:

code_block<ListValue: [StructValue([(‘code’, ‘– Search for all log entries containing a specific email address.rnSELECT log_entryrnFROM tablernWHERE SEARCH(rn log_entry, “[email protected]”,rn analyzer => “PATTERN_ANALYZER”,rn analyzer_options => CUSTOM_OPTIONSrn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c76825070>)])]>

IP addresses, customer UUIDs, other normal words (not the stop words) are also indexed, so searching for those can be improved by the search index in a similar manner.

In this example, we can see that the new PATTERN_ANALYZER is a powerful tool to help build an effective search index for our fraud investigation scenario. The text analyzers and their options are designed to be flexible to accommodate various use cases. With correct configurations for search indexes, query performance can be greatly improved as presented in this blog post.

Text analyzers with BigQuery ML

We also announce two new text preprocessing functions: ML.TF_IDF and ML.BAG_OF_WORDS. Similar to other preprocessing functions, these two new functions can be used in the TRANSFORM clause to create ML models with data preprocessing. The following example shows how to use these functions with text analyzers.

The ML use cases for text analyzers focus more on extracting all text tokens and performing Unicode normalization on them before vectorization. This is achieved via combining the above functions with the newly introduced TEXT_ANALYZE function.

While BigQuery offers the ability to use pre-trained ML models for text vectorization, the aforementioned statistics-based functions offer simplicity, interpretability, and lower computational requirements. Additionally, when dealing with new domains where extensive domain-specific data for fine-tuning is lacking, statistics-based methods often outperform pre-trained model-based methods.

In this example, we will explore building a machine learning model to classify news into 5 categories: tech, business, politics, sport, and entertainment. We will use the BBC news public dataset, which is hosted in BigQuery at bigquery-public-data.bbc_news, and carry out the following steps:

Tokenize news as raw strings and remove English stop words.Train a classification model using the tokenized data and a custom vectorization function in BigQuery ML.Make predictions on new test data to classify news into categories.

Raw text preprocessing with TEXT_ANALYZE

The first step in building the classifier is to tokenize the raw news text and preprocess the tokens. The default LOG_ANALYZER with its default list of delimiters is often good enough, so we can use them without any further configuration.

Additionally, our text data may contain Unicode text, so ICU normalization would be a useful preprocessing step.

{ “normalizer”: { “mode”: “ICU_NORMALIZE” } }

Next, we filter out some common English stop words from the tokens list:

{“stop_words”: [“is”, “are”, “the”, “and”, “of”, “be”, “to”]}

Putting it all together, we use TEXT_ANALYZE to preprocess the raw data and materialize into a table as inputs for our training model

code_block<ListValue: [StructValue([(‘code’, ‘DECLARE CUSTOM_OPTIONS STRING;rnSET CUSTOM_OPTIONS = ”'{rn “token_filters”: [rn { “normalizer”: { “mode”: “ICU_NORMALIZE” } },rn { “stop_words”: [“is”, “are”, “the”, “and”, “of”, “be”, “to”]}rn ]rn}”’;rnrnCREATE OR REPLACE TABLE `myproject`.`mydataset`.`tokenized` AS (SELECT category, TEXT_ANALYZE(body, analyzer_options => CUSTOM_OPTIONS) AS body FROM `bigquery-public-data`.`bbc_news`.`fulltext`);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c76825040>)])]>

We can take a look at an example tokenized row to get an idea of the tokenization result:

code_block<ListValue: [StructValue([(‘code’, ‘SELECT ARRAY_SLICE(body, 0, 5)rnFROM `myproject`.`mydataset`.`tokenized`rnLIMIT 1;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c768254c0>)])]>

Result:

Model training with TRANSFORM clause

Now we can use the tokenized data to train our classifier. In this example we use the random forest classifier together with the TRANSFORM clause using ML.BAG_OF_WORDS as the vectorization function:

code_block<ListValue: [StructValue([(‘code’, “CREATE OR REPLACE MODELrn `myproject`.`mydataset`.`test_model`rn TRANSFORM(rn ML.BAG_OF_WORDS(body) OVER() AS body, labelrn )rnOPTIONS (rn MODEL_TYPE=’RANDOM_FOREST_CLASSIFIER’,rn NUM_PARALLEL_TREE = 50,rn TREE_METHOD = ‘HIST’,rn EARLY_STOP = FALSE,rn SUBSAMPLE = 0.85,rn INPUT_LABEL_COLS = [‘label’]rn) ASrn SELECTrn body, category AS labelrn FROM `myproject`.`mydataset`.`tokenized`;”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c76825160>)])]>

After training, we can see that the created model has good performance.

Using the model for inference

Finally, we can test out the model using an excerpt from a sport article outside of the training data. Note that we will need to use the exact same TEXT_ANALYZE options for preprocessing.

code_block<ListValue: [StructValue([(‘code’, ‘WITH ExampleData AS (rnSELECT “Baseball, softball, flag football and lacrosse, sports with roots and a high level of popularity in the United States, have been proposed for the 2028 Games, alongside two sports with more international appeal, squash and cricket. The International Olympic Committee will vote on the proposals by the Los Angeles organizing committee later this month.” AS bodyrn)rnSELECTrn predicted_label, predicted_label_probs, bodyrnFROMrn ML.PREDICT(rn MODEL `myproject.mydataset.test_model`,rn (SELECTrn TEXT_ANALYZE(body, analyzer_options => CUSTOM_OPTIONSrn ) AS bodyrn FROM ExampleData)rn );’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e1c76825400>)])]>

Result:

This example illustrates that TEXT_ANALYZE and BigQuery ML vectorization functions are great tools for preprocessing data and creating machine learning models. This is a significant improvement compared to the multiple steps presented in our previous blogpost on sentiment analysis with BigQuery ML.

Conclusion

The new features added to our text analysis toolset are valuable additions to the existing functionality. They provide users with more flexibility, power, and insight into their data. The ability to perform custom text analysis in a variety of ways makes our toolset even more comprehensive and user-friendly.

We invite you to try out the new features and see for yourself how they can enhance your text preprocessing and analysis workflow. We hope that these new features will help you to gain even deeper insights into your data. We are committed to continuing to develop and improve our text analysis tools to meet your needs.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments