In today’s data-driven world, the ability to leverage real-time data for machine learning applications is a game-changer. Two key players in this field, Striim and Google BigQuery ML, offer a powerful combination to make this possible. Striim serves as a real-time data integration platform that seamlessly and continuously moves data from diverse data sources to destinations such as cloud databases, messaging systems, and data warehouses, making it a vital component in modern data architectures. Meanwhile, Google BigQuery ML is a machine learning service provided by Google Cloud, allowing you to create and deploy machine learning models using SQL-like syntax directly within the BigQuery environment.
Real-time data processing in the world of machine learning allows data scientists and engineers to focus on model development and monitoring. Striim’s strength lies in its capacity to connect to over 150 data sources, enabling real-time data acquisition from virtually any location and simplifying data transformations. This empowers businesses to expedite the creation of machine learning models and make data-driven decisions and predictions swiftly, ultimately enhancing customer experiences and optimizing operations.
Before we embark on the journey of integrating Striim with Google BigQuery ML for real-time data processing in machine learning, there are a few prerequisites that you should ensure are in place.
Striim Instance: To get started, you need to have a Striim instance created and have access to it. Striim is the backbone of this integration, and having a working Striim instance is essential for setting up the data pipelines and connecting to your source databases. For a free trial, please sign up for a Striim Cloud trial at https://go2.striim.com/free-trial
Basic Understanding of Striim: Familiarity with the basic concepts of Striim and the ability to create data pipelines is crucial. You should understand how to navigate the Striim environment, configure data sources, and set up data flows. If you’re new to Striim or need a refresher on its core functionalities, you can review the documentation and resources available at https://github.com/schowStriim/striim-PoC-migration.
Section 1: Connecting to the Source Database
The first step in this integration journey is connecting Striim to a PostgreSQL source database that contains raw machine learning data. In this blog, we will focus on a PostgreSQL database. Inside this database, we have an iris_dataset table with the following column structure.
| Column | Type | Collation | Nullable | Default |
| id | integer | | not null | nextval(‘iris_dataset_id_seq’::regclass) |
| sepal_length | double precision | | | |
| sepal_width | double precision | | | |
| petal_length | double precision | | | |
| petal_width | double precision | | | |
| species | text | | | |
This table contains raw data related to the characteristics of different species of iris flowers. It’s worth noting that this data has been gathered from a public source, and as a result, there are NULL values in some fields, and the labels for the species are represented numerically. Specifically, in this dataset, 1 represents “setosa,” 2 represents “versicolor,” and 3 represents “virginica.”
To read raw data from our PostgreSQL database, we will use Striim’s PostgreSQL Reader adapter, which captures all operations and changes from the PostgreSQL log files.
To get the PostgreSQL Reader adapter created, we will drag and drop it from the Component section, provide the Connection URL, username, and password, and specify the iris_dataset table in the Tables property. The PostgreSQL Reader adapter utilizes the wal2json plugin of the PostgreSQL database to read the log files and capture the changes. Therefore, as a part of the setup, we need to create a replication slot in the source database and then provide its name in the replication slot property.
Section 2: Creating Striim Continuous Query (CQ) Adapters
CQ adapters play a crucial role in this integration, as they help transform and prepare the data for machine learning in BigQuery ML. We will walk you through the steps of creating CQ adapters and how Striim transforms the data in-flight once we read it from the Postgres database.
1. Handling NULL Values:
We will build a CQ adapter that transforms NULL values into a float 0.0, ensuring the consistency and integrity of your data. Here’s the SQL query for this transformation:
SELECT * FROM pg_output_ml
SELECT * FROM pg_output_ml
MODIFY(data = CASE WHEN data IS NULL THEN TO_FLOAT(0.0) ELSE data END,
data = CASE WHEN data IS NULL THEN TO_FLOAT(0.0) ELSE data END, data = CASE
WHEN data IS NULL THEN TO_FLOAT(0.0) ELSE data END, data = CASE WHEN data
IS NULL THEN TO_FLOAT(0.0) ELSE data END);
We will attach the PostgreSQL Reader adapter to this CQ for seamless data processing:
2. Converting Numeric Species Classes to Text:
We will build another CQ adapter to convert numeric species classes to text classes, making the data more human-readable and interpretable for the ML model.
FROM pg_ml_data_output t;
We will attach the Data_ML_Transform CQ adapter to this CQ for label processing:
3. Data Transformation:
Finally, we will create the last CQ adapter to extract the final data and assign it to variable/column names, making it ready for integration with BigQuery ML.
SELECT data as id, data as sepal_length, data as sepal_width, data as petal_length, data as petal_width, data as species
FROM transformed_data t;
We will attach the Label_ML_Transform CQ adapter to this CQ to assign data field to variables:
Section 3: Attaching CQ to BigQuery Writer Adapter
Now that we’ve prepared the data using CQ adapters, we need to connect them to the BigQuery Writer adapter. The BigQuery Writer adapter is the gateway for streaming data into BigQuery. By clicking on the wave icon and attaching the BigQuery Adapter, you establish a connection between the last and previous adapters and BigQuery. In the Tables property, we utilize the ColumnMap to connect the transformed data with the appropriate BigQuery columns:
DMS_SAMPLE.iris_dataset ColumnMap(id=id, sepal_length=sepal_length, petal_length=petal_length, petal_width=petal_width, species=species)
Additionally, we specify the Project ID and supply the Service Account Key JSON file to give Striim permission to connect to Bigquery:
Section 4: Execute the CDC data pipeline to replicate the data to BigQuery
After successfully executing the CDC data pipeline, the Application Progress page indicates that we’ve read 30 ongoing changes from our source database and written these 30 records and changes to my BigQuery database. At the bottom, you can also preview the data flowing from the source to the target by clicking on the ‘Wave’ icon and then the ‘Eye’ icon located between the source and target components. This is one sample of the raw data:
Id | sepal_length | sepal_width | petal_length | petal_width | species
1 5.1 3.5 1.4 NULL “1”
Section 5: Building a BigQuery ML Model
With your data flowing seamlessly into BigQuery, it’s time to harness the power of Google Cloud’s machine learning service. BigQuery ML provides a user-friendly environment for creating machine learning models, without the need for extensive coding or external tools. We will provide you with step-by-step instructions on building a logistic machine learning model within BigQuery. This will include examples of model creation, training, and making predictions, giving you a comprehensive overview of the process.
Verify that the data has been populated correctly in the BigQuery iris_dataset table.
Create a logistic regression model from the iris_dataset table by executing this query:
The query you provided is creating a logistic regression model using the data from the iris_dataset table in BigQuery ML. Here’s a breakdown of what this query is doing:CREATE MODEL IF NOT EXISTS: This part of the query creates a machine learning model if it doesn’t already exist with the specified name, which is `striim_bq_model` in this case.
OPTIONS: This section defines various options and hyperparameters for the model. Here’s what each of these options means:
– model_type=’logistic_reg’: Specifies that you are creating a logistic regression model.
– ls_init_learn_rate=.15: Sets the initial learning rate for the model to 0.15.
– l1_reg=1: Applies L1 regularization with a regularization strength of 1.
– max_iterations=20: Limits the number of training iterations to 20.
– input_label_cols=[‘species’]: Specifies the target variable for the logistic regression, which is ‘species’ in this case.
– data_split_method=’seq’: Uses a sequential data split method for model training and evaluation.
– data_split_eval_fraction=0.3: Allocates 30% of the data for model evaluation.
– data_split_col=’id’: Uses the ‘id’ column to split the data into training and evaluation sets.
AS: This keyword indicates the start of the SELECT statement, where you define the data source for your model.
SELECT: This part of the query selects the features and target variable from the iris_dataset table, which is the data used for training and evaluating the logistic regression model.
– id, sepal_length, sepal_width, petal_length, petal_width are the feature columns used for model training.
– species is the target variable or label column that the model will predict.
In summary, this query creates a logistic regression model named striim_bq_model using the iris_dataset data in BigQuery ML. It specifies various model settings and hyperparameters to train and evaluate the model. The model’s goal is to predict the ‘species’ based on the other specified columns as features.
Evaluate the model by executing this query:
SELECT * FROM ML.EVALUATE(MODEL
`ancient-yeti-175123.DMS_SAMPLE.striim_bq_model`, (SELECT * FROM
This query performs an evaluation of a machine learning model called striim_bq_model that was previously created in BigQuery ML. Here’s a breakdown of what this query does:SELECT * FROM ML.EVALUATE: This part of the query is using the ML.EVALUATE function, which is a BigQuery ML function used to assess the performance of a machine learning model. It evaluates the model’s predictions against the actual values in the test dataset.
(MODEL ancient-yeti-175123.DMS_SAMPLE.striim_bq_model, … ): Here, you specify the model to be evaluated. The model being evaluated is named striim_bq_model, and it resides in the dataset ancient-yeti-175123.DMS_SAMPLE.
(SELECT * FROM `ancient-yeti-175123.DMS_SAMPLE.iris_dataset): This part of the query selects the data from the iris_dataset, which is used as the test dataset. The model’s predictions will be compared to the actual values in this dataset to assess its performance.
In summary, this query evaluates the striim_bq_model using the data from the iris_dataset to assess how well the model makes predictions. The results of this evaluation will provide insights into the model’s accuracy and performance.
Make a prediction from the striim_bq_model:
SELECT * FROM ML.PREDICT(MODEL `ancient-yeti-175123.DMS_SAMPLE.striim_bq_model`,(SELECT 5.1 as sepal_length, 2.5 as petal_length, 3.0 as petal_width, 1.1 as sepal_width))
In the screenshot above, you can see that the `striim_bq_model` provided us with information such as the predicted species, probabilities for the predicted species, and the feature column values used in our ML.PREDICT function.
Integrating Striim with BigQuery ML empowers data scientists and ML engineers by freeing them from the need to repeatedly gather data from the source and execute the same data cleaning processes. Instead, they can focus solely on building and monitoring machine learning models. This powerful combination accelerates decision-making, enhances customer experiences, and streamlines operations. We invite you to explore this integration for your real-time machine learning projects, as it has the potential to revolutionize how you leverage data for business insights and predictions. Embrace the future of real-time data processing and machine learning with Striim and BigQuery ML!
The post Integrating Striim with BigQuery ML: Real-time Data Processing for Machine Learning appeared first on Striim.