Real-time Data Integration from Oracle to Google BigQuery Using Striim

By mullaned2002

November 1, 2022

385

Editor’s notes: In this guest blog, we have the pleasure of inviting Alok Pareek, Founder & EVP Products, at Striim to share latest experimental results from a performance study on real-time data integration from Oracle to Google Cloud BigQuery using Striim.

Relational databases like Oracle are designed to store data, but they aren’t well suited for supporting analytics at scale. Google Cloud BigQuery is a serverless, scalable cloud data warehouse that is ideal for analytics use cases. To ensure timely and accurate analytics, it is essential to be able to continuously move data streams to BigQuery with minimal latency.

The best way to stream data from databases to BigQuery is through log-based Change Data Capture(CDC). Log-based CDC works by directly reading the transaction logs to collect DML operations, such as inserts, updates, and deletes. Unlike other CDC methods, log-based CDC provides a non-intrusive approach to streaming database changes that puts minimal load on the database.

Striim — a unified real-time data integration and streaming platform — comes with out-of-the-box log-based CDC readers that can move data from various databases (including Oracle) to BigQuery in real-time. Striim enables teams to act on data quickly, producing new insights, supporting optimal customer experiences, and driving innovation.

In this blog post, we will outline experimental results cited in Striim’s recent white paper, Real-Time Data Integration from Oracle to Google BigQuery: A Performance Study.

Building a Data Pipeline from Oracle to Google BigQuery with Striim: Components

We used the following components to build a data pipeline to move data between an Oracle database to BigQuery in real time:

Oracle CDC Adapters

A Striim adapter is a process that connects the Striim platform to a specific type of external application or file. Adapters enable various data sources to be connected to target systems with streaming data pipelines for real-time data integration.

Striim comes with two Oracle CDC adapters to help manage different workloads.

LogMiner-based Oracle CDC Reader uses Oracle LogMiner to ingest database changes on the server side and replicate them to the streaming platform. This adapter is ideal for low and medium workloads.

OJet adapter uses a high-performance log mining API to support high volumes of database changes on the source and replicate them in real time. This adaptor is ideal for high volume high throughput CDC workloads.

With two types of Oracle adapters to choose from, when is it advisable to use one over the other?

Our results show that if your DB workload profile is between 20GB and 80GB of CDC data per hour, the LogMiner based Oracle CDC reader is a good choice. If you work with a higher amount of data, then the OJet adapter is better; currently, it’s the fastest Oracle CDC Reader available. Here’s a table and chart that shows the latency (read-lag) for both adapters:

BigQuery Writer

Striim’s BigQuery Writer is designed to save time and storage; it takes advantage of partitioned tables on the target BigQuery system and supports partition pruning in its merge queries.

Database Workload

For our experiment, we used a custom-built, high-scale database workload simulation. This workload, SwingerMultiOps, is based on Swingbench — a popular workload for Oracle databases. It’s a multithreaded JDBC (Java Database Connectivity) application that generates concurrent DB sessions against the source database. We took the Order Entry (OE) schema of the Swingbench workload. In SwingerMultiOps, we continued to add more tables until we reached a total of 50 tables. Each of these tables comprised of varying data types.

Building the Data Pipeline: Steps

We built the data pipeline for our experiment following these steps:

1. Configure the source database and profile the workload

Striim’s Oracle adapters connect to Oracle server instances to mine for redo data. Therefore it’s important to have the source database instance tuned for optimum redo mining performance. Here’s what you need to keep in mind about the configuration:

Profile the DB workload to measure the load it generates on the source database

Redo log sizes to a reasonably large value of 2G per log group

For the OJet adapter, set a large size for the DB streams_pool_size to mine redo as quickly as possible

For an extremely high CDC data rate of around 150 Gb/hour, set streams_pool_size to 4G

2. Configure the Oracle adapter

For both adapters, default settings are enough to get started. The only configuration required is to set the DB endpoints to read data from the source database. Based on your need, you can use Striim to perform any of the following:

Handle large transactions

Read and write data to a downstream database

Mine from a specific SCN or timestamp

Regardless of which Oracle adapter you choose, only one adapter is needed to collect all data streams from the source database. This practice helps to cut the overhead incurred by both adapters.

3. Configure the BigQuery Writer

Use BigQuery Writer to configure how your data moves from source to database. For instance, you can set your writers to work with a specified dataset to move large amounts of data in parallel.

For performance improvement, you can use multiple BigQuery writers to integrate incoming data in parallel. Using a router ensures that events are distributed such that a single event isn’t sent to multiple writers.

Tuning the number of writers and their properties helps to ensure that data is moved from Oracle to BigQuery in real time. Since we’re dealing with large volumes of incoming streams, we configure 20 BigQuery Writers in our experiment. There are many other BigQuery Writer properties that can help you to move and control data. You can learn about them in detail here.

How to Execute the Striim App and Analyze Results

We used a Google BigQuery dataset to run our data integration infrastructure. We performed the following tasks to run our simulation and capture results for analysis:

Start the Striim app on the Striim server

Start monitoring our app components using the Tungsten Console by passing a simple script

Start the Database Workload

Capture all DB events in the Striim app, and let the app commit all incoming data to the BigQuery target

Analyze the app performance

The Striim UI image below shows our app running on the Striim server. From this UI, we can monitor the app throughput and latency in real time.

Results Analysis: Comparing the Performance of two Oracle Readers

At the end of the DB workload run, we looked at our captured performance data and analyzed the performance. Details are tabulated below for each of the source adapter types.

*LEE => Lag End-to-End

The charts below show how the CDC reader lag varies with the input rate as the workload progresses on the DB server.

Lag chart for Oracle Reader:

Lag chart for OJet Reader:

Use Striim to Move Data in Real Time to Google Cloud BigQuery

This experiment showed how to use Striim to move large amounts of data in real time from Oracle to BigQuery. Striim offers two high-performance Oracle CDC readers to support data streaming from Oracle databases. We demonstrated that Striim’s OJet Oracle reader is optimal for larger workloads, as measured by read-lag, end-to-end lag, and CPU and memory utilization. For smaller workloads, Striim’s LogMiner-based Oracle reader offers excellent performance. For more in-depth information, please refer to the white paper or contact Striim directly.

Cloud BlogRead More

Previous articleNavigate more sustainably and optimize for fuel savings with eco-friendly routing

Next articleIntroducing reCAPTCHA Enterprise’s Mobile SDK to help protect iOS, Android apps

Real-time Data Integration from Oracle to Google BigQuery Using Striim

Building a Data Pipeline from Oracle to Google BigQuery with Striim: Components

Building the Data Pipeline: Steps

How to Execute the Striim App and Analyze Results

Results Analysis: Comparing the Performance of two Oracle Readers

Use Striim to Move Data in Real Time to Google Cloud BigQuery

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Use Amazon SageMaker Studio to build a RAG question answering solution with Llama 2, LangChain, and Pinecone for fast experimentation

Much Ado About AI: Three Days at the Gartner Data and Analytics Summit in London

The evolution of multitenancy for cloud services

POPULAR CATEGORY