Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

By mullaned2002

February 23, 2023

738

Consider the case of an organization that has accumulated a lot of data, stored in various databases and applications. Producing analytics reports takes a long time because of the complexity of the data storage. The team decides to replicate all the data to BigQuery in order to increase reporting efficiency. Traditionally this would be a large, complex and costly project that can take a long time to complete.

Instead of painstakingly setting up replication for each data source, the team can now use Datastream and Terraform. They compile a list of data sources, create a few configuration files according to the organization’s setup, and voila! Replication begins and data starts to appear in BigQuery within minutes.

Datastream is Google Cloud’s serverless and easy-to-use change data capture and replication service. If you are unfamiliar with Datastream, we recommend this post for a high-level overview, or read our latest Datastream for BigQuery launch announcement.

Terraform is a popular Infrastructure as code (IaC) tool. Terraform enables infrastructure management through configuration files, which makes management safer, more consistent and easily automatable.

Launched in mid February 23’, the Terraform support in Datastream unblocks and simplifies some exciting use cases, such as:

Policy compliance management – Terraform can be used to enforce compliance and governance policies on the resources that teams provision.

Automated replication process – Terraform can be used to automate Datastream operations. This can be useful when you need automated replication, replication from many data sources, or replication of a single data source to multiple destinations.

Using Terraform to set up Datastream replication from PostgreSQL to BigQuery

Let’s look at an example where the data source is a PostgreSQL database, and review the process step by step.

Limitations

Datastream will only replicate data to the BigQuery data warehouse configured in the same Google Cloud project as Datastream, so make sure that you create Datastream resources in the same project you want your data in.

Requirements

Datastream API needs to be enabled on the project before we continue. Go to the API & Services page in the Google Cloud console, and make sure that the Datastream API is enabled.

Make sure you have the Terraform cli installed – Terraform cli installation guide

It’s possible to follow the steps in this blog with either a MySQL or Oracle database with a few slight modifications. Just skip the Postgres configuration section, and use our MySQL or Oracle configuration guides instead.

You will naturally need a Postgres database instance with some initial data. If you want to set up a new Postgres instance, you can follow the steps in the Cloud SQL for Postgres quickstart guide.

We will need to make sure that PostgreSQL is configured for replication with Datastream. This includes enabling logical replication and optionally creating a dedicated user for Datastream. See our PostgreSQL configuration guide. Make sure to note the replication slot and publication names from this step, as we will need them to configure the replication later on.

You will also need to set up connectivity between your database and Datastream. Check the Network connectivity options guide, and find the connectivity type that fits your setup.

Configuring the replication with Terraform

We will start by creating Datastream Connection Profiles, which store the information needed to connect to the source and destination (e.g. hostname, port, user, etc.).

To do this, we will start by creating a new .tf file in an empty directory, and adding the following configurations to it:

code_block[StructValue([(u’code’, u’resource “google_datastream_connection_profile” “source” {rn display_name = “Postgresql Source”rn location = “us-central1″rn connection_profile_id = “source”rnrn postgresql_profile {rn hostname = “HOSTNAME”rn port = 5432rn username = “USERNAME”rn password = “PASSWORD”rn database = “postgres”rn }rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee66c212bd0>)])]

In this example, we create a new Connection Profile that points to a Postgres instance. Edit the configuration with your source information and save it. For other sources and configurations see the Terraform Datastream Connection Profile documentation.

Next, let’s define the Connection Profile for the BigQuery destination:

code_block[StructValue([(u’code’, u’resource “google_datastream_connection_profile” “destination” {rn display_name = “BigQuery Destination”rn location = “us-central1″rn connection_profile_id = “destination”rnrn bigquery_profile {}rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee66cf3b890>)])]

We now have the source and destination configured, and are ready to configure the replication process between them. We will do that by defining a Stream, which is a Datastream resource representing the source and destination replication.

code_block[StructValue([(u’code’, u’resource “google_datastream_stream” “stream” {rn display_name = “Postgres to BigQuery Stream”rn location = “us-central1″rn stream_id = “stream”rn desired_state = “RUNNING”rnrn source_config {rn source_connection_profile = google_datastream_connection_profile.source.idrn postgresql_source_config {rnt publication = “PUBLICATION_NAME”rnt replication_slot = “REPLICATION_SLOT”rnt}rn }rnrn destination_config {rn destination_connection_profile = google_datastream_connection_profile.destination.idrn bigquery_destination_config {rn data_freshness = “900s”rn source_hierarchy_datasets {rn dataset_template {rn location = “us-central1″rn }rn }rn }rn }rnrn backfill_all {}rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee65a342850>)])]

In this configuration, we are creating a new Stream and configuring the source and destination Connection Profiles and properties. Some features to note here:

Backfill ( backfill_all ) – means that Datastream will replicate an initial snapshot of historical data. This can be configured to exclude specific tables.

Replicating a subset of the source – you can specify which data should be included or excluded from the Stream using the include / exclude lists – see more in the API docs

Edit the configuration with the source Postgres publication and replication slot that you created in the initial setup. For other sources and configurations see the Terraform Stream documentation.

Running the Terraform configuration

Now that we have our configuration ready, it’s time to run it with the Terraform CLI. For that, we can use Cloud Shell which has terraform CLI installed and configured with the permissions needed for your project.

You can also prepare a local environment, by adding this to your .tf file.

code_block[StructValue([(u’code’, u’terraform {rn required_providers {rn google = {rn source = “hashicorp/google”rn version = “4.53.0”rn }rn }rn}rnrnprovider “google” {rn credentials = file(“PATH/TO/KEY.json”)rn project = “PRROJECT_ID”rn region = “us-central1″rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee65a342050>)])]

Start by running the terraform init command to initialize terraform in your configuration directory. Then run the terraform plan command to check and validate the configuration.

Now lets run terraform apply to apply the new configuration

If all went well, you should now have a running Datastream Stream! Go to the Datastream console to manage your Streams, and to the BigQuery console and check that the appropriate data sets were created.

When you’re done, you can use terraform destroy to remove the created Datastream resources.

Something went wrong?

You can set: export TF_LOG=DEBUG flag to see debug logs for the Terraform CLI. See Debugging Terraform for more.

Automating multiple replications

To automate terraform configurations, we can make use of input variables, the count argument and the element function. A simple variable that holds a sources list can look like this:

code_block[StructValue([(u’code’, u’variable “sources” {rn type = list(object({rn name = stringrn hostname = stringrn …rn publication = stringrn replication_slot = stringrn }))rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee65a342790>)])]

Add a count field to the resources you want to automate (source, stream and maybe destination). Replace values from the variable to:

code_block[StructValue([(u’code’, u’count = length(var.sources)rn…rnconnection_profile_id = “source_${element(var.sources, count.index).name}”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee65a342e90>)])]

Additional Resources

Datastream documentation

Streamline your real-time data pipeline with Datastream and MongoDB

Using Datastream to unify data for machine learning and analytics

Migrate from Oracle to PostgreSQL with minimal downtime with Datastream

Cloud BlogRead More

Previous articleTraining a PyTorch Model with DataLoader and Dataset

Next articleCareer tips and no-cost training for the next generation of cloud technology

Getting started with Terraform and Datastream: Replicating Postgres data to BigQuery

Using Terraform to set up Datastream replication from PostgreSQL to BigQuery

Configuring the replication with Terraform

Running the Terraform configuration

Something went wrong?

Automating multiple replications

Additional Resources

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

The greenest energy is the energy you don’t use

Using AI to transform traditional approaches to online learning

Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart

POPULAR CATEGORY