Friday, March 29, 2024
No menu items!
HomeCloud ComputingData movement for the masses with Dataflow Templates

Data movement for the masses with Dataflow Templates

It’s 4 PM on a Friday afternoon, and your mind has already checked out for the weekend. Just as you are about to close your laptop, you see an e-mail come in from your engineering manager. You dread what lies ahead.

“Our data science team needs to analyze streaming data from our Kafka cluster. They need the data in BigQuery. Can you deliver this ask by Monday morning?

Sounds simple enough.

You might be tempted to write an ETL script that pulls data from the Kafka cluster every 30 minutes. But that suddenly becomes complicated when you have to introduce logic for retries. What if data written to your BigQuery table is not in the right format? And what if users are looking to filter a subset of the inbound data or convert certain fields into a different format?

Then you have to get into other user requirements quickly. What about non-functional requirements that are table stakes for any production data pipeline, such as monitoring & logging? Not to mention the operational challenges it would take to scale a homegrown ETL stack to the wider organization.

Not so simple a request anymore. Looks like your weekend is totally shot.

What if there was a cloud native way for this data movement use case?

Enter Dataflow Templates.

Dataflow Templates allow you to set your data in motion in just a handful of clicks. Dataflow Templates provides a user interface to select a source-sink combination from a dropdown menu, enter the values for required parameters, select optional settings, and deploy a pipeline. Once a pipeline is launched, it leverages the industry-leading, fully-managed Dataflow service, which includes horizontal & vertical autoscaling, dynamic work rebalancing, and limitless backends like Shuffle & Streaming Engine. 

Retry patterns? We’ve got code samples, not to mention support for snapshots, which protects you from data loss.

Need file format conversion? We’ve got a template for that.

Filter data using our built-in UDF support.

Monitoring & logging? Provided out of the box.

What about those pesky duplicates? We have that covered. 

No wonder studies have found that Dataflow boosts data engineering productivity by 55%.

Looks like your weekend might not be over after all.

The Dataflow team is excited to announce the general availability of 24 Google-Provided Dataflow templates, listed below:

Streaming

Pub/Sub Subscription to BigQuery

Pub/Sub Topic toBigQuery

Pub/Sub Avro toBigQuery

Pub/Sub Proto toBigQuery

Pub/Sub to Pub/Sub

Pub/Sub Avro to Cloud Storage

Pub/Sub Text to Cloud Storage

Cloud Storage Text toBigQuery

Cloud Storage Text to Pub/Sub

Kafka toBigQuery

CDC from MySQL to BigQuery

Datastream to Spanner

Batch

BigQueryto Cloud Storage (Parquet)

Firestore to Cloud Storage

Spanner to Cloud Storage

Cloud Storage toBigQuery

Cloud Storage to Firestore

Cloud Storage to Pub/Sub

Cassandra to Bigtable

Utility (for use cases that go beyond data transport)

File Format Conversion

Cloud Storage Bulk Compression

Cloud Storage Bulk Decompression

Firestore Bulk Delete

Streaming Data Generator

If you are new to Dataflow, Dataflow Templates is absolutely the right place to begin your Dataflow journey.

If you have been using Dataflow for some time, you might note that Dataflow Templates have been around for as long as you can remember. It’s true that we introduced Dataflow Templates in 2017, and since then, thousands of customers have come to rely on Dataflow Templates to automate many of their data movements between different data stores. What’s new is that we now have the structure and personnel in place to provide technical support for these open-source contributions. We have made the requisite investments with dedicated staffing, and now when you use these Dataflow Templates, you can feel confident that your production workloads will be supported no differently than any other workload you run on Google Cloud.

What’s Next

Dataflow Templates might serve your immediate data processing needs, but as any data engineer knows, requirements evolve and customizations are necessary. Thankfully, Dataflow is well-positioned to serve those use cases too.

Begin your Dataflow journey with our Google-provided templates

Visit our open-source Templates repository so you modify our templates for your use case (or launch a Cloud Shell instance with the templates preloaded!)

Deploy Flex Templates, which takes custom templates to the next level and more easily reuse code across your teams

Review how Tyson Foods leveraged Templates to democratize data movement for their end users

Related Article

Google charts the course for a modern data cloud

Why Google Cloud is leading the operational database management systems (DBMS) market with an open, multi-cloud, enterprise-ready vision.

Read Article

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments