Using Firestore and Apache Beam for data processing

By mullaned2002

November 10, 2021

708

Large scale data processing workloads can be challenging to operationalize and orchestrate. Google Cloud announced the release of a Firestore in Native Mode connector for Apache Beam that makes data processing easier than ever for Firestore users. Apache Beam is a popular open source project that supports large scale data processing with a unified batch and streaming processing model. It’s portable, works with many different backend runners, and allows for flexible deployment. The Firestore Beam I/O Connector joins BigQuery, Bigtable, and Datastore as Google databases with Apache Beam connectors and is automatically included with theGoogle Cloud Platform IO module of the Apache Beam Java SDK.

The Firestore connector can be used with a variety of Apache Beam backends, including Google Cloud Dataflow. Dataflow, an Apache Beam backend runner, provides a structure for developers to solve “embarrassingly parallel” problems. Mutating every record of your database is an example of such a problem. Using Beam pipelines removes much of the work of orchestrating the parallelization and allows developers to instead focus on the transforms on the data.

A practical application of a Firestore Connector for Beam

To better understand the use case for a Beam + Firestore Pipeline, let’s look at an example that illustrates the value of using Google Cloud Dataflow to do bulk operations on a Firestore database. Imagine you have a Firestore database and have a collection group you want to do a high number of operations on; for instance, deleting all documents within a collection group. Doing this on one worker could take a while. What if instead we could use the power of Beam to do it in parallel?

This pipeline starts by creating a request for a partition query on a given collectionGroupId. We specify withNameOnlyQuery as it will save on network bandwidth; we only need the name to delete a document. From there, we use a few custom functions. We read the query response to a document object, get the document’s name, and delete a document by that name.

Beam utilizes a watermark to ensure exactly-once processing. As a result, the Shuffle operation stops backtracking over work that is complete already, providing both speed and correctness.

While the code to create a partition query is a bit long, it consists of constructing the protobuf request to be sent to Firestore using the generated protobuf builder.

Creating a Partition Query:

There are many possible applications for this connector for Google Cloud users. Joining disparate data in a Firestore in Native Mode database, relating data across multiple databases, deleting a large number of entities, writing Firestore data to BigQuery, and more. We’re excited to have contributed this connector to the Apache Beam ecosystem and can’t wait to see how you use the Firestore connector to build the next great thing.

Cloud BlogRead More

Previous articleUse integrated explainability tools and improve model quality using Amazon SageMaker Autopilot

Next articleServerless Kubernetes with Azure Container Apps

Using Firestore and Apache Beam for data processing

A practical application of a Firestore Connector for Beam

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

How Google Cloud can help secure your software supply chain

Get value from data quickly with Informatica Data Loader for BigQuery

AlexaTM 20B is now available in Amazon SageMaker JumpStart

POPULAR CATEGORY

Using Firestore and Apache Beam for data processing

A practical application of a Firestore Connector for Beam

Announcing a Firestore Connector for Apache Beam and Cloud Dataflow

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY