Pub/Sub Lite’s Apache Spark Structured Streaming Connector is now Generally Available

By mullaned2002

February 7, 2023

360

We are excited to announce that the open source Pub/Sub Lite Apache Spark connector is now compatible with Apache Spark 3.X.X distributions, and the connector is officially GA.

What is the Pub/Sub Lite Apache Spark Connector?

Pub/Sub Lite is a Google Cloud messaging service that allows users to send and receive messages asynchronously between independent applications. Publish applications send messages to Pub/Sub Lite topics, and applications subscribe to Pub/Sub Lite subscriptions to receive those messages.

Pub/Sub Lite offers both zonal and regional topics, which differ only in the way that data is replicated. Zonal topics store data in a single zone, while regional topics replicate data to two zones in a single region.

The Pub/Sub Lite Spark connector supports the use of Pub/Sub Lite as both an input and output source for Apache Spark Structured Streaming. When writing to Pub/Sub Lite, the connector supports the following configuration options:

When reading from Pub/Sub Lite, the connector supports the following configuration options:

The connector works in all Apache Spark distributions, including Databricks and Google Cloud Dataproc. The first GA release of the Pub/Sub Lite Spark connector is v1.0.0, and it is compatible with Apache Spark 3.X.X versions.

Getting Started with Pub/Sub Lite and Spark Structured Streaming on Dataproc

Using Pub/Sub Lite as a source with Spark Structured Streaming is simple using the Pub/Sub Lite Spark connector.

To get started, first create a Google Cloud Dataproc cluster:

The cluster image version determines the Apache Spark version that is installed on the cluster. The Pub/Sub Lite Spark connector currently supports Spark 3.X.X, so choose a 2.X.X image version.

Enable API access to Google Cloud services by providing the ‘https://www.googleapis.com/auth/cloud-platform’ scope.

Next, create a Spark script. For writing to Pub/Sub Lite, use the writeStream API, like the following python script:

code_block[StructValue([(u’code’, u’# Ensure the DataFrame matches the required data fields and data types for writing to Pub/Sub Lite: https://github.com/googleapis/java-pubsublite-spark#data-schemarn# |– key: binaryrn# |– data: binaryrn# |– event_timestamp: timestamprn# |– attributes: maprn# | |– key: stringrn# | |– value: arrayrn# | | |– element: binaryrnsdf.printSchema()rnrn# Create the writeStream to send messages to the specified Pub/Sub Lite topicrnquery = (rn sdf.writeStream.format(“pubsublite”)rn .option(rn “pubsublite.topic”,rn f”projects/{project}/locations/{location}/topics/{topic}”,rn )rn .option(“checkpointLocation”, “/tmp/app” + uuid.uuid4().hex)rn .outputMode(“append”)rn .start()rn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0b60263550>)])]

For reading from Pub/Sub Lite, create a script using the readStream API, like so:

code_block[StructValue([(u’code’, u’spark = SparkSession.builder.appName(“psl-read-app”).master(“yarn”).getOrCreate()rnrnsdf = (rn spark.readStream.format(“pubsublite”)rn .option(rn “pubsublite.subscription”,rn f”projects/{project}/locations/{location}/subscriptions/{subscription}”,rn )rn .load()rn)rnrn# The DataFrame should match the fixed Pub/Sub Lite data schema for reading from Pub/Sub Lite: https://github.com/googleapis/java-pubsublite-spark#data-schemarn# |– subscription: stringrn# |– partition: longrn# |– offset: longrn# |– key: binaryrn# |– data: binaryrn# |– publish_timestamp: timestamprn# |– event_timestamp: timestamprn# |– attributes: maprn# | |– key: stringrn# | |– value: arrayrn# | | |– element: binaryrnsdf.printSchema()’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0b60263810>)])]

Finally, submit the job to Dataproc. When submitting the job, the Pub/Sub Lite Spark connector must be included in the job’s Jar files. All versions of the connector are publicly available from the Maven Central repository. Choose the latest version (or >1.0.0 for GA releases), and download the ‘with-dependencies.jar’. Upload this jar to the Dataproc job, and submit!

Pub/Sub Lite’s Apache Spark Structured Streaming Connector is now Generally Available

What is the Pub/Sub Lite Apache Spark Connector?

Getting Started with Pub/Sub Lite and Spark Structured Streaming on Dataproc

Further reading

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Rapidly deploy PyTorch applications on Batch using TorchX

Leveraging Couchbase connector and Application Integration in the Google ecosystem

Improve your Stable Diffusion prompts with Retrieval Augmented Generation

POPULAR CATEGORY