Friday, June 21, 2024
No menu items!
HomeCloud ComputingPub/Sub Lite’s Apache Spark Structured Streaming Connector is now Generally Available

Pub/Sub Lite’s Apache Spark Structured Streaming Connector is now Generally Available

We are excited to announce that the open source Pub/Sub Lite Apache Spark connector is now compatible with Apache Spark 3.X.X distributions, and the connector is officially GA.

What is the Pub/Sub Lite Apache Spark Connector?

Pub/Sub Lite is a Google Cloud messaging service that allows users to send and receive messages asynchronously between independent applications. Publish applications send messages to Pub/Sub Lite topics, and applications subscribe to Pub/Sub Lite subscriptions to receive those messages.

Pub/Sub Lite offers both zonal and regional topics, which differ only in the way that data is replicated. Zonal topics store data in a single zone, while regional topics replicate data to two zones in a single region.

The Pub/Sub Lite Spark connector supports the use of Pub/Sub Lite as both an input and output source for Apache Spark Structured Streaming. When writing to Pub/Sub Lite, the connector supports the following configuration options:

When reading from Pub/Sub Lite, the connector supports the following configuration options:

The connector works in all Apache Spark distributions, including Databricks and Google Cloud Dataproc. The first GA release of the Pub/Sub Lite Spark connector is v1.0.0, and it is compatible with Apache Spark 3.X.X versions.

Getting Started with Pub/Sub Lite and Spark Structured Streaming on Dataproc

Using Pub/Sub Lite as a source with Spark Structured Streaming is simple using the Pub/Sub Lite Spark connector.

To get started, first create a Google Cloud Dataproc cluster:

The cluster image version determines the Apache Spark version that is installed on the cluster. The Pub/Sub Lite Spark connector currently supports Spark 3.X.X, so choose a 2.X.X image version.

Enable API access to Google Cloud services by providing the ‘https://www.googleapis.com/auth/cloud-platform’ scope

Next, create a Spark script. For writing to Pub/Sub Lite, use the writeStream API, like the following python script:

code_block[StructValue([(u’code’, u’# Ensure the DataFrame matches the required data fields and data types for writing to Pub/Sub Lite: https://github.com/googleapis/java-pubsublite-spark#data-schemarn# |– key: binaryrn# |– data: binaryrn# |– event_timestamp: timestamprn# |– attributes: maprn# | |– key: stringrn# | |– value: arrayrn# | | |– element: binaryrnsdf.printSchema()rnrn# Create the writeStream to send messages to the specified Pub/Sub Lite topicrnquery = (rn sdf.writeStream.format(“pubsublite”)rn .option(rn “pubsublite.topic”,rn f”projects/{project}/locations/{location}/topics/{topic}”,rn )rn .option(“checkpointLocation”, “/tmp/app” + uuid.uuid4().hex)rn .outputMode(“append”)rn .start()rn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0b60263550>)])]

For reading from Pub/Sub Lite, create a script using the readStream API, like so:

code_block[StructValue([(u’code’, u’spark = SparkSession.builder.appName(“psl-read-app”).master(“yarn”).getOrCreate()rnrnsdf = (rn spark.readStream.format(“pubsublite”)rn .option(rn “pubsublite.subscription”,rn f”projects/{project}/locations/{location}/subscriptions/{subscription}”,rn )rn .load()rn)rnrn# The DataFrame should match the fixed Pub/Sub Lite data schema for reading from Pub/Sub Lite: https://github.com/googleapis/java-pubsublite-spark#data-schemarn# |– subscription: stringrn# |– partition: longrn# |– offset: longrn# |– key: binaryrn# |– data: binaryrn# |– publish_timestamp: timestamprn# |– event_timestamp: timestamprn# |– attributes: maprn# | |– key: stringrn# | |– value: arrayrn# | | |– element: binaryrnsdf.printSchema()’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0b60263810>)])]

Finally, submit the job to Dataproc. When submitting the job, the Pub/Sub Lite Spark connector must be included in the job’s Jar files. All versions of the connector are publicly available from the Maven Central repository. Choose the latest version (or >1.0.0 for GA releases), and download the ‘with-dependencies.jar’. Upload this jar to the Dataproc job, and submit!

Further reading

Get started with the Pub/Sub Lite Spark connector Quick Start

Check out our how-to-guides to ‘Write to Pub/Sub Lite from Spark’ and ‘Read from Pub/Sub Lite from Spark’.

Read ‘Pub/Sub Lite as a source with Spark Structured Streaming on Databricks’.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments