Thursday, December 5, 2024
No menu items!
HomeCloud ComputingLearn how to run Serverless Spark workloads without provisioning and managing clusters

Learn how to run Serverless Spark workloads without provisioning and managing clusters

Serverless Spark is a fully-managed and serverless product on Google Cloud that lets you run Apache Spark, PySpark, SparkR, and Spark SQL batch workloads without provisioning and managing your cluster. Serverless Spark enables you to run data processing jobs using Apache Spark, including PySpark, SparkR, and Spark SQL, on your data in BigQuery with the Apache Spark SQL connector for Google BigQuery, all from within a serverless environment. As a part of the Dataproc product portfolio, Serverless Spark also supports reading and writing to your Dataproc Metastore and provides access to the Spark History Server by configuring it with a Dataproc Persistent History Server

We’re pleased to announce a new interactive tutorial directly in the Google Cloud console that walks you through several ways to start processing your data with Serverless Spark on Google Cloud. 

Below we’ll cover at a high level what you’ll learn in the tutorial, which goes much deeper than this blog.

This tutorial will take you approximately 30 minutes. A basic understanding of Apache Spark will help you understand the concepts in this tutorial. Learn more about Apache Spark in the project documentation

What is Apache Spark?

Apache Spark is an open-source distributed data processing engine for large-scale Python, Java, Scala, R, or SQL datasets. It contains a more extensive set of tools in the core library for use cases such as machine learning, graph processing, structured streaming, and a pandas integration for pandas-based workloads. In addition, numerous third-party libraries extend Spark’s functionality, including sparknlp and database connectors such as the Apache Spark SQL connector for Google BigQuery. Apache Spark supports multiple table formats, including Apache Iceberg, Apache Hudi, Parquet, and Avro.

Run a PySpark job with Serverless Spark on BigQuery data

This tutorial teaches you how to read and write data from BigQuery using PySpark and Serverless Spark. The Apache Spark SQL connector for Google BigQuery is now included in the latest Serverless Spark 2.1 runtime. You can also submit jobs via the following code:

code_block[StructValue([(u’code’, u’gcloud dataproc batches submit pyspark job.py \rn–region=us-central1 \rn–version=2.1′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7d462b9210>)])]

View logs, console output, and Spark logs

Service-level jobs, such as Serverless Spark requesting extra executors when scaling up, are captured in Cloud Logging and can be viewed in real-time or later.

The console output will be visible via the command line as the job is running but is also logged to the Dataproc Batches console.

You can also view Spark logs via a Persistent History Server set up as a Dataproc single-node cluster. Create one below.

code_block[StructValue([(u’code’, u’BUCKET=my-bucketrngcloud dataproc clusters create \rn –region=us-central1 \rn –single-node \rn –enable-component-gateway \rn –properties=spark:spark.history.fs.logDirectory=gs://$BUCKET/*/spark-job-history’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7d279db590>)])]

You can include this when running Serverless Spark jobs to view Spark logs.

code_block[StructValue([(u’code’, u’gcloud dataproc batches submit pyspark job.py \rn–region=us-central1 \rn–version=2.1rn–history-server-cluster=projects/${GOOGLE_CLOUD_PROJECT}/regions/us-central1/clusters’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7d279dbd50>)])]

The Persistent History Server is available in the Batches console by clicking on the Batch ID of the job and then View Spark History Server.

Use Dataproc templates for simple data processing jobs

Dataproc templates provide functionality for simple ETL (extract, transform, load) and ELT (extract, load, transform) jobs. Using this command line-based tool, you can move and process your data for simple and common use cases. These templates utilize Serverless Spark but do not require the user to write any Spark code. Some of these templates include:

GCStoGCSGCStoBigQueryGCStoBigtableGCStoJDBC and JDBCtoGCSHivetoBigQueryMongotoGCS and GCStoMongo

Check out the full list of templates.

The following example will use the GCStoGCS template to convert a GCS file from csv to parquet.

code_block[StructValue([(u’code’, u’BUCKET=your-bucketrn./bin/start.sh — \rn –template=GCSTOGCS \rn –gcs.to.gcs.input.location=gs://$BUCKET/input/file.csv \rn –gcs.to.gcs.input.format=csv \rn –gcs.to.gcs.output_format=parquet \rn –gcs.to.gcs.output.mode=overwrite \rn –gcs.to.gcs.output.location=gs://$BUCKET/output’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7d279dca10>)])]

Get started

Check out the interactive tutorial for a more in-depth and comprehensive view of the information covered here. New customers also get Google Cloud’s $300 credit

Learn more:

Serverless Spark docs

Dataproc templates

Serverless Spark workshop

Apache Spark documentation

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments