Unify analytics with Spark procedures in BigQuery, now generally available

By mullaned2002

March 19, 2024

118

BigQuery is powered by a highly scalable and capable SQL engine that can handle large data volumes with standard SQL, and that offers advanced capabilities such as BigQuery ML, remote functions, vector search, and more. However, there are cases where you may need to leverage open-source Apache Spark expertise or existing Spark-based business logic to expand BigQuery data processing beyond SQL. For example, you may want to use community packages for complex JSON processing or graph data processing, or use legacy code that was written in Spark prior to migration to BigQuery. Historically, this required you to leave BigQuery, enable a separate API, use an alternative user interface (UI), manage disparate permissions, and pay for non-BigQuery SKUs.

To address these challenges, we developed an integrated experience to extend BigQuery’s data processing to Apache Spark, and today, we are announcing the general availability (GA) of Apache Spark stored procedures in BigQuery. BigQuery users looking to extend their queries with Spark-based data processing can now use BigQuery APIs to create and execute Spark stored procedures. It brings Spark together with BigQuery under a single experience, including management, security and billing. Spark procedures are supported using PySpark, Scala and Java code.

Here’s what DeNA, a provider of internet and AI technologies and a BigQuery customer, had to say

“BigQuery Spark stored procedures deliver a frictionless experience with unified API, governance and billing across Spark and BigQuery. We can now seamlessly use our Spark expertise and community packages for advanced data processing in BigQuery.” – Yusuke Kamo, Division Director, Data Management Division, DeNA Co., Ltd

Let’s look into some key aspects of this unified experience.

Develop, test, and deploy PySpark code in BigQuery Studio

BigQuery Studio, a single, unified interface for all data practitioners, includes a Python editor to develop, test and deploy your PySpark code. Procedures can be configured with IN/OUT parameters along with other options. After you create a Spark connection you can iteratively test the code within the UI. For debugging and troubleshooting, the BigQuery console incorporates log messages from underlying Spark jobs and surfaces those within the same context. Spark experts can also tune Spark execution by passing Spark parameters to the procedure.

Author PySpark procedure with a Python editor in BigQuery Studio

Once tested, the procedure is stored within a BigQuery dataset and access to the procedure can be managed similarly to your SQL procedures.

Extend for advanced use cases

One of the great benefits of Apache Spark is being able to take advantage of a wide range of community or third-party packages. You can configure Spark stored procedures in BigQuery to install packages that you need for your code execution.

For advanced use cases you can also import your code stored in Google Cloud Storage buckets or a custom container image that is available in Container Registry or Artifact Registry.

code_block
<ListValue: [StructValue([(‘code’, ‘–Create spark procedure with custom image in artifact registry that has the OSS graphframe lib. Also use custom spark options for specific number of executors rn rnCREATE OR REPLACE PROCEDURErn `myproject.mydataset.graphframe`(bucket_name STRING)rnWITH CONNECTION `myproject.region.my-spark-connection` OPTIONS (engine=’SPARK’,rn runtime_version=’1.1′,rncontainer_image=’us-central1-docker.pkg.dev/myproj/myrepo/graph-db-image’,rn properties=[(“spark.executor.instances”,rn “5”)])rn LANGUAGE python AS R”””rnfrom pyspark import *rnfrom graphframes import *rnfrom pyspark.sql import SparkSessionrnimport sysrnimport pyspark.sql.functions as frnfrom bigquery.spark.procedure import SparkProcParamContextrnrnrnspark = SparkSession.builder.appName(“graphframes_data”).getOrCreate()rnsc=spark.sparkContextrnspark_proc_param_context = SparkProcParamContext.getOrCreate(spark)rnbucket_name=spark_proc_param_context.bucket_namernrnrn# Reading Vertex and Edges data from GCSrnedges= spark.read.options(header=’True’, inferSchema=’True’, delimiter=’,’).csv(“gs://”+bucket_name+”/edges/*.csv”)rnedges=edges.withColumnRenamed(“Source”,”src”).withColumnRenamed(“Target”,”dst”)rnvertices= spark.read.options(header=’True’, inferSchema=’True’, delimiter=’,’).csv(“gs://”+bucket_name+”/nodes/*.csv”)rnvertices=vertices.withColumnRenamed(“Id”,”id”)rnrnrng = GraphFrame(vertices, edges)rn## Take a look at the DataFramesrng.vertices.show(20)rng.edges.show(20)rn## Check the number of edges of each vertexrng.degrees.sort(g.degrees.degree.desc()).show(20)rn”””;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eda335329d0>)])]>

Advanced security and authentication options like customer-managed encryption keys (CMEK) and using a pre-existing service account are also supported.

Serverless execution with BigQuery billing

With this release, you enjoy the benefits of Spark within the BigQuery APIs and only see BigQuery charges. Behind the scenes, this is made possible by our industry leading Serverless Spark engine that enables serverless, autoscaling Spark. However, you don’t need to enable Dataproc APIs or be charged for Dataproc when you leverage this new capability. You will be charged for Spark procedures usage using the Enterprise edition (EE) pay-as-you-go (PAYG) pricing SKU. This feature is available in all the BigQuery editions, including the on-demand model. You will get charged for Spark procedures with EE PAYG SKU irrespective of the editions. Please see BigQuery pricing for more details.

Next steps

Learn more about Apache Spark stored procedures in the BigQuery documentation.

Cloud BlogRead More

Previous articleUnfolding Data Stories: From First Glance to In-Depth Analysis

Next articleAdvanced scheduling for AI/ML with Ray and Kueue

Unify analytics with Spark procedures in BigQuery, now generally available

Develop, test, and deploy PySpark code in BigQuery Studio

Extend for advanced use cases

Serverless execution with BigQuery billing

Next steps

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Cloud Functions vs. Cloud Run: when to use one over the other

How Wipro models cloud transformation excellence with SAP on Google Cloud

Most cloud failures have nothing to do with cloud

POPULAR CATEGORY