Friday, July 19, 2024
No menu items!
HomeCloud ComputingBigQuery now supports manifest files for querying open table formats

BigQuery now supports manifest files for querying open table formats

In this blog we will give you an overview of the manifest support for BigQuery and also explain how it enables querying open table formats like Apache Hudi and Delta Lake in BigQuery.

Open table formats rely on embedded metadata to provide transactionally consistent DML and time travel features. They keep different versions of the data files and are capable of generating manifests, which are lists of data files that represent a point-in-time snapshot. Many data runtimes like Delta Lake and Apache Hudi can generate manifests, which can be used for load and query use cases. BigQuery now supports manifest files, which will make it easier to query open table formats with BigQuery.

BigQuery supports manifest files in SymLinkTextInputFormat, which is simply a newline-delimited list of URIs. Customers can now set the file_set_spec_type flag to NEW_LINE_DELIMITED_MANIFEST in table options to indicate that the provided URIs are newline-delimited manifest files, with one URI per line. This feature also supports partition pruning for hive-style partitioned tables which leads to better performance and lower cost. 

Here is an example of creating a BigLake table using a manifest file.

code_block[StructValue([(u’code’, u”CREATE EXTERNAL TABLE IF NOT EXISTS `my-project.mydataset.myTable`rnWITH CONNECTION `my-project.us.bl_connection`rnOPTIONS (rn uris = [‘gs://demo/myTable/manifest/latest-manifest.csv’],rn format = ‘PARQUET’,rn file_set_spec_type = ‘NEW_LINE_DELIMITED_MANIFEST’);”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9d90b6d7d0>)])]

Querying Apache Hudi using Manifest Support 

Apache Hudi is an open-source data management framework for big data workloads. It’s built on top of Apache Hadoop and provides a mechanism to manage data in a Hadoop Distributed File System (HDFS) or any other cloud storage system.

Hudi tables can be queried from BigQuery as external tables using the Hudi-BigQuery Connector. The Hudi-BigQuery integration only works for hive-style partitioned Copy-On-Write tables. The implementation precludes the use of some important query processing optimizations, which hurts performance and increases slot consumption.

To overcome these pain points, the Hudi-BigQuery Connector is upgraded to leverage BigQuery’s manifest file support. Here is a step by step process to query Apache Hudi workloads using the Connector.

Step 1: Download and build the BigQuery Hudi connector

Download and build the latest hudi-gcp-bundle to run the BigQuerySyncTool

Step 2: Run the spark application to generate a BigQuery external table

Here are the steps to use the connector using manifest approach:

Drop the existing view that represents the Hudi table in BigQuery [if old implementation is used]

The Hudi connector looks for the table name and if one exists it just updates the manifest file. Queries will start failing because of a schema mismatch. Make sure you drop the view before triggering the latest connector.

Run the latest Hudi Connector to trigger the manifest approach

Run the BigQuerySyncTool with the –use-bq-manifest-file flag.

If you are transitioning from the old implementation, append –use-bq-manifest-file flag to the current spark submit that runs the existing connector. Using the same table name is recommended as it will allow keeping the existing downstream pipeline code.

Running the connector with the use-bq-manifest-file flag will export a manifest file in a format supported by BigQuery and use it to create an external table with the name specified in the –table parameter.

Here is a sample spark submit for the manifest approach.

code_block[StructValue([(u’code’, u’spark-submit \rn –master yarn \rn –packages com.google.cloud:google-cloud-bigquery:2.10.4 \rn –class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \rn hudi/packaging/hudi-gcp-bundle/target/hudi-gcp-bundle-0.14.0-SNAPSHOT.jar \rn –project-id bq-hudi \rn –dataset-name demo \rn –dataset-location us \rn –table nyc_taxi_hudi \rn –source-uri gs://demo-bucket/hudi/taxi-trips/EventDate=* \rn –source-uri-prefix gs://demo-bucket/hudi/taxi-trips/ \rn –base-path gs://demo-bucket/hudi/taxi-trips/ \rn –partitioned-by EventDate \rn –use-bq-manifest-file’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9d90cdc610>)])]

Step 3: Recommended: Upgrade to an accelerated BigLake table

Customers running large-scale analytics can upgrade external tables to BigLake tables to set appropriate fine-grained controls and accelerate the performance of these workloads by taking advantage of metadata caching and materialized views

Querying Delta Lake using Manifest Support 

Delta Lake is an open-source storage framework that enables building a lakehouse architecture. It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. It also provides an option to export a manifest file that contains a list of data files that represent the point-in-time snapshot. 

With the manifest support, users can create a BigLake table to query the Delta Lake table on GCS. It is the responsibility of the user to generate the manifest whenever the underlying Delta Lake table changes and this approach only supports querying Delta Lake reader v1 tables.

Here is a step by step process to query Delta Lake tables using manifest support.

Step 1: Generate the Delta table’s manifests using Apache Spark

Delta Lake supports exporting manifest files.  The generate command generates manifest files at <path-to-delta-table>/_symlink_format_manifest/. The files in this directory will contain the names of the data files (that is, Parquet files) that should be read for reading a snapshot of the Delta table.

code_block[StructValue([(u’code’, u’Generating manifest file using PythonrnrndeltaTable = DeltaTable.forPath(<path-to-delta-table>)rndeltaTable.generate(“symlink_format_manifest”)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9d70070f50>)])]

Step 2: Create a BigLake table on the generated manifests 

Create a manifest file based BigLake table using the manifest files generated from the previous step.  If the underlying Delta Lake table is partitioned, you can create a hive style partitioned BigLake table.

code_block[StructValue([(u’code’, u’CREATE EXTERNAL TABLE IF NOT EXISTS `my-project.mydataset.myDeltaTable`rnWITH PARTITION COLUMNS (EventDate string)rnWITH CONNECTION `my-project.us.bl_connection`rnOPTIONS (rn hive_partition_uri_prefix = “<path-to-delta-table>/”,rn uris = [‘<path-to-delta-table>/_symlink_format_manifest/*/manifest’],rn file_set_spec_type = ‘NEW_LINE_DELIMITED_MANIFEST’,rn format=”PARQUET”);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e9d73db5350>)])]

Step 3: Recommended: Upgrade to an accelerated BigLake table

Customers running large-scale analytics on Delta Lake workloads can accelerate the performance by taking advantage of metadata caching and materialized views

What’s Next?

 If you are an OSS customer looking to query your Delta lake or Apache Hudi workloads on GCS, please leverage the manifest support and if you are also looking to further accelerate the performance, you can do that by taking advantage of metadata caching and materialized views

For more Information

Accelerate BigLake performance to run large-scale analytics workloads

Introduction to BigLake tables.

Visit BigLake on Google Cloud.

Acknowledgments: Micah Kornfield, Brian Hulette, Silvian Calman, Mahesh Bogadi, Garrett Casto, Yuri Volobuev, Justin Levandoski, Gaurav Saxena and the rest of the BigQuery Engineering team.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments