BigQuery under the hood: Behind the serverless storage and query optimizations that supercharge performance

By mullaned2002

March 17, 2023

544

Customers love the way BigQuery makes it easy for them to do hard things — from BigQuery Machine Learning (BQML) SQL turning data analysts into data scientists, to rich text analytics using the SEARCH function that unlocks ad-hoc text searches on unstructured data. A key reason for BigQuery’s ease of use is its underlying serverless architecture, which supercharges your analytical queries while making them run faster over time, all without changing a single line of SQL.

In this blog, we lift the curtain and share the magic behind BigQuery’s serverless architecture, such as storage and query optimizations as well as ecosystem improvements, and how they enable customers to work without limits in BigQuery to run their data analytics, data engineering and data science workloads.

Storage optimization

Improve query performance with adaptive storage file sizing

BigQuery stores table data in a columnar file store called Capacitor. These Capacitor files initially had a fixed file size, on the order of hundreds of megabytes, to support BigQuery customers’ large data sets. The larger file sizes enabled fast and efficient querying of petabyte-scale data by reducing the number of files a query had to scan. But as customers moving from traditional data warehouses started bringing in smaller data sets — on the order of gigabytes and terabytes — the default “big” file sizes were no longer the optimal form factor for these smaller tables. Recognizing that the solution would need to scale for users with big and smaller query workloads, the BigQuery team came up with the concept of adaptive file sizing for Capacitor files to improve small query performance.

The BigQuery team developed an adaptive algorithm to dynamically assign the appropriate file size, ranging from tens to hundreds of megabytes, to new tables being created in BigQuery storage. For existing tables, the BigQuery team added a background process to gradually migrate existing “fixed” file size tables into adaptive tables, to migrate customers’ existing tables to the performance-efficient adaptive tables. Today, the background Capacitor process continues to scan the growth of all tables and dynamically resizes them to ensure optimal performance.

“We have seen a greater than 90% reduction in the number of analytic queries in production that take more than one minute to run.” – Emily Pearson, Associate Director, Data Access and Visualization Platforms, Wayfair

Big metadata for performance boost

Reading from and writing to BigQuery tables maintained in storage files can become inefficient quickly if workloads had to scan all the files for every table. BigQuery, like most large data processing systems, has developed a rich store of information on the file contents, which is stored in the header of each Capacitor file. This information about data, called metadata, allows query planning, streaming and batch ingest, transaction processing and other read-write processes in BigQuery to quickly identify the relevant files within storage on which to perform the necessary operations, without wasting time reading non-relevant data files.

But while reading metadata for small tables is relatively simple and fast, large (petabyte-scale) fact tables can generate millions of metadata entries. For these queries to generate results quickly the query optimizer needs a highly performant metadata storage system.

Based on the concepts proposed in their 2021 VLDB paper, “Big Metadata: When Metadata is BigData,” the BigQuery team developed a distributed metadata system, called CMETA, that features fine-grained column and block-level metadata that is capable of supporting very large tables and that is organized and accessible as a system table. When the query optimizer receives a query, it rewrites the query to apply a semi-join (WHERE EXISTS or WHERE IN) with the CMETA system tables. By adding the metadata data lookup to the query predicate, the query optimizer dramatically increases the efficiency of the query.

In addition to managing metadata for BigQuery’s Capacitor-based storage, CMETA also extends to external tables through BigLake, improving the performance of lookups of large numbers of Hive partitioned tables.

The results shared in the VLDB paper demonstrate that query runtimes are accelerated by 5× to 10× for queries on tables ranging from 100GB to 10TB using the CMETA metadata system.

The three Cs of optimizing storage data: compact, coalesce, cluster

BigQuery has a built-in storage optimizer that continuously analyzes and optimizes data stored in storage files within Capacitor using various techniques:

Compact and Coalesce: BigQuery supports fast INSERTs using SQL or API interfaces. When data is initially inserted into tables, depending on the size of the inserts, there may be too many small files created. The Storage Optimizer merges many of these individual files into one, allowing efficient reading of table data without increasing the metadata overhead.

The files used to store table data over time may not be optimally sized. The storage optimizer analyzes this data and rewrites the files into the right-sized files so that queries can scan the appropriate number of these files, and retrieve data most efficiently. Why is the right size important? If the files are too big, then there’s overhead in eliminating unwanted rows from the larger files. If the files are too small, there’s overhead in reading and managing the metadata for the larger number of small files being read.

Cluster: Tables with user-defined column sort orders are called clustered columns; when you cluster a table using multiple columns, the column order determines which columns take precedence when BigQuery sorts and groups the data into storage blocks. BigQuery clustering accelerates queries that filter or aggregate by the clustered columns by only scanning the relevant files and blocks based on the clustered columns rather than the entire table or table partition. As data changes within the clustered table, BigQuery storage optimizer automatically performs reclustering to ensure that the cluster definition is continuously updated ensuring consistent query performance.

Query optimization

Join skew processing to reduce delays in analyzing skewed data

When a query begins execution in BigQuery, the query optimizer converts the query into a graph of execution, broken down into stages, each of which have steps. BigQuery uses dynamic query execution, which means the execution plan can evolve dynamically to adapt to different data sizes and key distributions, ensuring fast query response time and efficient resource allocation. When querying large fact tables, there is a strong likelihood that data may be skewed, meaning data is distributed asymmetrically over certain key values, creating unequal distribution of the data. Thus, a query of a skewed fact table is likely to cause more records for the skewed data over normal data. When the query engine distributes the work to workers to query skewed tables, certain workers may take longer to complete their task because there are excess rows for certain key values, i.e., skew, creating uneven wait times across the workers.

More worker capacity is allocated to Left or the Right side of the join depending on where the data skew is detected (Left side has data skew in task 2; Right side has data skew in task 1)

Let’s consider data that can show skew in its distribution. Cricket is an international team sport. However, it is only popular in certain countries around the world. If we were to maintain a list of cricket fans by country, the data will show that it is skewed to fans from full Member countries of the International Cricket Council and is not equally distributed across all countries.

Traditional databases have tried to handle this by maintaining data distribution statistics. However, in modern data warehouses, data distribution can change rapidly and data analysts can drive increasingly complex queries rendering these statistics obsolete, and thus, less useful. Depending on tables being queried on join columns, the skew may be on the table column referenced on the left side of the join or the right side.

The BigQuery team addressed data skew by developing techniques for join skew processing by detecting data skew and allocating work proportionally so that more workers are allocated to process the join over the skewed data. While processing joins, the query engine keeps monitoring join inputs for skewed data. If a skew is detected, the query engine changes the plan to process the joins over the skewed data. The query engine will further split the skewed data, creating equal distribution of processing across skewed and non-skewed data. This ensures that at execution time, the workers processing data from the table with data skew are proportionally allocated according to the detected skew. This allows all workers to complete their tasks simultaneously, thereby accelerating query runtime by eliminating any delays caused by waits due to skewed data.

“The ease to adopt BigQuery in the automation of data processing was an eye-opener. We don’t have to optimize queries ourselves. Instead, we can write programs that generate the queries, load them into BigQuery, and seconds later get the result.” – Peter De Jaeger, Chief Information Officer, AZ Delta

Dynamic concurrency with queuing

BigQuery’s documentation on Quotas and limits for Query jobs states “Your project can run up to 100 concurrent interactive queries.” BigQuery used the default setting of 100 for concurrency because it met requirements for 99.8% of customer workloads. Since it was a soft limit, the administrator could always increase this limit through a request process to increase the maximum concurrency. To support the ever-expanding range of workloads, such as data engineering, complex analysis, Spark and AI/ML processing, the BigQuery team developed dynamic concurrency with query queues to remove all practical limits on concurrency and eliminate the administrative burden. Dynamic concurrency with query queues is achieved with the following features:

Dynamic maximum concurrency setting: Customers start receiving the benefits of dynamic concurrency by default when they set the target concurrency to zero. BigQuery will automatically set and manage the concurrency based on reservation size and usage patterns. Experienced administrators who need the manual override option can specify the target concurrency limit, which replaces the dynamic concurrency setting. Note that the target concurrency limit is a function of available slots in the reservation and the admin-specified limit can’t exceed that. For on-demand workloads, this limit is computed dynamically and is not configurable by administrators.

Queuing for queries over concurrency limits: BigQuery now supports Query Queues to handle overflow scenarios when peak workloads generate a burst of queries that exceed the maximum concurrency limit. With Query Queues enabled, BigQuery can queue up to 1000 interactive queries so that they get scheduled for execution rather than being terminated due to concurrency limits, as they were previously. Now, users no longer have to scan for idle time periods or periods of low usage to optimize when to submit their workload requests. BigQuery automatically runs their requests or schedules them on a queue to run as soon as current running workloads have completed. You can learn about Query Queues here.

“BigQuery outperforms particularly strongly in very short and very complex queries. Half (47%) of the queries tested in BigQuery finished in less than 10 sec compared to only 20% on alternative solutions. Even more starkly, only 5% of the thousands of queries tested took more than 2 minutes to run on BigQuery whereas almost half (43%) of the queries tested on alternative solutions took 2 minutes or more to complete.” – Nikhil Mishra, Sr. Director of Engineering, Yahoo!

Colossus Flash Cache to serve data quickly and efficiently

Most distributed processing systems make a tradeoff between cost (querying data on hard disk) and performance (querying data in memory). The BigQuery team believes that this trade-off is a fallacy and that users can have both low cost and high performance, without having to choose between them. To achieve this, the team developed a disaggregated intermediate cache layer called Colossus Flash Cache which maintains a cache in flash storage for actively queried data. Based on access patterns, the underlying storage infrastructure caches data in Colossus Flash Cache. This way, queries rarely need to go to disk to retrieve data; the data is served up quickly and efficiently from Colossus Flash Cache.

Optimized Shuffle to prevent excess resource usage

BigQuery achieves its highly scalable data processing capabilities through in-memory execution of queries. These in-memory operations bring data from disk and store intermediate results of the various stages of query processing in another in-memory distributed component called Shuffle. Analytical queries containing WITH clauses encompassing common table expressions (CTE) often reference the same table through multiple subqueries. To solve this, the BigQuery team built a duplicate CTE detection mechanism in the query optimizer.This algorithm reduces resource usage substantially allowing more shuffle capacity to be available to be shared across queries.

To further help customers understand their shuffle usage, the team also added PERIOD_SHUFFLE_RAM_USAGE_RATIO metrics to the JOBS INFORMATION_SCHEMA view and to Admin Resource Charts. You should see fewer Resource Exceeded errors as a result of these improvements and now have a tracking metric to take preemptive actions to prevent excess shuffle resource usage.

“Our teams wanted to do more with data to create better products and services, but the technology tools we had weren’t letting us grow and explore. And that data was growing continually. Just one of our data warehouses had grown 300% from 2014 to 2018. Cloud migration choices usually involve either re-engineering or lift-and-shift, but we decided on a different strategy for ours: move and improve. This allowed us to take full advantage of BigQuery’s capabilities, including its capacity and elasticity, to help solve our essential problem of capacity constraints.” – Srinivas Vaddadi, Delivery Head, Data Services Engineering, HSBC

Ecosystem optimization

Faster ingest, faster egress, faster federation

The performance improvements BigQuery users experience are not limited to BigQuery’s query engine. We know that customers use BigQuery with other cloud services to allow data analysts to ingest from or query other data sources with their BigQuery data. To enable better interoperability, the BigQuery team works closely with other cloud services teams on a variety of integrations:

BigQuery JDBC/ODBC drivers: The new versions of the ODBC / JDBC drivers support faster user account authentication using OAuth 2.0 (OAuthType=1) by processing authentication token refreshes in the background.

BigQuery with Bigtable: The GA release of Cloud Bigtable to BigQuery federation supports pushdown of queries for specific row keys to avoid full table scans.

BigQuery with Spanner: Federated queries against Spanner in BigQuery now allow users to specify the execution priority, thereby giving them control over whether federated queries should compete with transaction traffic if executed with high priority or if they can complete at lower-priority settings.

BigQuery with Pub/Sub: BigQuery now supports direct ingest of Pub/Sub events through a purpose-built “BigQuery subscription” that allows events to be directly written to BigQuery tables.

BigQuery with Dataproc: The Spark connector for BigQuery supports the DIRECT write method, using the BigQuery Storage Write API, avoiding the need to write the data to Cloud Storage.

What can BigQuery do for you?

Taken together, these improvements to BigQuery translate into tangible performance results and business gains for customers around the world. For example, Camanchaca drove 6x faster data processing time, Telus drove 20x faster data processing and reduced $5M in cost, Vodafone saw 70% reduction in data ops and engineering costs, and Crux achieved 10x faster load times.

“Being able to very quickly and efficiently load our data into BigQuery allows us to build more product offerings, makes us more efficient, and allows us to offer more value-added services. Having BigQuery as part of our toolkit enables us to think up more products that help solve our customers’ challenges.” – Ryan Haggerty, Head of Infrastructure and Operations, Crux

Want to hear more about how you can use BigQuery to drive similar results for your business? Join us at the Data Cloud and AI Summit ‘23 to learn what’s new in BigQuery and check out our roadmap of performance innovations using the power of serverless.

Cloud BlogRead More

Previous articleHelping secure global collaboration at the first federally regulated crypto bank

Next articleIntelligently search your organization’s Microsoft Teams data source with the Amazon Kendra connector for Microsoft Teams

BigQuery under the hood: Behind the serverless storage and query optimizations that supercharge performance

Storage optimization

Improve query performance with adaptive storage file sizing

Big metadata for performance boost

The three Cs of optimizing storage data: compact, coalesce, cluster

Query optimization

Join skew processing to reduce delays in analyzing skewed data

Dynamic concurrency with queuing

Colossus Flash Cache to serve data quickly and efficiently

Optimized Shuffle to prevent excess resource usage

Ecosystem optimization

Faster ingest, faster egress, faster federation

What can BigQuery do for you?

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Run computer vision inference on large videos with Amazon SageMaker asynchronous endpoints

The Pros and Cons of AI in Essay Writing: A Complete Understanding

Figuring out microservices running on your GKE cluster with help from Duet AI

POPULAR CATEGORY