Monday, June 17, 2024
No menu items!
HomeCloud ComputingAchieve higher performance and lower query cost for BigQuery integer or timestamp...

Achieve higher performance and lower query cost for BigQuery integer or timestamp lookups

The first incarnation of search indexes in BigQuery focused on fast and efficient lookups on STRING data elements, either in standalone STRING scalar columns, or within an ARRAY, STRUCT, or JSON column. Our previous blog posts showcased the orders-of-magnitude performance gains achievable when utilizing indexes with the SEARCH function and other functions and operators.

Today, we are announcing the public preview of numeric search indexes, which enables optimized lookups on INT64 and TIMESTAMP data types. With this change, the EQUAL(=) and IN operations on these data types can utilize search indexes to reduce byte scans for improved performance. So now your lookups for account IDs or transactions IDs or log timestamps can get faster and cheaper.

In this blog, we demonstrate the gains on real data, showcasing index creation and queries on a 100TB log table called log_table that contains Google Cloud Logging data for an internal Google test project.

The base table details are as follows:

The table has the following columns of interest:

jsonPayload: type JSON

This jsonPayload has a leaf field named threadId of type JSON number.

sourceLocation: type RECORD (or STRUCT) with two sub-fields of interest:

file: type STRING, containing the name of the file producing the log entry

line: type INT64, containing the line number in the file where the log entry was produced. 

Using search indexes

By default, a search index is created for the STRING data only. If you want to index INT64 or TIMESTAMP, you need to provide them in the index option called data_types. In the following example, all data of type STRING and INT64 in the log_table table will be indexed.

code_block
<ListValue: [StructValue([(‘code’, “CREATE SEARCH INDEX index_with_string_int64rnON log_table (ALL COLUMNS)rnOPTIONS (data_types = [‘STRING’, ‘INT64’])”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e579ab8adf0>)])]>

JSON field search

In this first example, we want to search for log entries that have the thread ID 12104 in the JSON payload.

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT timestamp, jsonPayloadrnFROM log_tablernWHERE INT64(jsonPayload.threadId) = 12104;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e579a180700>)])]>

We compare between having a search index and having no index. Given that log entries with this thread ID are very rare, the results show dramatic improvements on all three metrics:

 

Metrics 

Without Index

With Index

Improvement

Execution Time (ms)

48,790

4,664

10x

Processed Bytes

2,174,758,158,336

774,897,664

2,806x

Slot Usage (ms)

25,735,222

7,300

3,525x

STRUCT nested field search

In the second example, we count how many log entries are produced from a certain line of code (line 813 in the file borg/borgletlib/borgletlib.cc).

code_block
<ListValue: [StructValue([(‘code’, “SELECT count(*)rnFROM log_tablernWHERE sourceLocation.file = ‘borg/borgletlib/borgletlib.cc’rn AND sourceLocation.line = 813″), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e579a180b20>)])]>

Note that sourceLocation.file is a STRING field. A search index on only STRING data type can already help improve the query performance as shown below. However, with indexing on also the INT64 data type, the performance can be further improved.

 

Metrics 

No Index

With IndexSTRING-only

With IndexSTRING & INT64

Execution Time (ms)

57,169

11,571

4.9x

7,982

7.1x

Processed Bytes

1,703,843,725,312

976,230,547,456 

1.7x

682,560,061,440

2.4x

Slot Usage (ms)

38,947,660

25,595,348 

1.5x

8,256,218

4.7x

Do I need indexing if I use partitioning/clustering?

While partitioning and clustering can optimize filtering and lookups, they have certain limitations. For instance, partitioning can only be done on a single column, and clustering allows up to four columns per table. However, clustering is most effective when filtering on the first clustering column, as subsequent columns often provide minimal pruning power. Furthermore, both partitioning and clustering are limited to top-level columns.

Search indexes on INT64/TIMESTAMP complement these BigQuery features by enabling lookup optimizations on any number of columns. In addition,  as demonstrated above, they cover struct nested fields, array elements, and JSON leaf fields.

This feature is currently in preview. For more information, refer to Optimize with numeric predicates.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments