Thursday, October 3, 2024
No menu items!
HomeCloud ComputingYear in Review for BigQuery User Friendly SQL

Year in Review for BigQuery User Friendly SQL

Early in the year, we shared the new capabilities that made BigQuery SQL more user-friendly than ever in our Valentine’s Day update. In this year-end edition, we are pleased to walk you through new SQL capabilities that we launched in the categories of Enterprise Class, Data Quality and Schema Operations.

Enterprise Class

For our enterprise data analysts, we are pleased to share new capabilities in BigQuery SQL that allow them to manage their workloads using the same familiarity that they may have used in legacy databases and data warehouses.

Sessions (GA)

A database session (or session) represents a connection between a user or a program and a database or data warehouse that stores and queries data. A data analyst through a tool or a program may execute multiple commands which are tracked under this session. BigQuery at its heart is a stateless query engine which means it doesn’t maintain a persistent network connection, as traditional database connections do. This allows BigQuery to scale compute and storage independently without limits.

For those enterprise data administrators looking to manage BigQuery activity through sessions, they can optionally enable session support in BigQuery using the UI, API and CLI or can by setting the EnableSession parameter in the JDBC or ODBC driver.  This allows enterprises to accrue all the benefits of sessions without the overhead of a persistent connection.

For data analysts, sessions provides the benefits of:

Session variables: which allows the creation of a session-specific variable that is persisted for all commands that get executed in the session.

System variables: Pre-build variables for timezone, dataset identifiers, projects, session identifiers, which if specified, allow the data analysts to override the organization or project level default values with session-specific settings.

Session-duration temporary tables and temporary functions: When temporary tables are used in a session to stage results or a session-specific temporary function is used to make query operations efficient, they are persisted for the entire duration of the session and are accessible to all jobs in that session.

For data administrators, the benefits are:

Session labels: By assigning labels for job tracking, data administrators can find all activities associated with that session using that label in the audit log.

Session management: data administrators can manually terminate sessions by calling the BQ.ABORT_SESSION system procedure with the session ID as input.

BigQuery sessions also support multi-statement queries and transactions. Documentation

code_block[StructValue([(u’code’, u’– Create a session as a part of a transactionrnBEGIN TRANSACTIONrnCREATE TEMP TABLE Flights(total INT64) AS SELECT * FROM UNNEST([10,23,3,14,55]) AS a;rnSELECT * FROM Flights;rn– clear the contents of the same query windowrnDECLARE x INT64 DEFAULT 10;rnCOMMIT;rnrn– Lookup sessionID details from information schemarnWITH running_transactions AS (rn SELECT DISTINCT transaction_idrn FROMrn `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECTrn WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)rn EXCEPT DISTINCTrn SELECT transaction_id FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECTrn WHERErn creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)rn AND statement_type = “COMMIT_TRANSACTION”rn OR statement_type = “ROLLBACK_TRANSACTION”rn)rnSELECTrn jobs.transaction_id AS transaction_id,rn project_id,rn user_email,rn session_info.session_id,rn queryrnFROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT AS jobs, running_transactionsrn WHERErn creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)rn AND destination_table.table_id = “Flights”rn AND jobs.transaction_id = running_transactions.transaction_id;rnrnSELECT total * x AS total_a FROM Flights LIMIT 2;rnrn– Use session specific variablesrnBEGINrn SET x = 100;rn SELECT total * x AS total_b FROM Flights LIMIT 2;rnEND;rnrn– Use and set system variables in a sessionrnSELECT STRING(TIMESTAMP “2008-12-20 15:30:00+00”, @@time_zone) AS default_time_zone;rnrnSET @@time_zone = “America/Los_Angeles”;rnrnSELECT STRING(TIMESTAMP “2008-12-20 15:30:00+00”, @@time_zone) AS new_time_zone;’), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de230c050>)])]

Case-insensitive dataset and table names (Preview)

BigQuery tables names and dataset names are case sensitive, by default, reflecting its origins in big data analysis where dataset file names have maintained case sensitivity. To become more user-friendly to the tools and programs brought over by data analysts from their legacy data warehouses, BigQuery now supports case-insensitive table names and schema (dataset) names so that a dataset.Table ≡ DaTaSeT.TABLE ≡ Dataset.table ≡ DATASET.tAbLe in BigQuery. You can configure this in the definition (DDL) for dataset or table using the is_case_insensitive option. Documentation

code_block[StructValue([(u’code’, u’– Create a case-insensitive datasetrnCREATE SCHEMA mydatasetrnOPTIONS(is_case_insensitive=TRUE)rnrn– Create a table in the case-insensitive dataset.rnCREATE TABLE mydataset.table1(id STRING);rnrn– Query table with different casingrnSELECT * FROM myDATASET.TaBLE1;’), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de237a990>)])]

Data Quality

As data analytics platforms, such as BigQuery, bring data from different sources to help with decision making, data engineers need to maintain the quality of critical data elements. Data sources may be imperfect and have incomplete information. Therefore, data engineers need to configure intelligent defaulting logic when column data is missing to ensure that the right data gets populated. Similarly, data pipelines need to provide flexible logic for numeric data to reduce bias and ensure accurate outcomes. Data analysts need to match text information independent of case to ensure accurate reporting.

Default column value (Preview)

When new rows are inserted into a table, some columns may not have any data. Default value expression allows the specification of a default value using a literal value or a function that computes the default value when the associated column data is missing.  The default value specification can be set when creating a new table or by altering the column properties of an existing table. In addition, DML statements such as INSERT or CREATE TABLE AS SELECT or MERGE allow the specification of the defaulting logic in place of an actual inserted value which uses the defaulting specification of the column being inserted into. Documentation

code_block[StructValue([(u’code’, u”– Create a table with a default value like CURRENT_TIMESTAMP() etc..rnCREATE OR REPLACE TABLE mydataset.orders rn(rn customer STRING DEFAULT ‘foo’,rn order_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),rn user STRING DEFAULT SESSION_USER(),rn details STRUCT<a STRING, b STRING> DEFAULT (‘a1’, ‘b1’)rn);rnrn– Insert to table with omitted columnsrnINSERT INTO mydataset.orders(customer, user) rnVALUES (‘customer1’, DEFAULT),rn (‘customer2’, ‘David’);rnrn– Validates default value are usedrnSELECT * FROM mydataset.orders;”), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de237af50>)])]

Case-insensitive string collation (Preview)

A collation specification when associated with a column operation, e.g. join, comparison or ORDER BY or GROUP BY clause, determines the logic used to compare or order string values. For example, the default collation specific in BigQuery is ‘binary’ which uses the code point order to specify the ordering sequence in Unicode by which all uppercase letters [A-Z] precede lower case letters [a-z]. We are pleased to offer case insensitive collation, specified using ‘und:ci’  by which [A,a] will be treated as equivalent characters and will precede [B. b] for string value operations. This allows data analysts to find matching string values independent of the case by treating Maclean and MacLean as equivalent.  Documentation

code_block[StructValue([(u’code’, u’– Create a table with case-insensitive string columns rnCREATE or replace TABLE mydataset.table1rn( rn id INT64,rn column_ci STRING COLLATE ‘und:ci’rn);rn rninsert into mydataset.table1rnvalues (1,”alpha”),(2,”BETA”), (3,”Alpha”);rn rn– Filter by column_ci case-insensitivelyrnSELECT * FROM `mydataset`.table1 WHERE column_ci = ‘Alpha’;rn rn– Sort by column_ci case-insensitivelyrnSELECT * FROM mydataset.table1 ORDER BY column_ci;rn rn– Group by column_ci case-insensitivelyrnSELECT column_ci, sum(id) FROM mydataset.table1 GROUP BY column_ci;’), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de237a710>)])]

Banker’s rounding (GA)

BigQuery has been natively rounding all values that overflow maximum precision using the most common rounding logic: “round half away from zero”. Using “Round half away from zero” exclusively however, can introduce a rounding  bias in the data by causing the aggregate operation e.g. sum or average to drift away from the actual value. Other rounding techniques, such as  Gaussian rounding, commonly known as banker’s rounding, provide an alternate rounding scheme which does not suffer from negative or positive bias as much as the round half away from zero method over aggregations over most reasonable distributions. With this, BigQuery is pleased to introduce support for multiple rounding modes for columns in table definitions and in the explicit ROUND function: the existing mode of rounding, called “round_half_away_from_zero”, and a new mode, Bankers Rounding, otherwise known as “round_half_even”. The “round_half_even” mode rounds towards the nearest “neighbor” unless both neighbors are equidistant, in which case, round towards the even neighbor, e.g., 3.1 & 3.2 will round to 3 and 3.5 & 4.5 will round to 4. Documentation 

code_block[StructValue([(u’code’, u’– ROUND_HALF_EVEN will round down 2.25 to 2.2rnSELECT ROUND(NUMERIC “2.25”, 1, “ROUND_HALF_EVEN”);rn rn– Create a table with ‘ROUND_HALF_EVEN’ numeric columnrnCREATE or REPLACE TABLE mydataset.table1 (rn x NUMERIC(5,2) OPTIONS (rounding_mode=’ROUND_HALF_EVEN’),rn y NUMERIC(5,2) OPTIONS (rounding_mode=’ROUND_HALF_AWAY_FROM_ZERO’)rn);rn rn– Insert values into the new columns, values that are in halfway between 2.22 and 2.23rnINSERT INTO mydataset.table1 VALUES (2.225, 2.225);rn rn– Select values and view how they were rounded to scale of 2.rnSELECT * FROM mydataset.table1;’), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de237abd0>)])]

Schema operations

Continuing the expansion of SQL syntax for schema operations, BigQuery is pleased to introduce two new SQL capabilities: LOAD DATA to load data into tables and RENAME column to allow existing table columns to be renamed

LOAD DATA (Preview)
This statement provides data engineers and analysts with a SQL interface to load one or more files into tables. It provides the ability to load into a named table, create a new table , or truncate an existing table as a part of the load from all the file types supported by BigQuery including CSV, AVRO, Parquet, JSON, etc. In addition to specifying table options, such as table metadata or table expiration, the LOAD DATA command also allows users to specify partition and clustering schemes as a part of the table specification or from a hive partitioning specification for external files.   

LOAD DATA uses the same resource pool as the BigQuery load api to load data. If flat slots customers have configured a PIPELINE reservation, then the LOAD DATA will use the dedicated compute slots from the PIPELINE reservation assignment. Documentation

code_block[StructValue([(u’code’, u”– Load a CSVfile into a table.rnLOAD DATA INTO mydataset.table6rn FROM FILES(rn format=’CSV’,rn uris = [‘gs://cloud-samples-data/bigquery/us-states/us-states-by-date.csv’]rn );rn rn select * from mydataset.table6;rnrn– Truncate table and load a CSV file into it.rnLOAD DATA OVERWRITE mydataset.table6rn FROM FILES(rn format=’CSV’,rn uris = [‘gs://cloud-samples-data/bigquery/us-states/us-states-by-date.csv’]rn );rn rn select * from mydataset.table6;rn rn– Load a CSV file into a table with schema, partitioning and clusteringrnLOAD DATA INTO mydataset.table7(name STRING, post_abbr STRING, date DATE)rn PARTITION BY datern CLUSTER BY post_abbrrn FROM FILES(rn skip_leading_rows=1,rn format=’CSV’,rn uris = [‘gs://cloud-samples-data/bigquery/us-states/us-states-by-date.csv’]rn );rn rnselect * from mydataset.table7;rn– Load hive partitioned PARQUET files, infer partition layoutrnLOAD DATA INTO mydataset.table8rn FROM FILES(rn format=’PARQUET’,rn uris = [‘gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout/*’],rn hive_partition_uri_prefix=’gs://cloud-samples-data/bigquery/hive-partitioning-samples/autolayout’rn )rn WITH PARTITION COLUMNS;rnselect * from mydataset.table8;”), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de224fcd0>)])]

RENAME column (Preview)

RENAME COLUMN allows users to change a column name in an existing table; this is a metadata only change. Previously, renaming columns required a workaround which required column data to be rewritten. Now, when a column has to be renamed because the current naming is obsolete or was named incorrectly in error, data owners can run a zero cost metadata only command, RENAME column, to fix and correct the column name. Documentation

code_block[StructValue([(u’code’, u’– Rename two columns rnCREATE or replace TABLE mydataset.table9rn(columnA int64, columnB string);rnselect column_name, ordinal_position, data_typernfrom information_schema.columnsrnwhere table_schema = u201cmydatasetu201drnand table_name = “table9”;rnALTER TABLE mydataset.table9rn RENAME COLUMN columnA TO columnA1,rn RENAME COLUMN IF EXISTS columnB TO columnB1;rnselect column_name, ordinal_position, data_typernfrom information_schema.columnsrnwhere table_schema = u201cmydatasetu201drnand table_name = “table9”;rn– Swap two columnsrnALTER TABLE mydataset.table9rn RENAME COLUMN columnA1 TO temp,rn RENAME COLUMN columnB1 TO columnA,rn RENAME COLUMN temp TO columnB;rn select column_name, ordinal_position, data_typernfrom information_schema.columnsrnwhere table_schema = u201cmydatasetu201drnand table_name = “table9”;’), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e1de224ffd0>)])]

As we close out this year, we are excited about bringing new capabilities in the new year to you, our BigQuery data aficionados, to make BigQuery SQL more user-friendly than ever.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments