Wednesday, December 6, 2023
No menu items!
HomeData IntegrationSnowflake Data Types, Subtypes, and Unsupported Formats

Snowflake Data Types, Subtypes, and Unsupported Formats

We talk a lot about the increasing volume, velocity, and variety of data these days. In this piece, we will drill down on the variety — otherwise known as data types. A data type classifies the kind of data that a variable or object can store in a computer program. 

Data types are crucial in programming as they help allocate memory for variables and objects and perform operations on them. Declaring a variable’s data type is crucial for the program to work as expected. Data types also aid in type checking and catching errors during compilation.

In this piece, we’re going to explore Snowflake data types specifically. 

An Overview of Snowflake and Its Data Types

Snowflake is a self-service, cloud-native data warehousing and analytics solution. Its architecture leverages public clouds like AWS, Google Cloud, and Microsoft Azure. Snowflake is ideal for organizations looking for a highly-performant and collaborative data and analytics solution with a flexible, scalable architecture that allows secure and easy data sharing.

Before working with Snowflake, one must understand the different data types present and how these data types affect building your data solutions. For example, Snowflake allocates physical storage space to columns based on its data types, guiding you on the limits of every column. 

Some other advantages of knowing data types in Snowflake include:

Optimized query performance: Knowing the data types guides how you write your queries and choosing the columnar and query acceleration features for optimizing query performance. For example, Snowflake stores date and timestamps more efficiently as DATE and TIMESTAMP types rather than VARCHAR, resulting in better-performing queries.
Efficient storage allocation for better management: Storage usage increases your Snowflake costs. Assigning the correct data types to your data helps you to organize your data better and efficiently, thus minimizing unnecessary expenses.
Data compression: Knowing what data types are in your table is essential when applying compression techniques like run-length and delta-encoding, thus helping you compress data most efficiently to optimize storage consumption and data access.
Better data quality and consistency: Transformation and data validation checks ensure high-quality data for analysis. Defining and creating validation checks at checkpoints requires adequate knowledge of data types to apply the right processors to ensure data is consistent and of high quality.

Snowflake currently supports six data types: Numerical, string & binary, logical, date & time, semi-structured, and geospatial data types. Let’s explore each of these types in detail.

Numeric Data Types

Before working with numerical data types, you must understand precision and scale.

Precision: Precision refers to the maximum number of digits a number contains.
Scale: Scale represents the number of digits after a decimal point.

A NUMERIC type is represented as NUMBER (precision, scale). While precision does not affect storage, as the storage requirements for two numbers with different levels of precision remain the same, scale does affect storage requirements. For example, Number A(25, 6) requires more storage than Number B (25, 2), and processing A takes longer and consumes more memory.

Here are the data types included in the numerical data types:

NUMBER, which stores a whole number, is represented as NUMBER(38, 0), where precision is 38 and scale represents 0.
DECIMAL and NUMERIC, which are synonymous with number.
INT, INTEGER, BIGINT, and SMALLINT are also NUMBER types, but the precision and scale remain fixed at 38 and 0, respectively.
FLOAT (FLOAT, FLOAT4, FLOAT8) are 64-bit floating-point numbers compatible with other systems. It also supports special values like NaN (Not a Number), inf (infinity), and -inf (negative infinity).

String & Binary Data

Snowflake supports the following text/string data types:

VARCHAR, which holds unicode UTF-8 characters. Unless you specify an optional parameter (N) stating the maximum number of characters it can have, the default is 16,777,2116.
CHAR, CHARACTER, NCHAR, synonymous with VARCHAR, except that unless specified, CHAR (1) is the default, i.e., the maximum number of characters it holds becomes 1.

Unlike string data types, binary uses bytes to store its data as it cannot understand Unicode characters. It supports BINARY, whose default length is the maximum length of 8 MB (8,388,608 bytes), and VARBINARY, which is synonymous with BINARY.

String also contains string constants, fixed values enclosed with delimiter characters like single quotes (”) and a pair of dollar signs ($$).

Logical Data

Snowflake supports a single logical data type: BOOLEAN. BOOLEAN values can be TRUE, FALSE, or NULL if the value is unknown. The existence of these possible Boolean values enables support for Tenary logic. In an explicit conversion, STRING and NUMERIC values can be converted to BOOLEAN using the TO_BOOLEAN or CAST functions.

STRING conversion: STRING values like ‘true,’ ‘t,’ ‘yes,’ ‘y’, ‘on,’ ‘1’ convert to TRUE while ‘false,’ ‘f,’ ‘no,’ ‘n,’ ‘off’, ‘0’ convert FALSE.
NUMERIC conversion: NUMERIC values like 0 convert to FALSE, while any non-zero figure converts to TRUE.

Date & Time Data

This datatype helps Snowflake manage dates, times, and timestamps and supports the following data types:

DATE, which stores data in YYYY-MM-DD, DD-MON-YYYY. All TIMESTAMP values are valid DATE values but without the TIME value.
DATETIME, a shorthand for TIMESTAMP_NTZ.
TIME, which stores the time in HH:MI:SS, with a default precision of 9, meaning all time values can only range from 00:00:00 to 23:59:59.999999999. In storing TIME and performing TIME operations, Snowflake uses a wall clock and considers no time zone.
TIMESTAMP: TIMESTAMP supports three variations: TIMESTAMP_LTZ, TIMESTAMP_TZ, and TIMESTAMP_NTZ. The difference between these variations is in the use of time zones. TIMESTAMP_LTZ stores the TIMESTAMP in the current session time zones, TIMESTAMP_NZ uses no time zone when performing operations, and TIMESTAMP_TZ performs operations using a time zone offset. Snowflake uses the TIMESTAMP_TZ variation when no time zone is provided.

Snowflake advises using years between 1582 and 9999 for DATE and TIMESTAMP data. Although it accepts other years outside this range, it suggests against using years before 1582 because of limitations in the Gregorian calendar.

Semi-Structured Data

Semi-structured data types in Snowflake allow you to load and perform operations on data in JSON, Avro, ORC, Parquet, or XML formats. These data types include:

VARIANT is a generic data type that can hold data of any type, including OBJECT and ARRAY. It can have values up to 16MB, making it ideal for archiving large data files.
OBJECT, akin to a JSON OBJECT, and stores key-value pairs. Every OBJECT key is a VARCHAR, and the value is a VARIANT. OBJECTS also contain constants delimited by curly braces({}).
ARRAY contains 0 or more data values, and every element in the array occupies an index position used for accessibility. Unlike OBJECTS, an ARRAY can have NULL values. Furthermore, ARRAYS can grow in length. The index values of an ARRAY can start from 0 and occur sequentially (2,3,4), as seen in dense ARRAYS, or occur non-sequentially (3,7,11), as seen in sparse arrays. Indexes with no corresponding elements are said to have an undefined value. Currently, Snowflake doesn’t support fixed-length ARRAYS.

Geospatial Data

This data type supports geospatial features on the earth’s surface, like polygons, lines, and points. Snowflake supports two data types from this category:

GEOGRAPHY: This data type treats the earth as a round sphere and follows the WGS 84 standard. It supports data of different types like Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon, and others. Every point describing a position of the earth uses longitude (-180 to +180 degrees) and latitude (-90 to +90 degrees).
GEOMETRY: This type represents data as a plane (Euclidean, Cartesian) using a coordinate system of two points. These two points (x,y) represent two real numbers, with their units determined by the Spatial Reference System (SRS). Currently, support is only available for 2D coordinates.

Snowflake advises storing any geography data you have (longitude and latitude data, WKT, WKB, GeoJSON) in the GEOGRAPHY column instead of in their original VARCHAR, VARIANT, and NUMBER formats. You can make these conversions using the Snowflake geospatial functions.

Data Types Not Supported in Snowflake

Snowflake currently offers no support for loading and operating on the following data types:

LOB (Large Objects): This includes BLOB (Binary Large Objects) and CLOB (Character Large Objects). Snowflake suggests using Binary for BLOB and VARCHAR for CLOB instead.
Others like ENUM and user-defined data types.

Simplify Snowflake With StreamSets

StreamSets Transformer for Snowflake lets data engineers process Snowflake data using Snowflake’s Snowpark client libraries. Our user-friendly interface removes the need to write SQL queries manually, thus allowing non-technical and other technical team members to design and perform complex data processing tasks within Snowflake. 

Furthermore, StreamSets provides Snowflake data cloud users access to governance features like permission and access control while ensuring your analytics stay updated on recent, relevant data via automatic data drift detection during streaming operations. There are also reduced security concerns and risks, as all data transformations occur within your Snowflake environment.

Accelerate the power of your Data Cloud through the StreamSets and Snowflake partnership.

The post Snowflake Data Types, Subtypes, and Unsupported Formats appeared first on StreamSets.

Read MoreStreamSets



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments