Transformer for Snowflake: Snowflake Transformation that Meets Cloud-First Expectations

By mullaned2002

February 3, 2022

464

It’s likely you remember (or have at least heard) that databases in the past, beyond SQL, were extremely complicated to stay in a performant state. You typically had to schedule workloads so that your Data Warehousing load jobs did not impact your Reporting and Analytics processes. Way back when, we had not even stepped into the space of having a clustered environment yet.

Then came Snowflake!

It made setting up, scaling, loading, and storing your data as easy as point and click. Gone were the days where you had to purchase hardware that is powerful enough for the one-time-a-year processing necessary and then sits under-utilized the other eight to ten months out of the year.

Now, you can scale your queries on demand to be as performant as you need or create a Virtual Warehouse for your Analytics team while your Warehousing team uses a different Virtual Warehouse, both with different CPU and Memory configurations but processing on the same data. No more insane scheduling configurations, no more administration. Just load, query and go! A product that is what one expects of a cloud-first product!

Yet, we seem to use products that are not cloud-first–where we take the data out of the cloud platform and process it locally and then send it back up. That just seems so counterintuitive. Yet, the reality is most “Cloud” databases today are nothing more than their on-premise version, just hosted in the cloud. We are still stuck with the limitations that we had before. We cannot run our Data Quality or some other custom code on the data where it is without taking a huge hit to performance.

Then, Snowflake did it again and delivered Snowpark. Snowpark has two unique functionalities that can work separately, or collectively, for powerful Snowflake transformation.

Snowpark: Two Unique Functionalities for Snowflake Transformation

First, we have Snowpark UDF’s which can be Java or Python and allow you to embed your application libraries into the actual Snowflake Engine. Once the code and supporting libraries have been pushed into the Snowflake Staging area, you can then create a Snowflake UDF Function that references the function in your application and states how many parameters it requires as well as the response.

These functions must be Scalar functions. In this article, I will create a simple Java UDF that will validate if an email is in proper format. Yes, you could do this with REGEX function in Snowflake, but it is a simple enough example to get you started on what I am sure is a number of libraries you have that are much more complicated.

Finally, we have Snowflake Snowpark, which is a library that is written in Scala that allows you to build a client application which will convert your code into the proper Snowflake SQL and execute across the cluster you have sized.

As stated earlier, either of these can be run without the other, however they become very powerful when working together. Taking advantage of them simultaneously will give you the ability to both create a very rich and powerful application and keep your data 100% in the Snowflake environment during a Snowflake transformation. This greatly increases your application’s security and performance. You can now execute against an infinite scale of data without ever needing to worry about the limitations of your network and/or hardware running the application.

Private Preview: Transformer for Snowflake

StreamSets gives you that very application we just discussed in our new Transformer for Snowflake which is set for Private Preview throughout Q1 of 2022.

Transformer for Snowflake is fully hosted in the StreamSets Cloud which means zero configuration or setup. All that is required is to sign up for the StreamSets DataOps Platform and you can begin working immediately with your Snowflake data.

This initial release of the product will have 17 processors that, when added to a pipeline, utilize the Snowflake Snowpark library which in turn generates the SQL and pushes it down 100%. This allows your pipelines doing ingress and egress to be much faster as they are not having to handle anything other than transportation of the data from Origin to Destination.

Your core business logic of the data processing can now be done in the database where it belongs and can be performed at scale–without needing to pull the data out and put it back in. Just imagine having the ability to process through petabytes of data through an application so much faster. No more waiting for what seems like forever for your Snowflake transformation, because it has to first source the data over the network and then run your business logic on it.

Anyway, enough words. Let’s actually get to working with the products! All the assets that we will be working with you can find at our GitHub repo.

Email Validator: Java UDF

First thing we need to do is create our UDF. For this example, I am going to be using Java. You can either clone the repo linked above or create a maven project in whatever IDE you are familiar with.

EmailValidator.java

package com.streamsets;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* Email Validation UDF
*/
public class EmailValidator {

private final String regex = “^(.+)@(.+)$”;
/**
* @param email Email to validate.
*/
public boolean validate(String email) {
if (email == null)
return false;
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(email);

return matcher.matches();
}
}

This is a very simple class that has a single function called validate that takes a single parameter of type String which we expect to be an email and returns a boolean value in response. We use a regular expression to do a very basic check of the email to make sure that it has some prefix with an ampersand and then some suffix. If the email value being passed in matches the regular expression, then we return true; otherwise if it does not match the regular expression or the input is null, we return false.

You can build this project if you cloned it with the Makefile provided `make build` which will create a target directory with a jar file email-validator-1.0-SNAPSHOT.jar. If your own project is using 3rd party libraries, which it most likely will be, then you will need to generate a single jar with all of the libraries or update the code for the PUT call to Snowflake and the Snowflake Create Function command to associate those libraries.

You will see in the put_command.sql the following lines:

put
   file://./target/email-validator-1.0–SNAPSHOT.jar
  @~/PLG/
  auto_compress = false
  overwrite = true
   ;

This file will be called next via our Makefile. also to take the jar file we just built and upload it into our Snowflake environment to be used eventually in the creation of our Java UDF. You can have multiple PUT statements in this file as needed for your Snowflake transformation.

Here’s what this file is doing:

Take the file (i.e. email-validator-1.0-SNAPSHOT.jar) and upload it into
/PLG/ Staging Directory in our Snowflake environment
Should we compress the file
Should we overwrite the file in the Staging Directory

Again, using our Makefile, we can issue the following command `make put account=<snowflake account> warehouse=<warehouse> database=<database> schema=<schema> user=<username>`

If we look at the Makefile for the put command reference we see the following:

.PHONY: put # Put jar file into Snowflake
put:
@snowsql -a $(account) -w $(warehouse) -d $(database) -s $(schema) -u $(user) -f put_command.sql

We are simply calling the snowsql command and telling it to execute the command(s) in the put_command.sql file. You will be prompted to enter your Snowflake password and, if successful, you should have a similar output.

Congratulations! You’ve just created your Java UDF application and uploaded it to Snowflake.

Now, let’s create the Java UDF Function in Snowflake and test it out.

Snowflake Java UDF

Now that we have the required file(s) loaded into our Snowflake Staging area, we can create the function and test that it is working properly. In the repo you will see a file called `create_udf_function.sql`. It has the following code.

create function EmailValidator(email varchar)
   returns boolean
   language java
imports = (‘@~/PLG/email-validator-1.0-SNAPSHOT.jar’)
   handler = ‘com.streamsets.EmailValidator.validate’
   ;

The first line of code is the declaration of the function where we give it a name that Snowflake will recognize, what input parameters that it accepts, and their data types.

For our example, our UDF:

Line 1: Takes a single parameter of name: email and of data type: varchar.
Line 2: States the return type; in our case it is of type boolean.
Line 3: States the language that the function expects our application to be written in for when it is executed.
Line 4: Represents where we import all of our supporting libraries and/or properties files that our application will need in order to properly execute.
Line 5: Shows where we point to the exact function in our application. It must be a fully qualified class path as well as the function name. Your class can contain as many functions you need for unique Snowflake UDF Functions or support functions.

When we execute this statement in our Snowflake Worksheet (and we have everything correct) it will give us a Success message.

To validate that our function is working properly, we can simply execute the following:

Positive Test

Select EmailValidator(‘[email protected]’)
(This will return a single row result set with a value of TRUE.)

Negative Test

Select EmailValidator(‘name!place.domain’)
(This will return a single row result set with a value of FALSE.)

Snowflake Transformation: Example Pipeline to Validate Emails

This is a very simple yet powerful pipeline as it utilizes both the custom UDF we wrote and processes the entire pipeline internally in the Snowflake Warehouse.

You can build this pipeline yourself in StreamSets using the following configurations.

Stage 1: MOCK_DATA: Snowflake Table Origin
Will issue a ‘SELECT *’ from the table name you enter

Stage 2: EmailValidator: Apply Function Processor
Configuration:
Field Expression: The field name to pass into the UDF
Enter: ‘email’

Function Type:
Select: User-Defined Function

UDF Name:
Enter: EmailValidator

Add Prefix to New Output Fields:
Select this box
We are using this because we want to preserve the column of email so we can still store that in our Destination

Prefix:
Enter: ‘udf_’

Add Suffix to New Output Fields:
Leave blank

Stage 3: Only Valid Emails: Filter Processor
Filter Condition: udf_email = true

Stage 4: VALID_EMAIL_RESULTS: Snowflake Table Destination
Configuration:
Specify Database: Unchecked
Specify Schema: Unchecked
Table: VALID_EMAIL_RESULTS
Write Mode: Append To Table
Create Table: Checked
Data Drift Enabled: Checked

When you run this job only valid emails will make it into the target table. The best part? It doesn’t matter if you have ten thousand or ten billion records because your Virtual Warehouse size will determine the speed at which this job is processed.

Transformer for Snowflake Meets Cloud-First Expectations

I am so proud to see all the hard work from our Engineering Team come to fruition. Transformer for Snowflake is what you expect of a cloud-first product. It requires zero downloads, zero installation, and zero configuration. That’s right… for a low, low, low effort of ZERO you can sign up and start building your Snowflake transformation and executing against your Snowflake Data Warehouse.

The post Transformer for Snowflake: Snowflake Transformation that Meets Cloud-First Expectations appeared first on StreamSets.

Transformer for Snowflake: Snowflake Transformation that Meets Cloud-First Expectations

Snowpark: Two Unique Functionalities for Snowflake Transformation

Private Preview: Transformer for Snowflake

Email Validator: Java UDF

EmailValidator.java

Snowflake Java UDF

Positive Test

Negative Test

Snowflake Transformation: Example Pipeline to Validate Emails

Transformer for Snowflake Meets Cloud-First Expectations

Unveiling the Evaluation Pipeline: GenAI App Builder’s New Tool for Ensuring Excellence in Generative AI Applications

Boost efficiency and accuracy of loading CSV data with Workato

Unlocking GenAI’s Potential: Overcoming Legacy Tech and Data Challenges According to IT Leaders

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

How to leverage the full potential of enterprise AI

Anthos makes multi-cloud easier with new API, support for Azure

Schneider Electric leverages Retrieval Augmented LLMs on SageMaker to ensure real-time updates in their ERP systems

POPULAR CATEGORY