Artificial Intelligence and Machine Learning

Build multi-class classification models with Amazon Redshift ML

By mullaned2002

June 10, 2021

570

Amazon Redshift ML simplifies the use of machine learning (ML) by using simple SQL statements to create and train ML models from data in Amazon Redshift. You can use Amazon Redshift ML to solve binary classification, multi-class classification, and regression problems and can use either AutoML or XGBoost directly.

This post is part of a series that describes the use of Amazon Redshift ML. For more information about building regression using Amazon Redshift ML, see Build regression models with Amazon Redshift ML.

You can use Amazon Redshift ML to automate data preparation, pre-processing, and selection of problem type as depicted in this blog post. We assume that you have a good understanding of your data and what problem type is most applicable for your use case. This post specifically focuses on creating models in Amazon Redshift using the multi-class classification problem type, which consists on classifying instances into one of three or more classes. For example, you can predict whether a transaction is fraudulent, failed or successful, whether a customer will remain active for 3 months, six months, nine months, 12 months, or whether a news is tagged as sports, world news, business.

Prerequisites

As a prerequisite for implementing this solution, you need to set up an Amazon Redshift cluster with ML enabled on it. For the preliminary steps to get started, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

Use case

For our use case, we want to target our most active customers for a special customer loyalty program. We use Amazon Redshift ML and multi-class classification to predict how many months a customer will be active over a 13-month period. This translates into up to 13 possible classes, which makes this a better fit for multi-class classification. Customers with predicted activity of 7 months or greater are targeted for a special customer loyalty program.

Input raw data

To prepare the raw data for this model, we populated the table ecommerce_sales in Amazon Redshift using the public data set E-Commerce Sales Forecast, which includes sales data of an online UK retailer.

Enter the following statements to load the data to Amazon Redshift:

CREATE TABLE IF NOT EXISTS ecommerce_sales
(
invoiceno VARCHAR(30)
,stockcode VARCHAR(30)
,description VARCHAR(60)
,quantity DOUBLE PRECISION
,invoicedate VARCHAR(30)
,unitprice DOUBLE PRECISION
,customerid BIGINT
,country VARCHAR(25)
)
;

Copy ecommerce_sales
From ‘s3://redshift-ml-multiclass/ecommerce_data.txt’
iam_role ‘<<your-amazon-redshift-sagemaker-iam-role-arn>>‘ delimiter ‘t’ IGNOREHEADER 1 region ‘us-east-1’ maxerror 100;

To reproduce this script in your environment, replace <<your-amazon-redshift-sagemaker-iam-role-arn>> with the AWS Identity and Access Management (IAM) ARN for your Amazon Redshift cluster.

Data preparation for the ML model

Now that our data set is loaded, we can optionally split the data into three sets for training (80%), validation (10%), and prediction (10%). Note that Amazon Redshift ML Autopilot will automatically split the data into training and validation, but by splitting it here, you will be able to verify the accuracy of your model. Additionally, we calculate the number of months a customer has been active, as it will be the value we want our model to predict on new data. We use the random function in our SQL statements to split the data. See the following code:

create table ecommerce_sales_data as (
select
t1.stockcode,
t1.description,
t1.invoicedate,
t1.customerid,
t1.country,
t1.sales_amt,
cast(random() * 100 as int) as data_group_id
from
(
select
stockcode,
description,
invoicedate,
customerid,
country,
sum(quantity * unitprice) as sales_amt
from
ecommerce_sales
group by
1,
2,
3,
4,
5
) t1
);

Training Set

create table ecommerce_sales_training as (
select
a.customerid,
a.country,
a.stockcode,
a.description,
a.invoicedate,
a.sales_amt,
(b.nbr_months_active) as nbr_months_active
from
ecommerce_sales_data a
inner join (
select
customerid,
count(
distinct(
DATE_PART(y, cast(invoicedate as date)) || ‘-‘ || LPAD(
DATE_PART(mon, cast(invoicedate as date)),
2,
’00’
)
)
) as nbr_months_active
from
ecommerce_sales_data
group by
1
) b on a.customerid = b.customerid
where
a.data_group_id < 80
);

Validation Set

create table ecommerce_sales_validation as (
select
a.customerid,
a.country,
a.stockcode,
a.description,
a.invoicedate,
a.sales_amt,
(b.nbr_months_active) as nbr_months_active
from
ecommerce_sales_data a
inner join (
select
customerid,
count(
distinct(
DATE_PART(y, cast(invoicedate as date)) || ‘-‘ || LPAD(
DATE_PART(mon, cast(invoicedate as date)),
2,
’00’
)
)
) as nbr_months_active
from
ecommerce_sales_data
group by
1
) b on a.customerid = b.customerid
where
a.data_group_id between 80
and 90
);

Prediction Set

create table ecommerce_sales_prediction as (
select
customerid,
country,
stockcode,
description,
invoicedate,
sales_amt
from
ecommerce_sales_data
where
data_group_id > 90);

Create the model in Amazon Redshift

Now that we created our training and validation data sets, we can use the create model statement in Amazon Redshift to create our ML model using Multiclass_Classification. We specify the problem type but we let AutoML take care of everything else. In this model, the target we want to predict is nbr_months_active. Amazon SageMaker creates the function predict_customer_activity, which we use to do inference in Amazon Redshift. See the following code:

create model ecommerce_customer_activity
from
(
select
customerid,
country,
stockcode,
description,
invoicedate,
sales_amt,
nbr_months_active
from ecommerce_sales_training)
TARGET nbr_months_active FUNCTION predict_customer_activity
IAM_ROLE ‘<<your-amazon-redshift-sagemaker-iam-role-arn>>’
problem_type MULTICLASS_CLASSIFICATION
SETTINGS (
S3_BUCKET ‘<<your-amazon-s3-bucket-name>>’,
S3_GARBAGE_COLLECT OFF
);

To reproduce this script in your environment, replace <<your-amazon-redshift-sagemaker-iam-role-arn>> with your cluster’s IAM role ARN.

Validate predictions

In this step, we evaluate the accuracy of our ML model against our validation data.

While creating the model, Amazon SageMaker Autopilot automatically splits the input data into train and validation sets, and selects the model with the best objective metric, which is deployed in the Amazon Redshift cluster. You can use the show model statement in your cluster to view various metrics, including the accuracy score. If you don’t specify explicitly, SageMaker automatically uses accuracy for the objective type. See the following code:

Show model ecommerce_customer_activity;

As shown in following output, our model has an accuracy score of 0.996580.

Key
Value
Model Name
ecommerce_customer_activity
Schema Name
Public
Owner
Demo
Creation Time
Mon, 22.03.2021 19:48:30
Model State
READY
Training Job Status
MaxAutoMLJobRuntimeReached
validation:accuracy
0.996580
Estimated Cost
4.260437

TRAINING DATA:

Query
SELECT CUSTOMERID, COUNTRY, STOCKCODE, DESCRIPTION, INVOICEDATE, SALES_AMT, NBR_MONTHS_ACTIVE

FROM ECOMMERCE_SALES_CLN_TRAINING
Target Column
NBR_MONTHS_ACTIVE

PARAMETERS:

Model Type
Auto
Problem Type
MulticlassClassification
Problem Type
Accuracy
AutoML Job Name
2021-03-07-19-48-30-960148-auto
Function Name
predict_customer_activity
Function Parameters
customerid country stockcode description invoicedate sales_amt
Function Parameter Types
int8 varchar varchar varchar varchar float8

Let’s run inference queries against our validation data using the following SQL code against the validation data:

select
cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2)) as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from
(select
customerid,
country,
stockcode,
description,
invoicedate,
sales_amt,
nbr_months_active,
predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active,
case when nbr_months_active = predicted_months_active then 1
else 0 end as match,
case when nbr_months_active <> predicted_months_active then 1
else 0 end as nonmatch
from ecommerce_sales_validation
)t1;

We can see that we predicted correctly on 99.74% on our data set, which matches our accuracy score from the show model.

predicted_matches
predicted_non_matches
total_predictions
pct_accuracy
43489.00
132.00
43621.00
0.99697393

Now let’s run a query to see which customers qualify for our customer loyalty program by being active for at least 7 months:

select
customerid,
predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active
from ecommerce_sales_prediction
where predicted_months_active >=7
group by 1,2
limit 10;

The following table shows our output.

customerid
predicted_months_active
16895
8
18283
10
15044
11
16746
11
14702
11
16607
10
12901
10
15078
13
16393
10
15005
12

Troubleshooting

Although the Create Model statement in Amazon Redshift automatically takes care of initiating the SageMaker Autopilot process to build, train, and tune the best ML model and deploy that model in Amazon Redshift, you can view the intermediate steps performed in this process, which may also help you with troubleshooting if something goes wrong. You can also retrieve the AutoML Job Name from the output of the show model command.

While creating the model, you need to mention an Amazon Simple Storage Service (Amazon S3) bucket name as the value for parameter, s3_bucket. You use this bucket to share training data and artifacts between Amazon Redshift and SageMaker. Amazon Redshift creates a subfolder in this bucket prior to unload of the training data. When training is complete, it deletes the subfolder and its contents unless you set the parameter s3_garbage_collect to off, which you can use for troubleshooting purposes. For more information, see CREATE MODEL.

For information about using the SageMaker console and Amazon SageMaker Studio, see Build regression models with Amazon Redshift ML.

Conclusion

Amazon Redshift ML provides the right platform for database users to create, train, and tune models using a SQL interface. In this post, we walked you through how to create a multi-class classification model. We hope you can take advantage of Amazon Redshift ML to help gain valuable insights.

For more information about building different models with Amazon Redshift ML, see Build regression models with Amazon Redshift ML and read the Amazon Redshift ML documentation.

Acknowledgments

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.

Nikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.

Enrico Sartorello is a Sr. Software Development Engineer at Amazon Web Services. He helps customers adopt machine learning solutions that fit their needs by developing new functionalities for Amazon SageMaker. In his spare time, he passionately follows his soccer team and likes to improve his cooking skills.

Build multi-class classification models with Amazon Redshift ML

Prerequisites

Use case

Input raw data

Data preparation for the ML model

Create the model in Amazon Redshift

Validate predictions

Troubleshooting

Conclusion

Acknowledgments

About the Authors

Databricks DBRX is now available in Amazon SageMaker JumpStart

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

Inpainting and Outpainting with Stable Diffusion

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 18

Databricks DBRX is now available in Amazon SageMaker JumpStart

Introducing new ML model monitoring capabilities in BigQuery

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Gruppo Maggioli reimagines public administration with Google Cloud

Manage data exfiltration risks in Cloud Run with VPC Service Controls

A Gentle Introduction To Sigmoid Function

POPULAR CATEGORY