Data generates new value to businesses through insights and building predictive models. However, although data is plentiful, available data scientists are far and few. Despite our attempts in recent years to produce data scientists from academia and elsewhere, we still see a huge shortage that will continue into the near future.
To accelerate model building, data scientists and ML practitioners often take advantage of AutoML (automated machine learning) tools that can augment their work. They can take away the tedious and iterative process of data preparation, model training and tuning. AutoML tools help data scientists improve their productivity when developing ML models.
In this post, we discuss how data scientists and other advanced analytics users can use Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot to analyze their data sets and build highly predictive ML models. To demonstrate these capabilities, we use the Pima Indian Diabetes public data set from UCI.
Solution overview
The Pima Indian Diabetes data set contains the information of 768 women from a population near Phoenix, Arizona. The outcome tested was diabetes. It carries 258 tested positive and 500 tested negative observations, with one target and eight attributes: pregnancies, glucose, blood pressure, skin thickness, insulin, BMI (body mass index), age, and pedigree diabetes function. We use this data set to demonstrate how to use Autopilot and Data Wrangler to build highly predictive ML models without having to write any code.
The high-level steps for building an ML model are as follows:
Perform exploratory data analysis.
Perform feature engineering.
Train the model.
Validate the model.
Deploy the model.
Make predictions.
We walk through these steps as we build a binary classification model using the Pima Indian Diabetes data set.
Import your data set with Data Wrangler
Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding.
On the Studio console, under File, choose New.
Choose Flow.
If this is your first time opening Data Wrangler, you may have to wait a few minutes for it to be ready.
Rename your flow as needed.
For Import data, choose your data source.
Upload the pima-indian-diabates.csv file from Amazon S3.
You can now preview your data set.
In the Details pane, deselect Enable sampling (this is a small data set, so we don’t need it).
Choose Import dataset.
You now have a flow diagram.
Choose the + icon next to Data types and choose Edit data types.
Make sure that Data Wrangler automatically inferred the correct data types for your data columns.
If not, you can easily modify them through the UI. If multiple data sources are present, you can join or concatenate them.
We can now create an analysis and add transformations.
Exploratory data analysis and feature engineering
Exploratory data analysis is an important step when building ML models. In this step, data scientists analyze data to listen to its story. If you have the patience to listen, data is a great storyteller. This step involves statistical analysis, summarization tables, histograms, scatter plots, outlier analysis, finding missing values, and more. We demonstrate some of these in this post.
Choose the + icon next to Data types and choose Add analysis.
On the Configure tab, for Analysis type, choose Table Summary.
For Analysis name¸ enter a name (optional).
Choose Preview to see a preview of the table.
The count summary shows that all columns have 768 entries. But on closer examination, we find that the minimum value is 0 for columns such as Glucose and BloodPressure. Missing values are stored as 0 in this data set. Let’s fix that.
Choose Create and save this table.
On the flow’s main page, choose the + icon next to Data types and choose Add transform.
Under Search and edit, for Transform, choose Convert regex to missing.
For Input column, choose Glucose.
For Pattern, enter 0.
Choose Preview.
The 0 entries under Glucose are now missing entries.
Choose Add to save this step.
Repeat these steps for the other columns with incorrect 0 entries: BloodPressure, SkinThickness, Insulin, and BMI.
Data Wrangler gives you a couple of options to fix missing values.
Choose the + icon next to Data types and choose Add transform.
Replace missing values with the median values for all five columns (Glucose, BloodPressure, SkinThickness, Insulin, and BMI).
This completes one iteration of analysis and transformation.
Data Wrangler gives you an option to build a quick model to see how predictive your features are.
Choose the + icon next to Data types and choose Add analysis.
On the Configure tab, for Analysis type, choose Quick Model.
For Analysis name¸ enter a name.
For Label¸ choose Class.
The following chart shows the F1 score and the importance of the predictive features.
The F1 score is a commonly used metric in classification problems; it represents the harmonic average of recall and precision. If we build a model with this data at this stage, we get an approximate F1 score of 0.735 (1 being the best F1 score) and find that Glucose is the most important explanatory feature.
Another valuable feature of Data Wrangler is checking for target leakage. Target leakage is a phenomenon in which the target that you’re trying to predict has leaked into one or more of your features, and this feature isn’t available at prediction time.
Choose the + icon next to Data types and choose Add analysis.
For Analysis type¸ choose Target leakage.
For Problem type, choose classification.
For Target, choose Class.
Choose Create.
We don’t have a target leakage situation in this data set, but if we did, we would need to remove that column from the data set so that the model doesn’t falsely show a perfect model during training.
Next, we draw some scatter plots for Glucose vs. BloodPressure.
Women that are less than 100 in Glucose and less than 80 in BloodPressure seem to have a lesser chance for diabetes. Let’s create a new feature using that information.
We use the Custom formula feature in the transformation options.
This custom formula creates a new column in the data set.
Next, let’s check if Pregnancies/Age could have some effect on the target.
Create a new column using the Custom formula
Next, we draw a histogram to see its effect.
As we can see, this new feature could have an influence on our target.
A quick model after adding these two features shows an improvement in our model’s F1 score.
Other features are available that also don’t require any coding, such as finding outliers and scaling features, but we don’t need them for this data set.
The last step is to export data in this new format.
Choose Data Wrangler job to create a Python notebook.
Under Run, choose Run all cells to run the notebook.
The notebook creates output for this flow as a CSV file in Amazon S3. You can see the S3 path for the output file in the notebook. Depending on your input data file, Data Wrangler might split the output into multiple files. If so, you need to combine them into a single CSV file with a single header, which you then feed into Autopilot.
Build and deploy your model with SageMaker Autopilot
Autopilot allows you to automatically build ML models. It explores your data, selects the algorithms relevant to your problem type, and prepares the data to facilitate model training and tuning. It ranks all of the optimized models tested by their performance and finds the best performing model, which you can deploy at a fraction of the time normally required.
We can either run Autopilot directly on the raw data or feed it with the enhanced data set that we generated with Data Wrangler.
On the Studio console, under File, choose New.
Choose Experiment.
For Experiment name, enter a name.
For Connect your data, enter the S3 bucket of your uploaded input data.
For Target, type Class.
For Output data location, specify the location of the S3 bucket where you want the results saved.
For Select the machine learning problem type, choose Binary classification.
If you’re not sure what problem type to use, you can leave it as Auto and Autopilot will figure it out.
For Objective metric, choose F1.
Choose Create Experiment.
Autopilot analyzes the input data, processes it, selects the right ML algorithm, and runs several trials of experiments on it to tune the model for best performance. It then ranks these trials and presents you the best model.
Choose a model from the list and deploy it to an endpoint.
Specify a name for the endpoint, instance type for it, and instance count.
You can also have the endpoint return predicted labels and their probabilities.
You can see the creation of this endpoint on the SageMaker console.
Choose the endpoint to see more details on it.
You see an endpoint URL that you can use to make predictions in real time.
Under Production variants, make a note of the model name.
On the SageMaker console, under Inference in the navigation pane, choose Batch transform job.
Choose Create batch transform job.
For Model name, enter the model name you saved earlier.
For Instance type, choose an instance type.
For Content type, enter text/csv.
For S3 location, enter the path to your input bucket.
For S3 output path, enter the path to your output bucket.
When the batch transformation job is complete, you can see your inference job’s output in the S3 bucket.
Conclusion
In this post, you have learned an easy way for conducting exploratory data analysis, ML model development, deployment, and batch transformation to make predictions. This technique can be used by anyone that has access to data and wants to quickly build powerful machine learning models and thereby increase their productivity. Learn more about Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot by visiting their product pages.
About the Author
Raju Penmatcha is a Senior AI/ML Specialist Solutions Architect at AWS. He works with education, government, and nonprofit customers on machine learning and artificial intelligence related projects, helping them build solutions using AWS. When not helping customers, he likes traveling to new places.
Read MoreAWS Machine Learning Blog