Data is the foundation to capturing the maximum value from AI technology and solving business problems quickly. To unlock the potential of generative AI technologies, however, there’s a key prerequisite: your data needs to be appropriately prepared. In this post, we describe how use generative AI to update and scale your data pipeline using Amazon SageMaker Canvas for data prep.
Typically, data pipeline work requires a specialized skill to prepare and organize data for security analysts to use to extract value, which can take time, increase risks, and increase time to value. With SageMaker Canvas, security analysts can effortlessly and securely access leading foundation models to prepare their data faster and remediate cyber security risks.
Data prep involves careful formatting and thoughtful contextualization, working backward from the customer problem. Now with the SageMaker Canvas chat for data prep capability, analysts with domain knowledge can quickly prepare, organize, and extract value from data using a chat-based experience.
Solution overview
Generative AI is revolutionizing the security domain by providing personalized and natural language experiences, enhancing risk identification and remediations, while boosting business productivity. For this use case, we use SageMaker Canvas, Amazon SageMaker Data Wrangler, Amazon Security Lake, and Amazon Simple Storage Service (Amazon S3). Amazon Security Lake allows you to aggregate and normalize security data for analysis to gain a better understanding of security across your organization. Amazon S3 enables you to store and retrieve any amount of data at any time or place. It offers industry-leading scalability, data availability, security, and performance.
SageMaker Canvas now supports comprehensive data preparation capabilities powered by SageMaker Data Wrangler. With this integration, SageMaker Canvas provides an end-to-end no-code workspace to prepare data, build, and use machine learning (ML) and Amazon Bedrock foundation models to accelerate the time from data to business insights. You can now discover and aggregate data from over 50 data sources and explore and prepare data using over 300 built-in analyses and transformations in the SageMaker Canvas visual interface. You’ll also see faster performance for transforms and analyses, and benefit from a natural language interface to explore and transform data for ML.
In this post, we demonstrate three key transformations; filtering, column renaming, and text extraction from a column on the security findings dataset. We also demonstrate using the chat for data prep feature in SageMaker Canvas to analyze the data and visualize your findings.
Prerequisites
Before starting, you need an AWS account. You also need to set up an Amazon SageMaker Studio domain. For instructions on setting up SageMaker Canvas, refer to Generate machine learning predictions without code.
Access the SageMaker Canvas chat interface
Complete the following steps to start using the SageMaker Canvas chat feature:
On the SageMaker Canvas console, choose Data Wrangler.
Under Datasets, choose Amazon S3 as your source and specify the security findings dataset from Amazon Security Lake.
Choose your data flow and choose Chat for data prep, which will display a chat interface experience with guided prompts.
Filter data
For this post, we first want to filter for critical and high severity warnings, so we enter into the chat box instructions to remove findings that are not critical or high severity. Canvas removes the rows, displays a preview of transformed data, and provides the option to use the code. We can add it to the list of steps in the Steps pane.
Rename columns
Next, we want rename two columns, so we enter in the chat box the following prompt, to rename the desc and title columns to Finding and Remediation. SageMaker Canvas generates a preview, and if you’re happy with the results, you can add the transformed data to the data flow steps.
Extract text
To determine the source Regions of the findings, you can enter in chat instructions to Extract the Region text from the UID column based on the pattern arn:aws:security:securityhub:region:* and create a new column called Region) to extract the Region text from the UID column based on a pattern. SageMaker Canvas then generates code to create a new region column. The data preview shows the findings originate from one Region: us-west-2. You can add this transformation to the data flow for downstream analysis.
Analyze the data
Finally, we want to analyze the data to determine if there is a correlation between time of day and number of critical findings. You can enter a request to summarize critical findings by time of day into the chat, and SageMaker Canvas returns insights that are useful for your investigation and analysis.
Visualize findings
Next, we visualize the findings by severity over time to include in a leadership report. You can ask SageMaker Canvas to generate a bar chart of severity compared to time of day. In seconds, SageMaker Canvas has created the chart grouped by severity. You can add this visualization to the analysis in the data flow and download it for your report. The data shows the findings originate from one Region and happen at specific times. This gives us confidence on where to focus our security findings investigation to determine root causes and corrective actions.
Clean up
To avoid incurring unintended charges, complete the following steps to clean up your resources:
Empty the S3 bucket you used as a source.
Log out of SageMaker Canvas.
Conclusion
In this post, we showed you how to use SageMaker Canvas as an end-to-end no-code workspace for data preparation to build and use Amazon Bedrock foundation models to accelerate time to gather business insights from data.
Note that this approach is not limited to security findings; you can apply this to any generative AI use case that uses data preparation at its core.
The future belongs to businesses that can effectively harness the power of generative AI and large language models. But to do so, we must first develop a solid data strategy and understand the art of data preparation. By using generative AI to structure our data intelligently, and working backward from the customer, we can solve business problems faster. With SageMaker Canvas chat for data preparation, it’s effortless for analysts to get started and capture immediate value from AI.
About the Authors
Sudeesh Sasidharan is a Senior Solutions Architect at AWS, within the Energy team. Sudeesh loves experimenting with new technologies and building innovative solutions that solve complex business challenges. When he is not designing solutions or tinkering with the latest technologies, you can find him on the tennis court working on his backhand.
John Klacynski is a Principal Customer Solution Manager within the AWS Independent Software Vendor (ISV) team. In this role, he programmatically helps ISV customers adopt AWS technologies and services to reach their business goals more quickly. Prior to joining AWS, John led Data Product Teams for large Consumer Package Goods companies, helping them leverage data insights to improve their operations and decision making.
Read MoreAWS Machine Learning Blog