Generative artificial intelligence (generative AI) models have demonstrated impressive capabilities in generating high-quality text, images, and other content. However, these models require massive amounts of clean, structured training data to reach their full potential. Most real-world data exists in unstructured formats like PDFs, which requires preprocessing before it can be used effectively.
According to IDC, unstructured data accounts for over 80% of all business data today. This includes formats like emails, PDFs, scanned documents, images, audio, video, and more. While this data holds valuable insights, its unstructured nature makes it difficult for AI algorithms to interpret and learn from it. According to a 2019 survey by Deloitte, only 18% of businesses reported being able to take advantage of unstructured data.
As AI adoption continues to accelerate, developing efficient mechanisms for digesting and learning from unstructured data becomes even more critical in the future. This could involve better preprocessing tools, semi-supervised learning techniques, and advances in natural language processing. Companies that use their unstructured data most effectively will gain significant competitive advantages from AI. Clean data is important for good model performance. Extracted texts still have large amounts of gibberish and boilerplate text (e.g., read HTML). Scraped data from the internet often contains a lot of duplications. Data from social media, reviews, or any user generated contents can also contain toxic and biased contents, and you may need to filter them out using some pre-processing steps. There could also be a lot of low-quality contents or bot-generated texts, which can be filtered out using accompanying metadata (e.g., filter out customer service responses that received low customer ratings).
Data preparation is important at multiple stages in Retrieval Augmented Generation (RAG) models. The knowledge source documents need preprocessing, like cleaning text and generating semantic embeddings, so they can be efficiently indexed and retrieved. The user’s natural language query also requires preprocessing, so it can be encoded into a vector and compared to document embeddings. After retrieving relevant contexts, they may need additional preprocessing, like truncation, before being concatenated to the user’s query to create the final prompt for the foundation model.
Solution overview
In this post, we work with a PDF documentation dataset—Amazon Bedrock user guide. Further, we show how to preprocess a dataset for RAG. Specifically, we clean the data and create RAG artifacts to answer the questions about the content of the dataset. Consider the following machine learning (ML) problem: user asks a large language model (LLM) question: “How to filter and search models in Amazon Bedrock?”. LLM has not seen the documentation during the training or fine-tuning stage, thus wouldn’t be able to answer the question and most probably will hallucinate. Our goal with this post, is to find a relevant piece of text from the PDF (i.e., RAG) and attach it to the prompt, thus enabling LLM to answer questions specific to this document.
Below, we show how you can do all these main preprocessing steps from Amazon SageMaker Data Wrangler:
Extracting text from a PDF document (powered by Textract)
Remove sensitive information (powered by Comprehend)
Chunk text into pieces.
Create embeddings for each piece (powered by Bedrock).
Upload embedding to a vector database (powered by OpenSearch)
Prerequisites
For this walkthrough, you should have the following:
An AWS account with permissions to create AWS Identity and Access Management (AWS IAM) policies and roles
Access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a user for Studio. For more information about prerequisites, see Getting started with using Amazon SageMaker Canvas.
Access to Amazon Bedrock models. Follow the guidelines for model access.
Access to Amazon Comprehend. The Amazon SageMaker Studio execution role must have permission to call the Amazon Comprehend DetectPiiEntities action.
Access to Amazon Textract. The Amazon SageMaker Studio execution role must have permission to call the Amazon Textract.
Read and write access to an Amazon Simple Storage Service (Amazon S3) bucket.
Access to Amazon OpenSearch as a vector database. The choice of vector database is an important architectural decision. There are several good options to consider, each with their own strengths. In this example, we have chosen Amazon OpenSearch as our vector database.
Note: Create OpenSearch Service domains following the instructions here. For simplicity, let’s pick the option with a master username and password for fine-grained access control. Once the domain is created, create a vector index with the following mappings, and vector dimension 1536 aligns with Amazon Titan embeddings:
Walkthrough
Build a data flow
In this section, we cover how we can build a data flow to extract text and metadata from PDFs, clean and process the data, generate embeddings using Amazon Bedrock, and index the data in Amazon OpenSearch.
Launch SageMaker Canvas
To launch SageMaker Canvas, complete the following steps:
On the Amazon SageMaker Console, choose Domains in the navigation pane.
Choose your domain.
On the launch menu, choose Canvas.
Create a dataflow
Complete the following steps to create a data flow in SageMaker Canvas:
On the SageMaker Canvas home page, choose Data preparation.
Choose Create on the right side of page, then give a data flow name and select Create.
This will land on a data flow page.
Choose Import data, select tabular data.
Now let’s import the data from Amazon S3 bucket:
Choose Import data and select Tabular from the drop-down list.
Data Source and select Amazon S3 from the drop-down list.
Navigate to the meta data file with PDF file locations, and choose the file.
Now the metadata file is loaded to the data preparation data flow, and we can proceed to add next steps to transform the data and index into Amazon OpenSearch. In this case the file has following metadata, with the location of each file in Amazon S3 directory.
To add a new transform, complete the following steps:
Choose the plus sign and choose Add Transform.
Choose Add Step and choose Custom Transform.
You can create a custom transform using Pandas, PySpark, Python user-defined functions, and SQL PySpark. Choose Python (PySpark) for this use-case.
Enter a name for the step. From the example code snippets, browse and select extract text from pdf. Make necessary changes to code snippet and select Add.
Let’s add a step to redact Personal Identifiable Information (PII) data from the extracted data by leveraging Amazon Comprehend. Choose Add Step and choose Custom Transform. And select Python (PySpark).
From the example code snippets, browse and select mask PII. Make necessary changes to code snippet and select Add.
The next step is to chunk the text content. Choose Add Step and choose Custom Transform. And select Python (PySpark).
From the example code snippets, browse and select Chunk text. Make necessary changes to code snippet and select Add.
Let’s convert the text content to vector embeddings using the Amazon Bedrock Titan Embeddings model. Choose Add Step and choose Custom Transform. And select Python (PySpark).
From the example code snippets, browse and select Generate text embedding with Bedrock. Make necessary changes to code snippet and select Add.
Now we have vector embeddings available for the PDF file contents. Let’s go ahead and index the data into Amazon OpenSearch. Choose Add Step and choose Custom Transform. And select Python (PySpark). You’re free to rewrite the following code to use your preferred vector database. For simplicity, we are using master username and password to access OpenSearch API’s, for production workloads select option according to your organization policies.
Finally, the dataflow created would be as follows:
With this dataflow, the data from the PDF file has been read and indexed with vector embeddings in Amazon OpenSearch. Now it’s time for us to create a file with queries to query the indexed data and save it to the Amazon S3 location. We’ll point our search data flow to the file and output a file with corresponding results in a new file in an Amazon S3 location.
Preparing a prompt
After we create a knowledge base out of our PDF, we can test it by searching the knowledge base for a few sample queries. We’ll process each query as follows:
Generate embedding for the query (powered by Amazon Bedrock)
Query vector database for the nearest neighbor context (powered by Amazon OpenSearch)
Combine the query and the context into the prompt.
Query LLM with a prompt (powered by Amazon Bedrock)
On the SageMaker Canvas home page, choose Data preparation.
Choose Create on the right side of page, then give a data flow name and select Create.
Now let’s load the user questions and then create a prompt by combining the question and the similar documents. This prompt is provided to the LLM for generating an answer to the user question.
Let’s load a csv file with user questions. Choose Import Data and select Tabular from the drop-down list.
Data Source, and select Amazon S3 from the drop-down list. Alternatively, you can choose to upload a file with user queries.
Let’s add a custom transformation to convert the data into vector embeddings, followed by searching related embeddings from Amazon OpenSearch, before sending a prompt to Amazon Bedrock with the query and context from knowledge base. To generate embeddings for the query, you can use the same example code snippet Generate text embedding with Bedrock mentioned in Step #7 above.
Let’s invoke the Amazon OpenSearch API to search relevant documents for the generated vector embeddings. Add a custom transform with Python (PySpark).
Let’s add a custom transform to call the Amazon Bedrock API for query response, passing the documents from the Amazon OpenSearch knowledge base. From the example code snippets, browse and select Query Bedrock with context. Make necessary changes to code snippet and select Add.
In summary, RAG based question answering dataflow is as follows:
ML practitioners spend a lot of time crafting feature engineering code, applying it to their initial datasets, training models on the engineered datasets, and evaluating model accuracy. Given the experimental nature of this work, even the smallest project leads to multiple iterations. The same feature engineering code is often run again and again, wasting time and compute resources on repeating the same operations. In large organizations, this can cause an even greater loss of productivity because different teams often run identical jobs or even write duplicate feature engineering code because they have no knowledge of prior work. To avoid the reprocessing of features, we’ll export our data flow to an Amazon SageMaker pipeline. Let’s select the + button to the right of the query. Select export data flow and choose Run SageMaker Pipeline (via Jupyter notebook).
Cleaning up
To avoid incurring future charges, delete or shut down the resources you created while following this post. Refer to Logging out of Amazon SageMaker Canvas for more details.
Conclusion
In this post, we showed you how Amazon SageMaker Canvas’s end-to-end capabilities by assuming the role of a data professional preparing data for an LLM. The interactive data preparation enabled quickly cleaning, transforming, and analyzing the data to engineer informative features. By removing coding complexities, SageMaker Canvas allowed rapid iteration to create a high-quality training dataset. This accelerated workflow led directly into building, training, and deploying a performant machine learning model for business impact. With its comprehensive data preparation and unified experience from data to insights, SageMaker Canvas empowers users to improve their ML outcomes.
We encourage you to learn more by exploring Amazon SageMaker Data Wrangler, Amazon SageMaker Canvas, Amazon Titan models, Amazon Bedrock, and Amazon OpenSearch Service to build a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, then please leave a comment.
About the Authors
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Nikita Ivkin is a Senior Applied Scientist at Amazon SageMaker Data Wrangler with interests in machine learning and data cleaning algorithms.
Read MoreAWS Machine Learning Blog