Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text. Amazon Comprehend provides customized features, custom entity recognition, custom classification, and pre-trained APIs such as key phrase extraction, sentiment analysis, entity recognition, and more so you can easily integrate NLP into your applications.
We recently added Amazon Comprehend related notebooks in Amazon SageMaker JumpStart notebooks that can help you quickly get started using the Amazon Comprehend custom classifier and custom entity recognizer. You can use custom classification to organize documents into categories (classes) that you define. Custom entity recognition extends the capability of the Amazon Comprehend pre-trained entity detection API by helping you identify entity types that are unique to your domain or business that aren’t in the preset generic entity types.
In this post, we show you how to use JumpStart to build Amazon Comprehend custom classification and custom entity detection models as part of your enterprise NLP needs.
SageMaker JumpStart
The Amazon SageMaker Studio landing page provides the option to use JumpStart. JumpStart provides a quick way to get started by providing pre-trained models for a variety of problem types. You can train and tune these models. JumpStart also provides other resources like notebooks, blogs, and videos.
JumpStart notebooks are essentially sample code that you can use as a starting point to get started quickly. Currently, we provide you with over 40 notebooks that you can use as is or customize as needed. You can find your notebooks by using search or the tabbed view panel. After you find the notebook you want to use, you can import the notebook, customize it for your requirements, and select the infrastructure and environment to run the notebook on.
Get started with JumpStart notebooks
To get started with JumpStart, go to the Amazon SageMaker console and open Studio. Refer to Get Started with SageMaker Studio for instructions on how to get started with Studio. Then complete the following steps:
In Studio, go to the launch page of JumpStart and choose Go to SageMaker JumpStart.
You’re offered multiple ways to search. You may either use tabs on the top to get to what you want, or use the search box as shown in the following screenshot.
To find notebooks, we go to the Notebooks tab.
At the time of writing, JumpStart offers 47 notebooks. You can use filters to find Amazon Comprehend related notebooks.
On the Content Type drop-down menu, choose Notebook.
As you can see in the following screenshot, we currently have two Amazon Comprehend notebooks.
In the following sections, we explore both notebooks.
Amazon Comprehend Custom Classifier
In this notebook, we demonstrate how to use the custom classifier API to create a document classification model.
The custom classifier is a fully managed Amazon Comprehend feature that lets you build custom text classification models that are unique to your business, even if you have little or no ML expertise. The custom classifier builds on the existing capabilities of Amazon Comprehend, which are already trained on tens of millions of documents. It abstracts much of the complexity required to build a NLP classification model. The custom classifier automatically loads and inspects the training data, selects the right ML algorithms, trains your model, finds the optimal hyperparameters, tests the model, and provides model performance metrics. The Amazon Comprehend custom classifier also provides an easy-to-use console for the entire ML workflow, including labeling text using Amazon SageMaker Ground Truth, training and deploying a model, and visualizing the test results. With an Amazon Comprehend custom classifier, you can build the following models:
Multi-class classification model – In multi-class classification, each document can have one and only one class assigned to it. The individual classes are mutually exclusive. For example, a movie can be classed as a documentary or as science fiction, but not both at the same time.
Multi-label classification model – In multi-label classification, individual classes represent different categories, but these categories are somehow related and not mutually exclusive. As a result, each document has at least one class assigned to it, but can have more. For example, a movie can simply be an action movie, or it can be an action movie, a science fiction movie, and a comedy, all at the same time.
This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
Amazon Custom Entity Recognizer
In this notebook, we demonstrate how to use the custom entity recognition API to create an entity recognition model.
Custom entity recognition extends the capabilities of Amazon Comprehend by helping you identify your specific entity types that aren’t in the preset generic entity types. This means that you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.
Building an accurate custom entity recognizer on your own can be a complex process, requiring preparation of large sets of manually annotated training documents and selecting the right algorithms and parameters for model training. Amazon Comprehend helps reduce the complexity by providing automatic annotation and model development to create a custom entity recognition model.
The example notebook takes the training dataset in CSV format and runs inference against text input. Amazon Comprehend also supports an advanced use case that takes Ground Truth annotated data for training and allows you to directly run inference on PDFs and Word documents. For more information, refer to Build a custom entity recognizer for PDF documents using Amazon Comprehend.
Amazon Comprehend has lowered the annotation limits and allowed you to get more stable results, especially for few-shot subsamples. For more information about this improvement, refer to Amazon Comprehend announces lower annotation limits for custom entity recognition.
This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
Use, customize, and deploy Amazon Comprehend JumpStart notebooks
After you select the Amazon Comprehend notebook you want to use, choose Import notebook. As you do that, you can see the notebook kernel starting.
Importing your notebook triggers selection of the notebook instance, kernel, and image that is used to run the notebook. After the default infrastructure is provisioned, you can change the selections as per your requirements.
Now, go over the outline of the notebook and carefully read the sections for prerequisites setup, data setup, training the model, running inference, and stopping the model. Feel free to customize the generated code per your needs.
Based on your requirements, you may want to customize the following sections:
Permissions – For a production application, we recommend restricting access policies to only those needed to run the application. Permissions can be restricted based on the use case, such as training or inference, and specific resource names, such as a full Amazon Simple Storage Service (Amazon S3) bucket name or an S3 bucket name pattern. You should also restrict access to the custom classifier or SageMaker operations to just those that your application needs.
Data and location – The example notebook provides you sample data and S3 locations. Based on your requirements, you may use your own data for training, validation, and testing, and use different S3 locations as needed. Similarly, when the model is created, you can choose to keep the model at different locations. Just make sure you have provided the right permissions to access S3 buckets.
Preprocessing steps – If you’re using different data for training and testing, you may want to adjust the preprocessing steps per your requirements.
Testing data – You can bring your own inference data for testing.
Clean up – Delete the resources launched by the notebook to avoid recurring charges.
Conclusion
In this post, we showed you how to use JumpStart to learn and fast-track using Amazon Comprehend APIs by making it convenient to find and run Amazon Comprehend related notebooks from Studio while having the option to modify the code as needed. The notebooks use sample datasets with AWS product announcements and sample news articles. You may use this notebook to learn how to use Amazon Comprehend APIs in a Python notebook, or you may use it as a starting point and expand the code further for your unique requirements and production deployments.
You can start using JumpStart and take advantage of over 40 notebooks in various topics in all Regions where Studio is available at no additional cost.
About the Authors
Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team with expertise in AI and ML for Content Moderation and Rekognition. She is passionate about promoting AWS AI services and helping customers transform their business solutions.
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI
Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in the future and bring economic and social prosperity. In her spare time, Rachna likes spending time with her family, hiking, and listening to music.
Read MoreAWS Machine Learning Blog