Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks labels your data. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus.
Today, we are excited to announce the launch of new built-in interfaces on Ground Truth Plus. With this new capability, multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through self-serve interfaces. This enables you to accelerate the development of high-quality training datasets by reducing project set up time. Additionally, you can control fine-grained access to your data by scoping your AWS Identity and Access Management (IAM) role permissions to match your individual level of Amazon Simple Storage Service (Amazon S3) access, and you always have the option to revoke access to certain buckets.
Until now, you had to reach out to your Ground Truth Plus operations program manager (OPM) to create new data labeling projects and batches. This process had some restrictions because it allowed only one user to request a new project and batch—if multiple users within the organization were using the same AWS account, then only one user could request a new data labeling project and batch using the Ground Truth Plus console. Additionally, the process created artificial delays in kicking off the labeling process due to multiple manual touchpoints and troubleshooting required in case of issues. Separately, all the projects used the same IAM role for accessing data. Therefore, to run projects and batches that needed access to different data sources such as different Amazon S3 buckets, you had to rely on your Ground Truth Plus OPM to provide your account specific S3 policies, which you had to manually apply to your S3 buckets. This entire operation was manually intensive resulting in operational overheads.
This post walks you through steps to create a new project and batch, share data, and receive data using the new self-serve interfaces to efficiently kickstart the labeling process. This post assumes that you are familiar with Ground Truth Plus. For more information, see Amazon SageMaker Ground Truth Plus – Create Training Datasets Without Code or In-house Resources.
Solution overview
We demonstrate how to do the following:
Update existing projects
Request a new project
Set up a project team
Create a batch
Prerequisites
Before you get started, make sure you have the following prerequisites:
An AWS account
An IAM user with access to create IAM roles
The Amazon S3 URI of the bucket where your labeling objects are stored
Update existing projects
If you have a Ground Truth Plus project before the launch (December 9, 2022) of the new features described in this post, then you need to create and share an IAM role so that you can use these features with your existing Ground Truth Plus project. If you’re a new user of Ground Truth Plus, you can skip this section.
To create an IAM role, complete the following steps:
On the IAM console, choose Create role.
Select Custom trust policy.
Specify the following trust relationship for the role:
Choose Next.
Choose Create policy.
On the JSON tab, specify the following policy. Update the Resource property by specifying two entries for each bucket: one with just the bucket ARN, and another with the bucket ARN followed by /*. For example, replace <your-input-s3-arn> with arn:aws:s3:::my-bucket/myprefix/ and <your-input-s3-arn>/* with arn:aws:s3:::my-bucket/myprefix/*.
Choose Next: Tags and Next: Review.
Enter the name of the policy and an optional description.
Choose Create policy.
Close this tab and go back to the previous tab to create your role.
On the Add permissions tab, you should see the new policy you created (refresh the page if you don’t see it).
Select the newly created policy and choose Next.
Enter a name (for example, GTPlusExecutionRole) and optionally a description of the role.
Choose Create role.
Provide the role ARN to your Ground Truth Plus OPM, who will then update your existing project with this newly created role.
Request a new project
To request a new project, complete the following steps:
On the Ground Truth Plus console, navigate to the Projects section.
This is where all your projects are listed.
Choose Request project.
The Request project page is your opportunity to provide details that will help us schedule an initial consultation call and set up your project.
In addition to specifying general information like the project name and description, you must specify the project’s task type and whether it contains personally identifiable information (PII).
To label your data, Ground Truth Plus needs temporary access to your raw data in an S3 bucket. When the labeling process is complete, Ground Truth Plus delivers the labeling output back to your S3 bucket. This is done through an IAM role. You can either create a new role, or you can navigate to the IAM console to create a new role (refer to the previous section for instructions).
If you choose to create a role, choose Enter a custom IAM role ARN and enter your IAM role ARN, which is in the format of arn:aws:iam::<YourAccountNumber>:role/<RoleName>.
To use the built-in tool, on the drop-down menu under IAM Role, choose Create a new role.
Specify the bucket location of your labeling data. If you don’t know the location of your labeling data or if you don’t have any labeling data uploaded, select Any S3 bucket, which will give Ground Truth Plus access to all your account’s buckets.
Choose Create to create the role.
Your IAM role will allow Ground Truth Plus, identified as sagemaker-ground-truth-plus.amazonaws.com in the role’s trust policy, to run the following actions on your S3 buckets:
Choose Request project to complete the request.
A Ground Truth Plus OPM will schedule an initial consultation call with you to discuss your data labeling project requirements and pricing.
Set up a project team
After you request a project, you need to create a project team to log in to your project portal. A project team provides access to the members from your organization or team to track projects, view metrics, and review labels. You can use the option Invite new members by email or Import members from existing Amazon Cognito user groups. In this post, we show how to import members from existing Amazon Cognito user groups to add users to your project team.
On the Ground Truth Plus console, navigate to the Project team section.
Choose Create project team.
Choose Import members from existing Amazon Cognito user groups.
Choose an Amazon Cognito user pool.
User pools require a domain and an existing user group.
Choose an app client.
We recommend using a client generated by Amazon SageMaker.
Choose a user group from your pool to import members.
Choose Create project team.
You can add more team members after creating the project team by choosing Invite new members on the Members page of the Ground Truth Plus console.
Create a batch
After you have successfully submitted the project request and created a project team, you can access the Ground Truth Plus project portal by clicking Open project portal on the Ground Truth Plus console.
You can use the project portal to create batches for a project, but only after the project’s status has changed to Request approved.
View a project’s details and batches by choosing the project name.
A page titled with the project name opens.
In the Batches section, choose Create batch.
Enter a batch name and optional description.
Enter the S3 locations of the input and output datasets.
To ensure the batch is created successfully, you must meet the following requirements:
The S3 bucket and prefix should exist, and the total number of files should be greater than 0
The total number of objects should be less than 10,000
The size of each object should be less than 2 GB
The total size of all objects combined is less than 100 GB
The IAM role provided to create a project has permission to access the input bucket, output bucket, and S3 files that are used to create the batch
The files under the provided S3 location for the input datasets should not be encrypted by AWS Key Management Service (AWS KMS)
Choose Submit.
Your batch status will show as Request submitted. After Ground Truth Plus has temporary access to your data, AWS experts will set up data labeling workflows and operate them on your behalf, which will change the batch status to In-progress. When the labeling is complete, the batch status changes from In-progress to Ready for review. If you want to review your labels before receiving the labels then choose Review batch. From there, you have an option to choose Accept batch to receive your labeled data.
Conclusion
This post showed you how multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through new self-serve interfaces. This new capability allows you to kickstart your labeling projects faster and reduces operational overhead. We also demonstrated how you can control fine-grained access to data by scoping your IAM role permissions to match your individual level of access.
We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!
About the authors
Manish Goel is the Product Manager for Amazon SageMaker Ground Truth Plus. He is focused on building products that make it easier for customers to adopt machine learning. In his spare time, he enjoys road trips and reading books.
Karthik Ganduri is a Software Development Engineer at Amazon AWS, where he works on building ML tools for customers and internal solutions. Outside of work, he enjoys clicking pictures. Â
Zhuling Bai is a Software Development Engineer at Amazon AWS. She works on developing large scale distributed systems to solve machine learning problems.
Aatef Baransy is a Frontend engineer at Amazon AWS. He writes fast, reliable, and thoroughly tested software to nurture and grow the industry’s most cutting-edge AI applications.
Mohammad Adnan is a Senior Engineer for AI and ML at AWS. He was part of many AWS service launch, notably Amazon Lookout for Metrics and AWS Panorama. Currently, he is focusing on AWS human-in-the-loop offerings (AWS SageMaker’s Ground truth, Ground truth plus and Augmented AI). He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on LinkedIn, mohammad-adnan-6a99a829.
Read MoreAWS Machine Learning Blog