Regulatory mandates, audit requirements, and security policies often call for data visibility and granular data control while using Amazon Simple Storage Service (Amazon S3) for shared datasets. Because data on Amazon S3 is often accessible by multiple applications and teams, fine-grained access controls should be implemented to restrict privileged information such as personally identifiable information (PII) to only authorized entities. For example, PII data used by a marketing application may need to be masked to meet data privacy requirements. Similarly, an order inventory dataset used by a production ordering application may include customer credit card information that shouldn’t be accessed by a business analytics application, so this data should be suppressed to prevent unintended data leakage.
In this post, we show you how to implement Amazon S3 Object Lambda to process and modify data retrieved from Amazon S3.
Currently, organizations employ some combination of manual processes and rules-based automation to identify and protect PII. Manual processes are slow, expensive, and can’t scale to address large amounts of data with accuracy. Manual processes also exacerbate human risk because sensitive data is in the hands of more human users and applications during the PII management processes. Rules-based automation is often used to augment manual processes, but this automation requires continued investment to keep it relevant and effective. These automation investments also have diminishing returns because they often require human support to sufficiently protect PII due to the context-driven nature of many PII scenarios that can’t be effectively addressed by rules-based automation alone.
From an implementation perspective, organizations typically either create and manage a proxy in front of Amazon S3 to intercept and redact data or create and store additional redacted derivative copies of datasets to provide multiple users and applications with redacted and unredacted versions of the same datasets. In both implementation models, you need to build and operate custom data processing software on additional infrastructure and storage, which adds complexity, data risk, and cost. These circumstances make it challenging for organizations to affordably and accurately protect PII at scale.
AWS customers manage many S3 buckets containing shared datasets that are accessed by multiple applications and users. You can use Amazon S3 Access Points to simplify data access management at scale. S3 Access Points have unique hostnames with dedicated access policies that describe how data can be accessed using the S3 Access Point. Before S3 Access Points, least privilege shared access to data meant managing permissions directly on the bucket using a single bucket policy document and bucket ACLs. These policies could represent hundreds of applications and users with various access needs and permissions. S3 Access Points simplify and streamline data access by creating individualized access permissions that easily scale with your data while providing management transparency and auditability.
Solution overview
With S3 Object Lambda, organizations can transform S3 objects in-flight as they are being retrieved through a standard Amazon S3 GET request by using S3 Object Lambda Access Points. AWS has provided two new pre-built AWS Lambda functions to help you detect, redact, and govern PII. Both functions are now available on the AWS Serverless Application Repository to be deployed at no license cost:
The ComprehendPiiRedactionS3ObjectLambda function provides configurable redaction and masking of sensitive PII data.
The ComprehendPiiAccessControlS3ObjectLambda checks if an object contains specified types of PII information and prevents retrieval to avoid inadvertent leakage of PII.
Unlike a human workforce, these capabilities can scale to large amounts of data without affecting accuracy and can reduce the number of humans and entities in contact with known and unknown PII data.
These S3 Object Lambda functions are powered by Amazon Comprehend, a fully managed service that uses state-of-the-art natural language processing (NLP) techniques to accurately identify PII. This means that the two new functions can capture variations in how PII is represented, regardless of how PII exists in text (such as numerically or as a combination of words and numbers). Amazon Comprehend can even use context in the text to understand if a 4-digit number is a PIN, the last four numbers of a Social Security number, or a year. With S3 Object Lambda, you don’t have to operate custom software or maintain additional infrastructure and storage to deploy this processing around your data. With just a few clicks on the AWS Management Console or through the AWS Command Line Interface (AWS CLI), you can configure and deploy the Amazon Comprehend-powered PII Lambda functions to control and manage your PII information.
The following diagram shows a basic data flow of how accessing an S3 object from an S3 Object Lambda Access Point uses S3 Object Lambda functions to detect and act on data as it’s being retrieved.
The solution contains the following steps:
An authorized and authenticated user or application makes an S3 GetObject API call to the S3 Object Lambda Access Point.
The S3 Access Point invokes the attached Lambda function for either access control or redaction.
If the ComprehendPiiAccessControlS3ObjectLambda is attached to the S3 Object Lambda Access Point and PII is detected in the object, the GetObject API call is denied and a response reads Access Denied Object Contains PII.
Alternatively, if the ComprehendPiiRedactionS3ObjectLambda is attached to the S3 Object Lambda Access Point, the GetObject API call returns the requested object with the selected PII redacted according to the configuration of the redaction Lambda function.
In this post, we present two use cases to demonstrate how to configure and use the pre-built Lambda functions to detect and protect sensitive data.
Use case overview
This solution includes two architectures, which include resources that you create with AWS CloudFormation templates and through manual operations using the console. The first use case focuses on access control for PII. The second use case focuses on selective redaction of data for multiple personas.
The CloudFormation templates and Lambda function code are available in the GitHub repo.
Use case #1 (access control)
In the first use case, you create an S3 Object Lambda Access Point and attach a pre-built Lambda function for access control. This pre-built Amazon Comprehend-powered function allows for just-in-time interception and denial of access to unknown PII in a scalable and cost-effective way, which reduces the risk of accidental PII exposure to unauthorized users. You can easily configure and deploy this function through the AWS Serverless Application Repository.
After you deploy the function, you validate that access to an S3 object is blocked if PII is detected during the object retrieval process. This scenario simulates a standard business user who may need to access existing data in S3 buckets but isn’t authorized to access PII data. By enabling S3 Object Lambda to discover, intercept, and block access to objects with unexpected PII, you can effectively discover unknown PII in your environments and protect against unintended PII leakage. If unknown PII is discovered in an object, a data governance user or data owner typically should review the object and decide on a course of action to either to redact the information or remove it before granting access to the business user.
The following diagram illustrates this architecture.
To implement this architecture, you complete the following high-level steps:
Deploy the ComprehendPiiAccessControlS3ObjectLambda function from an AWS verified author using the AWS Serverless Application Repository.
You attach this function to the S3 Object Lambda Access Point for access control.
Launch the CloudFormation template s3olap-access-control-foundation.
The template creates AWS Identity and Access Management (IAM) resources, standard S3 Access Points, and an S3 bucket.
Upload the example files to the newly created S3 bucket as sample data.
You download two files from the GitHub repo and upload to the bucket:
survey-results.txt – Contains example data that contains previously unknown PII. This data simulates PII data that was inadvertently recorded as a part of a survey transcript and is unknown to all parties.
innocuous.txt – Control data to prove that the Lambda function is only blocking files that contain PII, and demonstrates the ability to retrieve objects without PII through an S3 Object Lambda Access Point.
Create an S3 Object Lambda Access Point on the console.
You do this by associating a supporting standard S3 Access Point and the previously created and configured ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point.
Use the IAM role GeneralRole to access the S3 Object Lambda Access Point to access the survey-results.txt data.
This role simulates a business user role in use by many people to access data as a part of their day-to-day responsibilities. The assumption is that there isn’t any business need to view any sensitive PII for these users. This user expects that the customer data they’re accessing doesn’t contain sensitive PII. The GeneralRole IAM role makes a GetObject call to the S3 Object Lambda Access Point to retrieve survey-results.txt. During retrieval the associated Lambda function is invoked and when the function detects PII it blocks retrieval and responds that the object can’t be retrieved. After you’re denied access to the file, you use GeneralRole to retrieve the innocuous.txt file using a GetObject call to validate you can retrieve files without PII.
Use case #2 (redaction)
In the second use case, you create multiple S3 Object Lambda Access Points and enable them with pre-built Lambda functions that are configured to redact specific types of PII depending on the accessors’ business needs. These functions are configured and deployed from the AWS Serverless Application Repository. Next, you validate that an S3 object is being properly redacted for each user based on the S3 Object Lambda Access Point performing the object retrieval.
This redaction example use case has three personas: an administrator, a billing user, and a customer support user. Each persona requires access to the same data with varying levels of redaction to achieve least privilege and still access the information necessary for their role:
The administrator needs access to the unredacted data to properly configure redaction for other users. They may also need to implement redaction for themselves to suppress specific PII data in the future.
The billing user needs access to the financial data in the text file but shouldn’t have access to other sensitive user data such as SSNs.
The customer support user needs access to the personal data of the user such as the email address, physical address, age, and name, but shouldn’t have access to financial data or sensitive information such as SSNs.
Each user only has access to one S3 Object Lambda Access Point (managed through IAM permissions).
This use case demonstrates how S3 Object Lambda enables configurable user-specific redaction for data. The following diagram illustrates our architecture.
We deploy the architecture with the following high-level steps:
Configure and deploy the ComprehendPiiRedactionS3ObjectLambda function from an AWS verified author using the AWS Serverless Application Repository.
You attach this function to each S3 Object Lambda Access Point for the redaction use cases. You deploy three redaction functions, each configured differently to support the specific personas.
Launch the CloudFormation template s3olap-redaction-foundation.
The template creates IAM resources, standard S3 Access Points, and an S3 bucket.
Upload the example file transcript.txt to the newly created S3 bucket as sample data.
This file is an example of a sensitive call transcript containing known PII of various types, including phone numbers, banking info, and SSNs. This data simulates PII data that was recorded as a part of a call center interaction that is known to be sensitive and has been protected accordingly. A variety of personas have a valid business need to access this information, but each persona’s needs differ based on their role, and the business wants to implement the best practice of least privilege. The personas access the file through S3 Object Lambda Access Points to give them the appropriate level of information to do their job.
Create an S3 Object Lambda Access Point for the admin user.
You associate the admin supporting standard S3 Access Point and the previously created admin ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point. Make sure it’s the function you configured for admin redaction.
Create an S3 Object Lambda Access Point for the billing user.
You associate the billing supporting standard S3 Access Point and the previously created billing ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point. Make sure it’s the function you configured for billing redaction.
Create an S3 Object Lambda Access Point for the customer support user.
You associate the customer support supporting standard S3 Access Point and the previously created customer support ComprehendPiiAccessControlS3ObjectLambda function as the S3 Object Lambda Access Point. Make sure it’s the function you configured for customer support redaction.
Assume the Admin Redaction role and attempt to download the transcript.txt data through the Admin S3 Object Lambda Access Point.
The download should complete successfully without modification to the file.
Assume the Billing Redaction role and attempt to download the transcript.txt data through the Billing S3 Object Lambda Access Point.
The download should complete with redaction of sensitive non-financial PII.
Assume the Customer Support Redaction role and attempt to download the transcript.txt data through the Customer Support S3 Object Lambda Access Point.
The download should complete with redaction of all financial PII while preserving contact information.
Permissions should be established such that roles can only access their corresponding S3 Object Lambda Access Points.
Validate that permissions only allow the download through the S3 Object Lambda Access Points that correspond to the similarly named roles (for example, the Billing Redaction role can download from the Billing S3 Object Lambda Access Point, but not the Admin S3 Object Lambda Access Point or the Customer Support S3 Object Lambda Access Point).
The following diagram indicates the permissions, features, and access control functionality that you use to manage how the S3 Object Lambda solutions work.
Solution cost
Using S3 Object Lambda for PII access control or redaction incurs costs from Amazon S3, Lambda, and Amazon Comprehend.
For more information about pricing, see Amazon S3 pricing, AWS Lambda pricing, and Amazon Comprehend pricing.
Use case #1: Detection and denial of data retrieval for objects with PII
In this section, we walk you through the steps to implement the first use case, which restricts access to objects containing PII.
Deploy the ComprehendPiiAccessControlS3ObjectLambda function
To deploy the ComprehendPiiAccessControlS3ObjectLambda function, complete the following steps:
Sign in to your AWS account.
Navigate to the AWS Serverless Application Repository page for the ComprehendPiiAccessControlS3ObjectLambda
This is the function we attach to the S3 Object Lambda Access Point.
Review the Readme file portion of the ComprehendPiiAccessControlS3ObjectLambda.
For Application Name, enter general-ComprehendPiiAccessControlS3ObjectLambda.
Select the check box to acknowledge the creation of custom IAM roles.
Choose Deploy.
You should now be able to review your deployed function on the Lambda functions page.
Launch the s3olap-access-control-foundation CloudFormation template
You use the console to launch a CloudFormation stack (s3olap-access-control-foundation) that sets up the following resources:
An S3 bucket called survey-results-unknown-pii-[postfix]. The postfix is a user-supplied string that should be added to make the S3 bucket globally unique.
An S3 bucket policy that only allows users to read from the bucket through S3 Access Points.
A standard S3 Access Point called accessctl-s3-ap-survey-results-unknown-pii for use by the general IAM role.
An IAM role named GeneralRole and IAM policy named General Policy, meant to act as a general cloud data user. This role is used to access the S3 bucket.
Choose Launch Stack to deploy the resources, and make sure you’re in the US East (N. Virginia) Region (us-east-1):
Upload example PII data to the unknown-pii bucket
Next, we upload our example PII data.
On the Amazon S3 console, select the survey-results-unknown-pii-[postfix] bucket.
Download survey-results.txt and innocuous.txt files from GitHub.
Upload the files to the survey-results-unknown-pii-[postfix]bucket.
Create the Access Control S3 Object Lambda Access Point
In this step, we create an S3 Object Lambda Access Point using the ComprehendPiiAccessControlS3ObjectLambda function to test our access control.
On the Amazon S3 console, choose Object Lambda Access Points in the navigation pane.
Choose Create Object Lambda Access Point.
Name your Access Point (for this post, we use accessctl-s3olap-survey-results-unknown-pii).
Choose Browse S3.
Select the correct supporting Access Point named accessctl-s3-ap-survey-results-unknown-pii.
In the Lambda function section, choose Choose from functions in your account and choose the serverlessrepo-Comprehend-PiiAccessControlFunction-[random string] function.
Choose Create Object Lambda Access Point.
Test the solution
We can now test the solution by attempting to retrieve S3 objects with the Access Point.
On the Amazon S3 console, choose Object Lambda Access Points in the navigation pane.
Choose the accessctl-s3olap-survey-results-unknown-pii Access Point.
Select survey-results.txt.
On the Actions menu, choose Download.
You should receive a message that the download is denied.
Return to the Object Lambda Access Points page and choose the accessctl-s3olap-survey-results-unknown-pii Access Point again.
Select survey-results.txt.
On the Actions menu, choose Download.
View the downloaded file to see if it downloaded successfully.
The file shouldn’t contain any PII and should download successfully. If you see any unexpected characters in the file, download the file and open it in a text editor (some browsers experience text encoding issues).
Use case #2: Redaction of known PII data for multiple personas
In this section, we walk you through the steps to create multiple S3 Object Lambda Access Points and enable them with pre-built Lambda functions that are configured to redact specific types of PII depending on the accessors’ business needs.
Deploy the ComprehendPiiRedactionS3ObjectLambda function for admin access
Your first step is to deploy the ComprehendPiiRedactionS3ObjectLambda function for use cases with admin access.
Sign in to your AWS account.
Navigate to the AWS Serverless Application Repository page for the ComprehendPiiRedactionS3ObjectLambda
For Application name, enter admin-ComprehendPiiRedactionS3ObjectLambda.
Set PiiEntityTypes to none so no information is redacted.
You can change this value and redeploy the stack in the future to test other redaction scenarios.
Set MaskCharacter to an empty space.
Set UnsupportedFileHandling to PASS so that unsupported files are be served.
Select the check box to acknowledge the creation of custom IAM roles.
Choose Deploy.
You should now be able to review your deployed function on the Lambda functions page.
Deploy the ComprehendPiiRedactionS3ObjectLambda function for billing access
In this section, you deploy the ComprehendPiiRedactionS3ObjectLambda function for use cases with billing access.
Navigate to the AWS Serverless Application Repository page for the ComprehendPiiRedactionS3ObjectLambda
Validate the function is from an AWS verified author.
We attach this function to the S3 Object Lambda Access Points for redaction use cases.
Review the Readme file portion of the ComprehendPiiRedactionS3ObjectLambda
For Application Name, enter billing-ComprehendPiiRedactionS3ObjectLambda.
Set MaskMode to REPLACE_WITH_PII_ENTITY_TYPE.
Set PiiEntityTypes to AGE,DRIVER_ID,IP_ADDRESS,MAC_ADDRESS,PASSPORT_NUMBER,PASSWORD,SSN.
This field configures the Lambda function to redact the specified types of information discovered in the object. For more information about supported entity types, see Detect Personally Identifiable Information (PII).
Select the check box to acknowledge the creation of custom IAM roles.
Chose Deploy.
You should now be able to review your deployed function on the Lambda functions page.
Deploy the ComprehendPiiRedactionS3ObjectLambda function for customer support access
Finally, we deploy the ComprehendPiiRedactionS3ObjectLambda function for use cases with customer support access.
Navigate to the AWS Serverless Application Repository page for the ComprehendPiiRedactionS3ObjectLambda
Validate the function is from an AWS verified author.
For Application Name, enter customersupport-ComprehendPiiRedactionS3ObjectLambda.
Set MaskMode to REPLACE_WITH_PII_ENTITY_TYPE.
Set PiiEntityTypes to BANK_ACCOUNT_NUMBER,BANK_ROUTING,CREDIT_DEBIT_CVV,CREDIT_DEBIT_EXPIRY,CREDIT_DEBIT_NUMBER,SSN so all this information is redacted.
Select the check box to acknowledge the creation of custom IAM roles.
Choose Deploy.
You should now be able to review your deployed function on the Lambda functions page.
Launch the s3olap-redaction-foundation CloudFormation template
You now launch the s3olap-redaction-foundation CloudFormation stack to set up the following resources:
An S3 bucket called call-transcripts-known-pii-[postfix]. The postfix is a user-supplied string that should be added to make the S3 bucket globally unique.
An S3 bucket policy that only allows users to read from the bucket through S3 Access Points.
A standard S3 Access Point called billing-s3-access-point-call-transcripts-known-pii for use by the billing IAM role.
A standard S3 Access Point called cs-s3-access-point-call-transcripts-known-pii for use by the customer support IAM role.
A standard S3 Access Point called admin-s3-access-point-call-transcripts-known-pii for use by the admin IAM role.
An IAM role named AdminRole and IAM policy named Admin Policy, meant to act as a cloud data administrator. This role is used to access unredacted data, configure S3 Access Points, and deploy the pre-built Lambda functions from the AWS Serverless Application Repository.
An IAM role named BillingRole and IAM policy named Billing Policy, meant to act as a team member on the billing team. This role is used to access the transcript of the call while only redacting non-billing related sensitive information.
An IAM role named CustSupportRole and IAM policy named CustSupport Policy, meant to act as a team member on the customer support team. This role is used to access the transcript of the call while redacting sensitive information such as financial information and sensitive out-of-scope personal information (like the last four numbers of an SSN).
Choose Launch Stack to deploy the resources, and make sure you’re in the US East (N. Virginia) Region (us-east-1).
Upload example PII data to the known-pii bucket
Next, we upload our sample data.
On the Amazon S3 console, select the call-transcripts-known-pii-[postfix] bucket.
Download transcript.txt from GitHub.
Upload the file to the call-transcripts-known-pii-[postfix] bucket.
You should now see transcript.txt listed in the call-transcripts-known-pii-[postfix] S3 bucket.
You now have the necessary IAM and Amazon S3 foundation to set up the redaction use cases. Next, we deploy the S3 Object Lambda Access Points.
Create the Admin S3 Object Lambda Access Point
We create the S3 Object Lambda Access Point for admin access using the ComprehendPiiRedactionS3ObjectLambda function.
On the Amazon S3 console, choose Object Lambda Access Points in the navigation pane.
Choose Create Object Lambda Access Point.
Name your Access Point admin-s3olap-call-transcripts-known-pii.
Make sure to use this exact name. If any Object Lambda Access Points aren’t named properly, the provided IAM policies don’t allow access because they’re restricted by resource name.
Choose Browse S3.
Select the corresponding Access Point admin-s3-access-point-call-transcripts-known-pii.
In the Lambda function section, choose Choose from functions in your account and choose the serverlessrepo-admin-Comprehe-PiiRedactionFunction-[random string] function.
Choose Create Object Lambda Access Point.
Create the Billing S3 Object Lambda Access Point
We now create the S3 Object Lambda Access Point for billing access using the ComprehendPiiRedactionS3ObjectLambda function.
On the Amazon S3 console, choose Object Lambda Access Points in the navigation pane.
Choose Create Object Lambda Access Point.
Name your Access Point billing-s3olap-call-transcripts-known-pii.
Make sure to use this exact name.
Choose Browse S3.
Select the corresponding Access Point billing-s3-access-point-call-transcripts-known-pii.
In the Lambda function section, choose Choose from functions in your account and choose the serverlessrepo-billing-Comprehe-PiiRedactionFunction-[random string] function.
Choose Create Object Lambda Access Point.
Create the Customer Support S3 Object Lambda Access Point
Finally, we create the S3 Object Lambda Access Point for customer support access using the ComprehendPiiRedactionS3ObjectLambda function.
On the Amazon S3 console, choose Object Lambda Access Points in the navigation pane.
Choose Create Object Lambda Access Point.
Name your Access Point custsupport-s3olap-call-transcripts-known-pii.
Make sure to use this exact name.
Choose Browse S3.
Select the corresponding Access Point cs-s3-access-point-call-transcripts-known-pii.
In the Lambda function section, choose Choose from functions in your account and choose the serverlessrepo-customersuppor-Comprehe-PiiRedactionFunction-[random string] function.
Choose Create Object Lambda Access Point.
Test the solution
To test the solution, we retrieve S3 objects using the S3 Object Lambda Access Points we just created.
On the Amazon S3 console, choose Object Lambda Access Points in the navigation pane.
Assume the Admin Redaction role.
Choose the admin-s3olap-call-transcripts-known-pii Access Point.
Select transcript.txt.
On the Actions menu, choose Download.
View the downloaded file to see if the file has any redactions.
No information should be redacted.
Assume the Billing Redaction role.
Return to the Object Lambda Access Points page and choose the billing-s3olap-call-transcripts-known-pii Access Point.
Select transcript.txt.
On the Actions menu, choose Download.
View the downloaded file.
Sensitive data like the last four numbers of the SSN should be redacted.
Assume the Customer Support Redaction role.
Return to the Object Lambda Access Points page and choose the custsupport-s3olap-call-transcripts-known-pii Access Point.
Select transcript.txt.
On the Actions menu, choose Download.
View the downloaded file.
All the financial data and the last four numbers of the SSN should be redacted.
If you see any unexpected characters in the files, download the file and open it in a text editor (some browsers experience text encoding issues).
If you want to implement this solution effectively, give each team or persona access to only one specific role (such as the billing role) and make sure teams only have access to an IAM role that corresponds to the level of data access they should have.
Clean up resources
Finally, delete the resources you created in the earlier steps, in order to avoid additional charges.
Delete all the S3 Object Lambda Access Points you created.
Delete the files you added to the S3 buckets for known-pii and unknown-pii.
Delete all the serverlessrepo-*-Comprehend stacks.
Delete the s3olap-redaction-foundation and s3olap-access-control-foundation CloudFormation stacks.
Conclusion
In this post, we demonstrated how you can use S3 Object Lambda with Amazon Comprehend to detect, redact, and protect PII data. You can build your own Lambda functions and customize them further to meet your specific data protection needs and improve data value by using additional Amazon Comprehend features like entity recognition, key phrase recognition, sentiment analysis, and document classification. Also, consider Amazon Comprehend Medical as a HIPAA-eligible NLP service to analyze and extract data in a context-aware manner.
Use S3 Object Lambda throughout your AWS footprint to give you scalable and intelligent protection of data to help you mitigate data risk and manage access.
If you have any feedback about this post, please provide it in the comments section.
About the Authors
Ram Ramani joined Amazon in 2017 and is part of the core security specialist team with a deep focus on data protection and privacy. Ram’s work includes enabling customer adoption of the AWS Cloud by educating and evangelizing security best practices while the customers continue to innovate on their business . Prior to joining AWS, Ram spent 10 years working on various machine learning and security problems in the Telecom space.
Â
Austin Quam is a Security Solutions Architect specializing in solving data security problems. Austin works with a diverse set of customers across North America, and is obsessed with helping customers achieve their business and security objectives on the AWS Cloud. Austin’s work includes security strategy, thought leadership, and detailed security design for cloud environments and workloads. Prior to joining AWS, Austin worked with several leading consulting firms serving clients across the US in many different cloud and security roles.
Read MoreAWS Machine Learning Blog