Data replication across data stores is a complex, multiphase process that includes assessment, schema conversion, data migration, data validation, and implementation of data access and security policies. As data is replicated across data stores, most organizations have compliance needs to protect personally identifiable information (PII) or commercially sensitive data from users that shouldn’t have access to that information.
In this post, we provide a solution to implement data masking while replicating data using AWS Database Migration Service (AWS DMS) from Amazon Aurora PostgreSQL cluster to Amazon Simple Storage Service (Amazon S3). Data masking lets you anonymize the data in a field rather than remove it altogether. You can use this solution to perform a random hash or pattern-based substitution of sensitive information from users while replicating the dataset.
AWS DMS helps you securely migrate data to AWS. AWS DMS can migrate your data to and from most widely used commercial and open-source databases. The source database remains fully operational during the migration, which minimizes application downtime. AWS DMS supports both homogenous and heterogeneous migrations, so it can continuously replicate data with high availability across your data stores.
In May 2020, AWS DMS announced the support for expression-based data transformations. In this post, we talk about the process to use SQLite expression-based data transformations to mask a dataset on the replication engine before loading it into the target data store. We also talk about some other interesting use cases where AWS DMS data transformations can help.
The solution uses SQLite expression-based data transformations in AWS DMS replication tasks to mask data fields related to Social Security numbers (SSN) during data replication. We use AWS DMS to replicate data from an Amazon Aurora PostgreSQL cluster to Amazon Simple Storage Service (Amazon S3).
The AWS DMS team will continue to update the page Using transformation rule expressions to define column content with all the supported SQLite functions for various source and target types.
The following diagram illustrates the solution architecture implemented by an AWS CloudFormation template.
Although AWS DMS masks the data before loading it into the target endpoint, the data is still pulled from the source endpoint and loaded into the AWS DMS replication instance. AWS DMS uses a replication instance to connect to your source data store, read the source data, format the data for consumption by the target data store, and load the data into the target data store. The transformation (masking) happens on the replication instance before loading into the target.
In this post, we perform the following high-level implementation steps:
Create a CloudFormation stack using the template provided in the post.
Connect to the source Aurora PostgreSQL database and populate the source table with a PII and non-PII dataset.
Start the AWS DMS task with the required transformation rules to replicate data from Aurora PostgreSQL to Amazon S3.
Query the replicated dataset in Amazon S3 and verify that records in the same column with SSN data (PII) are masked and the records without SSN data (non-PII) are not masked.
The CloudFormation template requires you to select an Amazon Elastic Compute Cloud (Amazon EC2) key pair. This key is configured on an EC2 instance that lives in the public subnet. We use this EC2 instance to connect to the Aurora cluster that lives in the private subnet. Make sure you have a key in the Region where you deploy the template. If you don’t have one, you can create a new key pair.
Create a CloudFormation stack
Choose Launch Stack to get started and provide the following parameters:
CIDR range for the VPC.
CIDR range for the first private subnet.
CIDR range for the second private subnet.
CIDR range for the public subnet.
Primary user password for the Aurora cluster. Note the password for future use to connect to the cluster.
The EC2 key pair to be configured in the EC2 instance on the public subnet. You use this instance to get to the Aurora cluster in the private subnet. Choose the value from the drop-down menu.
The image ID used to create the EC2 instance in the public subnet to be a jump box to connect to the source Aurora cluster. If you supply your image ID, the template uses it to create the EC2 instance.
AWS DMS batches the inputs from the source and loads the output to the S3 bucket. This parameter defines the frequency in which the data is loaded to the S3 bucket and is specific to Amazon S3 as the AWS DMS target endpoint.
Stack creation can take up to 15 minutes to complete. You can view the resources created on the Resources tab of the CloudFormation stack on the AWS CloudFormation console.
On the Outputs tab for the stack, note the following keys:
AuroraSourceEndpointAddress – The endpoint of the Aurora PostgreSQL database with PII data
TargetS3Bucket – The S3 bucket that stores the masked dataset as output from AWS DMS
Connect to the source database and populate the source table
We now connect the source database and populate the source table. We start by adding an inbound security group rule to allow SSH connection from your laptop to the EC2 instance in the VPC so you can connect to the Aurora database.
Search for DmsSampleSecurityGroup on the Resources tab of the CloudFormation stack and choose the link under the Physical ID.
On the Inbound rules tab, choose Edit Inbound rules.
For Type, choose SSH.
For Source, choose My IP.
Test the source endpoint connection using the AWS DMS replication instance by choosing Test connection.When the source endpoint connection is successful, it confirms that the AWS DMS replication instance can connect to the Aurora cluster to read the source dataset.We use the EC2 instance on the Resources tab of the CloudFormation stack on the AWS CloudFormation console, to connect to the Aurora cluster in the private subnet. Use the key provided in the CloudFormation template to connect to your EC2 instance.
The PostgreSQL client is required to connect to the Aurora PostgreSQL instance from the EC2 instance. Psql is the standard command line client, maintained by the PostgreSQL development group and typically distributed as part of the server installation.
Install the PostgreSQL client with the following code.
Use the Aurora PostgreSQL cluster endpoint provided on the Outputs tab of the CloudFormation template and the AuroraDBMasterUserPassword parameter noted previously to connect to the Postgres database using the psql client.
We’re now ready to insert records into the source database.
Run following commands to create the human_resources schema and address_book table. The address_book table has a column ssn_or_emp_no, which can have both SSN details (PII data) or employee numbers (non-PII data).
The following screenshot shows the dataset creation on the Aurora PostgreSQL database.
Start the replication task
Before we start the AWS DMS replication task, let’s examine the transformation mappings in the task to understand how sensitive data is replaced with hashed values.
On the AWS DMS console, navigate to the details page for dms-sample-aurora-src-to-raw-s3-tgt. You can examine the transformations set on the Mapping rules tab.
The transformation is set for the ssn_or_emp_no column of the address_book table. As mentioned earlier, ssn_or_emp_no can have both PII and non-PII data. The transformation mapping rule ensures that any SSN (PII) record in this field gets masked, while employee numbers (non-PII) are replicated as is to the target S3 bucket.
In the following transformation mapping rule, we specify the condition using the SQLite built-in glob operator to replace values that match the pattern of SSN, and use the built-in hash_sha256 function to replace the original SSN with its SHA256 hash. The transformed data is then appended as a new column, ssn_or_emp_no_transformed, to the target dataset, and the original column is dropped during replication through the remove-column transformation rule action.
We can now start the DMS replication task and wait until the replication is completed.
On the details page for the dms-sample-aurora-src-to-raw-s3-tgt task, on the Actions menu, choose Restart/Resume to start the task and wait until its status shows as Load complete, ongoing replication.
Query the dataset
To verify that the task was successful, we check that the task created files in the target S3 bucket and any SSN values are replaced with a SHA256 hash.
Use the TargetS3Bucket provided on the Outputs tab of the CloudFormation stack on the AWS CloudFormation console. On the Amazon S3 console, navigate to the S3 bucket location corresponding to the target AWS DMS endpoint and use the query with S3 Select feature to query the dataset replicated by the AWS DMS task.
The query allows you to run SQL query directly on the Parquet file replicated by AWS DMS. Select CVS format as the output.
You can validate the hash value generated by the AWS DMS task in the Aurora PostgreSQL database by using the encode function and retrieving SSN details. See the following query:
The following screenshot shows our output.
Additional scenarios of data masking using AWS DMS
Based on your organization’s policies, you could have various requirements to deal with PII datasets during replication across data stores. For instance, you may want to completely remove PII columns from the replication target or substitute PII records with a defined string instead of a hash value. It’s also possible that all records in the field have only PII data, so you don’t need pattern matching because all records in the field must be masked.
In the following section, we provide sample AWS DMS transformation mapping rules for the described scenarios. You can modify the transformation mapping rules of the replication task created through the CloudFormation template with any of the following sample AWS DMS templates, based on your requirements.
1 – Remove the PII field from the target completely
The following transformation mapping rule locates the ssn_or_emp_no column and completely removes it during ongoing data replication to the S3 target:
The following screenshot confirms that the transformation removes the ssn_or_emp_no column from the replicated target S3 dataset.
2 – Substitute PII records with a user-defined string
The following AWS DMS transformation mapping shown locates SSN data in the column ssn_or_emp_no using the glob SQLite operator and substitutes SSN records with the string Redacted-SSN:
The following screenshot shows that the new file has SSN records (PII data) replaced by the string Redacted-SSN, and other records with employee numbers (non-PII data) are not changed.
3 – Mask all records of the field with only PII records (not a combination of PII and non-PII)
For this use case, we don’t need the glob operator to perform any pattern matching, because all records under the column ssn_or_emp_no are expected to be masked. We can use the hash_sha256 operator or add a constant value (like you did in the previous step) to mask all records in the column ssn_or_emp_no. See the following code:
The following screenshot shows that the transformation replaces all records in the ssn_or_emp_no column with their corresponding hash values.
To stop incurring costs from the resources you created during this post, stop your AWS DMS replication task and delete your CloudFormation stack. You must stop the replication task first in order to delete the stack successfully. Delete all the files created in target S3 bucket.
Minimizing proliferation of sensitive PII helps keep data more secure and reduces the risk of unauthorized access to sensitive information during data replication.
In this post, we showed how you can use AWS DMS transformation with SQLite expressions to mask data during replication. We presented various scenarios associated with masking PII only or a combination of PII and non-PII datasets. The use cases featured in the post include removing PII columns, masking all records in PII-only datasets, pattern matching, and substituting PII data from datasets with a combination of PII and non-PII records with either a hash value or a user-defined string.
As always, AWS welcomes feedback. If you have any comments or questions on this post, please share them in the comments.
About the authors
Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with the customers on their use cases, architects a solution to solve their business problems and helps the customers build an scalable prototype. Prior to his journey in AWS, Vishal helped customers implement BI, DW and DataLake projects in US and Australia.
Devika Singh is a Senior Solutions Architect at Amazon Web Services. Devika helps customers architect and build database and data analytics solutions to accelerate their path to production as part of AWS Data Lab. She has expertise in database migrations to AWS and provides technical guidance on database and analytics services, helping customers improve value of their solutions with AWS.
Alex Tarasov is a Startup Solutions Architect at AWS. He works with UK Startups helping them to architect their workloads using AWS services adhering to best practises. Before joining AWS Alex was working in UK Startup building scalable data pipelines and applications.
Read MoreAWS Database Blog