Sunday, April 28, 2024
No menu items!
HomeDatabase ManagementFilter delete operations with AWS DMS

Filter delete operations with AWS DMS

AWS Database Migration Service (AWS DMS) is a fully managed service that helps you migrate databases to AWS quickly and securely. Every customer’s use cases are unique; you can use AWS DMS not only for a one-time data migration solution but also to replicate the data as per the requirements of your downstream applications.

In this post, we demonstrate two options to configure AWS DMS to filter deletes when replicating data to the target database. The first strategy uses Amazon Simple Storage Service (Amazon S3), and the second strategy uses AWS DMS transformation rules to flag delete operations during replication. This configuration allows you to selectively retain deleted data in the target database for auditing purposes, even when the source undergoes regular cleanup operations.

Use Amazon S3 as a staging environment

One way to filter deletes with AWS DMS is to use Amazon S3 as a staging environment. AWS DMS migrates inserts and updates and ignores deletes from the source database to an S3 bucket. You can use this approach for any supported source or target endpoints in AWS DMS. Review the limitations for the chosen endpoints before starting the migration.

As shown in the following diagram, Amazon S3 acts as an intermediate system that filters out the deletes from applying to the final target by using the CdcInsertsAndUpdates endpoint setting available with Amazon S3 as target.

AWS Direct Connect is an optional but recommended component for improved security. AWS Direct Connect is a networking service that provides an alternative to using the internet to connect to AWS. Using AWS Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and AWS.

When you use the cdcInsertsAndUpdates S3 target endpoint setting, an additional first field is added, which indicates whether the row was inserted (I) or updated (U) at the source database. How AWS DMS creates and sets this field depends on the migration task type and the settings of includeOpForFullLoad, cdcInsertsOnly, and cdcInsertsAndUpdates.

If our AWS DMS task performs both a full load and change data capture (CDC), we must add the includeOpForFullLoad parameter to our target S3 endpoint settings. AWS DMS creates an additional field in our load files, which ensures that the columns in the load and CDC files are consistent.

Without this parameter, the load and CDC files will have different numbers of columns, which can cause the following error if you try to use the data from both files together:

[SOURCE_UNLOAD   ]E:  Failed to write record id: 2, Number of values: 6 is not equal to number of columns: 5. [1020417]  (file_unload.c:477)

For example, let’s say we have the following data in our source table:

+—-+—————-+——+——————–+——————————-+

| id | name | age | email | address |

+—-+—————-+——+——————–+——————————-+

| 1 | John Doe | 30 | [email protected] | 123 Any Street, Any Town, USA |

| 2 | Jane Smith | 25 | [email protected] | 456 Any Street, Any Town, USA |

| 3 | Carlos Salazar | 40 | [email protected] | 789 Any Street, Any Town, USA |

+—-+—————-+——+——————–+——————————-+

Now, let’s say we perform the following DML operations on this table:

— Update a row
UPDATE my_table SET age = 34 WHERE id = 1;
Query OK, 1 row affected (0.090 sec)
Rows matched: 1 Changed: 1 Warnings: 0

— Delete a row
DELETE FROM my_table WHERE id = 2;
Query OK, 1 row affected (0.088 sec)

— Insert another row
INSERT INTO my_table (id, name, age, email, address) VALUES (5, ‘Mary Major’, 35, [email protected]’, 231 Main Street, Anytown, USA);

After performing these operations, the data present in our table is as follows:

+—-+—————-+——+——————–+——————————-+

| id | name | age | email | address |

+—-+—————-+——+——————–+——————————-+

| 1 | John Doe | 34 | [email protected] | 123 Any Street, Any Town, USA |

| 3 | Carlos Salazar | 40 | [email protected] | 789 Any Street, Any Town, USA |

| 5 | Mary Major | 35 | [email protected] | 231 Main Street, Anytown, USA |

+—-+—————-+——+——————–+——————————-+

The following is the AWS DMS output in the S3 staging bucket when the cdcInsertsAndUpdates parameter is used:

U 1 John Doe 34 [email protected] 123 Main St

I 5 Mary Major 35 [email protected] 321 Pine St

We can observe that only inserts and updates are captured and deletes were filtered when the cdcInsertsAndUpdates parameter was used. This is because this parameter tells AWS DMS to only capture change data that is inserted or updated, and to filter out any change data that is deleted.

Now we use this intermediate S3 bucket as a source and create another AWS DMS task between Amazon S3 and the target database.

When configuring Amazon S3 as source, we also need to provide a JSON file with the external table definition so that AWS DMS can replicate the data correctly. The following is the example external table definition that we use:

{
“TableCount”: “1”,
“Tables”: [
{
“TableName”: “my_table”,
“TablePath”: “test/my_table”,
“TableOwner”: “test”,
“TableColumns”: [
{
“ColumnName”: “operation_indicator”,
“ColumnType”: “STRING”,
“ColumnNullable”: “false”,
“ColumnLength”: 10
},
{
“ColumnName”: “id”,
“ColumnType”: “INT8”,
“ColumnNullable”: “false”,
“ColumnIsPk”: “true”
},
{
“ColumnName”: “name”,
“ColumnType”: “STRING”,
“ColumnLength”: “20”
},
{
“ColumnName”: “age”,
“ColumnType”: “INT8”
},
{
“ColumnName”: “email”,
“ColumnType”: “STRING”,
“ColumnLength”: “100”
},
{
“ColumnName”: “address”,
“ColumnType”: “STRING”,
“ColumnLength”: “50”
}
],
“TableColumnsTotal”: “6”
}
]
}

The target database shows that the deletes were not applied, and the record with ID 2 still exists:

+———————+—-+—————-+——+——————–+——————————-+

| operation_indicator | id | name | age | email | address |

+———————+—-+—————-+——+——————–+——————————-+

| U | 1 | John Doe | 34 | [email protected] | 123 Any Street, Any Town, USA |

| I | 2 | Jane Doe | 25 | [email protected] | 456 Any Street, Any Town, USA |

| I | 3 | Carlos Salazar | 40 | [email protected] | 789 Any Street, Any Town, USA |

| I | 5 | Mary Major | 35 | [email protected] | 231 Any Street, Any Town, USA |

+———————+—-+—————-+——+——————–+——————————-+

By using Amazon S3 as a staging environment to filter DML operations, you can selectively capture only the changes that you want to replicate. For example, if we want replicate only inserts, we can use cdcinsertsonly. This is useful for us to retain the deleted data on the target for auditing purposes even after the records are deleted in the source.

Use an AWS DMS transformation rule

As shown in the following diagram, AWS DMS is used to migrate and replicate the data from the on-premises data center. With the built-in functionality of the transformation rule, it filters out the delete operations.

AWS Direct Connect is an optional but recommended component for improved security. AWS Direct Connect is a networking service that provides an alternative to using the internet to connect to AWS. Using AWS Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and AWS.

Using the operation_indicator function makes sure that the target records are flagged with the DML operation identifier in an extra column in the target table. To have this extra column in the target table, have the following transformation rule in the table mappings section of the AWS DMS task. With this transformation rule, the records in the target table are flagged with a user-provided value as per the value specified in the expression parameter in the transformation JSON rule. One of the prerequisites for using this functionality is that both the source and target table should have a primary key in place.

{
“rule-type”: “transformation”,
“rule-id”: “2”,
“rule-name”: “2”,
“rule-target”: “column”,
“object-locator”: {
“schema-name”: “%”,
“table-name”: “%”
},
“rule-action”: “add-column”,
“value”: “Operation”,
“expression”: “operation_indicator(‘D’, ‘U’, ‘I’)”,
“data-type”: {
“type”: “string”,
“length”: 50
}

This allows you to keep the record alive in the target even though it has been deleted in the source. The deleted record at the source gets flagged with the user-provided value of D as per the previous transformation rule in the target.

To test this functionality, we use the same table metadata and the DML transactions mentioned in the beginning of the post. The final output at the target table with the preceding activities is as follows:

+—-+—————-+——+——————–+——————————-+———————+

| id | name | age | email | address | operation_indicator |

+—-+—————-+——+——————–+——————————-+———————+

| 3 | Carlos Salazar | 40 | [email protected] | 789 Any Street, Any Town, USA | |

| 2 | Jane Smith | 25 | [email protected] | 456 Any Street, Any Town, USA | U |

| 1 | John Doe | 34 | [email protected] | 123 Any Street, Any Town, USA | D |

| 5 | Mary Major | 35 | [email protected] | 231 Main Street, Anytown, USA | I |

+—-+—————-+——+——————–+——————————-+———————+

The record with ID 2 is still intact in the target with a value of D under the operation column, as specified in the transformation rule. We can also view the table contents without deletes by running a query or creating a view with the condition operator <> ‘D’.

Conclusion

In this post, we demonstrated two approaches to filtering out the delete operation at the source database by using the in-house functionalities of AWS DMS. Depending on the data requirement at the target database, you can use either approach. Using Amazon S3 as a target adds an additional cost of intermediate data storage, but it can be useful if multiple consumers need to process different subsets of the data from Amazon S3. Similarly, using the operation_indicator function of the transformation rule is a good option when there is a single line of source-to-target data migration or replication requirement.

Check out Database Migration—What Do You Need to Know Before You Start? to get started. Also review the recommended best practices associated with AWS DMS and other S3 endpoint settings to address various use cases.

About the authors

Deepthi Saina is a Data Architect at AWS based in the UK. Deepthi specializes in Amazon RDS, Amazon Aurora, and AWS DMS. She is also a subject matter expert in AWS DMS and Amazon RDS. In her role, she helps customers across the EMEA migrate, modernize, and optimize database solutions on AWS.

Vivekananda Mohapatra is a Lead Consultant with the Amazon Web Services ProServe team. He has deep expertise in database development and administration on Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, and Amazon Redshift databases. He is also a subject matter expert in AWS DMS. He works closely with customers to help migrate and modernize their databases and applications to AWS on a daily basis.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments