Sunday, April 28, 2024
No menu items!
HomeDatabase ManagementInsights into AWS DMS resiliency and recovery scenarios with mitigations – Part...

Insights into AWS DMS resiliency and recovery scenarios with mitigations – Part 1

When you’re considering migrating from on premises to the AWS Cloud, the spotlight is firmly on database transitions. After you migrate the schemas and business logic, data migration strategies are determined by factors such as available downtime, database size, and data complexity. Homogeneous migrations, like those within Oracle, often rely on tools such as Oracle Data Pump or RMAN. However, when migrating between different databases, like from Oracle to Amazon Relational Database Service (Amazon RDS) for PostgreSQL, the go-to solutions typically involve AWS Database Migration Service (AWS DMS), Oracle GoldenGate, or third-party utilities like Shareplex.

Choosing the right cutover strategy is pivotal. Whether it’s the all-at-once “big bang” or the more gradual phased approach, each comes with its nuances. Extended replications in the phased method can span weeks to months, necessitating rigorous resilience planning against potential hitches. It’s imperative to communicate with stakeholders: although replication methods aren’t foolproof, a solid mitigation plan paves the way for seamless migration. Deciding on the approach also entails weighing the inherent risks of the “big bang” method against the controlled progression of phased migrations. You need to carefully evaluate the risk criteria between various cutover approaches.

In the complex terrain of heterogeneous database migrations, AWS DMS stands out as the preferred choice for many. Its adaptability, integration capabilities, and the backing of AWS make it a trusted tool for bridging database disparities. However, like all technologies, AWS DMS isn’t immune to challenges during the migration process. Despite its robustness, unforeseen issues can emerge.

In this two-part series, we examine common failure causes for components like the source database, replication instance, migration task, and target database. We delve into investigative steps, provide error log samples, and offer mitigation guidance. Although this series primarily focuses on migrating from Oracle to Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition, the discussed scenarios can apply to other database combinations. In this post, we address failures related to the source, network, task, and replication instance. In Part 2, we cover validation-related issues, target database failures, and error handling strategies.

AWS DMS failure use cases

A migration setup using AWS DMS is comprised of different components, namely the source database, network infrastructure, replication instance, replication task, and target database. The source and target components are configured as endpoints, and the AWS DMS task is a workflow to migrate data between the source and target. You can create different types of AWS DMS tasks depending on the ongoing replication requirements, such as full load, change data capture (CDC) only, or full load plus CDC. You can split the source database schema and tables across multiple tasks depending on the characteristic of data, namely tables with a primary key, non-primary key, large objects (LOBs), and so on. Each component in this setup is susceptible to failures; we discuss the reasons for these failures later in this post.

Recommended testing approach for failure scenarios

It’s essential to establish a comprehensive testing strategy to identify and address potential failure scenarios during database migrations with AWS DMS. The test environment should match the production environment in terms of both volumetric and representative datasets. You should ideally reproduce each of the discussed failure scenarios in the test environment to develop the solution and mitigations.

For example, the failure of a task might lead to a full data load for a table or several tables, causing an outage. You need to evaluate the following failure scenarios and design the mitigation in alignment with Service Level Agreements (SLAs):

Source database failure
Network failure
AWS DMS failures (replication instances and tasks)
Target database failure

You should build a comprehensive testing strategy with various test cases, including expected and actual outcomes. Compare these with the business SLAs to obtain agreement from business stakeholders when designing your data migration strategy.

Resources involved

The resources used in this post include a source database, AWS DMS replication instance, AWS DMS endpoints, AWS DMS replication jobs, target database, and your network.

The following diagram illustrates the AWS DMS data migration architecture and its various components.

Source database failure

The first component of failure in the AWS DMS setup is challenging for the AWS team to investigate due to limited or no access to the customer’s database environment. The database logs, ASM logs, and AWS DMS logs serve as the primary sources of information for investigation. The scenarios discussed in this section represent the most common failures for the Oracle source database.

The following diagram illustrates the source database failure scenarios, such as source database disruption due to underlying host failure.

Single instance failure or RAC node eviction

AWS DMS interacts with the Oracle instance, even in a RAC configuration, because AWS DMS doesn’t support scan IPs, if there’s an instance failure or, in a RAC cluster setup, one of the instances restarts or gets evicted, AWS DMS stops reading the live data and archive logs necessary for replication.

The following is an example error log:

00017657: 2021-06-11T15:24:45:438334 [SOURCE_CAPTURE ]D: OCI error ‘ORA-03135: connection lost contact
Process ID: 15894
Session ID: 668 Serial number: 339′ [1022307] (oradcdc_io.c:1050)

To mitigate this issue, start the Oracle instance and ensure it returns gracefully. For a RAC setup, configure the source endpoint to utilize multiple nodes of the cluster using Amazon Route 53.

ASM restart or crash due to known bug

In this scenario, AWS DMS frequently disconnects from the source database, causing pauses in reading changes and increasing latency. Due to a known bug of ASM, Oracle instance may restart or crash.

The following code snippet is an error from the AWS DMS logs:

00017657: 2021-06-11T15:24:45:438334 [SOURCE_CAPTURE ]D: OCI error ‘ORA-03135: connection lost contact
Process ID: 15894
Session ID: 668 Serial number: 339′ [1022307] (oradcdc_io.c:1050)

The following code snippet is an error from the ASM logs:

ORA-04031: unable to allocate 4120 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(1,1)”,”ASM extent pointer array”)
ASMB (ospid: 4913): terminating the instance due to error 4031

SOURCE_CAPTURE ]D: OCI error ‘ORA-03135: connection lost contact
Process ID: 17162
Session ID: 1005 Serial number: 99′ [1022307] (oradcdc_io.c:1050)
00027957: 2021-06-11T14:21:24:350513 [SOURCE_CAPTURE ]D: Failed to read from ASM file with thread id ‘1’ from block number ‘1035028’,

A temporary workaround is to restart ASM with a memory flushing script.

Unavailability of redo or archive logs

This error may be caused for the following reasons:

The archive logs are missing
The archive logs are available but unable to read from the additional archive destinations

The following code snippet is an error from the AWS DMS logs:

[SOURCE_CAPTURE ]E: Redo archive file with the sequence XXXX does not exist, thread 1 [XXXX] Archived Redo log is not active (oradcdc_thread.c:307)
[TASK_MANAGER ]E: Task error notification received from subtask 0, thread 0 [XXXX] Endpoint initialization failed. (replicationtask.c:XXX)

If the necessary archives are missing because of the retention policy, restore the archives from available backups, whether from disks or tapes. This restore allows AWS DMS to continue the replication.

You can also set up additional archive destinations using a parameter. Configure an additional archive destination to keep a copy of the redo and archive files on a shared file system. This system should be accessible by all standby nodes of RAC, as shown in the following configuration:

archivedLogDestId=1;

Source endpoint ECA – ExtraArchivedLogDestIds=[3,4];

For information about other issues, such as if AWS DMS stops replicating data changes due to reset logs when the standby database is transitioned to the primary database during a failover of the primary database, refer to How to handle AWS DMS replication when used with Oracle database in fail-over scenarios.

UNDO and TEMP tablespaces

When using Oracle as a source for AWS DMS, sometimes the full load task fails due to a small tablespace size, resulting in the following error:

ORA-1652: unable to extend temp segment by 8 in tablespace

To address this issue, follow the best practices and add more space to the TEMP tablespaces. If you see a similar issue with the UNDO tablespaces, you can also add more space.

Network failure

Sometimes, AWS DMS tasks or endpoint connectivity fails because of network-related issues. These problems can stem from network bandwidth, network packet loss, or incorrect security group configurations between the source or target and the AWS DMS replication instance.

Troubleshooting network failures becomes complex due to strict security policies and resource governance. Although there isn’t a straightforward troubleshooting method, basic networking tools and utilities can help pinpoint the issue. You can also consider configuring the AWS DMS diagnostic support AMI to perform the network testing at various source and target data stores. Initial investigations at the network component level can identify obvious network problems. However, for more complex issues related to firewall rules, configurations, and network devices, consulting with network SMEs and DBAs is recommended to isolate the problem.

The following diagram illustrates a network failure scenario.

Network disconnections

AWS DMS may experience frequent disconnections from the source database, leading to pausing the reads of the changes and resulting in an increase of latency. The client (AWS DMS) log data lists ORA-03113 and ORA-03135 as the encountered errors.

The following code snippet is an example error message:

“ORA-03113: end-of-file on communication channel”.

2021-06-22T07:20:50 [SOURCE_UNLOAD ]E: ORA-03113: end-of-file on communication
channel Process ID: 20346 Session ID: 203 Serial number: 945 [1022305]
(oracle_endpoint_unload.c:235)
2021-06-22T07:23:01 [SOURCE_UNLOAD ]E: ORA-03135: connection lost contact Process ID:
21030 Session ID: 1557 Serial number: 19 [1022307] (oracle_endpoint_unload.c:235)

To address this issue, consider the following:

Adjust the sqlnet.expire_time value in the sqlnet.ora file – Changing this to an appropriate value could reduce or eradicate disconnection errors during longer scans, especially when including LOB values.
Adjust the MTU size – To find the Maximum Transmission Unit (MTU) size, you can use the tracepath command. The MTU size should be the same across the entire network from source to destination host. When the MTU size is different on the gateway device, update the MTU value in the AWS DMS settings. Aligning these values will increase stability. Note that this value is host-specific and, if the AWS DMS instance is changed, this value needs to be set again.

Network packet loss

AWS DMS may experience frequent disconnections from the source database, leading to pausing the reads of the changes and resulting in an increase of latency. The following code snippet is an example error message:

[SOURCE_UNLOAD ]E: ORA-03135: connection lost contact Process ID: 21030 Session ID: 1557 Serial number: 19 [1022307] (oracle_endpoint_unload.c:235)

To optimize the network throughput for the tasks, add the extra connection attribute (ECA) asmUsePLSQLArray=TRUE on the source endpoint to be incorporated in the task. This ECA will instruct the AWS DMS task to cache 50 reads before transmitting over the network, resulting in fewer round trips over the network.

AWS DMS replication instance failure

The AWS DMS replication instance performs the actual data movement between the source and target. The replication instance plays a major role because all replication activities originate from it. However, there are specific instances where replication failures occur:

Disk space is full
AWS DMS is set up in a single Availability Zone
There’s contention on CPU or memory

The following diagram illustrates a replication instance failure scenario.

Cases have occurred where old, unused tasks led to increased storage requirements. There should be a default log cleanup or maintenance process to manage log storage space usage. When detailed debugging is enabled on heavy-duty transaction workloads, it can result in storage being filled. To address this, always set an alarm for storage. If it reaches a threshold, take the necessary action. You can find an event for low storage in the event subscription. You can enable this, or if preferred, set an alarm using Amazon CloudWatch metrics.

If a failure occurs due to full storage, consider the following strategies:

Examine the task log file size. If possible, delete it to make space available. If not, disable any unnecessary additional logging. Use the AWS DMS console for log management to immediately free up space. Although logs will remain in CloudWatch, they’ll be removed from the replication instance. For more details, refer to Introducing log management in AWS Database Migration Service.
Modify the replication instance to add more storage.
As a last resort, consider creating a new replication instance and moving the existing task to new instance.
Set a CloudWatch alarm for metrics on CPU and memory. This way, if issues arise, you’ll be alerted. To prevent contention on the replication instance, select the right type of instance for your replication workloads. Also, reviewing replication instance forensics best practices can provide insights into the impact of AWS DMS tasks on replication instance resources.
Always enable Multi-AZ. In the event of an Availability Zone failure, a replication instance from another Availability Zone will step in and continue data replication.

AWS DMS task failure

A replication task is a component of AWS DMS that reads the data from the source database. The data is formatted for consumption by the target data store before loading into target databases. Replication tasks can be used for three migration types:

Migration of existing data (full load)
Ongoing replication (CDC)
Migration of existing data and ongoing replication (full load and CDC)

The following diagram illustrates an AWS DMS task failure scenario.

During data replication, you may come across various failure scenarios during different migration types.

The following scenarios are common in full load tasks:

Data duplication
Connection timeout
LOB truncation
UNDO or TEMP tablespaces on the source database

The following are common in CDC tasks:

Table errors
Swap files on the AWS DMS replication instance
Update or delete failed because row doesn’t exist
DDL changes on the source or target database
Auto generated columns

The following are common for full load and CDC tasks:

Tasks running concurrently
Insert failed due to existing data on the target

Full load tasks

This section discusses some of the common issues observed during a full load migration (migrating existing data).

Data duplication

Sometimes, we encounter tables without primary keys in older legacy databases. During migration, we redesign these target tables to include primary keys. Yet, when running a full load, there’s a possibility that the replication task might fail due to duplicate data in the source tables. When using the AWS DMS setting DO_NOTHING, and if duplicate data exists in the target, the task could fail.

The following screenshot shows the error in the table statistics on the AWS DMS console.

The following code snippet shows the error in the AWS DMS logs:

2023-06-25T04:57:38 [TARGET_LOAD ]E: Command failed to load data with exit error code 0. Please check target database logs for more information. [1020403] (csv_target.c:1060)

The following code snippet shows the error in the database log files:

2023-06-25 04:57:38 UTC:172.31.27.41(33858):postgres@ttp:[18108]:ERROR: duplicate key value violates unique constraint “test_pkey”

To mitigate this issue, consider the following:

If duplicates exist in the source table, remove them before reloading the data.
If you can’t remove duplicates from the source table, ensure the target has the same structure. However, you should include these types of tables in only full load tasks.
If no duplicate data exists in the source tables but duplicates exist on the target due to an earlier failed full load job, enable TRUNCATE_BEFORE_LOAD in the target table preparation mode.

Connection timeout

The replication task handles data replication. It connects to the source, reads the data, and writes it to the target table. If a process locks the target table during writing, the replication task will wait. If this wait lasts more than 60 seconds, the task times out.

This could be caused by the following:

The replication task attempts a DML/DDL operation on the target table, and it takes an unusually long time to finish.
If the target table has a lock, the AWS DMS task will cancel the statement. If a full load task encounters this error, the table load fails, and table statistics will display an error.

The following code snippet shows the error in the AWS DMS logs:

[TARGET_LOAD ]E: RetCode: SQL_ERROR SqlState: 57014 NativeError: 1 Message: ERROR: canceling statement due to
statement timeout; Error while executing the query [1022502]

To mitigate this issue, you need to adjust the ECA in the target endpoint with an acceptable time. A bigger value will hold off cancellation for a longer duration. See the following code:

ExecuteTimeout=<value in seconds>

It sets the client statement timeout for the PostgreSQL instance, in seconds. The default value is 60 seconds.

Additionally, check if any query is holding the lock on the target table using a locking query. If the locks are legitimate, then you need to wait until query completion or cancel the query and restart the AWS DMS replication task to load the data by excluding the fully loaded tables. See the following code:

Locking Query

SELECT
activity.pid,
activity.usename,
activity.query,
blocking.pid AS blocking_id,
blocking.query AS blocking_query
FROM pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid));

The connection can also be interrupted between the source and the AWS DMS replication instance by either a firewall or general network issues. For more details, refer to the previous section in this post regarding network failure.

LOB truncation

During a full load, if tables have LOB columns and you select LimitedLobmode, and the LOB size of a column exceeds the maximum LOB size set in the task settings, the data will truncate. By default, AWS DMS logs the LOB truncation and continues. However, if the end-user forgets or overlooks the logs, this can lead to truncated data. For more information about data types supported in AWS DMS for LOB columns, refer to Data types for AWS Database Migration Service.

Let’s look at an example of the LOB truncation issue. In the following code, the value is initially less than 1 KB, and the replication task setting for the LOB size is set to 1 KB:

DBMS_LOB.GETLENGTH(BLOG)/1024
—————————–
.265625

On the AWS DMS console, you can set the maximum LOB size using Limited LOB mode, as shown in the following screenshot. For more information, refer to Setting LOB support for source databases in an AWS DMS task.

The following screenshot shows the AWS DMS task status on the table statistics page.

The following code snippet shows the warning from the AWS DMS logs:

2023-06-25T07:13:21 [SOURCE_UNLOAD ]W: The value of column ‘BLOG’ in table ‘JST.TEST’ was truncated to length 1024 (oracle_endpoint_utils.c:2838)

The table load completed successfully, but there is warning in AWS DMS for the LOB column.

After investigating the task log, we recheck the table LOB size and observe the mismatch in size because of truncation of data:

MAX(DBMS_LOB.GETLENGTH(BLOG))/1024
———————————-
2.32421875

To mitigate this, add the ECA FailTasksOnLobTruncation=true in the source endpoint. This attribute causes a task to fail if the actual size of a LOB column is greater than the specified size and avoids truncating the data.

The following code shows an example error message:

Last Error ; LOB data was truncated.; Error executing source loop; Stream component failed

You can also adopt an alternate approach of running a full load with the help of maximum LOB size and inline LOB mode. This approach might help speed up migrating LOB tables.

CDC tasks

During the replication phase, the replication task reads data from transaction log files, accumulates it on the replication instance in serial transaction order, and then replays it on the target in the same sequence. During this phase, a variety of possible failures can occur.

Table errors

When there are data issues, the task will, by default, log the error and continue with the replication. Additionally, an exception will appear in the control table. When this error count reaches a threshold, the table will be suspended from replication. The default count for TableFailureMaxCount is 1000.

Consider the following mitigation strategies:

It’s recommended to have proper monitoring in place on the task log files, which will notify stakeholders of any error. For more information on setting up monitoring at the task level, refer to Automating database migration monitoring with AWS DMS.
As soon as a table error is encountered, check the control table (awsdms_apply_exceptions) in the target database.
For critical business tables, it’s recommended to understand the data and set the error settings accordingly, so you can stop the task if any data issues are encountered. For more information, refer to Error handling task settings.

Swap file on the AWS DMS replication instance

When running a CDC task, you may encounter a swap file issue. In this scenario, swap files reading from the source database are paused. This issue leads to latency or a huge number of transaction files being generated.

The following code snippet shows the error in the AWS DMS logs:

]I: Reading from source is paused. Total storage used by swap files exceeded the limit 3045064704 bytes

If this error occurs, consider the following strategies:

Check the health of the target database.
Change the logging parameter in PostgreSQL( ) to log all long-running AWS DMS queries in the PostgreSQL logs. Generate an explain plan to drill down the issue (for missing or improper indexes and full table scans).
If there are no issues on the target database, try to optimize the task by splitting the task into small chunks to run a parallel data load. Also, you can tune the AWS DMS task by enabling batch apply mode in the replication task. For more details, refer to Change processing tuning settings.

Update or delete failed because row doesn’t exist

When running an update or delete, you may encounter the following error:

2023-07-18T10:17:02 [TARGET_APPLY ]W: Source changes that would have had no impact were not applied to the target database. Refer to the ‘awsdms_apply_exceptions’ table for details. (endpointshell.c:6711)

This is caused by missing data in the target database. This issue is often observed in non-primary key tables.

To mitigate the issue, consider the following steps:

Check the error message in the awsdms_apply_exceptions table present in the target database under the control schema.
Visit the task log files to understand more about the error.
Compare the data between the source and target manually for the same primary key value. If it’s a really data issue, then you need to fix the AWS DMS replication task.

DDL changes on the source or target database

If a DDL is run on the source (such as adding a column or changing a data type) and the change isn’t replicated or applied to the target, the table may be suspended from replication. Here are the limitations with regards to DDL changes.

To address the issue, you can enable the ChangeProcessingDdlHandlingPolicy settings so that AWS DMS will take care of replicating the DDL to target. You should refer to the individual documentation for the source or target as per your use case.

Make sure to stop the replication task and apply the changes on the target first, then on the source. Then you can resume the replication task.

Auto generated columns

There is a chance of table failure if there are auto generated columns in the source database tables. For example, SQL Server and Oracle support computed columns and virtual columns with “generated always” clause respectively.

There are two use cases regarding this error:

Virtual column with “generate by column” – If a column is not visible to AWS DMS, then verify if the target database supports a similar functionality (for example, Amazon Aurora PostgreSQL 12 and later supports the “generate by column” feature).
Non-virtual column – If a column’s data is generated using “generated always” based on another column, then the column is visible to AWS DMS. Verify if the target database supports this functionality and exclude the column from replication, allowing the target functionality to handle the column. You can also alter the target column to be non-generated and allow AWS DMS to input values. During cutover, manage this in the target using methods such as triggers or “generated by.”

Full load and CDC tasks

In this section, we discuss issues you may encounter during full load and CDC tasks.

Tasks running concurrently

AWS DMS employs a lightweight database to store metadata for all AWS DMS tasks and related activities. Occasionally, issues linked to this database may surface, though they are intermittent.

While rare, this problem might arise due to contention from launching too many replication tasks simultaneously, which can lead to a locking situation in the metadata.

The following code snippet shows an example error message in the AWS DMS logs:

E: SQLite general error. Code <5>, Message <database is locked>. [1000504] (at_sqlite.c:326)

A straightforward way to approach this issue is to start only a few tasks at time, wait until the tasks start replicating, and then start the next batch of tasks with a gap of time between.

Insert failed due to existing data on the target

To avoid downtime, end-users often use the full load and CDC option with

replication tasks. With this option, AWS DMS loads existing data from the source to the target, caching all incoming changes to apply after the full load. If any data changes occur on a table before migrating to the target, these changes will be migrated as part of the full load. Additionally, these changes are cached for CDC. When applying the cached changes to the target, AWS DMS may begin to throw errors for duplicate rows.

For a workaround for open transactions, refer to How AWS DMS handles open transactions when starting a full load and CDC task. Note that this issue has been fixed in AWS DMS version 3.5.1.

The following code snippet shows an example error message in the AWS DMS logs:

2023-06-25 04:57:38 UTC:172.31.27.41(33858):postgres@ttp:[18108]:ERROR: duplicate key value violates unique constraint “test_pkey”

Consider the following mitigation strategies:

A thorough review is required to investigate duplicate rows. If there is a primary key on both the source and target database, then AWS DMS will manage it by logging an error message in the task log file and continuing.
If the table has a non-primary key in the source database and a primary key in the target database, verify whether those are actual values. If not, AWS DMS will handle it because there’s a primary key on the target table.
If there’s a non-primary key on both the source and target table, AWS DMS will insert duplicate data. If rows exist in the target but not in the source, you should manually remove them from the target table.

Conclusion

In this post, we explored various failure scenarios and their respective mitigation steps related to each component in the data replication setup. In Part 2 of this series, we will delve into validation errors, target database failure scenarios and handling data related errors through error handling settings.

We value your feedback. Share any questions or suggestions in the comments section.

About the Authors

Vivekananda Mohapatra is a Lead Consultant with the Amazon Web Services ProServe team. He has deep expertise in database development and administration on Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, and Amazon Redshift databases. He is also a subject matter expert in AWS DMS. He works closely with customers to help migrate and modernize their databases and applications to AWS.

Jitender Kumar is a Senior Lead Consultant with the Professional Services team at Amazon Web Services. He has worked on multiple databases as DBA and is skilled in SQL performance tuning and database migrations. He focuses on database migration to AWS and works with customers from assessment to implementation.

HariKrishna Boorgadda is a Senior Consultant with the Professional Services team at Amazon Web Services. He focuses on database migrations to AWS and works with customers to design and implement Amazon RDS and Amazon Aurora architectures.

Rajesh Madiwale is a Lead Consultant with Amazon Web Services. He has deep expertise on database development and administration on Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL, Amazon Redshift, MySQL, and Greenplum databases. He is an ardent member of the PostgreSQL community and has been working on PostgreSQL his entire tenure. He has also delivered several sessions at PostgreSQL conferences.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments