Tune replication performance with AWS DMS for an Amazon Kinesis Data Streams target endpoint – Part 2

By mullaned2002

May 3, 2024

44

In Part 1 of this series, we discussed the architecture of multi-threaded full load and change data capture (CDC) settings, and considerations and best practices for configuring various parameters when replicating data using AWS Database Migration Service (AWS DMS) from a relational database system to Amazon Kinesis Data Streams. In this post, we demonstrate the effect of changing various parameters on the throughput for the full load and CDC phases. The main parameters we considered are the AWS DMS settings for the parallel load and parallel apply and the number of shards in Kinesis Data Streams.

Test environment configuration

To demonstrate the behaviors outlined in Part 1, we assembled several different configurations for parallel load and apply settings, custom table mapping rules to define partition keys, and target Kinesis Data Streams shards. We chose Amazon Relational Database (Amazon RDS) for PostgreSQL as the source database engine, using PostgreSQL 13.7 running on an r5.xlarge instance using the pglogical plugin. We used AWS DMS for data migration, using engine version 3.4.7 running on an dms.r5.xlarge instance.

We created a table named large_test with the following table definition:

CREATE TABLE IF NOT EXISTS public.large_test
(
pk integer NOT NULL DEFAULT nextval(‘large_test_pk_seq’::regclass),
num2 double precision,
num3 double precision,
CONSTRAINT large_test_pkey PRIMARY KEY (pk)
);

CREATE SEQUENCE IF NOT EXISTS public.large_test_pk_seq
INCREMENT 1
START 1
MINVALUE 1
MAXVALUE 9223372036854775807
CACHE 1;

We then loaded the data (approximately 1 million records) into the table large_test for all the test scenarios for full load and tested the throughput for the replication performance. For the full load, the throughput is measured in terms of Fullloadthroughputrowstarget (count/second) and Networktransmithroughput (bytes/second). For CDC, we created a procedure to produce a high volume of changes against the table large_test at a given time and tested the replication performance. The CDC throughput is measured in terms of CDClatencytarget (seconds) and Networktransmithroughput (bytes/second). We discuss the scenarios and results in the following sections.

Scenario 1: Full load with no parallel load settings and no custom table mappings

We created an AWS DMS task for full load with large_test as the source and Kinesis Data Streams with provisioned mode as the target. Within the AWS DMS task, for benchmarking purposes, we did not use parallel load settings or custom table mapping rules.

We choose four shards in Kinesis Data Streams for this test case and the following settings:

“ParallelLoadQueuesPerThread”: 0,
“ParallelLoadThreads”: 0,
“ParallelLoadBufferSize”: 0

With these settings, we were able to baseline at around 115 records per second ingested into the Kinesis data stream. The following figure shows the corresponding row throughput.

The following figure shows the corresponding network throughput at around 350 KB/s.

We also baselined this configuration against Kinesis Data Streams with eight provisioned shards with the rest of parallel load settings or custom table mapping rules the same as the previous scenario (1a). The test showed a moderate increase to the records per second at 125, as shown in the following figures.

The following figure shows the corresponding network throughput at around 360 KB/s.

Scenario 2: Full load with no parallel load settings and with custom table mapping

In this scenario, with the parallel load settings remaining the same as earlier, we adjusted the AWS DMS task to use the custom table mapping as follows:

{
“rules”:
[
{
“rule-type”: “selection”,
“rule-id”: “967609532”,
“rule-name”: “967609532”,
“object-locator”:
{
“schema-name”: “public”,
“table-name”: “large_test”
},
“rule-action”: “include”,
“filters”:
[]
},
{
“rule-type”: “object-mapping”,
“rule-id”: “23”,
“rule-name”: “TransformToKinesis”,
“rule-action”: “map-record-to-record”,
“target-table-name”: “large_test”,
“object-locator”:
{
“schema-name”: “public”,
“table-name”: “large_test”
},
“mapping-parameters”:
{
“partition-key-type”: “attribute-name”,
“partition-key-name”: “pk”,
“attribute-mappings”:
[
{
“target-attribute-name”: “pk”,
“attribute-type”: “scalar”,
“attribute-sub-type”: “number”,
“value”: “${pk}”
},
{
“target-attribute-name”: “num_2”,
“attribute-type”: “scalar”,
“attribute-sub-type”: “number”,
“value”: “${num2}”
},
{
“target-attribute-name”: “num_3”,
“attribute-type”: “scalar”,
“attribute-sub-type”: “number”,
“value”: “${num3}”
}
]
}
}
]
}

When using the default settings (with no custom table mapping defined), AWS DMS sets the AWS DMS object mapping parameter partition-key-type to the primary key of the table. You can choose to set partition-key-type to schema-table, meaning the same table in the same schema of the source database will be loaded to the same shard in Kinesis Data Streams.

By setting partition-key-type to the primary key, you can force AWS DMS to use the Kinesis PutRecords API, and distribute the records across all of the target shards per the partition key.

With these custom table mapping rules, we achieved 108 records per second for a four-shard target, as shown in the following figures.

The following figure shows the corresponding network throughput at around 350 KB/s.

Using these custom table mapping rules, we achieved 116 records per second for an eight-shard target, as shown in the following figures. The use of custom table mapping did not change the throughput much, but you can control how the data is distributed among shards and avoid any potential hot shards.

The following figure shows the corresponding network throughput at around 375 KB/s.

Scenario 3: Parallel load settings with no custom table mappings

We altered the AWS DMS task to use the parallel load settings and removed the custom table mappings. The settings and results are as follows for four and eight Kinesis shards.

For a four-shard target, we used the following settings and achieved 700 records per second:

“ParallelLoadQueuesPerThread”: 1,
“ParallelLoadThreads”: 4,
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 1.5 MB/s.

For a four-shard target, we use the following settings and achieved 1,440 records per second:

“ParallelLoadQueuesPerThread”: 2,
“ParallelLoadThreads”: 4
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 1.25 KB/s.

For a four-shard target, we used the following settings and achieved 2,733 records per second:

“ParallelLoadQueuesPerThread”: 4,
“ParallelLoadThreads”: 8,
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 2.5 MB/s.

For an eight-shard target, we used the following settings and achieved 1,450 records per second:

“ParallelLoadQueuesPerThread”: 1,
“ParallelLoadThreads”: 4,
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 1.25 MB/s.

For an eight-shard target, we used the following settings and achieved 1,430 records per second:

“ParallelLoadQueuesPerThread”: 2,
“ParallelLoadThreads”: 4,
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 1.25 MB/s.

For an eight-shard target, we used the following settings and achieved 1,433 records per second:

“ParallelLoadQueuesPerThread”: 1,
“ParallelLoadThreads”: 8,
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 2.75 MB/s.

For an eight-shard target, we used the following settings and achieved 2,880 records per second:

“ParallelLoadQueuesPerThread”: 2,
“ParallelLoadThreads”: 8,
“ParallelLoadBufferSize”: 100

The following figure shows the corresponding network throughput at around 2.75 MB/s.

As you can observe in these examples, the throughput increases as we increase the number of shards and the number of threads in the parallel load settings. As discussed in Part 1 and as a general rule, the number of parallel load threads can be set equal to the number of Kinesis shards for optimal performance. A shard supports up to 1 MB/s as throughput; for more information, refer to Amazon Kinesis Data Streams FAQs.

Scenario 4: Parallel load settings with custom table mappings

In this post, we haven’t demonstrated parallel load settings with custom table mappings. The results were similar to the previous scenarios with the parallel load settings because we choose the partition key as the primary key of the table.

Summary of full load tests

The following table summarizes the full load tests with four shards.

.
Default Task Mapping
Custom Task Mapping

Parallel Load Threads
0
0
4
4
8

Parallel Load Queues per Thread
0
0
1
2
4

Parallel Load Buffer Size
0
0
100
100
100

Records per Second
115
108
700
1440
2733

Average Network Transmit Throughput (Bytes/Sec)
350,000
350,000
1500,000
1,250,000
2,500,000

The following table summarizes the full load tests with eight shards.

.
Default Table Mapping
Custom Table Mapping

Parallel Load Threads
0
0
4
4
8
8

Parallel Load Queues per Thread
0
0
1
2
1
2

Parallel Load Buffer Size
0
0
100
100
100
100

Records per Second
125
116
1450
1430
1433
2880

Average Network Transmit Throughput (Bytes/Sec)
360,000
375,000
1,250,000
1,250,000
2,750,000
2,750,000

Scenario 5: CDC with no parallel apply

In this scenario, we tested the CDC performance by setting up a procedure to issue a high volume of CDC records against the source. We measure the CDC latency target defined as the gap, in seconds, between the first event timestamp waiting to commit on the target and the current timestamp of the AWS DMS instance. Target latency is the difference between the replication instance server time and the oldest unconfirmed event ID forwarded to a target component. For more details on CDC monitoring, refer to MonitoringAWS DMS tasks.

The function code is as follows:

DO
$do$
BEGIN
FOR i IN 1..1000 LOOP
BEGIN

INSERT INTO large_test (num2, num3)
SELECT random(), random()*142
FROM generate_series(1, 500) s(i);
end;
END LOOP;
END;
$do$;

The preceding function generally ran in seconds against the source table large_test.

To gain some good benchmarking, we configured a CDC-only task with no parallel apply settings or custom table mappings and used the same methodology to test with four and eight shards.

For a four-shard target and with no parallel apply settings, we achieved 35 records per second.

“ParallelApplyThreads”: 0,
“ParallelApplyBufferSize”: 0,
“ParallelApplyQueuesPerThread”: 0

The following figure shows the corresponding network throughput at around 400 KB/s.

Next, we applied parallel apply settings and tested with a four-shard target and eight-shard target.

For a four shard-target with the following settings, we achieved 240 records per second:

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 1

The following figure shows the corresponding network throughput at around 1.45 MB/s.

For a four-shard target with the following settings, we achieved 280 records per second:

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 2

The following figure shows the corresponding network throughput at around 1.45 MB/s.

For a four-shard target with the following settings, we achieved 1,282 records per second:

“ParallelApplyQueuesPerThread”: 4,
“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 100

The following figure shows the corresponding network throughput at around 1.4 MB/s.

For a four-shard target with the following settings, we achieved 1,388 records per second:

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 1000,
“ParallelApplyQueuesPerThread”: 4

The following figure shows the corresponding network throughput at around 1.45 MB/s.

For a four-shard target with the following settings, we achieved 2,380 records per second:

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 16

The following figure shows the corresponding network throughput at around 3.0 MB/s.

For a four-shard target with the following settings, we achieved 2,777 records per second:

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 1000,
“ParallelApplyQueuesPerThread”: 16

The following figure shows the corresponding network throughput at around 3.0 MB/s.

For a four-shard target with the following settings, we achieved 2,777 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 4

The following figure shows the corresponding network throughput at around 3.0 MB/s.

In the following scenarios, we use the same testing methodology but with an eight-shard Kinesis target.

For an eight-shard target without any parallel apply settings, we achieved 35 records per second:

“ParallelApplyThreads”: 0,
“ParallelApplyBufferSize”: 0,
“ParallelApplyQueuesPerThread”: 0

The following figure shows the corresponding network throughput at around 400 KB/s.

For an eight-shard target with the following parallel apply settings, we achieved 240 records per second

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 1

The following figure shows the corresponding network throughput at around 1.45 MB/s.

For an eight-shard target with the following parallel apply settings, we achieved 280 records per second:

“ParallelApplyThreads”: 4,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 2

The following figure shows the corresponding network throughput at around 1.45 MB/s.

For an eight-shard target with the following parallel apply settings, we achieved 333 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 1

The following figure shows the corresponding network throughput at around 2.6 MB/s.

For an eight-shard target with the following parallel apply settings, we achieved 555 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 2

The following figure shows the corresponding network throughput at around 2.75 MB/s.

For an eight-shard target with the following parallel apply settings, we achieved 2,777 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 4

The following figure shows the corresponding network throughput at around 3.25 MB/s.

For an eight-shard target with the following parallel apply settings, we achieved 2,777 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 1000,
“ParallelApplyQueuesPerThread”: 4

The following figure shows the corresponding network throughput at around 3.3 MB/s.

For an eight-shard target with the following settings, we achieved 4,166 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 100,
“ParallelApplyQueuesPerThread”: 16

The following figure shows the corresponding network throughput at around 5.25 MB/s.

For an eight-shard target with the following settings, we achieved 4,166 records per second:

“ParallelApplyThreads”: 8,
“ParallelApplyBufferSize”: 1000,
“ParallelApplyQueuesPerThread”: 16

The following figure shows the corresponding network throughput at around 5.25 MB/s.

Summary of CDC tests

The following table summarizes the CDC tests with four shards.

Parallel Apply Threads
0
4
4
4
4
4
4
8

Parallel Apply Queues per Thread
0
1
2
4
4
16
16
4

Parallel Apply Buffer Size
0
100
100
100
1000
100
1000
100

Records per Second
35
240
280
1282
1388
2380
2777
2777

Average Network Transmit Throughput (Bytes/Sec)
400,000
1,450,000
1,450,000
14,000,000
14,500,000
3,000,000
3,000,000
3000000

The following table summarizes the CDC tests with eight shards.

Parallel Apply Threads
0
4
4
8
8
8
8
8
8

Parallel Apply Queues per Thread
0
1
2
1
2
4
4
16
16

Parallel Apply Buffer Size
0
100
100
100
100
100
1000
100
1000

Records per Second
35
240
280
333
555
2777
2777
4166
4166

Average Network Transmit Throughput (Bytes/Sec)
400,000
1,450,000
1,450,000
2,600,000
2,750,000
3,250,000
3,300,000
5,250,000
5,250,000

Conclusion

In this post, we demonstrated the effect of changing various parameters on the throughput for full load and CDC phases when using AWS DMS to replicate data from a relational database system to a Kinesis data stream. In Part 3 of this series, we share some other considerations and best practices for using Kinesis Data Streams as a target.

If you have any questions or suggestions, leave them in the comments section.

About the Authors

Siva Thang is a Senior Solutions Architect, Partners with AWS. His specialty is in databases and analytics, and he also holds a master’s degree in Engineering. Siva is deeply passionate about helping customers build a modern data platform in the cloud that includes migrating to modern relational databases and building data lakes and data pipelines at scale for analytics and machine learning. Siva also likes to present in various conferences and summits on the topic of modern databases and analytics.

Suchindranath Hegde is a Data Migration Specialist Solutions Architect at Amazon Web Services. He works with our customers to provide guidance and technical assistance on data migration into the AWS Cloud using AWS DMS.

Wanchen Zhao is a Senior Database Specialist Solutions Architect at AWS. Wanchen specializes in Amazon RDS and Amazon Aurora, and is a subject matter expert for AWS DMS. Wanchen works with SI and ISV partners to design and implement database migration and modernization strategies and provides assistance to customers for building scalable, secure, performant, and robust database architectures in the AWS Cloud.

Michael Newlin is a Cloud Support DBE with Amazon Web Services and Subject Matter Expert for AWS DMS. At AWS, he works with customers and internal teams to ensure smooth and fast transitions of database workloads to AWS.

Jay Chen is a Software Development Manager at AWS DMS, where he oversees the development of DMS endpoints, including S3, Kinesis, Kafka, Opensearch, Timestream, and others. Jay is widely recognized for his significant contributions to the database field. He co-authored the Star Schema Benchmark, which is a standard benchmark based on the TPC-H benchmark for OLAP databases. Moreover, he has contributed as a co-author to C-STORE: A COLUMN-ORIENTED DBMS.

Tune replication performance with AWS DMS for an Amazon Kinesis Data Streams target endpoint – Part 2

Test environment configuration

Scenario 1: Full load with no parallel load settings and no custom table mappings

Scenario 2: Full load with no parallel load settings and with custom table mapping

Scenario 3: Parallel load settings with no custom table mappings

Scenario 4: Parallel load settings with custom table mappings

Summary of full load tests

Scenario 5: CDC with no parallel apply

Summary of CDC tests

Conclusion

About the Authors

Binary logging optimizations in Amazon Aurora MySQL version 3

Discover and visualize graph schemas in Amazon Neptune

Enforce row-level security with the RDS Data API

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of May 9

Binary logging optimizations in Amazon Aurora MySQL version 3

Collaborative ML research projects within a single cloud environment

How to strengthen supply chain security with GKE Security Posture

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

BigQuery Admin reference guide: API landscape

Build faster with Buck2: Our open source build system

Automate the process to change image backgrounds using Amazon Bedrock and AWS Step Functions

POPULAR CATEGORY