Thursday, April 25, 2024
No menu items!
HomeDatabase ManagementImprove performance of your bulk data import to Amazon RDS for MySQL

Improve performance of your bulk data import to Amazon RDS for MySQL

We use bulk data loading to move data from a source to a destination database for migration or load testing. Bulk data importing into MySQL databases can take several minutes to hours and sometimes even days depending upon the amount of data that needs to be loaded. Creating a dump file takes a small amount of time, but loading the same into the database takes an extended period of time. This in turn causes a delay in any further activities that might have been planned to be carried out on the specific data or database.

There are several techniques that you can use to import data into Amazon Relational Database Service (Amazon RDS) for MySQL. The best approach depends on the source of the data, the amount of data, whether the import is done one time or is ongoing, and the amount of downtime a business can afford. For a list of techniques available to import data into Amazon RDS for MySQL, see Importing data into a MySQL DB instance.

In this post, I discuss how to improve the speed of loading data into an RDS for MySQL instance. The recommendations are divided into three sections:

What to consider while creating the backup of the source database and loading the data into an RDS for MySQL instance
Database parameters that you can modify to improve load performance
Changes you can make on the infrastructure level

All the recommendations given in this post apply to Amazon RDS for MySQL 5.7 and 8.0.

While this post is mostly focusing on importing into a new database instance, some of the recommendations can also be used for batch jobs, bulk writes, or ETL. The source can be a different database server, the same database server, or a set of flat files.

Considerations when creating a backup and importing data

The following are recommendations to consider while creating a backup of data from the source database and also while loading the data into the target database.

Physical backup vs. logical backup

You can choose to create a physical or logical backup of your data depending on your use case and requirements. Here I discuss which backup is suitable in which scenario. For more information regarding backup types, refer to Backup and Recovery Types.

Physical backups consist of raw copies of the directories and files that store database contents. This type of backup is often used for larger datasets as it can allow for faster, less intrusive backup and faster recovery. You can take physical backups of a self-managed MySQL database running on premises or on Amazon Elastic Compute Cloud (Amazon EC2) instances using the tool Percona XtraBackup. You can’t create physical backups of your RDS for MySQL database instances using Percona XtraBackup.

Logical backups save information represented as a logical database structure (CREATE DATABASE, CREATE TABLE statements) and content (INSERT statements or delimited-text files). This type of backup is suitable for smaller amounts of data where you might edit the data values or table structure, or recreate the data on a different machine architecture. You can take logical backups using tools like mysqldump or MyDumper.

If you choose to proceed with a physical backup, you can create backup of your database using Percona XtraBackup and restore it as an RDS for MySQL instance using the Restore from Amazon S3 option. However, with this option, you can’t restore data to an already existing RDS for MySQL instance. Also, the parameter changes suggested in this post aren’t relevant for improving performance during restoring physical backups because this method involves restoring data files on the server.

For detailed information and steps for doing a physical backup and restore to an RDS for MySQL instance, refer to Restoring a backup into a MySQL DB instance.

If you choose to proceed with logical backups, you can consider the following recommendations:

Use flat files – Importing data from flat files can be significantly faster compared to SQL dump files. You can load flat times into the database using the LOAD DATA LOCAL INFILE statement. This statement reads rows from a text file into a table at a very high speed. Additionally, the flat files are created on a per-table basis, which directly helps in parallel data loading to the database. You can create flat files from on-premises or self-managed databases using SELECT …INTO statements or by using mysqldump along with the –tab option. However, the SELECT …INTO statement as well as mysqldump with the –tab option don’t work for RDS for MySQL databases because, being a managed service, the access to the underlying server of the database is restricted. For RDS MySQL, you can consider using alternatives such as the following:

AWS Database Migration Service (AWS DMS) – You can migrate data to Amazon Simple Storage Service (Amazon S3) using AWS DMS from RDS for MySQL database instance. When using Amazon S3 as a target in an AWS DMS task, both full load and change data capture (CDC) data is written to comma-separated value (.csv) format by default. For more details, refer to Using Amazon S3 as a target for AWS Database Migration Service
MySQL Command Line Client – You can use MySQL command line client to create flat files. Sample command:

mysql -h <endpoint> -u <username> -p –silent –batch –quick –raw -e “select * from <database_name>.<table_name> limit 10″ > <filename>.csv

Create multiple files that can be loaded in parallel instead of using a single file – Consider dividing the backup into multiple files (one file per table) instead of creating a single large file with all the databases and tables. Also, if the tables are large, you can split the table data into more than one file. Typically, a high watermark of 500 MB is recommended to keep the transaction size manageable. This also allows you to import files in parallel, and this approach helps you significantly reduce the time of data import to your databases, ensures faster rollback, and is easier to resume in case of any issues. For example, you can create separate dump files for each table using the following command:

mysqldump -h <endpoint> -u <username> -p <database_name> <tablename> > <dumpfilename.sql>

You can import the file into the database using the following command:

mysql -h <endpoint> -u <username> -p <database_name> < <dumpfilename.sql>

As mentioned in the previous section, you can use the SELECT …INTO statement on your on-premises or self-managed databases to create separate flat files for each table. For example:

SELECT * INTO OUTFILE ‘/path/<tablename>.csv’ FROM <databasename>.<tablename>;

You can import this file into the database using the following command:

LOAD DATA LOCAL INFILE ‘/path/<tablename>.csv’ INTO TABLE <databasename>.<tablename>;

You can also use third-party tools like mydumper, which has parameters such as –rows or –chunk-filesize that you can use to split files when dumping them. For more details, refer to Mydumper Usage.

Load data in primary key order – Loading data in primary key order is a faster approach for inserting rows and is particularly important for tables that don’t fit entirely within the buffer pool. To achieve this, you need to take the backup in primary key order. For example:

mysqldump -h <endpoint> -u <username> -p –order-by-primary <database_name> <tablename> > <dumpfilename.sql>

The –order-by-primary option dumps each table’s rows sorted by its primary key, or by its first unique index.

Load data without a secondary and full-text index – For large datasets, it’s much faster to load your data into a table that has no secondary and/or full-text index and then create the index after that, rather than load data into a table that has an existing index. In the case of a full-text index, when a document is inserted, it’s tokenized, and the individual words and associated data are inserted into the full-text index. This process can result in numerous small insertions into the auxiliary index tables, making concurrent access to these tables a point of contention and therefore slow down the loading of data into tables.

Parameter configurations to improve load performance

The following is a list of parameters that you can modify to provide better performance while loading data into an RDS for MySQL instance. You can modify some of these via parameter groups, and several of them can be modified at the session level.

Make sure that after the data import activity is complete, you revert the values of the following parameters to default values or already set values. The suggested values might not be suitable when no data import is ongoing.

Batch commits

All user activity in InnoDB happens within a transaction. With autocommit enabled, such as autocommit = 1 (which is the default setting in MySQL), each SQL statement is treated as a separate transaction. MySQL performs a commit after each SQL statement if that statement didn’t return an error. At every transaction commit, the redo logs are flushed to the disk, which can cause increased runtime overhead during the import operation.

To overcome this performance penalty, consider batching your commits. Batching the statements improves throughput and therefore helps the performance of the data load. You can achieve this using bulk commit methods, such as the following:

Use INSERT statements with multiple values – This can be considerably faster compared to using single-row INSERT statements. See the following syntax:

INSERT INTO <tablename> VALUES (value1, value2), (value3, value4),…..;

Batching flat files with LOAD DATA – You can load flat files into the database using the LOAD DATA LOCAL INFILE statement. The flat files are created on a per-table basis, which directly helps in parallel data loading to the database.
Disable autocommit – When you change the value for autocommit by making changes in the custom parameter group associated to the RDS instance, the changes take place at a global level. You can instead consider changing the value for this parameter at the session level. You can envelope your SQL import statements as follows:

SET autocommit=0;
… SQL import statements …
SET autocommit=1;

All these cases lead to a reduced number of commits. Although this can be beneficial, it’s also important to avoid making transactions too big. A single transaction should not be greater than a few 100 MBs. So if you have a large transaction, you should split it into 250 MB chunks and load them sequentially. This allows you to keep transactions small and also reduces rollback time, should there be an issue.

Increase the size of innodb_log_file_size

All the data changes are logged in the redo logs, and these logs are flushed to the disk once in a while. When the logs get flushed depends on adaptive flushing, the percentage of dirty pages reaching the innodb_max_dirty_pages_pct_lwm value, and the innodb_lru_scan_depth values indicating how far to look for dirty pages for a disk flush. For more details regarding buffer pool flushing, refer to Configuring Buffer Pool Flushing.

innodb_log_file_size defines the checkpointing age. Small redo log files cause a lot of disk writes. Therefore, increasing the size of the log file leads to less disk I/O.

The default value of innodb_log_file_size for Amazon RDS for MySQL is 128MiB.

Please note that a larger log file can increase the crash recovery time. The combined size of log files (innodb_log_file_size * innodb_log_files_in_group) can’t exceed a maximum value that is slightly less than 512 GB. Generally, the combined size of the log files should be large enough that the server can smooth out peaks and troughs in workload activity, which often means that there is enough redo log space to handle more than an hour of write activity. For more information, refer to How to calculate a good InnoDB log file size.

You can modify the value for this parameter in the custom parameter group associated to the RDS instance. This is a static parameter so a reboot is required.

Modification of this parameter helps keep the Write IOPS from becoming a bottleneck and therefore helps the performance of the data load.

Increase the size of innodb_log_buffer_size

A large log buffer enables large transactions to run without the need to write the log to disk before the transactions commit. When you’re importing data to your database, transactions are inserting many rows. In this case, making the log buffer larger helps reduce disk I/O and therefore aids the import performance.

The default value for innodb_log_buffer_size is 8 MiB. You can modify the value of this parameter in the custom parameter group associated to the RDS instance. This is a static parameter so a reboot is required.

Modification of this parameter helps keep the write IOPS from becoming a bottleneck and therefore aids the performance of data load.

Disable foreign key and unique key checks

For the duration of importing data into the database, consider disabling foreign key and unique key checks. Make sure that you don’t have any duplicate keys in your dataset and the data being inserted does not violate any referential constraints as disabling these will disable the associated checks. Disabling foreign and unique key checks can save a lot of disk I/O for big tables. You can’t change the values for these parameters via parameter groups. To change the value of these system variables, you can envelope your SQL import statements as follows. Note that the statements SET unique_checks=1 and SET foreign_key_checks=1 indicates that the checks are reactivated post data load:

SET unique_checks=0;

… SQL import statements …

SET unique_checks=1;

SET foreign_key_checks=0;

… SQL import statements …

SET foreign_key_checks=1;

Modification of this parameter helps keep the write IOPS from becoming a bottleneck and therefore aids the performance of the data load.

Infrastructure configurations to improve load performance

In this section, we discuss considerations regarding Amazon RDS infrastructure that can help improve data import performance.

Source configuration

If you’re working with data that exists on an EC2 instance, we highly recommend having this source EC2 instance and target RDS instance in the same Region and same Availability Zone. This helps reduce the network overhead, and also the data transfer within the same Availability Zone is free.

If you’re loading data from an external on-premises source, it’s recommended that, if possible, you should upload the dataset first to AWS and then load the data to the RDS instance. This can help reduce latency.

The source should be adequately sized in terms of I/O and network bandwidth. This is to ensure that resource contention on the source doesn’t become a bottleneck and impact the load performance.

Make sure that while dumping and loading data, you’re using the latest version of the MySQL client, because the newer version comes with improvements and bug fixes.

Scale the instance

Consider scaling up the RDS database instance for the duration of the import, because importing data can require a higher-than-usual amount of resources like CPU utilization, EBS bandwidth, and memory. After the import completes successfully, you can scale down the instance to the required instance type. For more information, refer to Modifying an Amazon RDS DB instance.

Please note that scaling the instance type requires downtime. The downtime for a Single-AZ instance can be higher because the amount of downtime is equal to the amount of time required to replace the underlying host. In the case of Multi-AZ instance, the downtime is equal to the time required for a Multi-AZ failover to complete.

If modification of the above suggested parameters doesn’t provide the desired results and you observe bottlenecks because of resources like CPU utilization, EBS bandwidth, and memory, consider scaling your instance to a type with more resources that will aid the performance of data load.

Disable automated backups

Disabling automated backup on an instance removes any existing automated backups present for the instance.

Disabling automated backups means disabling binary logging on RDS for MySQL database instances. This will disable point-in-time recovery. Also, you cannot disable automated backups if you have read replicas associated to your RDS instance.

MySQL writes to binary logs immediately after a statement or transaction is complete. During bulk loading of data, disabling binary logging can help reduce the overhead on the resources and therefore improve the load performance.

If you disable automated backups, we highly recommend taking a manual snapshot of the RDS instance before starting the import operation. This way, if required, you can restore the database to a state before the data import started. You can disable automated backups on RDS instances by setting the backup retention period to 0 days. For more information, see Disabling automated backups. Also, ensure that after the import is complete, you enable backups on your database instance by increasing the backup retention period to a value greater than 0 as per your compliance requirements.

Binary logging can cause increased writes to the disk. Disabling binary logging can assist with keeping the write IOPS from becoming a bottleneck and therefore aid the performance of the data load.

Disable Multi-AZ

By disabling Multi-AZ, you’re no longer running your DB instance with high availability.

In a Multi-AZ setup, all the data written to the primary instance is synchronously replicated to the secondary instance. DB instances using Multi-AZ deployments can have increased write and commit latency compared to a Single-AZ deployment, due to the synchronous data replication that occurs. Disabling Multi-AZ when you’re working with very large datasets gives improvements in performance while bulk loading data.

Additionally, converting an instance from Single-AZ to Multi-AZ or vice versa is an online operation and requires no downtime. However, if your system is under heavy load (write heavy), the conversion can take time and therefore the instance may remain in modifying state for a long time because the primary and standby needs to be in sync before the instance can be set to available.

Disabling Multi-AZ can assist with reducing the write latencies on the database and therefore aid the performance of the data load.

Read Replica

Any updates that are made to the primary DB instance are asynchronously replicated to the read replica. You can consider creating read replica after importing your data into the database. This will help save time on changes being replicated after being written to the source database. For example, when a change is made on the primary instance and it takes an hour to run, then the replication lag is one hour. Because the change might also take one hour to complete on the replica, by the time the change is complete, the total lag is approximately two hours. For more information regarding replica lag and its causes, refer to How can I troubleshoot high replica lag with Amazon RDS for MySQL?

Conclusion

Bulk data loading into MySQL can be an expensive operation in terms of the time required to complete the process. In this post, I walked you through different considerations while bulk importing data into an RDS for MySQL instance.

If you have comments about this post, submit them in the comments section.

About the author

Asmita Varma is a Technical Account Manager with Amazon Web Services. She helps customers navigate the challenges and complexities of the AWS Cloud to attain operational excellence. Her focus areas have been MySQL and Amazon Aurora MySQL-Compatible Edition.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments