Monday, May 27, 2024
No menu items!
HomeDatabase ManagementMigrate time-series data from Amazon RDS for PostgreSQL to Amazon Timestream using...

Migrate time-series data from Amazon RDS for PostgreSQL to Amazon Timestream using batch load

Amazon Timestream is a fast, scalable, fully managed, purpose-built time-series database that makes it straightforward to store and analyze trillions of time-series data points per day. Timestream saves you time and cost in managing the lifecycle of time-series data by keeping recent data in memory and moving historical data to a cost-optimized storage tier based on user-defined policies.

In this post, we show you how to migrate your time-series data from Amazon Relational Database Service (Amazon RDS) for PostgreSQL to Timestream using the batch load feature. Same solution would also work if your time-series data resides in Amazon Aurora PostgreSQL-Compatible Edition.

Timestream concepts

Before we get started, let’s review some key Timestream concepts:

Time-series – A sequence of one or more data points (or records) recorded over a time interval.
Record – A single data point in a time-series.
Dimension – An attribute that describes the metadata of a time-series. A dimension consists of a dimension name and a dimension value.
Measure – The actual value being measured by the record. Examples include stock price, CPU or memory utilization, and temperature or humidity readings. Measures consist of measure names and measure values.

With batch load for Timestream, you can ingest data from CSV files stored in Amazon Simple Storage Service (Amazon S3). With this new functionality, you can load your data to Timestream without having to rely on other tools or write custom code. You can use batch load for backfilling data with flexible wait times, such as data that isn’t immediately required for querying or analysis.

Solution overview

In this solution, we model our data using multi-measure records to load a sample dataset from an RDS for PostgreSQL database to Timestream. With multi-measure records, you can store your time-series data in a more compact format, which helps lower data storage costs. Also, the compact data storage lends itself to writing simpler queries for data retrieval and improving query performance. For example, you can use multi-measure records when you’re tracking more than one metric emitted from the same source at the same time. Multi-measure records also simplify migrating data from relational databases to Timestream because the database schema needs fewer or no changes.

The architecture uses the Amazon RDS for PostgreSQL aws_s3 export extension to export the time-series data to CSV files stored in an S3 bucket, and also perform timestamp conversion to epoch format (which is a batch load prerequisite). After the data is stored in an S3 bucket, we use a Timestream batch load to perform a one-time migration of time-series data to a Timestream database.

The following diagram illustrates this architecture.

Prerequisites

To implement this solution, create the following resources:

An RDS for PostgreSQL instance to store your time-series data. For setup instructions, refer to Create a database.
The sample dataset downloaded from Amazon S3. Verify the downloaded file and import it to your RDS for PostgreSQL database using the copy command.
An S3 bucket to stage data from Amazon RDS for PostgreSQL. For instructions, refer to Create your first S3 bucket and Setting up access to an Amazon S3 bucket.
AWS Identity and Access Management (IAM) policies and roles.
A Timestream table that will store the time-series data and be the target for the batch load. For instructions, refer to Create a table. To optimize query runs, partition the Timestream table based on the dimension hostname using a customer-defined partitioning key. The following screenshot illustrates this configuration on the Timestream console. We discuss partitioning keys in more detail later in this post.

Additionally, we encourage you to review the prerequisites, quotas, and best practices for batch loads.

Dataset

Tables in Timestream support a special attribute (or column) called measure_name. Timestream uses the values in the measure name attribute to partition and index the data. Timestream has a requirement to have at least one measure_name column; therefore, we have added a measure_name attribute with a region classification as shown in the following screenshot.

The preceding sample dataset is comprised of CPU and memory metrics from a farm of compute instances running on multiple Regions and Availability Zones on AWS.

Prepare the Amazon RDS for PostgreSQL export to Amazon S3

To prepare your export, complete the following steps:

Verify that the Amazon RDS for PostgreSQL engine version supports Amazon S3 exports.
Install the aws_s3 extension on Amazon RDS for PostgreSQL.
Provide Amazon RDS for PostgreSQL permissions to the S3 bucket.

Export time-series data from Amazon RDS for PostgreSQL

In this step, we use the aws_s3.query_export_to_s3 function to export time-series data from Amazon RDS for PostgreSQL to the S3 bucket in CSV format with column file headers:

SELECT * FROM aws_s3.query_export_to_s3(
‘SELECT cast(EXTRACT(epoch from time) * 1000000 as BIGINT) as time, measure_name, region, location, hostname, mem_usage, cpu_usage from perf_metrics’,
aws_commons.create_s3_uri(‘test-aws-dms-target-bucket/batchload’, ‘perf_metrics.csv’, ‘us-west-2′), options :=’format csv, HEADER true’);

Note that in order to meet the Timestream batch load prerequisites, we use the EXTRACT function to convert the timestamp column to epoch (microseconds) format.

Modeling performance metrics data

Let’s walk through the attributes of the dataset that we load to a Timestream table using batch load and how that contributes to performance:

time – The exact time when the performance metrics are collected
measure_name – An aggregated attribute to classify performance metrics
region – The Region where the host is located
location – The Availability Zone where the host is located
hostname – The hostname whose metrics are collected
mem_usage – The memory usage of a host
cpu_usage – The CPU usage of a host

Choosing the right dimensions and measures

When migrating from a traditional database to Timestream, it is often assumed that dumping tables and columns from the existing database to Timestream will work. However, the real challenge lies in knowing the query patterns and selecting the right dimensions, measures, and optionally a partition key.

Dimensions, including the record timestamp, contextualize observations, helping us identify the who, what, when, and where of a record. Dimensions are used to organize and categorize data, and to filter data as part of a query. Therefore, the columns region, location, and hostname are ideal choices for organizing and categorizing the server performance metrics data.

Measures are responsible for quantitative data (values that change over time). Measures provide the basis for performing mathematical calculations (computing totals, averages, differences in rate of change, and so on) and quantitative analysis on your data. Therefore, the columns (measures) mem_usage and cpu_usage capture important metrics related to the host’s performance.

There are limits on the number of dimensions, measures (maximum measures per record, unique measures across table), and maximum size of a record. These factors should be considered when designing your data model. Often, data ingesting in Timestream originates through an event or metric that contains additional attributes than what’s needed for time-series analysis. To prevent hitting limits, target only required attributes. When data doesn’t relate and isn’t queried together, using separate tables is better than one consolidated table.

To learn more about data modeling best practices, refer to Data Modeling Best Practices to Unlock the Value of your Time-series Data.

Choosing the customer-defined partition key

Customer-defined partition keys is a new feature that provides the flexibility needed to speed up queries and derive insights more efficiently based on specific time-series data-related needs. Partitioning is a technique used to distribute data across multiple physical storage units, allowing for faster and more efficient data retrieval. With customer-defined partition keys, you can create a partitioning schema that better aligns with your query patterns and use cases.

When it comes to partitioning in Timestream, you have the option to choose a partition key or use the default partition, which is based on the measure_name column.

We recommend selecting a partition key based on a dimension with a high cardinality column and frequently used as a predicate in queries. This helps evenly distribute data across partitions and avoid performance issues.

In this performance metrics use case, columns with high cardinality like measure_name (default) or hostname could be suitable as a partition key. However, the choice depends on the specific use case and which column is frequently used for filtering when making queries and is a high cardinality column. In our use case, query access patterns frequently use hostname as a predicate, which is also a high cardinality column, and for these reasons, we configured hostname as a customer-defined partition key.

It’s highly recommended to use a customer-defined partition key over default partitioning.

To learn more about optimizing query performance using customer-defined partition keys, refer to Introducing customer-defined partition keys for Amazon Timestream: Optimizing query performance.

Perform a Timestream batch load

Complete the following steps to create a batch load task:

On the Timestream console, in the navigation pane, choose Management Tools, then choose Batch load tasks.
Choose Create batch load task.
For Target database, choose the database you created as a prerequisite.
For Target table, choose the table you created as a prerequisite.
If necessary, you can add a table from this pane by choosing Create new table.
For Data source S3 location in Data source, choose the S3 bucket where the source data is stored. Use the Browse S3 button to view S3 resources the active AWS account has access to, or enter the S3 location URL. The data source must be located in the same Region.
In File format settings, you can use the default settings to parse input data. You can also choose Advanced settings and choose CSV format parameters and select parameters to parse input data. For information about these parameters, see CSV format parameters.
Next, configure data model mapping using the Visual builder.

For Error logs S3 location in Error log report, choose the S3 location that will be used to report errors. For information about how to use this report, see Using batch load error reports.
For Encryption key type, choose one of the following:

Amazon S3-managed key (SSE-S3) – An encryption key that Amazon S3 creates, manages, and uses for you.
AWS KMS key (SSE-KMS) – An encryption key protected by AWS Key Management Service (AWS KMS).

Choose Next.
On the Review and create page, review the settings and edit as necessary.
Choose Create batch load task.
Check the batch load status.

If you encounter any issues, refer to Troubleshooting for common errors.

After the batch load task completes successfully, you can go to the Timestream query editor and query for results.

The data should appear similar to the following screenshot.

Clean up

When you’re done using this solution, delete the following resources to avoid ongoing charges:

RDS for PostgreSQL instance
S3 bucket
Timestream table
IAM policies and roles

Conclusion

In this post, we showed you how you can migrate time-series data from a relational database to Timestream using the batch load functionality. We encourage you to test and use the solution to visualize the data using Amazon QuickSight or Grafana dashboards to suit your business needs.

Share your experience in the comments section. We value your feedback!

About the authors

Sushant Deshmukh is a Database Consultant with AWS Professional Services. He works with AWS customers and partners to build highly available, scalable, and secure database architectures on AWS. He provides technical design and implementation expertise in running database workloads on AWS, and helps customers migrate and modernize their databases to the AWS Cloud. Outside of work, he enjoys traveling and exploring new places, playing volleyball, and spending time with his family and friends.

Aritra Biswas is a Cloud Support DBA with Amazon Web Services and Subject Matter Expert for AWS DMS. He has over a decade of experience in working with relational databases. At AWS, he works with service teams, Technical Account Managers, and Solutions Architects, and assists customers migrate database workloads to AWS. Outside of work, he enjoys playing racquetball and spending time with family and friends.

Ravi Teja Bellamkonda is a Cloud Support Engineer II with Amazon Web Services and Subject Matter Expert for Amazon RDS core systems. He’s passionate about helping customers solve their AWS issues and provides technical assistance to customers, guiding them to migrate, optimize, and navigate their journey in the AWS Cloud. Outside of work, he enjoys reading and spending time with family and friends.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments