Sunday, September 8, 2024
No menu items!
HomeDatabase ManagementJoining historical data between Amazon Athena and Amazon RDS for PostgreSQL

Joining historical data between Amazon Athena and Amazon RDS for PostgreSQL

While databases are used to store and retrieve data, there are situations where applications should archive or purge the data to reduce storage costs or improve performance. However, there are often business requirements where an application must query both active data and archived data simultaneously. Developers need a solution that lets them benefit from using cheaper storage for archived data, but they can still use SQL queries to join data between the active and archival systems.

In this post, we walk through how to move archived data from Amazon Relational Database Service (Amazon RDS) for PostgreSQL to Amazon Simple Storage Service (Amazon S3), fetch historical (archived) data from S3, and use SQL’s to write queries that join data between Amazon RDS for PostgreSQL and Amazon Athena.

Solution overview

In the following diagram, the client is an Amazon Elastic Compute Cloud (Amazon EC2) instance where you can install your application, or install the utilities like psql client, AWS Command Line Interface (AWS CLI), and others.

For this solution, we’ll upload historical data to Amazon S3 in the form of CSV files. However, you can use other formats as well.

A user queries the historical data from Amazon S3 via the AWS Glue Data Catalog, using Amazon S3 as the data source. The diagram also shows querying the current data from Amazon RDS for PostgreSQL using AWS Lambda. This solution helps set up the pipeline to join historical data with current data in the system using Amazon Athena.

To build out this solution, this blog post walks you through the steps listed below:

Create partitioned and standalone (non-partitioned) tables in the RDS for PostgreSQL instance.
Create a directory structure inside the S3 bucket to store the historical data from the partitioned and standalone tables.
Archive the table’s data in CSV files and upload the CSV files to the S3 bucket in the respective table directories.
Specify the location for resultset. This is set for our Athena workgroup, and is required before running any query.
Create a database in the AWS Glue Data Catalog. This database enables you to query historical data stored in Amazon S3.
In the Data Catalog database, create table structures (partitioned and non-partitioned) from the actual database whose historical data you want to put in Amazon S3. Specify the input dataset location as the S3 bucket where the CSV files are stored.
To join historical data with the RDS for PostgreSQL instance current tables, we use Athena federated queries.
Test joining the historical data with RDS for PostgreSQL tables using Athena.

Prerequisites

You should have the following prerequisites in place:

An AWS account with an Amazon RDS instance running. For this post, we use Amazon RDS for PostgreSQL.
An EC2 instance with psql installed to connect to Amazon RDS for PostgreSQL.
The AWS CLI installed on the EC2 instance.
User credentials (access key and secret access key) configured on the EC2 instance by running aws configure.
An S3 bucket where you store the historical data. For this post, our bucket is called historicalbucket. You can use your unique bucket.
A basic understanding of Amazon Athena federated queries.

Create partitioned and standalone tables in the RDS for PostgreSQL instance

Table partitioning is the technique of distributing data across multiple partitions that in turn improves query performance and data manageability.

Table creation and sample data insertion is needed to demonstrate the solution.

For this post, we create tables under the test schema of a database named blog. Complete the following steps to create tables and insert sample data.

To establish connectivity from the application running on an EC2 instance to an RDS for PostgreSQL instance, you can use the 1-click setup from the AWS console.

Use the following connection string to connect to the database:

psql –host=<host endpoint> –port=<port> –username=<username> –password –dbname=<dbname>

Once connected, validate if you are into the desired database as below:

conninfo

You are connected to database blog as user test

dn

List of schemas
Name | Owner
———-+———-
public | postgres
test | test

We’ll create tables on blog database under test schema.

Create a table named wallpaper_partitioned with the following code:

CREATE TABLE wallpaper_partitioned (
w_id integer,
name CHAR,
colour CHAR,
created timestamp(3) without time zone DEFAULT now() NOT NULL
) partition by range(created);

Add the partitions:

create table wallpaper_partitioned_20220601 PARTITION OF wallpaper_partitioned FOR VALUES FROM (‘2022-06-01’) TO (‘2022-07-01’);
create table wallpaper_partitioned_20220701 PARTITION OF wallpaper_partitioned FOR VALUES FROM (‘2022-07-01’) TO (‘2022-08-01’);
create table wallpaper_partitioned_20220801 PARTITION OF wallpaper_partitioned FOR VALUES FROM (‘2022-08-01’) TO (‘2022-09-01’);
create table wallpaper_partitioned_20220901 PARTITION OF wallpaper_partitioned FOR VALUES FROM (‘2022-09-01’) TO (‘2022-10-01’);
create table wallpaper_partitioned_20221001 PARTITION OF wallpaper_partitioned FOR VALUES FROM (‘2022-10-01’) TO (‘2022-11-01’);
create table wallpaper_partitioned_20221101 PARTITION OF wallpaper_partitioned FOR VALUES FROM (‘2022-11-01’) TO (‘2022-12-01’);

Insert some data:

insert into wallpaper_partitioned values (11,’l’,’m’,’2022-06-02 00:00:00′),
(12,’n’,’o’,’2022-07-02 00:00:00′),
(13,’p’,’q’,’2022-08-02 00:00:00′),
(14,’r’,’s’,’2022-09-02 00:00:00′),
(15,’t’,’u’,’2022-10-02 00:00:00′),
(16,’v’,’w’,’2022-11-02 00:00:00′);

Similarly, create a table named bricks_partitioned:

CREATE TABLE bricks_partitioned (
b_id integer,
colour CHAR,
size integer,
created timestamp(3) without time zone DEFAULT now() NOT NULL
) partition by range(created);

create table bricks_partitioned_20220601 PARTITION OF bricks_partitioned FOR VALUES FROM (‘2022-06-01’) TO (‘2022-07-01’);
create table bricks_partitioned_20220701 PARTITION OF bricks_partitioned FOR VALUES FROM (‘2022-07-01’) TO (‘2022-08-01’);
create table bricks_partitioned_20220801 PARTITION OF bricks_partitioned FOR VALUES FROM (‘2022-08-01’) TO (‘2022-09-01’);
create table bricks_partitioned_20220901 PARTITION OF bricks_partitioned FOR VALUES FROM (‘2022-09-01’) TO (‘2022-10-01’);
create table bricks_partitioned_20221001 PARTITION OF bricks_partitioned FOR VALUES FROM (‘2022-10-01’) TO (‘2022-11-01’);
create table bricks_partitioned_20221101 PARTITION OF bricks_partitioned FOR VALUES FROM (‘2022-11-01’) TO (‘2022-12-01’);

Insert data as follows:

insert into bricks_partitioned values (21,’a’,22,’2022-06-02 00:00:00′),
(23,’b’,24,’2022-07-02 00:00:00′),
(25,’c’,26,’2022-08-02 00:00:00′),
(27,’d’,28,’2022-09-02 00:00:00′),
(29,’e’,30,’2022-10-02 00:00:00′),
(31,’f’,32,’2022-11-02 00:00:00′);

Likewise, create the standalone tables bricks and wallpaper:

create table bricks (b_id integer,colour CHAR,size integer);
create table wallpaper (w_id integer,name CHAR,colour CHAR);

Insert data as follows:

insert into wallpaper values ( 1, ‘M’, ‘p’ ),( 2, ‘C’, ‘b’ ),(3, ‘B’, ‘n’ );
insert into bricks values ( 2, ‘b’, 1 ),(3, ‘g’, ‘4’ ),( 4, ‘b’, ‘8’ );

Create a directory structure inside the S3 bucket to store the historical data from the RDS for PostgreSQL instance

Database tables have a data retention period, after which the archived partitions go to Amazon S3. In order to store the archived data in Amazon S3, we need to create the corresponding directory structure in the S3 bucket via an EC2 instance using the AWSCLI.

Complete the following steps:

Create the directory structure for the wallpaper_partitioned table with the following code:

/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_06/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_07/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_08/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_09/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_10/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_11/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper_partitioned/2022_12/

Validate the corresponding directory structure in the S3 bucket via the Amazon S3 console.

Repeat the same steps for the bricks_partitioned table:

/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_06/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_07/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_08/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_09/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_10/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_11/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks_partitioned/2022_12/

Validate the corresponding directory structure in the S3 bucket.

Create a similar S3 directory structure for the standalone tables (bricks and wallpaper):

/usr/local/bin/aws s3api put-object –bucket historicalbucket –key bricks/
/usr/local/bin/aws s3api put-object –bucket historicalbucket –key wallpaper/

Validate the S3 bucket directory structure.

Archive the table data in CSV files and upload the CSV files to the S3 bucket in the respective table directories

For this post, archiving the table data is done using CSV files. You can also use aws_s3 to upload the data directly from the database.

For this post, we want to archive data starting from June 2022 (202206) to August 2022 (202208) from the wallpaper_partitioned table and bricks_partitioned table.

Complete the following steps:

Connect to the database and archive the partitions on the client as follows:

copy wallpaper_partitioned_20220601 to ‘/home/ec2-user/wallpaper_partitioned_20220601.csv’ delimiter ‘,’CSV HEADER;
copy wallpaper_partitioned_20220701 to ‘/home/ec2-user/wallpaper_partitioned_20220701.csv’ delimiter ‘,’CSV HEADER;
copy wallpaper_partitioned_20220801 to ‘/home/ec2-user/wallpaper_partitioned_20220801.csv’ delimiter ‘,’CSV HEADER;

copy bricks_partitioned_20220601 to ‘/home/ec2-user/bricks_partitioned_20220601.csv’ delimiter ‘,’CSV HEADER;
copy bricks_partitioned_20220701 to ‘/home/ec2-user/bricks_partitioned_20220701.csv’ delimiter ‘,’CSV HEADER;
copy bricks_partitioned_20220801 to ‘/home/ec2-user/bricks_partitioned_20220801.csv’ delimiter ‘,’CSV HEADER;

Once archived, copy these partitions to the S3 bucket in the respective directories from the EC2 instance using the AWSCLI. Make sure to remove the CSV file headers before copying to S3:

/usr/local/bin/aws s3 cp wallpaper_partitioned_20220601.csv –region us-east-1 s3://historicalbucket/wallpaper_partitioned/2022_06/

/usr/local/bin/aws s3 cp wallpaper_partitioned_20220701.csv –region us-east-1 s3://historicalbucket/wallpaper_partitioned/2022_07/

/usr/local/bin/aws s3 cp wallpaper_partitioned_20220801.csv –region us-east-1 s3://historicalbucket/wallpaper_partitioned/2022_08/

/usr/local/bin/aws s3 cp bricks_partitioned_20220601.csv –region us-east-1 s3://historicalbucket/bricks_partitioned/2022_06/

/usr/local/bin/aws s3 cp bricks_partitioned_20220701.csv –region us-east-1 s3://historicalbucket/bricks_partitioned/2022_07/

/usr/local/bin/aws s3 cp bricks_partitioned_20220801.csv –region us-east-1 s3://historicalbucket/bricks_partitioned/2022_08/

Likewise, create the CSV files for standalone tables from the database and copy them to the EC2 instance:

copy wallpaper to ‘/home/ec2-user/wallpaper.csv’ delimiter ‘,’CSV HEADER;
copy bricks to ‘/home/ec2-user/bricks.csv’ delimiter ‘,’CSV HEADER;

From the EC2 instance, upload the CSV files to Amazon S3 using the AWS CLI(make sure to remove the file headers before uploading):

/usr/local/bin/aws s3 cp wallpaper.csv s3://historicalbucket/wallpaper/
/usr/local/bin/aws s3 cp bricks.csv s3://historicalbucket/bricks/

Specify the result set location

Result set location is set for Athena workgroup and it is needed before running any query.

Complete the following steps:

On the Athena console, choose Query editor in the navigation pane.
On the Editor tab, choose Edit settings.

Enter the Amazon S3 location where you want to store the query results. For this post, we call the subdirectory resultset.
Choose Save.

Create a database in the AWS Glue Data Catalog

In order to query the historical data from Amazon S3, you need to create a database in the AWS Glue Data Catalog. For this post, we create the database historical_data. Use the following code in Athena query editor:

create database historical_data;

Validate that the new database is showing on the Editor tab under Database.

Create table structures on the historical_data database

Complete the following steps to create table structures on the historical_data database:

On the Editor tab of the query editor, run the following DDL to create the table.

CREATE EXTERNAL TABLE `historical_data`.`wallpaper_partitioned` (`w_id` int, `colour` char(9))
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (‘field.delim’ = ‘,’)
STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.TextInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘s3://historicalbucket/wallpaper_partitioned/’
TBLPROPERTIES (‘classification’ = ‘csv’);

After you create the table, add the partitions in the database historical_data:

ALTER TABLE wallpaper_partitioned ADD
PARTITION (year=’2022′,month=’06’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_06/’
PARTITION (year=’2022′,month=’07’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_07/’
PARTITION (year=’2022′,month=’08’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_08/’
PARTITION (year=’2022′,month=’09’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_09/’
PARTITION (year=’2022′,month=’10’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_10/’
PARTITION (year=’2022′,month=’11’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_11/’
PARTITION (year=’2022′,month=’12’) location ‘s3://historicalbucket/wallpaper_partitioned/2022_12/’;

Similarly, create a table structure for bricks_partitioned and add the corresponding partitions.

Now you follow similar steps to create the standalone tables.

On the Editor tab of the query editor, run the following DDL to create a table:

CREATE EXTERNAL TABLE `historical_data`.`wallpaper` (
`w_id` int,
`name` char(9),
`colour` char(9)
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (‘field.delim’ = ‘,’)
STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.TextInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘s3://historicalbucket/wallpaper/’
TBLPROPERTIES (‘classification’ = ‘csv’);

Follow similar steps for the bricks table structure.
Validate that all the table structures are reflected in the database historical_data.

Now we’re ready to query the historical data using Athena.

In the following example, we query the historical data for the wallpaper_partitioned table. You can this same method to query other tables as well.

Athena federated query capability

To join historical data with the RDS for PostgreSQL instance current tables, we will use Athena federated query capability that allows running SQL queries across data stored in relational, non-relational, object, and custom data sources.

Complete the following steps:

Create a data source and AWS Lambda function

The historical data can have SQL joins with the current data in the database. That’s where Athena federated queries come in. Athena runs federated queries using data source connectors that run on a Lambda function.

On the Athena console, choose Data sources in the navigation pane.
Choose Create data source.

For Choose a data source, search for PostgreSQL and choose Next.

For Data source name, provide a name for the data source (for example, rdbcatalog).
Under Connection details, choose Create Lambda function.

Enter your function details:
For Application name, keep the default AthenaPostgreSQLConnector.
For CompositeHandler, enter PostGreSqlMuxCompositeHandler.
For SecretNamePrefix, enter AthenaPostgreSQLFederation.
For SpillBucket, enter your S3 bucket name (for this post, historicalbucket).
For ConnectionString, follow the below format postgres://jdbc:postgresql://<RDSEndpoint>:port/<dbname>?user=<username>&password=<password>
For LambdaFunctionName, enter postgresqlathena.
For LambdaMemory and LambdaTimeout, keep the default values.
For SecurityGroupIds, enter the security group ID that is associated to the VPC ID corresponding to your subnet.
For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for this post, athena-spill).
For Subnetids, enter the corresponding subnet that the Lambda function can use to access your data source. For example: subnet1, subnet2.

Choose Deploy.If the Lambda function is running in a private subnet, setup an S3 Gateway Endpoint for Lambda to connect to Amazon S3. Note that, Lambda talks to Amazon S3 in order to query data.

Confirm the function exists on the Lambda console.

Return to the data source creation window and for Lambda function, choose the function you created.
Choose Next.

Choose Create data source.

Confirm if the data source was created on the Athena console.

On the Lambda console, navigate to the function you created.

Edit the environment variables.
Choose Add environment variable.
Add datasource_connection_string as the key and the JDBC string as the value.
Because the datasource used in this post is rdbcatalog, the corresponding key is rdbcatalog_connection_string
Choose Save.

Return to the Athena console and choose the data source.

You can see that database named test has been populated under the data source created above, test is actually the schema under which tables were created on the RDS for PostgreSQL instance.

Join the historical data with the current data in Amazon RDS

Now you can add the historical data joins with the current tables in the database. Use the following query for the partitioned tables:

SELECT * FROM
historical_data.wallpaper_partitioned
inner join rdbcatalog.test.bricks_partitioned on w_id=b_id;

Use the following query for the non-partitioned tables:

SELECT * FROM
historical_data.wallpaper
Inner join rdbcatalog.test.bricks on w_id=b_id;

We have seen how to query the historical data with current data in the system.

Cleanup

Delete the Amazon RDS instance, historical data in Amazon S3 and athena instance if it’s no longer needed.

Summary

In this post, we learned how to simultaneously query active data in Amazon RDS for PostgreSQL along with archived data from Amazon Athena. With this solution, you don’t need to store all your data on the active database system, it lets you offload historical data to lower cost storage. Archiving data can help improve the performance of queries using Amazon RDS for PostgreSQL instance while still letting you access your historical data using Amazon Athena.

Do you have follow-up questions or feedback? Leave a comment. We’d love to hear your thoughts and suggestions.

About the Author

Deepak Sharma is a Database Specialist with AWS focusing on Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL engines. He works with enterprise customers to provide technical assistance on PostgreSQL operational performance, help them run optimized workloads on AWS, and share database best practices.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments