Friday, March 29, 2024
No menu items!
HomeDatabase ManagementBuild high-performance functions in Rust on Amazon RDS for PostgreSQL

Build high-performance functions in Rust on Amazon RDS for PostgreSQL

Amazon Relational Database Service (Amazon RDS) for PostgreSQL now supports trusted PL/Rust, allowing developers to safely build high-performance database functions in the Rust programming language. PL/Rust is an open-source project that lets you write Rust code that runs directly inside a PostgreSQL database, and provides support for PostgreSQL features such as running queries, writing trigger functions, and logging output. You can also use PL/Rust to build Trusted Language Extensions for PostgreSQL with performance benefits that are comparable to writing code in C without the risk of unsafe memory access. For more information on how PL/Rust works, see the PL/Rust Guide.

Stored procedures in databases help developers build code that can perform calculations on data and reduce latency to the application, apply follow-up actions on data modifications, and simplify data visualization. PostgreSQL gives developers choices in what programming languages you can use through its procedural language system, and the open-source community maintains a wiki page that lists available procedural languages. PostgreSQL designates languages as “trusted” or “untrusted,” where a “trusted” language is safe for an unprivileged user to use without risk of privilege escalation. The property of being a “trusted” programming language in PostgreSQL makes it possible to build Trusted Language Extensions for PostgreSQL, using the open-source pg_tle project, and safely run them in production systems.

In addition to PL/Rust, Amazon RDS for PostgreSQL supports several trusted programming languages, including PL/pgSQL, PL/Perl, PL/v8 (JavaScript), and PL/Tcl. Each programming language has benefits, and choosing a language may come down to what problem you’re solving or your familiarity with the programming language. For example, PL/pgSQL, PostgreSQL’s native procedural language, is convenient for writing trigger functions and multi-step procedures that may interface with built-in PostgreSQL functions. Also, all of these languages are interpreted, which means that there may be performance overhead when executing functions written in these languages.

In this post, we show how to deploy an RDS for PostgreSQL instance with PL/Rust enabled, and review several examples for how to write high-performance Rust code directly in your database. We also look at a use case that involves a lot of computations, and see how writing it in PL/Rust compares to PL/pgSQL and PL/v8 (JavaScript).

Prerequisites

To run the examples in this post, you’ll need to provision an RDS for PostgreSQL instance or Multi-AZ DB cluster running PostgreSQL 15.2 or higher. Additionally, you’ll need to add plrust to the shared_preload_libraries parameter into a DB parameter group and assign the DB parameter group to your database instance.

Using the AWS Command Line Interface (AWS CLI), you can create a DB parameter group that adds plrust to the shared_preload_libraries parameter:

REGION=”us-east-1″

aws rds create-db-parameter-group
–db-parameter-group-name pg15-plrust
–db-parameter-group-family postgres15
–description “Parameter group that contains PL/Rust settings for PostgreSQL 15”
–region “${REGION}”

aws rds modify-db-parameter-group
–db-parameter-group-name pg15-plrust
–parameters “ParameterName=’shared_preload_libraries’,ParameterValue=’plrust’,ApplyMethod=pending-reboot”
–region “${REGION}”

Note that if you modify the shared_preload_libraries parameter on an existing database instance, the changes will not take effect until the instance is rebooted. You can also modify the parameter group directly from the AWS Management Console. For more information, see Working with DB parameter groups.

If you use the psql client to access your database, you can optionally enable timing to see how long each step takes to run using the following command:

timing

Now that we have set up our RDS for PostgreSQL instance to use PL/Rust, let’s build and run a PL/Rust function.

Build a PL/Rust function

Before building a PL/Rust function, we must first check that PL/Rust is installed in the database:

CREATE EXTENSION IF NOT EXISTS plrust;

If the preceding command runs successfully, you will see the following output:

CREATE EXTENSION

Let’s create a function that sums up all of the values in a double precision or float8 array. You can do this with the following command:

CREATE OR REPLACE FUNCTION public.array_sum(x float8[])
RETURNS float8
LANGUAGE plrust
IMMUTABLE PARALLEL SAFE STRICT
AS $$
let sum: f64 = x.iter()
.map(|&xi| (xi.unwrap()))
.sum();

Ok(Some(sum))
$$;

This may take a few seconds to compile. On a db.m6i.xlarge RDS for PostgreSQL instance, this took about 3 seconds to compile. PL/Rust functions only need to compile when they are created or any modifications are made to the code. When the compilation is complete, a PL/Rust function is stored in an executable format and can be called from a SQL statement. For more information on how to write Rust code for PL/Rust, see PL/Rust Functions.

A PL/Rust function is run like any other PostgreSQL function. For example, the following code sums an array that contains the values 1, 2, and 3:

SELECT public.array_sum(ARRAY[1.0,2.0,3.0]);

This produces the following output:

array_sum
———–
6
(1 row)

We’ve now learned how to build and run a basic PL/Rust function. We don’t need this function for the remainder of the post. You can drop it using the following command:

DROP FUNCTION public.array_sum;

In the following sections, we review examples of PL/Rust functions that perform computations on PostgreSQL arrays.

Validate that a PostgreSQL array is a vector

PostgreSQL has built-in support for arrays of data types—in fact, PostgreSQL can store multi-dimensional arrays including matrices. One application of an array is a vector, which is a mathematical construct that has a magnitude (or size) and a direction (or angle). In this section, we show how we can use PL/Rust to write code that validates if a PostgreSQL float8[] array is a vector. If you have an application that requires additional vector processing, you can use the pgvector extension on Amazon RDS for PostgreSQL. See Building AI-powered search in PostgreSQL using Amazon SageMaker and pgvector for more information.

First, create a schema called vectors. This will be our workspace for creating functions and a table:

CREATE SCHEMA vectors;

We need to check the following conditions to ensure a PostgreSQL float8[] is a valid vector:

The array must contain only one column
The array must not contain NULL values
The array must have a magnitude greater than 0
If a table column for an array represents a vector, all arrays must have the same length (dimension)

PostgreSQL already has built-in functions written in C that check how many columns an array has and the length of an array. We can use PL/Rust to build functions that calculate the magnitude of the array and check if it contains any NULL values.

Let’s build a function in PL/Rust that can calculate the magnitude of a float8[] array:

CREATE OR REPLACE FUNCTION vectors.magnitude(x float8[])
RETURNS float8
LANGUAGE plrust
IMMUTABLE PARALLEL SAFE STRICT
AS $$
let mag: f64 = x.iter()
.map(|(&xi)| (xi.unwrap().powi(2)))
.sum();

Ok(Some(mag.sqrt()))
$$;

On a db.r6i.xlarge RDS for PostgreSQL instance, the preceding code took about 3 seconds to compile.

Now let’s build a function in PL/Rust that detects if there are any NULL values in the array:

CREATE OR REPLACE FUNCTION vectors.has_nulls(x float8[])
RETURNS bool
LANGUAGE plrust
IMMUTABLE PARALLEL SAFE STRICT
AS $$
let has_nulls: bool = x.iter()
.any(|&xi| xi.is_none());

Ok(Some(has_nulls))
$$;

On a db.r6i.xlarge RDS for PostgreSQL instance, the preceding code took about 3 seconds to compile.

Finally, we need to build a function that contains the four checks that determine if a PostgreSQL array is a valid vector. Currently, calling other stored functions from a PL/Rust function involves using the Server Programming Interface (SPI) in PostgreSQL, so for simplicity, we build the final vector validation using a SQL function. All the functions in the vector validation function are written either in C or Rust, so the performance overhead should be minimal. You can create a function that performs the vector validation using the following example code:

CREATE OR REPLACE FUNCTION vectors.validate_vector(x float8[], dim int)
RETURNS bool
LANGUAGE SQL
IMMUTABLE PARALLEL SAFE STRICT
BEGIN ATOMIC
SELECT
pg_catalog.array_ndims(x) = 1 AND
pg_catalog.array_length(x, 1) = dim AND
NOT vectors.has_nulls(x) AND
vectors.magnitude(x) > 0;
END;

Let’s test out this function. The following code creates a table that stores embeddings of 768-dimensional vectors that has a check constraint calling the vectors.validate_vector function. For more information on vector embeddings, see Building AI-powered search in PostgreSQL using Amazon SageMaker and pgvector.

CREATE TABLE vectors.embeddings (
id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
embedding float8[768],
CHECK (vectors.validate_vector(embedding, 768))
);

Let’s test if the check constraint works. The following examples demonstrate different violations of the check constraint, including trying to insert arrays that don’t have the correct dimensions, have an incorrect number of columns, have no magnitude, or contain NULL values:

INSERT INTO vectors.embeddings (embedding) VALUES (ARRAY[1.0]);
ERROR: new row for relation “embeddings” violates check constraint “embeddings_embedding_check”
DETAIL: Failing row contains (1, {1}).

INSERT INTO vectors.embeddings (embedding) VALUES (ARRAY[ARRAY[1.0]]);
ERROR: new row for relation “embeddings” violates check constraint “embeddings_embedding_check”
DETAIL: Failing row contains (2, {{1}}).

INSERT INTO vectors.embeddings (embedding) VALUES (ARRAY[0]::float8[]);
ERROR: new row for relation “embeddings” violates check constraint “embeddings_embedding_check”
DETAIL: Failing row contains (3, {0}).

INSERT INTO vectors.embeddings (embedding) VALUES (ARRAY[NULL]::float8[]);
ERROR: new row for relation “embeddings” violates check constraint “embeddings_embedding_check”
DETAIL: Failing row contains (4, {NULL}).

Now let’s create an array that passes the vectors.validate_vector check constraint on the vectors.embeddings table. It’s tedious to create a 768-dimensional vector manually, so we write a function that can generate random vectors. For this example, we write the function in SQL because Amazon RDS for PostgreSQL doesn’t currently have the Rust random number generation crate as it contains unsafe code blocks. You can create a function that generates random vectors using the following code:

CREATE OR REPLACE FUNCTION vectors.generate_random_vector(dim int)
RETURNS float8[]
LANGUAGE SQL
BEGIN ATOMIC
SELECT pg_catalog.array_agg(pg_catalog.random())::float8[]
FROM generate_series(1, $1);
END;

After you create the function, try inserting a valid vector into the vectors.embeddings table:

INSERT INTO vectors.embeddings (embedding)
SELECT vectors.generate_random_vector(768)
RETURNING id, vectors.magnitude(embedding);

The display for a 768-dimensional vector in the PostgreSQL terminal can be challenging to visualize, so the preceding example returns the calculated magnitude of the generated vector. If you run the command successfully, you’ll see something similar to the following output:

id | magnitude
—-+——————-
7 | 15.97756057918017

Before moving on to the example in the next section, we recommend removing all the data from the vectors.embeddings table:

TRUNCATE TABLE vectors.embeddings;

Build normalized vectors from an array in PL/Rust, PL/pgSQL, and PL/v8 (JavaScript)

A common operation when working with vector data is to normalize it so that all vectors have the same magnitude, specifically a magnitude of one. This is done by dividing each value in a vector by its magnitude. Normalizing vectors can simplify operations such as determining the distance between two vectors or building indexes. In this section, we write functions in PL/Rust, PL/pgSQL, and PL/v8 (JavaScript) that normalize vectors and see how they compare in relative performance.

First, we need to populate the vectors.embeddings table with data. The following query generates 100,000 random 768-dimensional vectors and inserts them into the vectors.embeddings table:

INSERT INTO vectors.embeddings (embedding)
SELECT vectors.generate_random_vector(768)
FROM pg_catalog.generate_series(1,100000);

On a db.r6i.xlarge instance, the preceding query took about 30 seconds to complete.

Let’s create functions that normalize vectors. The functions assume that the PostgreSQL array already represents a valid vector. The following is a function written in PL/Rust that normalizes a vector:

CREATE OR REPLACE FUNCTION vectors.normalize_vector_plrust(x float8[])
RETURNS float8[]
LANGUAGE plrust
IMMUTABLE PARALLEL SAFE STRICT
AS $$
let mut mag: f64 = x.iter()
.map(|&xi| (xi.unwrap().powi(2)))
.sum();
mag = mag.sqrt();

let norm: Vec<Option<f64>> = x.iter()
.map(|&xi| Some(xi.unwrap() / mag))
.collect();

Ok(Some(norm))
$$;

The following function is written in PL/pgSQL and also normalizes a vector:

CREATE OR REPLACE FUNCTION vectors.normalize_vector_plpgsql(x float8[])
RETURNS float8[]
LANGUAGE plpgsql
IMMUTABLE PARALLEL SAFE STRICT
AS $$
DECLARE
i int;
n int;
mag float8 := 0.0;
u float8[];
BEGIN
n := pg_catalog.array_length(x, 1);
FOR i IN 1..n LOOP
mag := mag + (x[i] * x[i]);
END LOOP;

mag := pg_catalog.sqrt(mag);

FOR i IN 1..n LOOP
u[i] := x[i] / mag;
END LOOP;

RETURN u;
END
$$;

Finally, the following example is a function written in PL/v8 (JavaScript) that normalizes the vector. The example includes a CREATE EXTENSION call to load plv8 into your database if you haven’t already done so:

CREATE EXTENSION IF NOT EXISTS plv8;
CREATE OR REPLACE FUNCTION vectors.normalize_vector_plv8(x plv8_float8array)
RETURNS float8[]
LANGUAGE plv8
IMMUTABLE PARALLEL SAFE STRICT
AS $$
let n = x.length;
let mag = 0.0;
let u = new Array(n);

for (i = 0; i < n; i++) {
mag += x[i] * x[i];
}

mag = Math.sqrt(mag);

for (i = 0; i < n; i++) {
u[i] = x[i] / mag;
}

return u;
$$;

Let’s see how all of these functions compare in terms of performance. For this example, we run each normalization function over the entire table:

SELECT vectors.normalize_vector_plrust(embedding)
FROM vectors.embeddings;

First, we step through the code to generate the runtimes, then we summarize the results in a table. Note that these results are directional and you should perform your own benchmarking for your workloads.

Let’s try the PL/Rust function. Run the following code to see the runtime of the PL/Rust function:

EXPLAIN ANALYZE
SELECT vectors.normalize_vector_plrust(embedding)
FROM vectors.embeddings;

On a db.r6i.xlarge RDS for PostgreSQL instance, you may see results similar to the following:

Seq Scan on embeddings (cost=0.00..26637.00 rows=100000 width=32) (actual time=0.091..3126.124 ro
ws=100000 loops=1)
Planning Time: 0.033 ms
Execution Time: 3133.119 ms

Next, let’s try the PL/pgSQL function:

EXPLAIN ANALYZE
SELECT vectors.normalize_vector_plpgsql(embedding)
FROM vectors.embeddings;

On a db.r6i.xlarge RDS for PostgreSQL instance, you may see results similar to the following:

Seq Scan on embeddings (cost=0.00..26637.00 rows=100000 width=32) (actual time=0.198..13804.672 r
ows=100000 loops=1)
Planning Time: 0.034 ms
Execution Time: 13811.686 ms

Finally, let’s try the PL/v8 function:

EXPLAIN ANALYZE
SELECT vectors.normalize_vector_plv8(embedding)
FROM vectors.embeddings;

On a db.r6i.xlarge RDS for PostgreSQL instance, you may see results similar to the following:

Seq Scan on embeddings (cost=0.00..26637.00 rows=100000 width=32) (actual time=0.658..53543.052 r
ows=100000 loops=1)
Planning Time: 0.034 ms
Execution Time: 53551.793 ms

The following table shows a comparison of all the directional runtimes and the relative overhead of the interpreted languages vs. PL/Rust.

Language
Runtime (seconds)
Performance reduction vs. PL/Rust (x)
PL/Rust
3.13

PL/pgSQL
13.81
4.41x
PL/v8 (JavaScript)
53.55
17.11x

Although this testing is directional, we can see that PL/Rust has performance benefits for performing many computations over other trusted procedural languages.

Clean up

You can clean up all the code from the examples by dropping the vectors schema:

DROP SCHEMA vectors CASCADE;

You can delete your RDS for PostgreSQL instance if you no longer need to use it.

Conclusion

Trusted PL/Rust gives developers a high-performance option for writing stored procedures on Amazon RDS for PostgreSQL. Although in some use cases it may be simple or convenient to solve a problem with a stored procedure language like PL/pgSQL, builders who require extra performance can use PL/Rust to achieve their goals. PL/Rust also gives developers another language option for writing Trusted Language Extensions for PostgreSQL, allowing Rust-based extensions to be installed directly from a SQL interface.

Amazon RDS for PostgreSQL will continue to add more trusted PL/Rust features, such as crates. If there’s a crate you’d like to use for your application, reach out to [email protected] and let us know your use case and what crates you’d like to build with!

We invite you to leave feedback in the comments section in this post.

About the authors

Jonathan Katz is a Principal Product Manager – Technical on the Amazon RDS team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments