Monday, December 2, 2024
No menu items!
HomeDatabase ManagementManage case-insensitive data in PostgreSQL

Manage case-insensitive data in PostgreSQL

It’s important to be aware of the case sensitivity of text data in PostgreSQL when performing queries or comparisons to ensure that the results match your intended expectations. Case also affects how the database performs sorting operations. By default, PostgreSQL is case sensitive when sorting or comparing string values. For example, PostgreSQL considers “amazon” and “Amazon” to be different values, and this impacts how they are sorted.

If you want to search for a specific value regardless of its case, you need to use a case-insensitive comparison function or normalize the data to a consistent case. Case insensitivity in PostgreSQL allows for greater flexibility and ease of use when working with data.

In this post, we discuss different options to deal with the case-insensitive data in PostgreSQL and their use cases, pros, and cons.

Solution overview

Case insensitivity in PostgreSQL can improve productivity, reduce errors, and make it easier to work with data and identifiers in a flexible and consistent way. However, there are some performance tradeoffs to consider. PostgreSQL provides different options to achieve case insensitivity in data. We discuss the following options in this post:

Using case-insensitive comparison functions
Using ILIKE or ~~* operators for comparison
Using the CITEXT data type
Using a custom = (equal to) operator
Using a case-insensitive ICU collation

Prerequisites

If you want to follow along this post’s examples, complete the following prerequisite steps:

Create an Amazon Aurora PostgreSQL-Compatible Edition cluster or an Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance if you don’t already have one. For instructions, refer to Create an Aurora PostgreSQL DB cluster or Create a PostgreSQL DB instance respectively.
Create an Amazon Elastic Compute Cloud (Amazon EC2) instance to install the PostgreSQL client to access the Aurora PostgreSQL or RDS for PostgreSQL instance. For instructions, refer to Create your EC2 resources and launch your EC2 instance. Alternatively, you can set up connectivity between your RDS database and EC2 compute instance in 1-click
Install the PostgreSQL client. On Amazon Linux, we can use the following commands to download the psql command line tool:

sudo yum install postgresql15*

Option 1: Using case-insensitive comparison functions

In this option, we discuss the PostgreSQL in-built comparison functions LOWER and UPPER. These functions allow you to convert a string to lowercase or uppercase at runtime, respectively.

The lower function takes a string as an argument and returns a new string with all the characters in lowercase. For example:

postgres=# SELECT pg_catalog.lower(‘Print IN LOWER’);
lower
—————-
print in lower
(1 row)

The upper function, on the other hand, takes a string as an argument and returns a new string with all the characters in uppercase. For example:

postgres=# SELECT pg_catalog.upper(‘Print in upper’);
upper
—————-
PRINT IN UPPER
(1 row)

This is a straightforward way of dealing with case insensitivity in PostgreSQL. When performing a search on a text column, you can use the lower or upper function to convert the search term to lowercase or uppercase and then search for that term in the lowercased or uppercased text column. This ensures that the search is case insensitive, so you can find matches regardless of the case of the text data. Also, when importing data from external sources, it’s common to use the lower and upper functions to standardize the case of text data. This helps ensure that all text data is in the same case format, making it easier to compare and manipulate.

Benefits

This option offers built-in functions and is simple to use.

Drawbacks

If you want to use an index for case-insensitive searches or comparisons, you’ll need to create a specialized index such as an expression index or a specialized data type like citext instead of relying on a standard text index. Additionally, PostgreSQL can’t use an expression index if the expression is not present, so you would need to add another index for case-sensitive searches. Complete the following steps:

Create a test table:

CREATE TABLE public.case_test_opt1 (
id SERIAL PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email TEXT NOT NULL
);

Insert some sample data. The following query inserts 1 million records:

INSERT INTO public.case_test_opt1 (first_name, last_name, email)
SELECT ‘Foo’, ‘Bar’ || n::text, ‘foo.bar’ || n::text || ‘@example.com’ FROM generate_series(1, 1000) n;

Create a regular index on the email column:

CREATE INDEX case_opt1_email_idx
ON public.case_test_opt1 (email);

Run the following query, which doesn’t use the index created in the previous step because it’s a normal index. You can see the full table scan (seq scan) as highlighted:

postgres=> EXPLAIN ANALYZE
SELECT
*
FROM
public.case_test_opt1
WHERE
pg_catalog.UPPER(email) = ‘[email protected]’;

QUERY PLAN
———————————————————————————————————————————–
Gather (cost=1000.00..16084.00 rows=5000 width=37) (actual time=0.428..171.612 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on case_test_opt1 (cost=0.00..14584.00 rows=2083 width=37) (actual time=0.237..163.207 rows=333 loops=3)
Filter: (upper(email) = ‘[email protected]’::text)
Rows Removed by Filter: 333000
Planning Time: 0.082 ms
Execution Time: 171.681 ms
(8 rows)

Create a functional index on the uppercased email column:

CREATE INDEX case_opt1_email_upper_idx
ON public.case_test_opt1 (pg_catalog.UPPER(email));

Run the same query to see if the functional index is utilized:

postgres=> EXPLAIN ANALYZE
SELECT
*
FROM
public.case_test_opt1
WHERE
pg_catalog.UPPER(email) = ‘[email protected]’;

QUERY PLAN
——————————————————————————————————————————————
Bitmap Heap Scan on case_test_opt1 (cost=139.18..7761.06 rows=5000 width=37) (actual time=0.331..1.314 rows=1000 loops=1)
Recheck Cond: (upper(email) = ‘[email protected]’::text)
Heap Blocks: exact=1000
-> Bitmap Index Scan on case_opt1_email_upper_idx (cost=0.00..137.93 rows=5000 width=0) (actual time=0.200..0.200 rows=1000 loops=1)
Index Cond: (upper(email) = ‘[email protected]’::text)
Planning Time: 0.149 ms
Execution Time: 1.375 ms
(7 rows)

Option 2: Using ILIKE or ~~* operators for comparison

In this option, we discuss PostgreSQL built-in string comparison operators ILIKE and ~~*. They are equivalent in functionality, but ILIKE is the standard SQL-compliant operator whereas ~~* is a PostgreSQL-specific operator.

The ILIKE operator matches two strings, ignoring differences in letter case. It’s equivalent to the LIKE operator but case insensitive. For example, the expression ‘hello’ ILIKE ‘HelLO’ would evaluate to TRUE:

postgres=# SELECT ‘hello’ ILIKE ‘HelLO’;
?column?
———-
t
(1 row)

The ~~* operator is the PostgreSQL-specific version of ILIKE, and performs a case-insensitive string comparison similar to ILIKE. For example, the expression ‘hello’ ~~* ‘HelLO’ would evaluate to TRUE:

postgres=# SELECT ‘hello’ ~~* ‘HelLO’;
?column?
———-
t
(1 row)

When accepting user input in a search form or filter, you can use the ILIKE or ~~* operator to search for matching records in a case-insensitive manner. This ensures that users can find the records they’re looking for regardless of the case of the input. You can also use these operators when searching for specific text patterns in a large dataset. This is useful for scenarios such as finding all the product names that contain a certain keyword, regardless of the case of the keyword.

Benefits

This option offers built-in operators and is simple to use.

Drawbacks

You cannot use B-tree indexes, even expression indexes, with the ILIKE or ~~* operators. However, you can use PostgreSQL’s GIN or GiST index types to build indexes for these types of operators. More details on GIN and GiST indexes including the differences can be found in Preferred Index Types for Text Search documentation. Keep the following in mind:

You must use the pg_trgm extension. This extension is available in Amazon Aurora and Amazon RDS, but it is not enabled by default. You can install it in your database using “CREATE EXTENSION pg_trgm;“
GIN indexes can be significantly larger than B-tree indexes, especially for data types with many distinct values or long text fields. This can impact the speed of index updates and maintenance. Creating a GIN index, as well as inserts and updates, can take longer than when using a B-tree index.
GIN indexes are optimized for certain query patterns, such as searching for multiple values or ranges within the indexed data. However, they may not be as effective for other types of queries, such as equality or range queries on a single column.

Using the same table that we created earlier (case_test_opt1), we can create a normal, functional, and GIN index to see the index creation time difference:

CREATE INDEX case_opt1_email_tgrm_idx
ON public.case_test_opt1 using GIN (email gin_trgm_ops);

For this data set, the GIN index creation time took almost three times longer than the B-tree indexes from the previous example.

The following query shows that PostgreSQL scans the GIN index when using the ILIKE operator:

postgres=> EXPLAIN ANALYZE
SELECT
*
FROM
public.case_test_opt1
WHERE
email ILIKE ‘FOO.BAR500@EXAMPLE%’;

QUERY PLAN
—————————————————————————————————————————————–
Bitmap Heap Scan on case_test_opt1 (cost=344.77..709.64 rows=99 width=37) (actual time=84.869..86.652 rows=1000 loops=1)
Recheck Cond: (email ~~* ‘FOO.BAR500@EXAMPLE%’::text)
Heap Blocks: exact=1000
-> Bitmap Index Scan on case_opt1_email_tgrm_idx (cost=0.00..344.74 rows=99 width=0) (actual time=84.740..84.740 rows=1000 loops=1)
Index Cond: (email ~~* ‘FOO.BAR500@EXAMPLE%’::text)
Planning Time: 0.256 ms
Execution Time: 86.729 ms
(7 rows)

Clean up after you have completed the testing:

DROP TABLE public.case_test_opt1;

Option 3: Using the CITEXT data type

citext is a case-insensitive text data type in PostgreSQL. It’s an extension module that provides a case-insensitive version of the built-in text data type. When citext is used for a column, any string value inserted into that column is automatically converted to lowercase, and any queries on that column are case insensitive.

To use citext, you must first enable the citext extension in your database using the following command:

CREATE EXTENSION IF NOT EXISTS citext;

After that, you can create a table with a citext column using following statement:

CREATE TABLE public.case_citext_opt3 (
id INT PRIMARY KEY,
name CITEXT
);

If you’re migrating data from a system that uses case-insensitive text types, using citext can simplify the migration process by allowing you to maintain the same data format. citext can help prevent data entry errors or inconsistencies in your data. With citext, you don’t need to use lower and upper functions in your queries to perform case-insensitive searches or comparisons. This can make your queries simpler and easier to read.

Benefits

This option offers a built-in extension with the citext data type. Additionally, migration from other databases with case-insensitive data types is straightforward.

Drawbacks

A drawback to this option is that you can’t perform case-sensitive searches if needed. Also, if you have a citext type on one side of an operator and a text on the other for a query, then it might not work as expected. So, we need to be careful with such queries.

Option 4: Using a custom = operator

PostgreSQL allows users to define custom operators to extend the built-in set of operators. Custom operators can be defined to perform specific operations that aren’t supported by the built-in operators, or to provide shorthand notations for frequently used operations. In this option, we show how to create a new = operator to deal with the case-insensitive data. Complete the following steps:

Create a new schema to define the operator and grant necessary privileges on the schema to public:

CREATE SCHEMA util;
GRANT USAGE ON SCHEMA util TO PUBLIC;

Create a function to define the functionality of the operator. This function basically changes the case of right and left operands to lowercase so that comparison is done without modifying the query to use any built-in functions like lower or upper.

CREATE OR REPLACE FUNCTION util.case_ins_cmp(text, text)
RETURNS BOOLEAN LANGUAGE sql immutable as $$
select pg_catalog.lower($1) operator(pg_catalog.=) pg_catalog.lower($2)
$$;

Grant the necessary privileges on the function to public:

GRANT EXECUTE ON FUNCTION util.case_ins_cmp(text,text) TO PUBLIC;

Create the operator using the new function:

CREATE OPERATOR util.= (
LEFTARG = TEXT,
RIGHTARG = TEXT,
PROCEDURE = util.case_ins_cmp);

Test the new operator, for example:

CREATE TABLE public.case_test_opt4 (
id SERIAL PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email TEXT NOT NULL
);
INSERT INTO public.case_test_opt4 (first_name, last_name, email) VALUES
(‘foo’, ‘bar’, ‘[email protected]’),
(‘foo’, ‘BAR’, ‘[email protected]’),
(‘foo’, ‘BaR’, ‘[email protected]’);

Query the data with last_name and review the results:

postgres=# SELECT * FROM public.case_test_opt4 WHERE last_name=’bar’;
id | first_name | last_name | email
—-+————+———–+———————
1 | foo | bar | [email protected]
(1 row)

You can see only one row based on the case.

Set the search_path to use the util schema before pg_catalog and check the results:

postgres=# SET search_path TO util, pg_catalog;
SET
postgres=# SELECT * FROM public.case_test_opt4 WHERE last_name=’bar’;
id | first_name | last_name | email
—-+————+———–+———————
1 | foo | bar | [email protected]
2 | foo | BAR | [email protected]
3 | foo | BaR | [email protected]
(3 rows)

You can see all three rows regardless of case.

Alternatively, you can use the absolute path of the operator using schema instead of setting the search_path:

postgres=# SELECT * FROM public.case_test_opt4 WHERE last_name operator(util.=) ‘bar’;
id | first_name | last_name | email
—-+————+———–+———————
1 | foo | bar | [email protected]
2 | foo | BAR | [email protected]
3 | foo | BaR | [email protected]
(3 rows)

You can use an expression index if required:

postgres=# CREATE INDEX ind_opt4
ON public.case_test_opt4(pg_catalog.lower(last_name));
CREATE INDEX
postgres=# ANALYZE case_test_opt4;
ANALYZE
postgres=# EXPLAIN ANALYZE
SELECT
*
FROM
public.case_test_opt4
WHERE
last_name OPERATOR (util. =) ‘bar’;

QUERY PLAN
————————————————————————————————————————–
Index Scan using ind_opt4 on case_test_opt4 (cost=0.42..8.44 rows=1 width=40) (actual time=0.071..0.072 rows=3 loops=1)
Index Cond: (lower(last_name) = ‘bar’::text)
Planning Time: 0.322 ms
Execution Time: 0.084 ms
(4 rows)

Benefits

One benefit of this option is that you might not need to change the application queries to add any built-in functions like lower and upper or add any operators like ~~* or ILIKE if you choose to add the schema (in which the operator is created) to the search_path and set it to the database.

Additionally, you can set the search_path as a switch to make the data case insensitive if you want only part of your application to behave like case insensitive without making any changes to the application.

Drawbacks

You need to add the search_path (with the util schema before pg_catalog) to the application queries or alter the database to set the search_path:

ALTER DATABASE <db_name> SET search_path TO ‘util, pg_catalog’;

In general, custom operators can be a powerful tool for developers and DBAs to extend the functionality of PostgreSQL and make it simpler to work with some operations. However, it’s important to use them judiciously and follow best practices for naming, documentation, and testing to ensure that they are consistent and efficient.

Clean up after you have completed the testing:

DROP TABLE public.case_test_opt4;
DROP OPERATOR util.=(text,text);
DROP FUNCTION util.case_ins_cmp;

Option 5: Using a case-insensitive ICU collation

“Custom collations” are a feature of the ICU collation provider. ICU allows extensive control over collation behavior by defining new collations with collation settings as a part of the language tag. These settings can modify the collation order to suit a variety of needs. You can go through Manage collation changes in PostgreSQL on Amazon Aurora and Amazon RDS for more information on collations.

The following is a test case with custom collation to access case-insensitive data:

Create a collation using the following command:

CREATE COLLATION public.case_insensitive (provider = icu, locale = ‘und-u-ks-level2’, deterministic = false);

Create a table and insert the data:

CREATE TABLE public.case_test_opt5 (
id SERIAL PRIMARY KEY,
first_name TEXT NOT NULL COLLATE case_insensitive,
last_name TEXT NOT NULL COLLATE case_insensitive,
email TEXT NOT NULL COLLATE case_insensitive
);

INSERT INTO public.case_test_opt5 (first_name, last_name, email) VALUES (‘foo’, ‘bar’, ‘[email protected]’), (‘foo’, ‘BAR’, ‘[email protected]’), (‘foo’, ‘BaR’, ‘[email protected]’);

Select the data and check the results:

postgres=> SELECT * FROM public.case_test_opt5 WHERE last_name=’bar’;
id | first_name | last_name | email
—-+————+———–+———————
1 | foo | bar | [email protected]
2 | foo | BAR | [email protected]
3 | foo | BaR | [email protected]
(3 rows)

You can see all three rows regardless of case.

Benefits

Like Option 4, you don’t need to change the application queries to add any built-in functions like lower or upper or add any operators like ~~* or ILIKE. Also, you don’t need to set the search_path.

Drawbacks

The major drawback is pattern matching does not work for nondeterministic collations that is case-insensitive collations. As of PostgreSQL 15, this also does not address the ability to index nondeterministic collations. Also, the sorting of the data might change as well based on the collation that you use. You have to deal with sorting if you are changing the collation. For example, the following has the order of last_name as bar, BAR, BaR:

postgres=> SELECT * FROM public.case_test_opt5 WHERE last_name=’bar’;
id | first_name | last_name | email
—-+————+———–+———————
1 | foo | bar | [email protected]
2 | foo | BAR | [email protected]
3 | foo | BaR | [email protected]
(3 rows)

if we modify the column’s collation and run a query we observe that the order of column changed to bar, BaR, BAR:

postgres=> ALTER TABLE public.case_test_opt5 ALTER COLUMN last_name TYPE text COLLATE “en_US”;
ALTER TABLE
Time: 57.316 ms
postgres=> SELECT * FROM public.case_test_opt5 ORDER BY last_name;
id | first_name | last_name | email
—-+————+———–+———————
1 | foo | bar | [email protected]
3 | foo | BaR | [email protected]
2 | foo | BAR | [email protected]
(3 rows)

You can see that changing the collation changed the order of the rows.

Although custom collations in PostgreSQL can provide many benefits, it’s important to carefully consider the potential drawbacks before implementing them in a production environment. Proper planning, testing, and ongoing maintenance can help ensure that custom collations work effectively and don’t negatively impact database performance or compatibility.

Clean up after you have completed the testing:

DROP TABLE public.case_test_opt5;
DROP COLLATION public.case_insensitive;

Conclusion

PostgreSQL provides several options for dealing with case insensitivity, each with its own advantages and disadvantages. The best approach depends on the specific requirements of your application and the data being stored in the database. In this post, we discussed a few options with their pros and cons. Based on your requirements, you can choose the right option for your application.

If you have any questions or suggestions about post, leave a comment.

About the authors

Baji Shaik is a Sr. Lead Consultant with AWS ProServe, Global Competency Centre. His background spans a wide depth and breadth of expertise and experience in SQL and NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on premises to Amazon RDS and Amazon Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration,” “Beginning PostgreSQL on the Cloud,” and “PostgreSQL Development Essentials.” Furthermore, he has delivered several conference and workshop sessions.

Scott Mead is a Database Engineer at AWS.

Read MoreAWS Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments