Perform fuzzy full-text search and semantic search on Amazon DocumentDB using Amazon OpenSearch Service

By mullaned2002

August 28, 2023

381

In this post, we show you how to integrate Amazon DocumentDB (with MongoDB compatibility) with Amazon OpenSearch Service using AWS Lambda integration and run full-text search, fuzzy search, and synonym search on an artificially generated reviews dataset.

Amazon DocumentDB is a fast, scalable, highly durable, and fully managed database service for operating mission-critical MongoDB API-compatible JSON-based workloads without having to worry about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

As your business evolves, new opportunities arise, requiring you to delve deeper into your data for better insights. For example, consider that you are a large ecommerce platform using Amazon DocumentDB to store product reviews as JSON documents. To enhance your customer experience, you can develop functionality to help them find relevant product reviews based on their interests, which could involve finding reviews not only based on the exact keywords of their interests but also considering synonyms and semantics.

OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. With OpenSearch Service, you can perform real-time search, full-text search, semantic search, fuzzy search, and other analyses on your data for use cases like recommendation engines, ecommerce sites, and much more.

Amazon DocumentDB change streams provide a time-ordered sequence of change events that occur within your Amazon DocumentDB cluster’s collections. Lambda recently launched an integration with Amazon DocumentDB change streams. With this launch, you can use Lambda functions to stream your Amazon DocumentDB data changes to the OpenSearch Service index and run fuzzy full-text search and semantic search queries. For more information, see Using Lambda with Amazon DocumentDB.

Solution overview

This solution involves the following high-level steps:

Deploy an AWS CloudFormation template to create the following resources:

A VPC and the required networking components
An Amazon DocumentDB cluster to store the JSON data
An OpenSearch Service domain for running fuzzy, full-text queries
An AWS Cloud9 environment to connect to Amazon DocumentDB and OpenSearch Service
A secret in AWS Secrets Manager to store Amazon DocumentDB credentials
A Lambda function to stream Amazon DocumentDB data to the OpenSearch Service index

Set up the AWS Cloud9 environment.
Enable Amazon DocumentDB change streams.
Configure the Amazon DocumentDB change stream as a source for the Lambda function.
Load the reviews dataset into Amazon DocumentDB.
Run fuzzy full-text queries on Amazon DocumentDB data in OpenSearch Service.

The following architecture diagram illustrates the solution.

The CloudFormation template deploy resources in your AWS account, which incur costs. For more information on pricing for the resources, see AWS Pricing.

Deploy the CloudFormation template

Complete the following tasks to deploy the CloudFormation template:

Download the template or quick launch the CloudFormation stack by choosing Launch stack:

For Stack name, enter the name for your CloudFormation stack.
For DocDBIdentifier, enter the name of your Amazon DocumentDB cluster.
For DocDBPassword, enter the administrator password for your Amazon DocumentDB cluster (minimum 8 characters).
For DocDBUsername, enter the name of your administrator user in the Amazon DocumentDB cluster.
For ExistingCloud9Role, choose True if you have the AWS Identity and Access Management (IAM) role AWSCloud9SSMAccessRole created in your account. If you have used AWS Cloud9 before, you should already have an existing role. You can verify by going to the IAM console and searching for it on the Roles page. Stack creation will fail if the roles exists and you choose False.
Choose Next.
Select the check box in the Capabilities section to allow the stack to create an IAM role, then choose Submit.

Set up an AWS Cloud9 environment

To set up your AWS Cloud9 environment, complete the following steps:

On the AWS Cloud9 console, launch the environment that you created in the previous step (ChangeStreamsCloud9).
From your environment, launch a new terminal window by choosing Window and New Terminal.
Install the required packages by running the following script to connect to Amazon DocumentDB using a terminal and load the reviews dataset using a Python script:

# Setting up mongo 4.0 repo

echo -e “[mongodb-org-4.0] nname=MongoDB Repositorynbaseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/4.0/x86_64/ngpgcheck=1 nenabled=1 ngpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc” | sudo tee /etc/yum.repos.d/mongodb-org-4.0.repo

# Installing packages
sudo yum -y update
sudo yum -y install mongodb-org-shell
sudo python -m pip install –upgrade pip
sudo python -m pip install pandas pymongo

# Downloading the SSL file and the loader

wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem
wget https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/DBBLOG-3344/loader.py

Enable Amazon DocumentDB change streams

Amazon DocumentDB change stream events comprise a time-ordered sequence of data changes due to inserts, updates, and deletes on your data. We use these change stream events to transmit data changes from the Amazon DocumentDB cluster to the OpenSearch Service domain.

Change streams are disabled by default; you can enable them at an individual collection level, at the database level, or at the cluster level.

To enable change streams on your cluster, complete the following steps:

Navigate to your AWS Cloud9 terminal and run the following code, replacing the values with those of your cluster:

export DOCDB_ENDPOINT=<Amazon DocumentDB Endpoint>
echo “export DOCDB_ENDPOINT=${DOCDB_ENDPOINT}” >> ~/.bash_profile

export USERNAME=<Amazon DocumentDB cluster username>
echo “export USERNAME=${USERNAME}” >> ~/.bash_profile

export PASSWORD=<Amazon DocumentDB cluster password>
echo “export PASSWORD=${PASSWORD}” >> ~/.bash_profile

You can find the Amazon DocumentDB endpoint on your CloudFormation stack’s Outputs tab or on the Amazon DocumentDB console, and the Amazon DocumentDB user name and password are the values you provided during the creation of the CloudFormation stack.

Connect to Amazon DocumentDB:

mongo –ssl –host $DOCDB_ENDPOINT:27017 –sslCAFile global-bundle.pem –username $USERNAME –password $PASSWORD

Enable change streams on all databases and collections:

db.adminCommand({modifyChangeStreams: 1, database: “”, collection: “”, enable: true})

For more information on change streams, see Using change streams with Amazon DocumentDB.

Configure the Amazon DocumentDB change stream as a source for the Lambda function

To accomplish this task, complete the following steps:

On the Lambda console, navigate to the Lambda function named DocumentDBLambdaESM.
On the Configuration tab, choose Triggers and choose Add trigger.
Select the source as Amazon DocumentDB for the trigger configuration.
For DocumentDB cluster, choose the cluster created by the CloudFormation stack.
For Database name, enter productreviewdb.
For Collection name, enter productreviews.
For Secrets Manager key, choose the Secrets Manager key created by the CloudFormation stack. You can find it in the CloudFormation stack outputs as the value for the key DocDBSecretName.
For Batch window, set it to the maximum amount of time in seconds to gather records before invoking your function. We set this to a low amount (5 seconds) to make the invocations happen faster.
For all other parameters, leave them at their defaults.
Choose Add.

Load the reviews dataset into Amazon DocumentDB

Navigate to AWS Cloud9, and in a new terminal, run the loader script to start inserting the review dataset into Amazon DocumentDB (the script will run for a few minutes; do not close the terminal):

python loader.py

As the loader script loads the data into the Amazon DocumentDB, the Lambda function streams the data into OpenSearch Service. You can monitor this process through the Lambda function metrics to make sure the function invoked successfully and view the function logs to make sure there is no issue with the indexing process, such as incorrect permission.

To verify data is streaming to the OpenSearch Service index, open a new terminal in your AWS Cloud9 environment and run the following commands (replace the OpenSearch Service endpoint with the value from your CloudFormation template outputs):

export AOS_HOST=<Amazon OpenSearch Service Endpoint>
echo “export AOS_HOST=${AOS_HOST}” >> ~/.bash_profile
curl https://$AOS_HOST/_cat/indices/documentdb-reviews?v=true

The following sample output contains document count (docs.count) information:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open documentdb-reviews HSI2-ih-QLa_PTYY-yDHng 5 1 102931 4354 81.9mb 81.9mb

In addition to verifying the index document count, OpenSearch Service provides several Amazon Cloudwatch metrics to monitor, including IndexingRate, which would be an indicator of running indexing operations.

If you’re working with an existing Amazon DocumentDB collection, you can perform a one-time full migration using an AWS Database Management Service (AWS DMS) full load task before enabling the change stream and processing future changes. AWS DMS supports having a source from Amazon DocumentDB and target to OpenSearch Service. For more information, refer to Source endpoints for data migration and Target endpoints for data migration.

Run queries on Amazon DocumentDB data in OpenSearch Service

As the data is being replicated to the OpenSearch Service domain, you can run full-text search, fuzzy search, and synonym search queries in OpenSearch Service. The following are some example queries that you can run in your AWS Cloud9 terminal.

Full-text search queries

To find out the reviews that have a rating greater than or equal to 4 out of 5 and contain the phrase “easy to use”. The query also highlights the matched phrase in the query response.

curl https://$AOS_HOST/documentdb-reviews/_search?pretty -H “Content-Type: application/json” -d
‘
{
“query”: {
“bool”: {
“must”: [
{
“match_phrase”: {
“review_body”: “easy to use”
}
}
],
“filter”: [
{
“range”: {
“star_rating”: {
“gte”: 4
}
}
}
]
}
},
“highlight”: {
“fields”: {
“review_body”: {}
}
}
}
‘

This query also highlights the matched phrase in the query response:

{
“took” : 52,
“timed_out” : false,
“_shards” : {
“total” : 5,
“successful” : 5,
“skipped” : 0,
“failed” : 0
},
“hits” : {
“total” : {
“value” : 10000,
“relation” : “gte”
},
“max_score” : 6.310626,
“hits” : [
{
“_index” : “documentdb-reviews”,
“_id” : “64d5f57e0f98e701c42c9a55”,
“_score” : 6.310626,
“_source” : {
“customer_id” : 545660141,
“product_id” : “PBNXC579GC”,
“product_category” : “Industrial & Scientific”,
“review_id” : “RVVH5TA8P834B”,
“helpful_votes” : 0,
“product_title” : “Tech-Pro Laser Distance Meter”,
“review_body” : “Reliable, easy to use distance measurement powerhouse! Highly recommend.”,
“review_date” : “2022-08-01”,
“star_rating” : 5,
“total_votes” : 0,
“verified_purchase” : false,
“helpful_votes#review_id” : “00#RVVH5TA8P834B”
},
“highlight” : {
“review_body” : [
“Reliable, easy to use distance measurement powerhouse! Highly recommend.”
]
}
}, . . .

Full-text search boost queries:

Using the boost feature, you can improve search relevancy by “boosting” certain fields. Boosts are multipliers that weigh matches in one field more heavily than matches in other fields.

In the following example, a match for game in the review_body field influences _score twice as much as a match in the product_title field. You can add the boosting factor to the query through the caret (^) operator.

curl https://$AOS_HOST/documentdb-reviews/_search?pretty -H “Content-Type: application/json” -d
‘
{
“query”: {
“multi_match”: {
“query”: “game”,
“fields”: [“review_body^2″,”product_title”]
}
}
}
‘

Sample output:

{
“_index”: “documentdb-reviews”,
“_id”: “dOVf34kBENnbYwkKwg_L”,
“_score”: 19.019758,
“_source”: {
“customer_id”: 270663766,
“product_id”: “PCIQ2RNGG8”,
“product_category”: “Toys & Games”,
“review_id”: “RB13N6LCBZNHN”,
“helpful_votes”: 2,
“product_title”: “FunLand Board Game”,
“review_body”: “FunLand Board Game is absolutely wonderful! The whole family loved playing this game night after night. Great quality components and super fun gameplay for all ages. I can’t recommend this highly enough!”,
“review_date”: “2023-01-06”,
“star_rating”: 5,
“total_votes”: 3,
“verified_purchase”: false,
“helpful_votes#review_id”: “02#RB13N6LCBZNHN”
}
}

Fuzzy search:

Fuzzy queries return documents that contain terms similar to the search term. For example, if the search term is “easy,” documents with data matching “eays”, “ease”, “easi” and more are matched.

Here is query to find all reviews with a review body that has a fuzzy match for “easi”:

curl https://$AOS_HOST/documentdb-reviews/_search?pretty -H “Content-Type: application/json” -d
‘
{
“query”: {
“fuzzy” : {
“review_body” : {
“value”: “easi”
}
}
}
}
‘

Sample output:

{
“took” : 62,
“timed_out” : false,
“_shards” : {
“total” : 5,
“successful” : 5,
“skipped” : 0,
“failed” : 0
},
“hits” : {
“total” : {
“value” : 10000,
“relation” : “gte”
},
“max_score” : 3.0857868,
“hits” : [
{
“_index” : “documentdb-reviews”,
“_id” : “64d5f4a20f98e701c42b5bae”,
“_score” : 3.0857868,
“_source” : {
“customer_id” : 650983998,
“product_id” : “PXJB5YRCQD”,
“product_category” : “Tools & Home Improvement”,
“review_id” : “RO59IIC7JZFYR”,
“helpful_votes” : 2,
“product_title” : “The GrungeMaster 3000”,
“review_body” : “This innovative tool cleans even the dirtiest grime with ease. It makes cleaning with sturdy ease a breeze! Easy to handle, store, and use. A must have tool for any home.”,
“review_date” : “2022-08-29”,
“star_rating” : 5,
“total_votes” : 2,
“verified_purchase” : false,
“helpful_votes#review_id” : “02#RO59IIC7JZFYR”}
}
},…]
}
}

Search with synonyms:

You can upload custom dictionary files such as stop words and synonyms referred to as packages to your Amazon OpenSearch cluster to tell OpenSearch to ignore certain high-frequency words or to treat terms like “brinjal”, “aubergine”, and “eggplant” as equivalent, resulting in better search results.

To implement search with synonyms, you need to perform additional configuration on your Amazon OpenSearch Service cluster. For steps to implement synonym search, see custom packages for Amazon OpenSearch. The following is an example for synonym search that considers “software” and “program” equivalent in the review_body field.

curl https://$AOS_HOST/documentdb-reviews/_search?pretty -H “Content-Type: application/json” -d
‘
{
“query”: {
“match”: {
“review_body”: “software”
}
}
}
‘

Sample output:

{
“_index”: “documentdb-reviews”,
“_id”: “nqde34kB-h_19Z5f1Qml”,
“_score”: 9.878886,
“_source”: {
“customer_id”: 326008690,
“product_id”: “P5STAPV6HU”,
“product_category”: “Office Products”,
“review_id”: “R75Y7F18631HQ”,
“helpful_votes”: 4,
“product_title”: “RELIABLE TAX PROGRAM”,
“review_body”: “RELIABLE TAX PROGRAM software helps you manage your taxes smoothly and efficiently with user-friendly features designed for tax prep novices and pros alike. This program simplifies the tax filing process.”,
“review_date”: “2021-04-20”,
“star_rating”: 5,
“total_votes”: 4,
“verified_purchase”: true,
“helpful_votes#review_id”: “04#R75Y7F18631HQ”
}
}

Clean up

To stop incurring costs, clean up the resources created in this post by deleting the CloudFormation stack you created. For instructions, refer to Deleting a stack on the AWS CloudFormation console.

Summary

In this post, we showed you how to integrate Amazon DocumentDB, OpenSearch Service, and Lambda to perform full-text search and fuzzy search queries over JSON data. Specifically, we used the latest Lambda integration to replicate change events from an Amazon DocumentDB change stream to an OpenSearch Service index.

Visit Get Started with Amazon DocumentDB to begin using Amazon DocumentDB.

About the Authors

Kaarthiik Thota is a Senior Amazon DocumentDB Specialist Solutions Architect at AWS based out of London. He is passionate about database technologies and enjoys helping customers solve problems and modernize applications using NoSQL databases. Before joining AWS, he worked extensively with relational databases, NoSQL databases, and business intelligence technologies for more than 15 years.

Hendy Wijaya is a Senior OpenSearch Specialist Solutions Architect at Amazon Web Services. Hendy enables customers to leverage AWS services to achieve their business objectives and gain competitive advantages. He is passionate in collaborating with customers in getting the best out of OpenSearch and Amazon OpenSearch

Perform fuzzy full-text search and semantic search on Amazon DocumentDB using Amazon OpenSearch Service

Solution overview

Deploy the CloudFormation template

Set up an AWS Cloud9 environment

Enable Amazon DocumentDB change streams

Configure the Amazon DocumentDB change stream as a source for the Lambda function

Load the reviews dataset into Amazon DocumentDB

Run queries on Amazon DocumentDB data in OpenSearch Service

Full-text search queries

Clean up

Summary

About the Authors

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Implement UUIDv7 in Amazon RDS for PostgreSQL using Trusted Language Extensions

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Optimizing Waze ad delivery using TensorFlow over Vertex AI

How to build your first recipe — Workato for Beginners

Enhance Amazon Connect and Lex with generative AI capabilities

POPULAR CATEGORY