Artificial Intelligence and Machine Learning

Incrementally update a dataset with a bulk import mechanism in Amazon Personalize

By mullaned2002

August 17, 2022

750

We are excited to announce that Amazon Personalize now supports incremental bulk dataset imports; a new option for updating your data and improving the quality of your recommendations. Keeping your datasets current is an important part of maintaining the relevance of your recommendations. Prior to this new feature launch, Amazon Personalize offered two mechanisms for ingesting data:

DatasetImportJob – DatasetImportJob is a bulk data ingestion mechanism designed to import large datasets into Amazon Personalize. A typical journey starts with importing your historical interactions dataset in addition to your item catalog and user dataset. DatasetImportJob can then be used to keep your datasets current by sending updated records in bulk. Prior to this launch, data ingested via previous import jobs was overwritten by any subsequent DatasetImportJob.
Streaming APIs: The streaming APIs (PutEvents, PutUsers, and PutItems) are designed to incrementally update each respective dataset in real-time. For example, after you have trained your model and launched your campaign, your users continue to generate interactions data. This data is then ingested via the PutEvents API, which incrementally updates your interactions dataset. Using the streaming APIs allows you to ingest data as you get it rather than accumulating the data and scheduling ingestion.

With incremental bulk imports, Amazon Personalize simplifies the data ingestion of historical records by enabling you to import incremental changes to your datasets with a DatasetImportJob. You can import 100 GB of data per FULL DatasetImportJob or 1 GB of data per INCREMENTAL DatasetImportJob. Data added to the datasets using INCREMENTAL imports are appended to your existing datasets. Personalize will update records with the current version if your incremental import duplicates any records found in your existing dataset, further simplifying the data ingestion process. In the following sections, we describe the changes to the existing API to support incremental dataset imports.

CreateDatasetImportJob

A new parameter called importMode has been added to the CreateDatasetImportJob API. This parameter is an enum type with two values: FULL and INCREMENTAL. The parameter is optional and is FULL by default to preserve backward compatibility. The CreateDatasetImportJob request is as follows:

{
“datasetArn”: “string”,
“dataSource”: {
“dataLocation”: “string”
},
“jobName”: “string”,
“roleArn”: “string”,
“importMode”: {INCREMENTAL, FULL}
}

The Boto3 API is create_dataset_import_job, and the AWS Command Line Interface (AWS CLI) command is create-dataset-import-job.

DescribeDatasetImportJob

The response to DescribeDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode field, which is an enum type with two values: FULL and INCREMENTAL. The DescribeDatasetImportJob response is as follows:

{
“datasetImportJob”: {
“creationDateTime”: number,
“datasetArn”: “string”,
“datasetImportJobArn”: “string”,
“dataSource”: {
“dataLocation”: “string”
},
“failureReason”: “string”,
“jobName”: “string”,
“lastUpdatedDateTime”: number,
“roleArn”: “string”,
“status”: “string”,
“importMode”: {INCREMENTAL, FULL}
}
}

The Boto3 API is describe_dataset_import_job, and the AWS CLI command is describe-dataset-import-job.

ListDatasetImportJob

The response to ListDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode field, which is an enum type with two values: FULL and INCREMENTAL. The ListDatasetImportJob response is as follows:

{
“datasetImportJobs”: [ {
“creationDateTime”: number,
“datasetImportJobArn”: “string”,
“failureReason”: “string”,
“jobName”: “string”,
“lastUpdatedDateTime”: number,
“status”: “string”,
“importMode”: ” {INCREMENTAL, FULL}
} ],
“nextToken”: “string”
}

The Boto3 API is list_dataset_import_jobs, and the AWS CLI command is list-dataset-import-jobs.

Code example

The following code shows how to create a dataset import job for incremental bulk import using the SDK for Python (Boto3):

import boto3

personalize = boto3.client(‘personalize’)

response = personalize.create_dataset_import_job(
jobName = ‘YourImportJob’,
datasetArn = ‘arn:aws:personalize:us-east 1:111111111111:dataset/AmazonPersonalizeExample/INTERACTIONS’,
dataSource = {‘dataLocation’:’s3://bucket/file.csv’},
roleArn = ‘role_arn’,
importMode = ‘INCREMENTAL’
)

dsij_arn = response[‘datasetImportJobArn’]

print (‘Dataset Import Job arn: ‘ + dsij_arn)

description = personalize.describe_dataset_import_job(
datasetImportJobArn = dsij_arn)[‘datasetImportJob’]

print(‘Name: ‘ + description[‘jobName’])
print(‘ARN: ‘ + description[‘datasetImportJobArn’])
print(‘Status: ‘ + description[‘status’])

Summary

In this post, we described how you can use this new feature in Amazon Personalize to perform incremental updates to a dataset with bulk import, keeping the data fresh and improving the relevance of Amazon Personalize recommendations. If you have delayed access to your data, incremental bulk import allows you to import your data more easily by appending it to your existing datasets.

Try out this new feature by accessing Amazon Personalize now.

About the authors

Neelam Koshiya is an enterprise solution architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

James Jory is a Principal Solutions Architect in Applied AI with AWS. He has a special interest in personalization and recommender systems and a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and auto racing simulations.

Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.

Alex Berlingeri is a Software Development Engineer with Amazon Personalize working on a machine learning powered recommendations service. In his free time he enjoys reading, working out and watching soccer.

Incrementally update a dataset with a bulk import mechanism in Amazon Personalize

CreateDatasetImportJob

DescribeDatasetImportJob

ListDatasetImportJob

Code example

Summary

About the authors

Amazon SageMaker inference launches faster auto scaling for generative AI models

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

Evaluate conversational AI agents with Amazon Bedrock

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Create high-quality datasets with Amazon SageMaker Ground Truth and FiftyOne

Automating Sales Processes with SnapLogic Flows

When one becomes two: Resource hierarchy strategies for divested organization

POPULAR CATEGORY