How to enrich product data with generative AI using Vertex AI

By mullaned2002

October 9, 2023

216

Product Information Management (PIM) is a critical process in the retail industry to manage product data, such as descriptions, images, and other attributes. In this blog post, we will show how to use Large Language Models (LLMs) with Vertex AI to enrich product data, which can improve the customer experience and the bottom line.

Product Information Management

PIM is the process of collecting, storing, and managing product information across an organization. It includes gathering data from a variety of sources, such as product catalogs, websites, and customer feedback. PIM systems then organize and normalize this data so that it can be used by other systems, such as e-commerce platforms, marketing automation tools, and product recommendations engines. The PIM market is growing rapidly, as businesses increasingly recognize the importance of having accurate and up-to-date product information.

LLMs can support the PIM process in a number of ways, including:

Generating product descriptions: LLMs can be trained on a large corpus of product descriptions to generate new descriptions for products.Translating product descriptions: LLMs can be used to translate product descriptions into multiple languages.Extracting product attributes: LLMs can be used to extract product attributes from product descriptions, such as the product name, price, and features.

Getting started

For this demonstration, we’ll use the Flipkart products dataset on Kaggle. It provides a sample of 20,000 products from the Indian e-commerce retailer Flipkart, with 15 fields including name, description, and price.

Our goal will be to improve the quality of product descriptions in the dataset. In particular, let’s look for short or incomplete descriptions that can be augmented.

You can follow along with the Colab notebook. Here, we will highlight key steps but not include all of the details.

Data analysis

Our first step will be to understand the distribution of product descriptions. We can create a Kernel Density Estimation (KDE) plot, which can help us visualize a smoothed distribution of the data.

code_block<ListValue: [StructValue([(‘code’, ‘# Get the number of characters in the description fieldrnnum_chars_in_description = df[“description”].str.len()rnrnrn# Create a kernel density estimate (KDE) objectrnkde = gaussian_kde(num_chars_in_description)rnrnrn# Evaluate the PDF at a range of pointsrnx = np.linspace(0, max(num_chars_in_description), 1000)rny = kde(x)rnrnrn# Plot the smoothed distribution functionrnplt.xscale(“log”)rnplt.plot(x, y)rnplt.xlabel(“Number of characters in description”)rnplt.ylabel(“Probability density”)rnplt.title(“Product Description Length KDE Plot”)rnplt.show()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e193c293730>)])]>

You may notice that we used a log scale on the X-axis, so we can more carefully inspect the left-tail of the distribution. We see that most descriptions are between 100 and 1000 characters. Let’s augment the shortest 0.05% of our descriptions, setting our threshold to 93 characters.

code_block<ListValue: [StructValue([(‘code’, ‘# Get the number of characters in the description fieldrnthreshold = int(num_chars_in_description.quantile(0.0005))rnrnrnprint(threshold) # 93’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e193c293760>)])]>

Data preparation

We can compute that there are 13 product descriptions that can be augmented.

code_block<ListValue: [StructValue([(‘code’, ‘# Get the number of characters in the description fieldrnnum_desc_chars = df[“description”].str.len()rnrnrn# Count the number of rows with descriptions under thresholdrnnum_rows_with_short_desc = df[num_desc_chars < threshold].shape[0]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e193c293550>)])]>

Let’s take a sampling of 3 of these. Clearly, we can make improvements to the copy!

Specifications of Shilpi NHSCN003 Coin Bank (Brown) In The Box Sales Package Coin BankKey Features of Prime Printed 6 Seater Table Cover Length 78 inch/198 cm Width 54 inch/137 cmSpecifications of Speedo Men’s Swimsuit General Details Occasion Sports Ideal For Men’s

It will be helpful to provide the LLM some extra context — not just the original description, but other attributes such as name, brand, and category. We don’t need to include everything, so let’s filter out columns that don’t help toward our goal. We can then put the results into a JSON string, so that it can be easily parsed by the model.

code_block<ListValue: [StructValue([(‘code’, ‘# Create a JSON object for each rowrnjson_objects = []rnfor index, row in rows_with_description_under_100.iterrows():rnrow = row[[“product_name”, “description”, “brand”, “product_category_tree”, “pid”]]rnjson_object = {}rnfor column in row.index:rn json_object[column] = row[column]rn json_objects.append(json_object)rnrnrn# Create a string with the JSON arrayrnproduct_data = json.dumps(json_objects)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e193b6c37c0>)])]>

Prompt engineering

Our next step is to create a prompt that will instruct the model what to do. We won’t go into too much depth on prompt design, but let’s explore each part of the prompt we will use.

First, we’ll want to be clear about our requirements — to provide compelling copy, while still being accurate. Also, we want the response to be in JSON format, so it can be easily parsed. Finally, we want to include the product unique identifier, so we can link the updated description to the original product. The product data we gathered up earlier will be embedded at the end of the prompt.

code_block<ListValue: [StructValue([(‘code’, ‘prompt = f”””rn Generate a compelling and accurate product description for each of the rn products provided in the JSON data structure below. The output shouldrn be a JSON array consisting of the `uniq_id` and updated `description`rn of each product.rnrnrn===rn{product data}rn”””‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e193b6c3670>)])]>

Querying the model

Now, we’re ready to generate product descriptions. It will take four simple steps.

First, import the Vertex AI SDK for Python and initialize the client with your project ID and region.Then, selected a pretrained text generation model from the Vertex AI Model Garden.Set the maximum output tokens to 1024, since we don’t mind getting elaborate descriptions back.Query the model and see the result.

code_block<ListValue: [StructValue([(‘code’, ‘import vertexairnfrom vertexai.language_models import TextGenerationModelrnrnrn# Initialize the clientrnvertexai.init(project=project_id, location=location)rnrnrn# Use the text-bison model from the Vertex Model Gardenrnmodel = TextGenerationModel.from_pretrained(“text-bison@001”)rnrnrn# Update the default max_output_tokensrnparameters = {“max_output_tokens”: 1024}rnrnrn# Query the modelrnresponse = model.predict(prompt, **parameters)rnrnrn# Print the resultrnresponse.text’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e193b6c3ca0>)])]>

Here’s our result:

[{“uniq_id”: “CNBEJ9EDXWN8HQUU”, “description”: “Shilpi NHSCN003 Coin Bank (Brown) is a coin bank with a capacity of 1000 coins…

We see that it’s a JSON structure that can be easily parsed. We can use JSON parsing functions to extract each ID and updated description, and then update the original product description.

Let’s compare our results to what we had before. Quite an improvement!

Here we see the distribution of the shorter product descriptions. We can see that we no longer have descriptions under the threshold.

Conclusion

Large Language Models can be an effective tool in improving product data quality. Consistent, accurate, and compelling product data is the cornerstone of the retail experience. We’ve seen in this blog post how we can easily integrate with Vertex AI services to make this happen. You can take the ideas further by capturing context from product images using Vertex AI Vision, or even create images based on the product metadata with Imagen on Vertex AI. You can find out more at our Vertex AI Generative AI and Google Cloud for Retail sites.

Cloud BlogRead More

Previous articleSpotify keeps engineers and code in tune with fleet management

Next articleUse no-code machine learning to derive insights from product reviews using Amazon SageMaker Canvas sentiment analysis and text analysis models

How to enrich product data with generative AI using Vertex AI

Product Information Management

Getting started

Data analysis

Data preparation

Prompt engineering

Querying the model

Conclusion

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Linear optimizes data and scalability with vector search support on Google Cloud SQL

The Pros and Cons of Using Different Web Browsers and Search Engines for Privacy

Transforming customer experiences with modern cloud database capabilities

POPULAR CATEGORY