Saturday, July 6, 2024
No menu items!
HomeCloud ComputingPrompting best practices for BigQuery data canvas

Prompting best practices for BigQuery data canvas

Traditionally, having strong knowledge of SQL has been a critical job requirement for data professionals. Lately, however, generative AI is helping data professionals of all skill levels use natural language prompts to do a variety of data tasks, democratizing access to data and insights and promoting collaboration.

In the BigQuery ecosystem, Gemini is the foundation model that lets users find data, create SQL, generate charts, and create data summaries — what’s commonly referred to as NL2SQL and NL2Chart. It does so through BigQuery data canvas, a versatile tool that helps developers, data scientists and analysts share their query building processes and insights. With data canvas, users can use natural language to locate, join, and query table assets, visualize the results, and collaborate with others throughout the entire process. This ultimately accelerates your analytics and saves time and effort across the entire team. 

However, many data professionals perceive that out-of-the-box large language models (LLMs) are not good at generating SQL. But in our experience, with the right prompting techniques, Gemini in BigQuery via data canvas can generate complex SQL queries with the context of your data corpus. Data canvas uses natural language queries to determine sorting, grouping, ordering, limiting the record count, and SQL structure.

In this blog post, we provide five tips to refine your natural language (NL) prompts to increase the accuracy of your NL2SQL and NL2Chart queries. Soon, you too can use BigQuery data canvas to generate accurate and insightful SQL and charts with intuitive natural language.

1. Clarity is key

When writing prompts, try to be as clear as possible, to avoid confusion — and thus, incomplete or inaccurate SQL.

Precision over ambiguity: State your request clearly and avoid being vague.  

DON’T: “Tell me about my sales data.”

DO: “Show me a breakdown of total sales revenue by product category for the last quarter.”

By focusing on precision, you can attain your goals quickly and effectively

Side-by-side examples of prompts and SQL outputs – right side shows proper way to structure prompts in Data Canvas

Context matters: Provide any relevant background information to help Gemini understand your request.

DON’T: “What’s the trend?”

DO: “What’s the trend of monthly active users for our mobile app over the past year?

Complete sentences: Use full sentences with more details to avoid misinterpretation.

DON’T: “Show me full month name for comitter’s date for the repo like ‘cloud-vision’ ”   <<- Steps are not clear and complete 

DO: 

# prompt: Convert months in integers to its complete month name for committer’s date. List commit, repo name, author name ,committer name, author date and year , committer date and year for repo like ‘cloud-vision’. Use “%” in like

Here is the SQL generated by the above prompt. Not too shabby…

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn t1.rnCOMMITrn ,rn t1.repo_name,rn t1.author_name,rn t1.committer_name,rn t1.author_date,rn EXTRACT(YEARrn FROMrn t1.committer_date) AS committer_year,rn EXTRACT(MONTHrn FROMrn t1.committer_date) AS committer_month,rn CASErn WHEN EXTRACT(MONTH FROM t1.committer_date) = 1 THEN ‘January’rn WHEN EXTRACT(MONTHrn FROMrn t1.committer_date) = 2 THEN ‘February’rn WHEN EXTRACT(MONTH FROM t1.committer_date) = 3 THEN ‘March’rn WHEN EXTRACT(MONTHrn FROMrn t1.committer_date) = 4 THEN ‘April’rn WHEN EXTRACT(MONTH FROM t1.committer_date) = 5 THEN ‘May’rn WHEN EXTRACT(MONTHrn FROMrn t1.committer_date) = 6 THEN ‘June’rn WHEN EXTRACT(MONTH FROM t1.committer_date) = 7 THEN ‘July’rn WHEN EXTRACT(MONTHrn FROMrn t1.committer_date) = 8 THEN ‘August’rn WHEN EXTRACT(MONTH FROM t1.committer_date) = 9 THEN ‘September’rn WHEN EXTRACT(MONTHrn FROMrn t1.committer_date) = 10 THEN ‘October’rn WHEN EXTRACT(MONTH FROM t1.committer_date) = 11 THEN ‘November’rn WHEN EXTRACT(MONTHrn FROMrn t1.committer_date) = 12 THEN ‘December’rnENDrn AS committer_month_namernFROMrncarbonfootprint-test._258496aad0f70ded1533d656e2d91ec72ca85379.anon8b5e7308ef0b4726721f5682b48ec5c51cd6137e7fae309474de23971464dd33 AS t1rnWHERErn t1.repo_name LIKE ‘%cloud-vision%'”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c5e79f4c0>)])]>

2. Ask direct questions

When prompting an LLM, it’s important to be straightforward and unambiguous. Here are a few things to keep in mind.

Ask one question at a time: For the most precise answer, focus on asking a single question with detailed instructions. Increased instructions enhances the model’s contextual comprehension. Try to be as descriptive as possible in executing one transformation.

Avoid overloading: Keep prompts concise to avoid overwhelming Gemini in BigQuery with too much information. Consider adding a new node for the next transformation if needed.

DON’T: “List all the columns except parent, trailer, and difference. Get only the repo name for the first entry in the repeated column and convert both author and committer date in seconds to timestamp seconds to a date. Include commit, repo, author, and committer names too. ”

DO: Separate requests into two separate nodes in series 

# prompt: List all the columns except parent, trailer, and difference. Get only the repo name for the first entry in the repeated column

# prompt: convert both author and committer date in seconds to timestamp seconds to a date. Include commit, repo, author and commiter names too

3. Give focused and explicit instructions

LLMs thrive on specificity. Here are some things to keep in mind when asking BigQuery data canvas to generate code for you

Emphasize key terms: You can highlight important terms to guide Gemini’s understanding. In the following example, Gemini’s ability to quickly build this intricate logic was made possible by recognizing the highlighted keywords in the query.

# prompt: list all columns by unnesting author and committer by prefixing the record name to all the columns. match repo name for the first entry only in the repeated column that has name like GoogleCloudPlatform. Make sure to use “%” for like

Here’s the generated SQL:

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn t1.rnCOMMITrn ,rn t1.tree,rn t1.parent,rn t1.author.name AS author_name,rn t1.author.email AS author_email,rn t1.author.time_sec AS author_time_sec,rn t1.author.tz_offset AS author_tz_offset,rn t1.author.date.seconds AS author_date_seconds,rn t1.author.date.nanos AS author_date_nanos,rn t1.committer.name AS committer_name,rn t1.committer.email AS committer_email,rn t1.committer.time_sec AS committer_time_sec,rn t1.committer.tz_offset AS committer_tz_offset,rn t1.committer.date.seconds AS committer_date_seconds,rn t1.committer.date.nanos AS committer_date_nanos,rn t1.subject,rn t1.message,rn t1.trailer,rn t1.difference,rn t1.difference_truncated,rn t1.repo_name,rn t1.encodingrnFROMrn `bigquery-public-data`.github_repos.commits AS t1rnWHERErn t1.repo_name[rnOFFSETrn (0)] LIKE ‘%GoogleCloudPlatform%'”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c5e79fa30>)])]>

4. Specify order of operations

Providing instructions in a clear and ordered manner is important for Gemini in BigQuery to process and respond effectively. This helps avoid confusion and ensures that the model accurately understands the task at hand.

Here are some tips for providing clear, ordered instructions to Gemini in BigQuery:

Start with the goal: Clearly state the desired outcome before diving into the specific steps.
Use concise language: Avoid ambiguity or unnecessary jargon.
Break it down: Divide complex tasks into smaller, easier-to-follow steps.
Double-check your work: Review your instructions for clarity and logical order before submitting.
Depending upon the output, rearrange phrases or rephrase as needed.

You can see an example of how prompt structure affects the resulting SQL query outputs below. The original prompt used to unnest the author and committer fields is modified by rearranging the phrases or adding more explicit phrases in an unintuitive order. 

Original prompt:

# prompt: list all columns by unnesting author and committer by prefixing the record name to all the columns. match repo name for the first entry only in the repeated column that has name like GoogleCloudPlatform.

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn t1.rnCOMMITrn ,rn t1.tree,rn t1.parent,rn t1.author.name AS author_name,rn t1.author.email AS author_email,rn t1.author.time_sec AS author_time_sec,rn t1.author.tz_offset AS author_tz_offset,rn t1.author.date.seconds AS author_date_seconds,rn t1.author.date.nanos AS author_date_nanos,rn t1.committer.name AS committer_name,rn t1.committer.email AS committer_email,rn t1.committer.time_sec AS committer_time_sec,rn t1.committer.tz_offset AS committer_tz_offset,rn t1.committer.date.seconds AS committer_date_seconds,rn t1.committer.date.nanos AS committer_date_nanos,rn t1.subject,rn t1.message,rn t1.trailer,rn t1.difference,rn t1.difference_truncated,rn t1.repo_namernFROMrn `bigquery-public-data`.github_repos.commits AS t1rnWHERErn t1.repo_name[rnOFFSETrn (0)] LIKE ‘GoogleCloudPlatform%'”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c5e79f490>)])]>

After reversing the order of the phrases of the prompt, we can see that the resulting SQL output has added an extra OFFSET syntax. Despite it being a valid query, we see that it results in no data being displayed. 

Here’s the modified prompt:

# prompt: match repo name for the first entry only in the repeated column that has name like GoogleCloudPlatform. list all columns by unnesting author and committer by prefixing the record name to all the columns.

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn t1.repo_name[rnOFFSETrn (0)],rn t1.rnCOMMITrn ,rn t1.tree,rn t1.parent,rn t1.author.name AS author_name,rn t1.author.email AS author_email,rn t1.author.time_sec AS author_time_sec,rn t1.author.tz_offset AS author_tz_offset,rn t1.author.date.seconds AS author_date_seconds,rn t1.author.date.nanos AS author_date_nanos,rn t1.committer.name AS committer_name,rn t1.committer.email AS committer_email,rn t1.committer.time_sec AS committer_time_sec,rn t1.committer.tz_offset AS committer_tz_offset,rn t1.committer.date.seconds AS committer_date_seconds,rn t1.committer.date.nanos AS committer_date_nanos,rn t1.subject,rn t1.message,rn t1.trailer,rn t1.difference,rn t1.difference_truncatedrnFROMrn `bigquery-public-data`.github_repos.commits AS t1rnWHERErn t1.repo_name[rnOFFSETrn (0)] LIKE ‘GoogleCloudPlatform'”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c5e79fd30>)])]>

In this last example, we see the addition of “use % in like” explicitly since that was missing in the previous SQL output. Unfortunately, after changing this, the resulting SQL output throws the following error. 

2nd modified prompt: 

# prompt: Make sure to use “%” for like and match repo name for the first entry only in the repeated column that has name like GoogleCloudPlatform. list all columns by unnesting author and committer by prefixing the record name to all the columns.

code_block
<ListValue: [StructValue([(‘code’, “SELECTrn t1.rnCOMMITrn ,rn t1.tree,rn t1.parent,rn t1.author.name AS author_name,rn t1.author.email AS author_email,rn t1.author.time_sec AS author_time_sec,rn t1.author.tz_offset AS author_tz_offset,rn t1.author.date.seconds AS author_date_seconds,rn t1.author.date.nanos AS author_date_nanos,rn t1.committer.name AS committer_name,rn t1.committer.email AS committer_email,rn t1.committer.time_sec AS committer_time_sec,rn t1.committer.tz_offset AS committer_tz_offset,rn t1.committer.date.seconds AS committer_date_seconds,rn t1.committer.date.nanos AS committer_date_nanos,rn t1.subject,rn t1.message,rn t1.trailer,rn t1.difference,rn t1.difference_truncated,rn t1.repo_name,rn t1.encodingrnFROMrn `bigquery-public-data`.github_repos.commits AS t1rnWHERErn t1.repo_name LIKE ‘GoogleCloudPlatform%’rn AND t1.repo_name LIKE ‘GoogleCloudPlatform'”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c5e79f3d0>)])]>

Error: No matching signature for operator LIKE for argument types: ARRAY<STRING>, STRING. Supported signatures: STRING LIKE STRING; BYTES LIKE BYTES at [31:3]

5. Refine and iterate

Like they say, if at first you don’t succeed, try, try again. When it comes to using Gemini in BigQuery data canvas, that comes down to:

Experimenting: Try different phrasings and approaches to see what yields the best results.

Learning from feedback: Use Gemini’s responses to adjust your prompts and improve future interactions.  

Creating charts with NL2Chart 

When it comes to visualizations, NL2Chart is a powerful tool that can transform natural language descriptions into insightful charts and graphs, and all of the preceding advice still applies. To maximize its effectiveness, follow these guidelines when crafting your prompts:

Be specific and concise: Clearly state the type of chart you want (bar, line, pie, etc.) and the specific data you want to visualize. Avoid ambiguity or unnecessary details.

Define the data: Specify the exact data points, variables, or categories to be included in the chart.

Set your timeframe: If your data is time-dependent, clearly define the time period you want to analyze (e.g., “first quarter of 2023,” “last 5 years”).

Units and labels: Mention the units of measurement for your data (e.g., dollars, percentages, units sold) and any desired labels for axes, titles, or data points.

Customization: If you have preferences for colors, styles, or specific formatting options, you can include them in your prompt. However, NL2Chart often has default settings that work well.

Example prompts

Basic: “Create a bar chart showing the sales revenue of each product category in the last quarter.”

Intermediate: “Generate a line chart comparing the stock prices of Apple and Microsoft from January 1, 2020, to December 31, 2023.”

Advanced: “Produce a scatter plot illustrating the relationship between advertising spending and website traffic for the past year, with a trendline and R-squared value.”

Additional tips

Experiment: Don’t be afraid to experiment with different phrasings and levels of detail in your prompts. NL2Chart is quite versatile.

Iterate: If the initial chart isn’t exactly what you envisioned, refine your prompt or provide additional details to guide NL2Chart toward the desired output.

Limitations: Be aware that NL2Chart might not understand complex or highly specific requests. Start with simpler prompts and gradually increase complexity as needed.

Alternative tools: If NL2Chart doesn’t meet your needs, explore other NL-to-chart tools available online.

By following these guidelines, you can effectively leverage NL2Chart to create informative and visually appealing charts that enhance your data analysis and communication. 

Along with this, you can also include specifics on customizing the visualization itself. You can customize chart aspects such as the color schemes of your trendlines, bars, pie areas, or data series. You can also specify chart axes, axis labels, titles, and the legend. Finally, you can sort the data in increasing, decreasing, or alphabetical order, as seen in the second example below.

Create custom visualizations that can be exported to Looker Studio or downloaded as a PNG.

Sorted in reverse alphabetical order via the prompt.

Visualization sorted by descending month order via prompt.

Last but not least, you can use data canvas to generate insightful summaries for your visualizations that interpret and summarize chart findings in ways that might not be apparent at first glance.

Using basic language, this detailed visualization was effortlessly produced by trying different phrasings

Practice makes perfect prompts

Learning to prompt in data canvas requires some practice, but it becomes more intuitive once you are familiar with the NL2SQL model. To get the most accurate and helpful responses from data canvas, be clear and specific in your prompts. To refine your prompts, use complete sentences, provide relevant context, and ask direct questions. Experiment with different phrasings and learn from Gemini’s feedback. By following these guidelines, you’ll unlock the full potential of Gemini in BigQuery and get the most accurate and helpful SQL outputs in BigQuery data canvas.

Put these tips and tricks into practice by accessing BigQuery data canvas! You can find some sample data canvas prompts on Github. Also, be sure to check out Data Canvas – Transforming Data into Insights with AI to follow a real example of using BigQuery data canvas to explore the Github repositories public dataset.

Cloud BlogRead More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments