Data Analytics

Understanding BigQuery data canvas: how to easily transform data into insights with AI

July 3, 2024

Layolin Jesudhass

Global Generative AI Solutions Architect, Google

Christine De Sario

Customer Engineer - Data & Analytics, Google

Try Gemini 1.5 models

Google's most advanced multimodal models in Vertex AI

Try it

BigQuery data canvas, a Gemini in BigQuery feature, is a revolutionary data analytics tool that simplifies the entire data analysis journey — from data discovery and preparation to analysis, visualization, and collaboration — all in one place, all within BigQuery. BigQuery data canvas leverages natural language processing to let you ask questions about your data in plain English and various other languages. This intuitive approach eliminates the need to build out complex SQL queries, making data analysis accessible to both technical and non-technical users. With data canvas, you can explore, transform, and visualize your BigQuery data without ever leaving the environment where your data resides.

In this blog, we present an overview of BigQuery data canvas and a technical walkthrough of a real-world example using the public github_repos dataset. This dataset contains over 3TB of activity from 3M+ open-source repositories. We'll explore how to answer questions such as:

How many commits were made to a specific repository in a year?
Who authored the most repositories in a given year?
How many non-authored commits were applied over time?
Which users contributed to a specific file at a specific time?

You'll see how data canvas handles complex SQL tasks like joining tables, unnesting fields, converting timestamps, and extracting specific data elements – all from your natural language prompts. We'll even demonstrate how to generate insightful visualizations and summaries with a single click.

BigQuery data canvas at a glance

At its core, BigQuery data canvas provides three key areas of functionality: You can use it to help you find data, generate SQL, and generate insights.

https://storage.googleapis.com/gweb-cloudblog-publish/images/img01.max-1500x1500.png

The three aspects of BigQuery data canvas

1. Find data

Use data canvas to find data in BigQuery with a quick keyword search or a natural-language text prompt.

https://storage.googleapis.com/gweb-cloudblog-publish/images/img02.max-1600x1600.png

Start your analysis journey in BigQuery's data canvas powered by Gemini

2. Generate SQL

You can also use BigQuery data canvas to write SQL code for you, also with Gemini-powered natural language prompts.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/img03.gif

Create SQL queries for the chosen table using simple English

3. Create insights

Finally, discover insights hidden within your data with a single click! Gemini automatically generates visualizations to help you understand the story your data is telling.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/img04.gif

Gemini automatically generates visualizations to help you understand the story your data is telling.

BigQuery data canvas in action

To give you a better sense of how the impact BigQuery data canvas can have in your organization, let’s take an example. Companies of all sizes, from large enterprises to small startups, can benefit from a deeper understanding of their developer team’s productivity. In this technical deep dive, we’ll show you how to use the github_repos public dataset with data canvas to generate valuable insights in a shareable workspace. Through this example, you’ll see how data canvas makes it easy to perform complex queries — allowing you to create SQL that joins and unnests nested fields, converts timestamps, extracts month/year from date fields, and more. With Gemini's capabilities, you can easily generate these queries and explore your data, with insightful visualizations, all using natural language.

Please note that as with many new AI products and services today, you need strong prompt engineering skills for the successful use of any LLM-enabled application. Many may perceive that out-of-the-box large language models (LLMs) are not good at generating SQL. But in our experience, with the right prompting techniques, Gemini in BigQuery via data canvas can generate complex SQL queries with the context of your data corpus. We see that data canvas determines the sorting, grouping, ordering, limiting the record count and SQL structure based on natural language queries. To learn about engineering prompts for BigQuery data canvas, check out our other blog post, BigQuery BigQuery Data Canvas: Prompting Best Practices.

The github_repos dataset, available in Bigquery Public Datasets, is a 3TB+ dataset that contains activity on 3M+ open-source repositories about commits, watch counts, etc., in multiple tables. For this example, we want to look at Google Cloud Platform repository. As always, ensure you have the proper IAM permissions before getting started. Along with these, ensure you have proper permissions to the data canvas and datasets to run nodes successfully.

Exploring each of the tables within the github_repos dataset is easy to do with data canvas. Here, we compare datasets side-by-side, examine schema, details, and preview data, all in the same panel.

https://storage.googleapis.com/gweb-cloudblog-publish/images/img05.max-800x800.png

Explore tables side-by-side - examine the schema, view details, and preview data

Once you've selected your dataset, you can branch another node to query or join it with another table by hovering at the node's bottom. Arrows indicate the dataset for the next transformation node. When sharing the canvas, you can name each node for clarity. Options in the top right allow you to delete, debug, duplicate, or execute all of the nodes in a series. You can download results, or export data to Sheets or Looker Studio. You can also rate SQL suggestions, restore prior versions, and view the DAG structure in the navigation panel.

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/img06.gif

Overview of entire data canvas from end-to-end

In exploring the github_repos dataset, we look at four different aspects of our data. We’ll look at determining the following:

1) The number of commits made throughout a single year

2) The number of authored repos for a given year

3) How many non-authored commits were applied throughout the years

4) Find the number of commits done by users for a given file at a given time

Below are five major examples of complex SQL queries and transformations generated by the natural language query prompts in this example workflow.

1. Unnest specific columns and find all with Google Cloud Platform in the repo_name

# prompt: list all columns by unnesting author and committer by prefixing the record name to all the columns. match repo name for the first entry only in the repeated column that has name like GoogleCloudPlatform. Make sure to use "%" for like

Generated SQL:

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/img07.gif

Proper un-nesting of fields under author and committer then filtering down to GoogleCloudPlatform repo

To determine which committers amended a specific file in repositories, we perform a table join to achieve the desired results as in the example below. An option is given at the bottom of each node to query or join the existing table or query results.

2: Inner join of previous query results with github_repos.files table on repo_name

# prompt: Join these data sources on repo name and filter only for the file "python/awwvision/redis/spec.yaml" and year 2017

Generated SQL:

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/img08.gif

Table join of query results with github_repos.files table on repo_name

Another prime example of using NL2SQL in data canvas is to convert data types. We need date conversions to convert the epoch time of the author date and committer dates to an actual date. Gemini understands the epoch time unit in seconds and applies the right timestamp and date functions in the right order to convert the dates correctly. We see that activity data in github_repos spans the years 2016 - 2022, which aligns with the results as expected.

3. Date conversion of author date and committer date expressed in seconds

# prompt: convert both author and commiter date in seconds to timestamp seconds to a date. Include commit, repo, author and commiter names too

Generated SQL:

https://storage.googleapis.com/gweb-cloudblog-publish/images/img09.max-1600x1600.png

Author date and committer date correctly converted from epoch time in seconds

Taking it a step further, you can see how well Gemini generates a complex query and figures out the logic based solely on what is asked. In this case, we asked it to extract the month and year from dates and to convert months in integers to the months’ actual names. This is considerably faster than creating the query by hand, and can be easily shared with other members of the team.

4: Extracting month and year from date types and converting months in integers to full month names.

# prompt: Convert months in integers to its complete month name for committer's date. List commit, repo name, author name ,committer name, author date and year , committer date and year for repo like 'cloud-vision’. Use "%" in like

Generated SQL:

SELECT
 t1.
COMMIT
 ,
 t1.repo_name,
 t1.author_name,
 t1.committer_name,
 t1.author_date,
 EXTRACT(YEAR
 FROM
   t1.committer_date) AS committer_year,
 EXTRACT(MONTH
 FROM
   t1.committer_date) AS committer_month,
 CASE
   WHEN EXTRACT(MONTH FROM t1.committer_date) = 1 THEN 'January'
   WHEN EXTRACT(MONTH
 FROM
   t1.committer_date) = 2 THEN 'February'
   WHEN EXTRACT(MONTH FROM t1.committer_date) = 3 THEN 'March'
   WHEN EXTRACT(MONTH
 FROM
   t1.committer_date) = 4 THEN 'April'
   WHEN EXTRACT(MONTH FROM t1.committer_date) = 5 THEN 'May'
   WHEN EXTRACT(MONTH
 FROM
   t1.committer_date) = 6 THEN 'June'
   WHEN EXTRACT(MONTH FROM t1.committer_date) = 7 THEN 'July'
   WHEN EXTRACT(MONTH
 FROM
   t1.committer_date) = 8 THEN 'August'
   WHEN EXTRACT(MONTH FROM t1.committer_date) = 9 THEN 'September'
   WHEN EXTRACT(MONTH
 FROM
   t1.committer_date) = 10 THEN 'October'
   WHEN EXTRACT(MONTH FROM t1.committer_date) = 11 THEN 'November'
   WHEN EXTRACT(MONTH
 FROM
   t1.committer_date) = 12 THEN 'December'
END
 AS committer_month_name
FROM
carbonfootprint-test._258496aad0f70ded1533d656e2d91ec72ca85379.anon8b5e7308ef0b4726721f5682b48ec5c51cd6137e7fae309474de23971464dd33 AS t1
WHERE
 t1.repo_name LIKE '%cloud-vision%'

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/img10.gif

Conversion of months expressed in integers to full month names

We also demonstrate filtering down data by specifying a specific repo_name and examining the number of commit_counts to this specific repository in a given year. From this, we can use NL2Chart to generate a pie chart. You’ll notice that there’s even an option to generate a gen AI summary for your data. This goes a step further and defines key points of the chart that are not apparent at first glance. Hovering over specific parts of the chart also yields a description with the pointer tooltip. You can export this visualization a png or looker studio or share it with others as well. You can edit chart properties as well as access a JSON output editor by clicking the edit option at the bottom of the node.

5: Generate visualizations and summaries on a specific repo_name

# prompt: for repo name like "cloud-vision", get repo name ,counts by month as commit_counts and year. Use “%" in like

Generated SQL:

Resulting visualization:

https://storage.googleapis.com/gweb-cloudblog-publish/images/img11.max-1300x1300.png

Resulting pie chart shows the number committers over the period of a year, 2016

Finally, once you've completed designing your ETL (Extract, Transform, Load) workflow visually in DAG form with BigQuery data canvas, you can export it to BigQuery Colab Enterprise notebooks, transitioning from a visual format to a coded notebook environment for further analysis or customization. Remember, it’s important to have the right networking parameters set up to connect to a runtime environment, so refer to this reference on Notebooks in BQ Studio.

Streamline data analysis with BigQuery data canvas

When dealing with extensive datasets that span various domains, it can be hard to understand the data for a new project or use case. Using data canvas can streamline this process. By simplifying data analysis through natural language-based SQL generation and visualizations, data canvas can enhance your efficiency and expedite workflows, minimizing repetitive queries and allowing you to schedule automatic data refreshes. BigQuery data canvas is generally available – get started today! And for prompt engineering best practices, be sure to read our companion blog, BigQuery Data Canvas: Prompting Best Practices.

References

Posted in

Data Analytics

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

By Qiqi Wu • 5-minute read

Data Analytics

How to use gen AI for better data schema handling, data quality, and data generation

By Deb Lee • 9-minute read

Data Analytics

BigQuery ML is now compatible with open-source gen AI models

By Vaibhav Sethi • 3-minute read

Data Analytics

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

By Yuri Volobuev • 4-minute read

Understanding BigQuery data canvas: how to easily transform data into insights with AI

Layolin Jesudhass

Christine De Sario

Try Gemini 1.5 models

BigQuery data canvas at a glance

1. Find data

2. Generate SQL

3. Create insights

BigQuery data canvas in action

Streamline data analysis with BigQuery data canvas

References

Related articles

How to reduce costs with Managed Service for Apache Kafka: CUDs, compression and more

How to use gen AI for better data schema handling, data quality, and data generation

BigQuery ML is now compatible with open-source gen AI models

Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support