Generate data insights in BigQuery
Data insights offers an automated way to explore and understand your data. It uses Gemini to generate queries based on the metadata of a table, and helps you uncover patterns, assess data quality, and perform statistical analysis.
This document describes the key features of data insights and the process to automate query generation for insightful data exploration.
Before you begin
Data insights are generated using Gemini in BigQuery. To start generating insights, you must first set up Gemini in BigQuery.
Required roles
To create, manage, and retrieve data insights, ask your administrator to grant you the following IAM roles:
Dataplex DataScan Editor (
roles/dataplex.dataScanEditor
) or Dataplex DataScan Administrator (roles/dataplex.dataScanAdmin
) on the project where you want to generate insights.BigQuery Data Viewer (
roles/bigquery.dataViewer
) on the BigQuery tables for which you want to generate insights.BigQuery User (
roles/bigquery.user
) or BigQuery Studio User (roles/bigquery.studioUser
) on the project where you want to generate insights.
To get read-only access to the generated insights, ask your administrator to grant you the following IAM role:
- Dataplex DataScan DataViewer (
roles/dataplex.dataScanDataViewer
) on the project containing the BigQuery tables for which you want to view insights.
For more information about granting roles, see Manage access to service accounts.
You might also be able to get the required permissions through custom roles or other predefined roles. To see the exact permissions that are required generate insights, expand the Required permissions section:
Required permissions
bigquery.jobs.create
bigquery.tables.get
bigquery.tables.getData
dataplex.datascans.create
dataplex.datascans.get
dataplex.datascans.getData
dataplex.datascans.run
Enable APIs
To use data insights, enable the following APIs in your project:
For more information about enabling the Gemini for Google Cloud API, see Enable the Gemini for Google Cloud API in a Google Cloud project.
About data insights
When exploring a new, unfamiliar table, data analysts often face the cold-start problem. The problem often involves uncertainties about the data structure, key patterns, and relevant insights in the data, making it hard to get started writing queries. Data insights addresses the cold-start problem by automatically generating queries in natural language and SQL based on the table's metadata. Rather than starting with an empty query editor, you can quickly start data exploration with meaningful queries that offer valuable insights. Queries generated using data insights are grounded using published data profiling results to improve their accuracy and usefulness.
Example of an insights run
Consider a table called telco_churn
with the following metadata:
Field name | Type |
---|---|
CustomerID | STRING |
Gender | STRING |
Tenure | INT64 |
InternetService | STRING |
StreamingTV | STRING |
OnlineBackup | STRING |
Contract | STRING |
TechSupport | STRING |
PaymentMethod | STRING |
MonthlyCharges | FLOAT |
Churn | BOOLEAN |
The following are some of the sample queries that data insights generates for this table:
Identify customers who have subscribed to all premium services and have been a customer for more than 50 months.
SELECT CustomerID, Contract, Tenure FROM agentville_datasets.telco_churn WHERE OnlineBackup = 'Yes' AND TechSupport = 'Yes' AND StreamingTV = 'Yes' AND Tenure > 50;
Identify which internet service has the most churned customers.
SELECT InternetService, COUNT(DISTINCT CustomerID) AS total_customers FROM agentville_datasets.telco_churn WHERE Churn = TRUE GROUP BY InternetService ORDER BY total_customers DESC LIMIT 1;
Identify churn rates by segment among high-value customers.
SELECT Contract, InternetService, Gender, PaymentMethod, COUNT(DISTINCT CustomerID) AS total_customers, SUM(CASE WHEN Churn = TRUE THEN 1 ELSE 0 END) AS churned_customers, (SUM(CASE WHEN Churn = TRUE THEN 1 ELSE 0 END) / COUNT(DISTINCT CustomerID)) * 100 AS churn_rate FROM agentville_datasets.telco_churn WHERE MonthlyCharges > 100 GROUP BY Contract, InternetService, Gender, PaymentMethod;
Best practices
To improve the accuracy of your generated insights, follow these best practices:
Ground insights to data profiling results
In generative AI, grounding is the ability to connect model output to verifiable sources of information. You can ground the generated insights to data profiling results. Data profiling analyzes the columns in your BigQuery tables and identifies common statistical characteristics, such as typical data values and data distribution. When you create a data profiling scan for a table, you can choose to publish the scan results to BigQuery and Dataplex Catalog. BigQuery uses data profiling results to create more accurate, relevant queries by doing the following:
- Analyzes the data profiling results to identify interesting patterns, trends, or outliers in the data.
- Generates queries that focus on these patterns, trends, or outliers to uncover insights.
- Validates the generated queries against the data profiling results to ensure that the queries return meaningful results.
Without data profiling scans, the queries that Gemini generates are more likely to include inaccurate clauses or produce meaningless results.
Ensure that the data profiling scan for your table is up-to-date and that the results are published to BigQuery.
You can adjust your data profiling settings to increase the sampling size and filter out rows and columns. After you run a new data profiling scan, regenerate insights.
Add a table description
Detailed table descriptions that describe what you want to analyze in your table can help Gemini in BigQuery to produce more relevant insights. After you add a table description, regenerate insights.
For example, you might add the following description to the
telco_churn
table: "This table tracks customer churn data,
including subscription details, tenure, and service usage, to predict customer
churn behavior."
Add column descriptions
Column descriptions that explain what each column is, or how one column relates to another, can improve the quality of your insights. After you update the column descriptions in your table, regenerate insights.
For example, you might add the following descriptions to specific columns of
the telco_churn
table:
Tenure
: "The number of months the customer has been with the service."Churn
: "Whether the customer has stopped using the service. TRUE indicates the customer no longer uses the service, FALSE indicates the customer is active."
Generate insights for a BigQuery table
To generate insights for a BigQuery table, you must access the table entry in BigQuery using BigQuery Studio.
In the Google Cloud console, go to BigQuery Studio.
In the Explorer pane, select the table you want to generate insights for.
Click the Insights tab. If the tab is empty, it means that the insights for this table are not generated yet.
To trigger the insights pipeline, click Generate insights.
It takes a few minutes for the insights to be populated.
If published data profiling results for the table are available, they're used to generate insights. Otherwise, insights are generated based on the column names and descriptions.
In the Insights tab, explore the generated queries and their descriptions.
To open a query in BigQuery, click Copy to Query.
To generate a new set of queries, click Generate insights and trigger the pipeline again.
After you have generated insights for a table, anyone with the
dataplex.datascans.getData
permission and access to the table
can view those insights.
Generate insights for a BigQuery external table
BigQuery data insights supports generating insights for
BigQuery external tables with data in Cloud Storage.
You and the Dataplex service account for the
current project must have the
Storage Object Viewer (roles/storage.objectViewer
)
role on the Cloud Storage bucket that contains the data. For more
information, see
Add a principal to a bucket-level policy.
To generate insights for a BigQuery external table, follow the instructions described in the Generate insights for a BigQuery table section of this document.
Generate insights for a BigLake table
To generate insights for a BigLake table, follow these steps:
Enable the BigQuery Connection API in your project.
Create a BigQuery connection. For more information, see Manage connections.
Grant the Storage Object Viewer (
roles/storage.objectViewer
) IAM role to the service account corresponding to the BigQuery connection that you created.You can retrieve the service account ID from the connection details.
To generate insights, follow the instructions described in the Generate insights for a BigQuery table section of this document.
Pricing
For details about pricing for this feature, see Gemini in BigQuery pricing overview.
Quotas and limits
For information about quotas and limits for this feature, see Quotas for Gemini in BigQuery.
Limitations
- Data insights is available for BigQuery tables, BigLake tables, external tables, and views.
- For multicloud customers, data from other clouds isn't available.
- Data insights doesn't support
Geo
orJSON
column types. - Insights runs don't guarantee the presentation of queries every time. To increase the likelihood of generating more engaging queries, re-initiate the insights pipeline.
- For tables with column-level access control (ACLs) and restricted user permissions, you can generate insights if you have read access to all columns of the table. To run the generated queries, you must have sufficient permissions.
What's next
- Learn more about Dataplex data profiling.
- Learn how to write queries with Gemini assistance in BigQuery.