Generate data insights in Dataplex

Data insights offers an automated and intuitive way to explore and understand your data. It uses Gemini large language models to generate queries based on the metadata of a table, and lets you uncover patterns, assess data quality, and perform statistical analysis.

This document describes the key features of data insights and the process to automate query generation for insightful data exploration.

About data insights

Data analysts face the cold-start problem in data exploration, when exploring a new dataset with little or no prior knowledge. The problem often involves uncertainties about the data structure, key patterns, and relevant insights. By using automated query generation based on metadata, data insights addresses the cold-start problem in data exploration. The insights provides valuable cues to help you make informed decisions and gain deeper insights into your data. Rather than starting with a blank slate, you can more quickly start data exploration with meaningful queries that offer valuable insights.

Queries generated using data insights are grounded using published profile scan data. Data insights uses published profile scan data to craft queries that deliver results, providing efficient and reliable information retrieval. This significantly accelerates the initiation of the data analysis process, and lets you delve into the data with a clearer direction and purpose.

Data insights serves as a guiding tool that resolves the common challenge of navigating unfamiliar datasets, empowering you to make informed decisions and uncover patterns more rapidly during data exploration.

Example of an insights run

Consider a table called telco_churn with the following technical metadata:

Field name Type
CustomerID STRING
Gender STRING
Tenure INT64
PhoneService STRING
OnlineBackup STRING
Dependents BOOLEAN
Contract STRING
TechSupport STRING
PaymentMethod STRING
MonthlyCharges FLOAT
Churn BOOLEAN

The following are some of the sample queries that data insights generates for this table:

  • Identify customers who have subscribed to all premium services and have been a customer for more than 50 months.

    SELECT
      CustomerID,
      Contract,
      Tenure
    FROM
      agentville_datasets.telco_churn
    WHERE
      OnlineSecurity = 'Yes'
      AND OnlineBackup = 'Yes'
      AND DeviceProtection = 'Yes'
      AND TechSupport = 'Yes'
      AND StreamingTV = 'Yes'
      AND StreamingMovies = 'Yes'
      AND Tenure > 50;
    
  • Identify which internet service has the most churned customers.

    SELECT
      InternetService,
      COUNT(DISTINCT customerID) AS total_customers
    FROM
      agentville_datasets.telco_churn
    WHERE
      Churn = TRUE
    GROUP BY
      InternetService
    ORDER BY
      total_customers DESC
    LIMIT 1;
    
  • Identify customer segments with high churn rates among high-value customers.

    SELECT
      Contract,
      InternetService,
      Gender,
      PaymentMethod,
      COUNT(DISTINCT customerID) AS total_customers,
      SUM(CASE WHEN Churn = TRUE THEN 1 ELSE 0 END) AS churned_customers,
      (SUM(CASE WHEN Churn = TRUE THEN 1 ELSE 0 END) / COUNT(DISTINCT customerID))
    * 100 AS churn_rate
    FROM
      agentville_datasets.telco_churn
    WHERE
      MonthlyCharges > 100
    GROUP BY
      Contract,
      InternetService,
      Gender,
      PaymentMethod;
    

About grounding insights using profile scans

Data insights use the published profile scan data to generate more accurate and relevant queries. Profile scan data provides valuable information about the data distribution, data types, and statistical summaries of the dataset. Because queries are grounded using profile scan data, data insights ensures that the generated queries are meaningful, and returns results that help in further analysis.

This section describes how data insights uses the published profile scan data to ground queries, and provides tips to fully use the potential of this feature.

About profile scan data

Profile scan data is the metadata that describes the contents of a dataset. It includes the following information:

  • Data types of columns
  • Minimum and maximum values
  • Distribution of values
  • Null or missing values
  • Top values
  • Unique values and their frequencies

Data insights uses this information to generate queries that are tailored to specific dataset and to provide meaningful insights.

How data insights grounds queries using profile scan data

Data insights uses profile scan data to create queries that are based on the actual data distribution and patterns within the dataset. This process involves the following steps:

  • Analyzing the profile scan data to identify interesting patterns, trends, or outliers in the data.
  • Generating queries that focus on these patterns, trends, or outliers to uncover insights.
  • Validating the generated queries against the profile scan data to ensure that the queries return meaningful results.

Tips for maximizing the benefits of data insights

Grounded queries help ensure that the insights you gain are accurate, relevant, and actionable, enabling you to make better data-driven decisions. To make the most out of grounded queries using profile scan data, follow these tips:

  • Ensure that your table has up-to-date published profile scan data. This helps data insights to generate more accurate and relevant queries.
  • Review the generated queries to understand how they are grounded in the profile scan data. This lets you interpret the results and gain deeper insights into your data.
  • Adjust the profile scan settings of your table or provide additional context to data insights if the generated queries aren't relevant or useful.

Pricing

The Dataplex data insights feature is offered free of charge during this Preview.

Limitations

  • Data insights is available for native BigQuery tables, BigLake tables, external tables, and views.
  • For multicloud customers, data from other clouds isn't available.
  • Data insights is available in all Dataplex regions.
  • Data insights doesn't support Geo or JSON column types.
  • Insights runs don't guarantee the presentation of queries every time. To increase the likelihood of generating more engaging queries, re-initiate the insights pipeline.
  • For tables with column-level access control (ACLs) and restricted user permissions, you can generate insights if you have read access to all columns of the table. To run the generated queries, you must have the respective permissions.

Before you begin

Required roles and permissions

To get the permissions that you need to use data insights, ask your administrator to grant you the following IAM roles on the project:

  • Run the insights pipeline:

    • Cloud AI Companion User (roles/cloudaicompanion.user) IAM role on the service account for the project in which you trigger the insights scan.

      The email address of the service account is of the following format:

      service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com.

    • dataplex.datascans.create permission on the project

    • bigquery.tables.getData permission on the BigQuery table

    • Read access to all columns of the table

  • View generated insights:

    • dataplex.datascans.getData on the generated data scan
  • Ground queries using profile scans:

    • Dataplex DataScan DataViewer (roles/dataplex.dataScanDataViewer) IAM role on the service account (service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com)

    • dataplex.datascans.getData permission on the published data profile scan

For more information about granting roles, see Manage access to service accounts.

You might also be able to get the required permissions through custom roles or other predefined roles.

Enable APIs

To use data insights, enable the following APIs in your project:

For more information about enabling the Gemini API, see Activate Gemini Code Assist in a Google Cloud project.

Generate insights for a BigQuery table

To generate insights for a BigQuery table, you must access the table entry in Dataplex using the Dataplex Search.

  1. In the Google Cloud console, go to the Dataplex Search page.

    Go to Dataplex Search

  2. Search for the table entry in Dataplex.

  3. Click the Insights tab. If the tab is empty, it means that the insights for this table are not generated yet.

  4. To trigger the insights pipeline, click Generate insights.

    It takes 5-10 minutes for the insights to be populated.

    If a published profile scan for the table is accessible, it is used to generate comprehensive insights. Otherwise, insights are formulated based on the column names and their respective descriptions. This approach ensures that you receive insights regardless of the availability of a profile scan.

  5. In the Insights tab, explore the generated queries and their descriptions.

  6. To execute a query, click Open in BigQuery. The query opens in BigQuery.

  7. To generate a new set of queries, click Generate insights and trigger the pipeline again.

Generate insights for a BigQuery external table

Dataplex data insights supports BigQuery external tables that are located in the same Google Cloud project. If the BigQuery table references data that is stored in Cloud Storage in another Google Cloud project, the insights generation fails.

To generate insights for a BigQuery external table, follow the instructions described in the Generate insights for a BigQuery table section of this document.

Generate insights for a BigLake table

To generate insights for a BigLake table, follow these steps:

  1. Enable the BigQuery Connection API in your project.

    Enable the BigQuery Connection API

  2. Create a BigQuery connection. For more information, see Manage connections.

  3. Grant the Storage Object Viewer (roles/storage.objectViewer) IAM role to the service account corresponding to the BigQuery connection that you created.

    You can retrieve the service account ID from the connection information page for the service account.

  4. To generate insights, follow the instructions described in the Generate insights for a BigQuery table section of this document.

What's next