Use auto data quality

This page describes how to create a Dataplex data quality scan.

To learn about data quality scans, see About auto data quality.

Before you begin

  1. Enable the Dataplex API.

    Enable the API

  2. Optional: If you want Dataplex to generate recommendations for data quality rules based on the results of a data profiling scan, create and run the data profiling scan.

Required roles

  • To run a data quality scan on a BigQuery table, you need permission to read the BigQuery table and permission to create a BigQuery job in the project used to scan the table.

  • If the BigQuery table and the data quality scan are in different projects, then you need to give the Dataplex service account of the project containing the data quality scan read permission for the corresponding BigQuery table.

  • If the data quality rules refer to additional tables, then the scan project's service account must have read permissions on the same tables.

  • To get the permissions that you need to export the scan results to a BigQuery table, ask your administrator to grant the Dataplex service account the BigQuery Data Editor (roles/bigquery.dataEditor) IAM role on the results dataset and table. This grants the following permissions:

    • bigquery.datasets.get
    • bigquery.tables.create
    • bigquery.tables.get
    • bigquery.tables.getData
    • bigquery.tables.update
    • bigquery.tables.updateData
  • If the BigQuery data is organized in a Dataplex lake, grant the Dataplex service account the Dataplex Metadata Reader (roles/dataplex.metadataReader) and Dataplex Viewer (roles/dataplex.viewer) IAM roles. Alternatively, you need all of the following permissions:

    • dataplex.lakes.list
    • dataplex.lakes.get
    • dataplex.zones.list
    • dataplex.zones.get
    • dataplex.entities.list
    • dataplex.entities.get
    • dataplex.operations.get
  • If you're scanning a BigQuery external table from Cloud Storage, grant the Dataplex service account the Cloud Storage roles/storage.objectViewer role for the bucket. Alternatively, assign the Dataplex service account the following permissions:

    • storage.buckets.get
    • storage.objects.get
  • If you want to publish the data quality scan results in the BigQuery and Data Catalog pages in the Google Cloud console for the source tables, you must be granted the BigQuery Data Editor (roles/bigquery.dataEditor) IAM role for the table. Alternatively, you need all of the following permissions:

    • bigquery.tables.get
    • bigquery.tables.update
    • bigquery.tables.updateData
    • bigquery.tables.delete
  • If you need to access columns protected by BigQuery column-level access policies, then assign the Dataplex service account permissions for those columns. The user creating or updating a data scan also needs permissions for the columns.

  • If a table has BigQuery row-level access policies enabled, then you can only scan rows visible to the Dataplex service account. Note that the individual user's access privileges are not evaluated for row-level policies.

Required data scan roles

To use auto data quality, you need either the permissions to run data scans, or a role with predefined permissions to run data scans.

The following table lists the DataScan permissions:

Permission name Grants permission to do the following:
dataplex.datascans.create Create a DataScan
dataplex.datascans.delete Delete a DataScan
dataplex.datascans.get View operational metadata such as ID or schedule, but not results and rules
dataplex.datascans.getData View DataScan details including rules and results
dataplex.datascans.list List DataScans
dataplex.datascans.run Execute a DataScan
dataplex.datascans.update Update the description of a DataScan
dataplex.datascans.getIamPolicy View the current IAM permissions on the scan
dataplex.datascans.setIamPolicy Set IAM permissions on the scan

Grant users one or more of the following roles:

  • Full access to DataScan resources: Dataplex DataScan Administrator (roles/dataplex.dataScanAdmin)
  • Write access to DataScan resources: Dataplex DataScan Editor (roles/dataplex.dataScanEditor)
  • Read access to DataScan resources excluding rules and results: Dataplex DataScan Viewer (roles/dataplex.dataScanViewer)
  • Read access to DataScan resources, including rules and results: Dataplex DataScan DataViewer (roles/dataplex.dataScanDataViewer)

Define data quality rules

You can define data quality rules by using built-in rules or custom SQL checks. If you're using the Google Cloud CLI, you can define these rules in a JSON or YAML file.

The examples in the following sections show how to define a variety of data quality rules. The rules validate a sample table that contains data about customer transactions. Assume the table has the following schema:

Column name Column type Column description
transaction_timestamp Timestamp Timestamp of the transaction. The table is partitioned on this field.
customer_id String A customer ID in the format of 8 letters followed by 16 digits.
transaction_id String The transaction ID needs to be unique across the table.
currency_id String One of the supported currencies.The currency type must match one of the available currencies in the dimension table dim_currency.
amount float Transaction amount.
discount_pct float Discount percentage. This value must be between 0 and 100.

Define data quality rules using built-in rule types

The following example rules are based on built-in rule types. You can create rules based on built-in rule types using the Google Cloud console or the API. Dataplex might recommend some of these rules.

Column name Rule Type Suggested dimension Rule parameters
transaction_id Uniqueness check Uniqueness Threshold: Not Applicable
amount Null check Completeness Threshold: 100%
customer_id Regex (regular expression) check Validity Regular expression: ^[0-9]{8}[a-zA-Z]{16}$
Threshold: 100%
currency_id Value set check Validity Set of: USD,JPY,INR,GBP,CAN
Threshold: 100%

Define data quality rules using custom SQL rules

To build custom SQL rules, use the following framework:

  • When you create a rule that evaluates one row at a time, create an expression that generates the number of successful rows when Dataplex evaluates the query SELECT COUNTIF(CUSTOM_SQL_EXPRESSION) FROM TABLE. Dataplex checks the number of successful rows against the threshold.

  • When you create a rule that evaluates across the rows or uses a table condition, create an expression that returns success or failure when Dataplex evaluates the query SELECT IF(CUSTOM_SQL_EXPRESSION) FROM TABLE.

  • When you create a rule that evaluates the invalid state of a dataset, provide a statement that returns invalid rows. If any rows are returned, the rule fails. Omit the trailing semicolon from the SQL statement.

  • You can refer to a data source table and all of its precondition filters by using the data reference parameter ${data()} in a rule, instead of explicitly mentioning the source table and its filters. Examples of precondition filters include row filters, sampling percents, and incremental filters. The ${data()} parameter is case-sensitive.

The following example rules are based on custom SQL rules.

Rule type Rule description SQL expression
Row condition Checks if the value of the discount_pct is between 0 and 100. 0 <discount_pct AND discount_pct < 100
Row condition Reference check to validate that currency_id is one of the supported currencies. currency_id in (select id from my_project_id.dim_dataset.dim_currency)
Table condition Aggregate SQL expression that checks if the average discount_pct is between 30% and 50%. 30<avg(discount) AND avg(discount) <50
Row condition Checks if a date is not in the future. TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()
Table condition A BigQuery user-defined function (UDF) to check that the average transaction amount is less than a predefined value per country. Create the (Javascript) UDF by running the following command:
        CREATE OR REPLACE FUNCTION
        myProject.myDataset.average_by_country (
          country STRING, average FLOAT64)
        RETURNS BOOL LANGUAGE js AS R"""
        if (country = "CAN" && average < 5000){
          return 1
        } else if (country = "IND" && average < 1000){
          return 1
        } else { return 0 }
        """;
       
Example rule to check the average transaction amount for country=CAN.
        myProject.myDataset.average_by_country(
        "CAN",
        (SELECT avg(amount) FROM
          myProject.myDataset.transactions_table
            WHERE currency_id = 'CAN'
        ))
      
Table condition A BigQuery ML predict clause to identify anomalies in discount_pct. It checks if a discount should be applied based on customer, currency, and transaction. The rule checks if the prediction matches the actual value, at least 99% of times. Assumption: The ML model is created before using the rule. Create the ML model using the following command:
  CREATE MODEL
  model-project-id.dataset-id.model-name
        OPTIONS(model_type='logistic_reg') AS
  SELECT
  IF(discount_pct IS NULL, 0, 1) AS label,
  IFNULL(customer_id, "") AS customer,
  IFNULL(currency_id, "") AS currency,
  IFNULL(amount, 0.0) AS amount
  FROM
  `data-project-id.dataset-id.table-names`
  WHERE transaction_timestamp < '2022-01-01';
  
The following rule checks if prediction accuracy is greater than 99%.
      SELECT
        accuracy > 0.99
      FROM
       ML.EVALUATE
        (MODEL model-project-id.dataset-id.model-name,
         (
          SELECT
            customer_id,
            currency_id,
            amount,
            discount_pct
          FROM
            data-project-id.dataset-id.table-names
          WHERE transaction_timestamp > '2022-01-01';
         )
        )
    
Row condition A BigQuery ML predict function to identify anomalies in discount_pct. The function checks if a discount should be applied based on customer, currency and transaction. The rule identifies all the occurrences where the prediction didn't match. Assumption: The ML model is created before using the rule. Create the ML model using the following command:
  CREATE MODEL
  model-project-id.dataset-id.model-name
        OPTIONS(model_type='logistic_reg') AS
  SELECT
  IF(discount_pct IS NULL, 0, 1) AS label,
  IFNULL(customer_id, "") AS customer,
  IFNULL(currency_id, "") AS currency,
  IFNULL(amount, 0.0) AS amount
  FROM
  `data-project-id.dataset-id.table-names`
  WHERE transaction_timestamp < '2022-01-01';
  
The following rule checks if the discount prediction matches with the actual for every row.
       IF(discount_pct > 0, 1, 0)
          =(SELECT predicted_label FROM
           ML.PREDICT(
            MODEL model-project-id.dataset-id.model-name,
              (
                SELECT
                  customer_id,
                  currency_id,
                  amount,
                  discount_pct
                FROM
                  data-project-id.dataset-id.table-names AS t
                    WHERE t.transaction_timestamp =
                     transaction_timestamp
                   LIMIT 1
              )
            )
         )
    
SQL assertion Validates if the discount_pct is greater than 30% for today by checking whether any rows exist with a discount percent less than or equal to 30. SELECT * FROM my_project_id.dim_dataset.dim_currency WHERE discount_pct <= 30 AND transaction_timestamp >= current_date()
SQL assertion (with data reference parameter)

Checks if the discount_pct is greater than 30% for all the supported currencies today.

The date filter transaction_timestamp >= current_date() is applied as a row filter on the data source table.

The data reference parameter ${data()} acts as a placeholder for my_project_id.dim_dataset.dim_currency WHERE transaction_timestamp >= current_date() and applies the row filter.

SELECT * FROM ${data()} WHERE discount_pct > 30

Define data quality rules using the gcloud CLI

The following example YAML file uses some of the same rules as the sample rules using built-in types and the sample custom SQL rules. You can use this YAML file as input to the gcloud CLI command.

rules:
- uniquenessExpectation: {}
  column: transaction_id
  dimension: UNIQUENESS
- nonNullExpectation: {}
  column: amount
  dimension: COMPLETENESS
  threshold: 1
- regexExpectation:
    regex: '^[0-9]{8}[a-zA-Z]{16}$'
  column : customer_id
  ignoreNull : true
  dimension : VALIDITY
  threshold : 1
- setExpectation :
    values :
    - 'USD'
    - 'JPY'
    - 'INR'
    - 'GBP'
    - 'CAN'
  column : currency_id
  ignoreNull : true
  dimension : VALIDITY
  threshold : 1
- rangeExpectation:
    minValue : '0'
    maxValue : '100'
  column : discount_pct
  ignoreNull : true
  dimension : VALIDITY
  threshold : 1
- rowConditionExpectation:
    sqlExpression : 0 < `discount_pct` AND `discount_pct` < 100
  column: discount_pct
  dimension: VALIDITY
  threshold: 1
- rowConditionExpectation:
    sqlExpression : currency_id in (select id from `my_project_id.dim_dataset.dim_currency`)
  column: currency_id
  dimension: VALIDITY
  threshold: 1
- tableConditionExpectation:
    sqlExpression : 30 < avg(discount_pct) AND avg(discount_pct) < 50
  dimension: VALIDITY
- rowConditionExpectation:
    sqlExpression : TIMESTAMP(transaction_timestamp) < CURRENT_TIMESTAMP()
  column: transaction_timestamp
  dimension: VALIDITY
  threshold: 1
- sqlAssertion:
    sqlStatement : SELECT * FROM `my_project_id.dim_dataset.dim_currency` WHERE discount_pct > 100
  dimension: VALIDITY

Create a data quality scan

Console

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click Create data quality scan.

  3. In the Define scan window, fill in the following fields:

    1. Enter a Display name.

    2. The scan ID is automatically generated if you don't provide your own ID. See the resource naming convention.

    3. Optional: Enter a Description.

    4. In the Table field, click Browse, choose your table, and click Select. Dataplex supports only standard BigQuery tables.

      For tables in multi-region datasets, choose a region where to create the data scan.

      To browse the tables organized within the Dataplex lake, click Browse within Dataplex Lakes.

    5. In the Scope field, choose Incremental or Entire data.

      • If you choose Incremental: In the Timestamp column field, select a column of type DATE or TIMESTAMP from your BigQuery table that increases monotonically and can be used to identify new records. It can be a column that partitions the table.
    6. Optional: Add labels. Labels are key:value pairs that allow you to group related objects together or with other Google Cloud resources.

    7. To filter your data, click Filters. Select the Filter rows checkbox. The input value for row filter must be a valid SQL expression that can be used as a part of a WHERE clause in GoogleSQL syntax. For example, col1 >= 0. The filter can be a combination of multiple column conditions. For example, col1 >= 0 AND col2 < 10.

    8. To sample your data, in the Sampling size list, select a sampling percentage. Choose a percentage value that ranges between 0.0% and 100.0% with up to 3 decimal digits. For larger datasets, choose a lower sampling percentage. For example, for a ~1 PB table, if you enter a value between 0.1% and 1.0%, Dataplex samples between 1-10 TB of data. For incremental data scans, Dataplex applies sampling to the latest increment.

    9. To publish the data quality scan results in the BigQuery and Data Catalog pages in the Google Cloud console for the source table, click the Publish results to the BigQuery and Dataplex Catalog UI checkbox. You can view the latest scan results in the Data Quality tab in the BigQuery and Data Catalog pages for the source table. To enable users to access the published scan results, see Share the published results. The publishing option might not be available in the following cases:

      • You don't have the required permissions on the table.
      • Another data quality scan is set to publish results.

      For more information about the permissions required to view the published results, see Permissions.

    10. Click Continue.

  4. In the Schedule window, choose one of the following options:

    • Repeat: Run your data quality scan job on a schedule: daily, weekly, monthly, or custom. Specify how often the scan runs and at what time. If you choose custom, use cron format to specify the schedule.

    • On-demand: Run your data quality scan job on demand.

    Click Continue.

  5. In the Data quality rules window, define the rules to configure for this data quality scan. Click Add rules, and then choose one of the following options.

    • Profile based recommendations: Build rules from the recommendations based on an existing data profiling scan.

      1. Choose columns: Select the columns to get recommended rules for.

      2. Scan project: Recommendations based on an existing data profiling scan. By default, Dataplex selects profiling scans from the same project in which you are creating the data quality scan. If you created the scan in a different project, you must specify the project to pull profile scans from.

      3. Choose profile results: Based on the columns and project you select, multiple profile results appear.

      4. Select one or more profile results and then click OK. This populates a list of rules to select from.

      5. Select the rules you want to edit by checking the boxes and clicking Select. Once selected, the rules are added to your current rule list. Then, you can edit the rules.

    • Built-in rule types: Build rules from predefined rules. See the list of predefined rules.

      1. Choose columns: Select the columns to select rules for.

      2. Choose rule types: Based on the columns you select, multiple rule types appear for selection.

      3. Select one or more rule types, and then click OK. This populates a list of rules to select from.

      4. Select the rules you want to edit by checking the boxes and clicking Select. Once selected, the rules are added to your current rules list. Then, you can edit the rules.

    • SQL row check rule: Create a custom SQL rule to apply to each row (custom SQL row-check rule).

      1. In Dimension, choose one dimension.

      2. In Passing threshold, choose a percentage of records that must pass the check.

      3. In Column name, choose a column.

      4. In the Provide a SQL expression field, enter a SQL expression that evaluates to a boolean true (pass) or false (fail). For more information, see Supported custom SQL rule types and the examples in the Define data quality rules section of this document.

      5. Click Add.

    • SQL aggregate check rule: Create a custom SQL table condition rule.

      1. In Dimension, choose one dimension.

      2. In Column name, choose a column.

      3. In the Provide a SQL expression field, enter a SQL expression that evaluates to a boolean true (pass) or false (fail). For more information, see Supported custom SQL rule types and the examples in the Define data quality rules section of this document.

      4. Click Add.

    • SQL assertion rule: Create a custom SQL assertion rule to check for an invalid state of the data.

      1. In Dimension, choose one dimension.

      2. Optional: In Column name, choose a column.

      3. In the Provide a SQL statement field, enter a SQL statement that returns rows that match the invalid state. If any rows are returned, this rule fails. Omit the trailing semicolon from the SQL statement. For more information, see Supported custom SQL rule types and the examples in the Define data quality rules section of this document.

      4. Click Add.

    Dataplex allows custom names for data quality rules for monitoring and alerting. For any data quality rule, you can optionally assign a custom rule name and a description. To do this, edit a rule and specify the following details:

    • Rule name: Enter a custom rule name with up to 63 characters. The rule name can include letters (a-z, A-Z), digits (0-9), and hyphens (-) and must start with a letter and end with a number or a letter.
    • Description: Enter a rule description with a maximum length of 1,024 characters.

    Click Continue.

  6. Optional: Export the scan results to a BigQuery standard table. In the Export scan results to BigQuery table section, click Browse to select an existing BigQuery dataset to store the data quality scan results.

    If the specified table doesn't exist, Dataplex creates it for you. If you are using an existing table, make sure that it is compatible with the export table schema.

  7. Optional: Set up email notification reports to alert people about the status and results of a data quality scan job. In the Notification report section, click Add email ID and enter up to five email addresses. Then, select the scenarios that you want to send reports for:

    • Quality score (<=): sends a report when a job succeeds with a data quality score that is lower than the specified target score. Enter a target quality score between 0 and 100.
    • Job failures: sends a report when the job itself fails, regardless of the data quality results.
    • Job completion (success or failure): sends a report when the job ends, regardless of the data quality results.
  8. Click Create.

    After the scan is created, you can run it at any time by clicking Run now.

gcloud

To create a data quality scan, use the gcloud dataplex datascans create data-quality command.

If the source data is organized in a Dataplex lake, include the --data-source-entity flag:

gcloud dataplex datascans create data-quality DATASCAN \
    --location=LOCATION \
    --data-quality-spec-file=DATA_QUALITY_SPEC_FILE \
    --data-source-entity=DATA_SOURCE_ENTITY

If the source data isn't organized in a Dataplex lake, include the --data-source-resource flag:

gcloud dataplex datascans create data-quality DATASCAN \
    --location=LOCATION \
    --data-quality-spec-file=DATA_QUALITY_SPEC_FILE \
    --data-source-resource=DATA_SOURCE_RESOURCE

Replace the following variables:

  • DATASCAN: The name of the data quality scan.
  • LOCATION: The Google Cloud region in which to create the data quality scan.
  • DATA_QUALITY_SPEC_FILE: The path to the JSON or YAML file containing the specifications for the data quality scan. The file can be a local file or a Cloud Storage path with the prefix gs://. Use this file to specify the data quality rules for the scan. You can also specify additional details in this file, such as filters, sampling percent, and post-scan actions like exporting to BigQuery or sending email notification reports. See the documentation for JSON representation.
  • DATA_SOURCE_ENTITY: The Dataplex entity that contains the data for the data quality scan. For example, projects/test-project/locations/test-location/lakes/test-lake/zones/test-zone/entities/test-entity.
  • DATA_SOURCE_RESOURCE: The name of the resource that contains the data for the data quality scan. For example, //bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table.

REST

Use the APIs Explorer to create a data quality scan.

If you want to build rules for the data quality scan by using rule recommendations that are based on the results of a data profiling scan, get the recommendations by calling the dataScans.jobs.generateDataQualityRules method on the data profiling scan.

Export table schema

To export the data quality scan results to an existing BigQuery table, make sure that it is compatible with the following table schema:

Column name Column data type Sub field name
(if applicable)
Sub field data type Mode Example
data_quality_scan struct/record resource_name string nullable //dataplex.googleapis.com/projects/test-project/locations/europe-west2/datascans/test-datascan
project_id string nullable dataplex-back-end-dev-project
location string nullable us-central1
data_scan_id string nullable test-datascan
data_source struct/record resource_name string nullable Entity case:
//dataplex.googleapis.com/projects/dataplex-back-end-dev-project/locations/europe-west2/lakes/a0-datascan-test-lake/zones/a0-datascan-test-zone/entities/table1

Table case: //bigquery.googleapis.com/projects/test-project/datasets/test-dataset/tables/test-table
dataplex_entity_project_id string nullable dataplex-back-end-dev-project
dataplex_entity_project_number integer nullable 123456789
dataplex_lake_id string nullable (Valid only if source is entity)
test-lake
dataplex_zone_id string nullable (Valid only if source is entity)
test-zone
dataplex_entity_id string nullable (Valid only if source is entity)
test-entity
table_project_id string nullable test-project
table_project_number integer nullable 987654321
dataset_id string nullable (Valid only if source is table)
test-dataset
table_id string nullable (Valid only if source is table)
test-table
data_quality_job_id string nullable caeba234-cfde-4fca-9e5b-fe02a9812e38
data_quality_job_configuration json trigger string nullable ondemand/schedule
incremental boolean nullable true/false
sampling_percent float nullable (0-100)
20.0 (indicates 20%)
row_filter string nullable col1 >= 0 AND col2 < 10
job_labels json nullable {"key1":value1}
job_start_time timestamp nullable 2023-01-01 00:00:00 UTC
job_end_time timestamp nullable 2023-01-01 00:00:00 UTC
job_rows_scanned integer nullable 7500
rule_name string nullable test-rule
rule_type string nullable Range Check
rule_evaluation_type string nullable Per row
rule_column string nullable Rule only attached to a certain column
rule_dimension string nullable UNIQUENESS
job_quality_result struct/record passed boolean nullable true/false
score float nullable 90.8
job_dimension_result json nullable {"ACCURACY":{"passed":true,"score":100},"CONSISTENCY":{"passed":false,"score":60}}
rule_threshold_percent float nullable (0.0-100.0)
Rule-threshold-pct in API * 100
rule_parameters json nullable {min: 24, max:5345}
rule_pass boolean nullable True
rule_rows_evaluated integer nullable 7400
rule_rows_passed integer nullable 3
rule_rows_null integer nullable 4
rule_failed_records_query string nullable "SELECT * FROM `test-project.test-dataset.test-table` WHERE (NOT((`cTime` >= '15:31:38.776361' and `cTime` <= '19:23:53.754823') IS TRUE));"

When you configure BigQueryExport for a data quality scan job, follow these guidelines:

  • For the field resultsTable, use the format: //bigquery.googleapis.com/projects/{project-id}/datasets/{dataset-id}/tables/{table-id}.
  • Use a BigQuery standard table.
  • If the table doesn't exist when the scan is created or updated, Dataplex creates the table for you.
  • By default, the table is partitioned on the job_start_time column daily.
  • If you want the table to be partitioned in other configurations or if you don't want the partition, then recreate the table with the required schema and configurations and then provide the pre-created table as the results table.
  • Make sure the results table is in the same location as the source table.
  • If VPC-SC is configured on the project, then the results table must be in the same VPC-SC perimeter as the source table.
  • If the table is modified during the scan execution stage, then the current running job exports to the previous results table and the table change takes effect from the next scan job.
  • Don't modify the table schema. If you need customized columns, create a view upon the table.
  • To reduce costs, set an expiration on the partition based on your use case. For more information, see how to set the partition expiration.

Run a data quality scan

Console

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click the data quality scan to run.

  3. Click Run now.

gcloud

To run a data quality scan, use the gcloud dataplex datascans run command:

gcloud dataplex datascans run DATASCAN \
--location=LOCATION \

Replace the following variables:

  • LOCATION: The Google Cloud region in which the data quality scan was created.
  • DATASCAN: The name of the data quality scan.

REST

Use the APIs Explorer to run your data quality scan.

View the data quality scan results

Console

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. To see the detailed results of a scan, click the name of the scan.

    • The Overview section displays information about the last seven jobs, including when the scan was run, the number of records scanned in each job, whether all the data quality checks passed, if there were failures, the number of data quality checks that failed, and which dimensions failed.

    • The Data quality scan configuration section displays details about the scan.

  3. To see data quality scores that indicate the percentage of rules that passed, click the Jobs history tab. Then, click a job ID.

gcloud

To view the results of a data quality scan job, use the gcloud dataplex datascans jobs describe command:

gcloud dataplex datascans jobs describe JOB \
--location=LOCATION \
--datascan=DATASCAN \
--view=FULL

Replace the following variables:

  • JOB: The job ID of the data quality scan job.
  • LOCATION: The Google Cloud region in which the data quality scan was created.
  • DATASCAN: The name of the data quality scan the job belongs to.
  • --view=FULL: To see the scan job result, specify FULL.

REST

Use the APIs Explorer to view the results of a data quality scan.

View historical scan results

Dataplex saves the data quality scan history of the last 300 jobs or for the past year, whichever occurs first.

Console

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click the name of a scan.

  3. Click the Jobs history tab.

    The Jobs history tab provides information about past jobs. It lists all of the jobs, the number of records scanned in each job, the job status, the time the job was run, whether each rule passed or failed, and more.

  4. To view detailed information about a job, click any of the jobs in the Job ID column.

gcloud

To view all jobs of a data quality scan, use the gcloud dataplex datascans jobs list command:

gcloud dataplex datascans jobs list \
--location=LOCATION \
--datascan=DATASCAN \

Replace the following variables:

  • LOCATION: The Google Cloud region in which the data quality scan was created.
  • DATASCAN: The name of the data quality scan to view all jobs for.

REST

Use the APIs Explorer to view all scan jobs.

Share the published results

When creating a data quality scan, if you chose to publish the scan results in the BigQuery and Data Catalog pages in the Google Cloud console, then the latest scan results will be available in the Data Quality tab in those pages.

You can enable the users in your organization to access the published scan results. To grant access to the scan results, follow these steps:

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click the data quality scan you want to share the results of.

  3. Go to the Permissions tab.

  4. Click Grant access.

  5. In the New principals field, add the principal to which you want to grant access.

  6. In the Select a role field, select Dataplex DataScan DataViewer.

  7. Click Save.

To remove access to the published scan results for a principal, follow these steps:

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click the data quality scan you want to share the results of.

  3. Go to the Permissions tab.

  4. Select the principal for which you want to remove the Dataplex DataScan DataViewer role.

  5. Click Remove access.

  6. Click Confirm.

Set alerts in Cloud Logging

To set alerts for data quality failures using the logs in Cloud Logging, follow these steps:

Console

  1. In the Google Cloud console, go to the Cloud Logging Logs Explorer.

    Go to Logs explorer

  2. In the Query window, enter your query. See sample queries.

  3. Click Run Query.

  4. Click Create alert. This opens a side panel.

  5. Enter your alert policy name and click Next.

  6. Review the query.

    1. Click the Preview Logs button to test your query. This shows logs with matching conditions.

    2. Click Next.

  7. Set the time between notifications and click Next.

  8. Define who should be notified for the alert and click Save to create the alert policy.

Alternatively, you can configure and edit your alerts by navigating in the Google Cloud console to Monitoring > Alerting.

gcloud

Not supported.

REST

Use the APIs Explorer to set alerts in Cloud Logging.

Sample queries for setting job level or dimension level alerts

  • A sample query to set alerts on overall data quality failures for a data quality scan:

    resource.type="dataplex.googleapis.com/DataScan"
    AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED"
    AND resource.labels.resource_container="projects/112233445566"
    AND resource.labels.datascan_id="a0-test-dec6-dq-3"
    AND NOT jsonPayload.dataQuality.passed=true
    
  • A sample query to set alerts on data quality failures for a dimension (for example, uniqueness) of a given data quality scan:

    resource.type="dataplex.googleapis.com/DataScan"
    AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED"
    AND resource.labels.resource_container="projects/112233445566"
    AND resource.labels.datascan_id="a0-test-dec6-dq-3"
    AND jsonPayload.dataQuality.dimensionPassed.UNIQUENESS=false
    
  • A sample query to set alerts on data quality failures for a table.

    • Set alerts on data quality failures for a BigQuery table that isn't organized in a Dataplex lake:

      resource.type="dataplex.googleapis.com/DataScan"
      AND jsonPayload.dataSource="//bigquery.googleapis.com/projects/test-project/datasets/testdataset/table/chicago_taxi_trips"
      AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED"
      AND resource.labels.resource_container="projects/112233445566"
      AND NOT jsonPayload.dataQuality.passed=true
      
    • Set alerts on data quality failures for a BigQuery table that's organized in a Dataplex lake:

      resource.type="dataplex.googleapis.com/DataScan"
      AND jsonPayload.dataSource="projects/test-project/datasets/testdataset/table/chicago_taxi_trips"
      AND labels."dataplex.googleapis.com/data_scan_state"="SUCCEEDED"
      AND resource.labels.resource_container="projects/112233445566"
      AND NOT jsonPayload.dataQuality.passed=true
      

Sample queries to set per rule alerts

  • A sample query to set alerts on all failing data quality rules with the specified custom rule name for a data quality scan:

    resource.type="dataplex.googleapis.com/DataScan"
    AND jsonPayload.ruleName="custom-name"
    AND jsonPayload.result="FAILED"
    
  • A sample query to set alerts on all failing data quality rules of a specific evaluation type for a data quality scan:

    resource.type="dataplex.googleapis.com/DataScan"
    AND jsonPayload.evalutionType="PER_ROW"
    AND jsonPayload.result="FAILED"
    
  • A sample query to set alerts on all failing data quality rules for a column in the table used for a data quality scan:

    resource.type="dataplex.googleapis.com/DataScan"
    AND jsonPayload.column="CInteger"
    AND jsonPayload.result="FAILED"
    

Troubleshoot a data quality failure

For each job with row-level rules that fail, Dataplex provides a query to get the failed records. Run this query to see the records that did not match your rule.

Console

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click the name of the scan whose records you want to troubleshoot.

  3. Click the Jobs history tab.

  4. Click the job ID of the job that identified data quality failures.

  5. In the job results window that opens, in the Rules section, find the column Query to get failed records. Click Copy query to clipboard for the failed rule.

  6. Run the query in BigQuery to see the records that caused the job to fail.

gcloud

Not supported.

REST

Use the APIs Explorer to see the query to get failed records for jobs that failed.

Update a data quality scan

Console

  1. In the Google Cloud console, go to the Data Quality page.

    Go to Data quality

  2. In the row with the scan to edit, click the vertical three dots > Edit.

  3. Edit the values.

  4. Click Save.

gcloud

To update the description of a data quality scan, use the gcloud dataplex datascans update data-quality command:

gcloud dataplex datascans update data-quality DATASCAN \
--location=LOCATION \
--description=DESCRIPTION

Replace the following:

  • DATASCAN: The name of the data quality scan to update.
  • LOCATION: The Google Cloud region in which the data quality scan was created.
  • DESCRIPTION: The new description for the data quality scan.

REST

Use the APIs Explorer to edit your data quality scan.

Delete a data quality scan

Console

  1. In the Google Cloud console, go to the Data quality page.

    Go to Data quality

  2. Click the scan you want to delete.

  3. Click Delete.

gcloud

To delete a data quality scan, use the gcloud dataplex datascans delete command:

gcloud dataplex datascans delete DATASCAN \
--location=LOCATION \
--async

Replace the following variables:

  • DATASCAN: The name of the data quality scan to delete.
  • LOCATION: The Google Cloud region in which the data quality scan was created.

REST

Use the APIs Explorer to delete your data quality scan.

What's next?