Discover data

This guide explains how to enable and use Dataplex Discovery. Discovery scans and extracts metadata from data in a data lake and registers it to Dataproc Metastore, BigQuery, and Data Catalog for analysis, search, and exploration.

Overview

For each Dataplex asset with Discovery enabled, Dataplex does the following:

  • Scans the data associated with the asset.
  • Groups structured and semi-structured files into tables.
  • Collects technical metadata, such as table name, schema, and partition definition.

For unstructured data, such as images and videos, Dataplex Discovery automatically detects and registers groups of files sharing media type as filesets. For example, if gs://images/group1 contains GIF images, and gs://images/group2 contains JPEG images, Dataplex Discovery detects and registers two filesets. For structured data, such as Avro, Discovery detects files only if they are located in folders that contain the same data format and schema.

The discovered tables and filesets are registered in Data Catalog for search and discovery. The tables appear in Dataproc Metastore as Hive-style tables, and in BigQuery as external tables, so that data is automatically made available for analysis.

Discovery supports the following structured and semi-structured data formats:

Discovery supports the following compression format for structured and semi-structured data:

  • Internal compression for these formats:

    Compression File extension sample Supported format
    gzip .gz.parquet Parquet
    lz4 .lz4.parquet Parquet
    Snappy .snappy.parquet Parquet, ORC, Avro
    lzo .lzo.parquet Parquet, ORC
  • External compression for JSON and CSV files:

    • gzip
    • bzip2

Discovery configuration

Discovery is enabled by default when you create a new zone or asset. You can disable Discovery at the zone or asset level.

When you create a zone or an asset, you can choose to inherit Discovery settings at the zone level, or override Discovery settings at the asset level.

Here are the Discovery configuration options available at zone and asset levels:

  • Discovery on and off.

  • Discovery schedule: This option can be set to a predefined schedule—for example, hourly or daily, or a custom schedule defined by cron format. New assets are scanned when they are added. For more information, see Configuring cron schedules. Recommended: Schedule Discovery to run every hour or less frequently.

  • Include or exclude pattern: Define which files to include or exclude from Discovery scans, by using glob patterns in the include or exclude path. For example, if you want to exclude gs://test_bucket/foo/.. from discovery, enter **/foo/* as the exclude path. Quotation marks causes errors. Make sure to enter **/foo/* instead of "**/foo/*".) This function is only available for Cloud Storage assets. When both include and exclude patterns exist at the same time, exclude patterns are applied first.

  • JSON or CSV specifications: Let you provide additional information about semi-structured data, such as CSV and JSON, to enhance the accuracy of Discovery results.

    • For CSV files, you can provide any of the following:

      • Delimiter: This field accepts one character, except for \r and \n. If more than one character is provided, only the first character of the string is used. If not provided, Discovery uses a comma as the delimiter.

      • Number of header rows: This field accepts value 0 or 1. The default value is 0. When the value is 0, Discovery performs header inference, and if a header is detected, Discovery extracts column names from the header and resets the value to 1.

      • Encoding: This field accepts string encoding names, such as UTF-8, US-ASCII, or ISO-8859-1. If nothing is specified, UTF-8 is used as the default.

      • Disable type inference: This field accepts a Boolean value. It's set to false by default. For CSV data, if you disable type inference, all columns are registered as strings.

    • For JSON files, you can provide any of the following:

      • Encoding: This field accepts string encoding names, such as UTF-8, US-ASCII, or ISO-8859-1. If nothing is specified, UTF-8 is used as the default.

      • Disable data type inference: This field accepts a Boolean value. It's set to false by default. For JSON data, if you disable type inference, all columns are registered as their primitive types (string, number, or boolean).

Publish metadata

When you create a data zone in your Dataplex lake, Dataplex creates a BigQuery dataset in the project containing the lake. Dataplex publishes tables into that dataset for tables discovered in the Cloud Storage buckets added to the data zone as assets. The dataset is referred to as a metadata publishing dataset corresponding to the zone.

Each Dataplex data zone maps to a dataset in BigQuery or a database in Dataproc Metastore, where metadata information is automatically made available.

You can edit auto-discovered metadata, such as table name or schema, using the Dataplex metadata API.

View discovered tables and filesets

You can search for discovered tables and filesets in the Dataplex Search view in the Google Cloud console.

Open Search

For more accurate search results, use Dataplex-specific filters, such as lake and data zone names. The top 50 items per facet are displayed on the filters list. You can find any additional items using the search box.

Each entry contains detailed technical and operational metadata.

From the entry details page, you can query the table in BigQuery and view corresponding Dataproc Metastore registration details.

If a Cloud Storage table can be published into BigQuery as an external table, then you can see the following in its entry details view:

  • BigQuery external table references
  • A button to Open in BigQuery to start analyzing the data in BigQuery.

The Dataplex metadata entries are directly visible and searchable in Data Catalog. To learn more, see the Data Catalog Search reference.

All discovered entries can be viewed through the Dataplex metadata API.

Discovery actions

Discovery raises the following administrator actions whenever data-related issues are detected during scans.

Invalid data format

Actions include:

  • Inconsistent data format in a table. For example, files of different formats exist with the same table prefix.

  • Invalid data format in curated zones (data not in Avro, Parquet, or ORC formats).

Incompatible schema

Actions include:

  • A schema detected by Discovery is incompatible with the active table schema in the metadata API in Dataproc Metastore. Schema A and schema B are incompatible if:

    • A and B share fields with the same name but of different and incompatible data types. For example, string and integer.

    • A and B have no overlapping fields.

    • A and B have at least one non-nullable field not found in the other schema.

  • Schema drift against a user-managed schema in the curated zone.

Invalid partition definition

Actions include:

  • Inconsistent partition naming. For example, gs://sales_data/year=2020/month=10/day=01 and gs://sales_data/year=2020/region=us.

  • Non-Hive style partition naming in the curated data zone. For example, gs://sales_data/2020/10/01 instead of gs://sales_data/year=2020/month=10/day=01.

Missing data

Actions include:

  • In the curated data zone, the underlying data for a registered table or fileset no longer exist. In other words, a curated zone table or fileset was discovered and registered, but later its underlying data got deleted. You can fix this issue by either replenishing the data or deleting the metadata entry.

Resolve Discovery actions

Data with actions is checked by subsequent Discovery scans. When the issue triggering the action is fixed, the action is resolved automatically by the next scheduled Discovery scan.

Other actions

In addition to the preceding Discovery actions, there are three other types of actions related to resource status and security policy propagations in Dataplex.

  • Missing resource: The underlying bucket or dataset is not found corresponding to an existing asset.

  • Unauthorized resource: Dataplex doesn't have sufficient permissions to perform Discovery or apply security policies to the bucket or dataset managed by Dataplex

  • Issues with security policy propagation: Security policies specified for a given lake, zone, or asset couldn't be successfully propagated to the underlying buckets or datasets. While all other actions are at the asset level, this type of action could be raised at lake, zone, and asset level.

These types of actions are auto-resolved when the underlying resource or security configuration issues are corrected.

FAQ

What should I do if the schema inferred by Discovery is incorrect?

If the inferred schema is different from what is expected for a given table, you can override the inferred schema by updating metadata using the metadata API. Make sure to set userManaged to true so that your edit is not overwritten in subsequent Discovery scans.

How do I exclude files from a Discovery scan?

By default, Discovery excludes certain types of files from scanning, including:

  • _SUCCESS
  • _started
  • _committed
  • _metadata, _METADATA, _Metadata
  • _common_metadata, _COMMON_METADATA
  • Files starting with README or readme
  • Directories starting with base_, delta_, delete_delta_, bucket_, followed by a number
  • Directories starting with .

You can specify additional include or exclude patterns by using the Discovery configuration at the zone or asset level, or by using the metadata API.

What should I do if the table grouping detected by Discovery is too granular?

If the tables detected by Discovery are at a more granular level compared to the table root path—for example, each individual partition is registered as a table, then there could be several reasons:

  • There are format differences, such as a mix of Avro and Parquet files, in the expected table root path, that split the table into smaller groupings.

  • There are different types of schema incompatibilities in the expected table root path, that split the table into smaller groupings.

You can resolve this issue in either of the following ways:

  • Fix format or schema differences so that all files in the same table root path are of consistent format and compatible schema.

  • Exclude heterogeneous files by using the exclude pattern configuration as part of the zone / asset configuration or metadata API.

After you take one of the corrective steps, in the next Discovery scan, the following occurs:

  • The existing lower-level tables are automatically removed from the Dataplex metadata API, BigQuery, Dataproc Metastore, and Data Catalog.
  • A new higher-level table with the expected table root path is created instead.

How do I specify table names?

You can specify table names by using the metadata API.

What happens if I create tables manually in Dataproc Metastore or BigQuery?

When Discovery is enabled for a given asset, you don't need to manually register entries in Dataproc Metastore or BigQuery.

You can manually define table name, schema, and partition definitions, while switching off Dataplex Discovery. Alternatively, you do the following:

  1. Create a table by only specifying the required information, such as table root path.
  2. Use Dataplex Discovery to populate the rest of the metadata, such as schema and partition definitions.
  3. Keep the metadata up-to-date.

What should I do if my table is not showing up in BigQuery?

While Dataplex metadata is all centrally registered in the metadata API, only Cloud Storage tables that are compatible with BigQuery are published to BigQuery as external tables. As part of table entry details in the metadata API, you can find a BigQuery compatibility marker that indicates which entities are published to BigQuery and why.

What's next?