Integrate your data sources with Data Catalog

Data Catalog can import and keep up-to-date metadata from several Google Cloud data sources as well as a number of popular on-premises ones.

With metadata ingested, Data Catalog does the following:

  • Makes the existing metadata discoverable through search. For more information, see How to search.
  • Allows the members of your organization to enrich your data with additional business metadata through tags. For more information, see Tags and tag templates.

While the integration with Google Cloud sources is automatic, to integrate with custom on-premises sources that your organization uses, you can:

Before you begin

If you're already using Data Catalog, you must already have a project with the enabled Data Catalog API. For more information on the recommended way to use multiple projects with Data Catalog, see Using tag templates in multiple projects.

If this is the first time you interact with the Data Catalog, do the following:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Data Catalog API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Data Catalog API.

    Enable the API

Integrate Google Cloud data sources

Analytics Hub

When you subscribe to a listing in Analytics Hub, a linked dataset is created in your project. Data Catalog automatically generates metadata entries for that linked dataset and all tables contained in it. For more information on linked datasets and other Analytics Hub features, see Introduction to Analytics Hub.

In Data Catalog search, linked datasets are displayed as standard BigQuery datasets, but you can filter them using the type=dataset.linked predicate. For more details, see Search for data assets.

BigQuery and Pub/Sub

If your organization already uses BigQuery and Pub/Sub, depending on your permissions, you can search for the metadata from those sources right away. If you can't see the corresponding entries in search results, look for the IAM roles that you and the users of your project might need in Identity and Access Management.

Sensitive Data Protection

Additionally, Data Catalog integrates with Sensitive Data Protection that lets you scan specific Google Cloud resources for sensitive data and send results back to Data Catalog in the form of tags.

For more information, see Sending Sensitive Data Protection scan results to Data Catalog.

Bigtable

When you store data in Bigtable, metadata is automatically synced to Data Catalog for the following Bigtable resources:

  • Instances
  • Tables, including column family details

For guidance on using Data Catalog for data discovery and tagging, see Manage data assets using Data Catalog in the Bigtable documentation.

Spanner (Preview)

When you store data in Spanner, metadata for the following Spanner resources is synced to Data Catalog:

  • Instances
  • Databases
  • Tables and views with column schema

For guidance on using Data Catalog for data discovery and tagging, see Manage data assets using Data Catalog.

Dataproc Metastore

To integrate with Dataproc Metastore, enable the sync to Data Catalog for new or existing services as described in Enabling Data Catalog sync.

Vertex AI

Vertex AI syncs metadata for the following resources to Data Catalog:

Integrate on-premises data sources

To integrate on-premises data sources, you can use the corresponding Python connectors contributed by the community:

  1. Find your data source in the table below.
  2. Open its GitHub repository.
  3. Follow the setup instructions in the readme file.
Category Component Description Repository
RDBMS mysql-connector Sample code for MySQL data source. google-datacatalog-mysql-connector
postgresql-connector Sample code for PostgreSQL data source. google-datacatalog-postgresql-connector
sqlserver-connector Sample code for SQLServer data source. google-datacatalog-sqlserver-connector
redshift-connector Sample code for Redshift data source. google-datacatalog-redshift-connector
oracle-connector Sample code for Oracle data source. google-datacatalog-oracle-connector
teradata-connector Sample code for Teradata data source. google-datacatalog-teradata-connector
vertica-connector Sample code for Vertica data source. google-datacatalog-vertica-connector
greenplum-connector Sample code for Greenplum data source. google-datacatalog-greenplum-connector
rdbmscsv-connector Sample code for generic RDBMS CSV ingestion. google-datacatalog-rdbmscsv-connector
saphana-connector Sample code for Sap Hana data source. google-datacatalog-saphana-connector
BI looker-connector Sample code for Looker data source. google-datacatalog-looker-connector
qlik-connector Sample code for Qlik Sense data source. google-datacatalog-qlik-connector
tableau-connector Sample code for Tableau data source. google-datacatalog-tableau-connector
Hive hive-connector Sample code for Hive data source. google-datacatalog-hive-connector
apache-atlas-connector Sample code for Apache Atlas data source. google-datacatalog-apache-atlas-connector

Integrate unsupported data sources

If you can't find a connector for your data source, you can still manually integrate it by creating entry groups and custom entries. To do that, you can:

To integrate your sources, first, learn about Entries and entry groups, then follow the instructions in Create custom Data Catalog entries for your data sources.

What's next