Data Catalog can import and keep up-to-date metadata from several Google Cloud data sources as well as a number of popular on-premises ones.
With metadata ingested, Data Catalog does the following:
- Makes the existing metadata discoverable through search. For more information, see How to search.
- Allows the members of your organization to enrich your data with additional business metadata through tags. For more information, see Tags and tag templates.
While the integration with Google Cloud sources is automatic, to integrate with custom on-premises sources that your organization uses, you can do either of the following:
- Set up and run corresponding connectors contributed by the community.
- Use the Data Catalog API for custom entries.
Before you begin
If you're already using Data Catalog, you must already have a project with the enabled Data Catalog API. For more information on the recommended way to use multiple projects with Data Catalog, see Using tag templates in multiple projects.
If this is the first time you interact with the Data Catalog, do the following:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Data Catalog API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Data Catalog API.
Integrate Google Cloud data sources
Analytics Hub
When you subscribe to a listing in Analytics Hub, a linked dataset is created in your project. Data Catalog automatically generates metadata entries for that linked dataset and all tables contained in it. For more information on linked datasets and other Analytics Hub features, see Introduction to Analytics Hub.
In Data Catalog search, linked datasets are displayed as
standard BigQuery datasets, but you can filter them using
the type=dataset.linked
predicate. For more details,
see Search for data assets.
BigQuery and Pub/Sub
If your organization already uses BigQuery and Pub/Sub, depending on your permissions, you can search for the metadata from those sources right away. If you can't see the corresponding entries in search results, look for the IAM roles that you and the users of your project might need in Identity and Access Management.
Bigtable
When you store data in Bigtable, metadata is automatically synced to Data Catalog for the following Bigtable resources:
- Instances
- Tables, including column family details
For guidance on using Data Catalog for data discovery and tagging, see Manage data assets using Data Catalog in the Bigtable documentation.
Cloud SQL
Cloud SQL doesn't integrate with Data Catalog, but does integrate with Dataplex Catalog. For more information, see Integrate your data sources with Dataplex Catalog.
Dataproc Metastore
To integrate with Dataproc Metastore, enable the sync to Data Catalog for new or existing services as described in Enabling Data Catalog sync.
Sensitive Data Protection
Additionally, Data Catalog integrates with Sensitive Data Protection that lets you scan specific Google Cloud resources for sensitive data and send results back to Data Catalog in the form of tags.
For more information, see Sending Sensitive Data Protection scan results to Data Catalog.
Spanner
When you store data in Spanner, metadata for the following Spanner resources is synced to Data Catalog:
- Instances
- Databases
- Tables and views with column schema
For guidance on using Data Catalog for data discovery and tagging, see Manage data assets using Data Catalog.
Vertex AI
Vertex AI syncs metadata for the following resources to Data Catalog:
Integrate on-premises data sources
To integrate on-premises data sources, you can use the corresponding Python connectors contributed by the community:
- Find your data source in the following table.
- Open its GitHub repository.
- Follow the setup instructions in the readme file.
Category | Component | Description | Repository |
---|---|---|---|
RDBMS | mysql-connector | Sample code for MySQL data source. | google-datacatalog-mysql-connector |
postgresql-connector | Sample code for PostgreSQL data source. | google-datacatalog-postgresql-connector | |
sqlserver-connector | Sample code for SQLServer data source. | google-datacatalog-sqlserver-connector | |
redshift-connector | Sample code for Redshift data source. | google-datacatalog-redshift-connector | |
oracle-connector | Sample code for Oracle data source. | google-datacatalog-oracle-connector | |
teradata-connector | Sample code for Teradata data source. | google-datacatalog-teradata-connector | |
vertica-connector | Sample code for Vertica data source. | google-datacatalog-vertica-connector | |
greenplum-connector | Sample code for Greenplum data source. | google-datacatalog-greenplum-connector | |
rdbmscsv-connector | Sample code for generic RDBMS CSV ingestion. | google-datacatalog-rdbmscsv-connector | |
saphana-connector | Sample code for Sap Hana data source. | google-datacatalog-saphana-connector | |
BI | looker-connector | Sample code for Looker data source. | google-datacatalog-looker-connector |
qlik-connector | Sample code for Qlik Sense data source. | google-datacatalog-qlik-connector | |
tableau-connector | Sample code for Tableau data source. | google-datacatalog-tableau-connector | |
Hive | hive-connector | Sample code for Hive data source. | google-datacatalog-hive-connector |
apache-atlas-connector | Sample code for Apache Atlas data source. | google-datacatalog-apache-atlas-connector |
Integrate unsupported data sources
If you can't find a connector for your data source, you can still manually integrate it by creating entry groups and custom entries. To do that, you can:
- Use one of the Data Catalog Client Libraries in one of the following languages: C#, Go, Java, Node.js, PHP, Python, or Ruby.
- Or manually build on the Data Catalog API.
To integrate your sources, first, learn about Entries and entry groups, then follow the instructions in Create custom Data Catalog entries for your data sources.
What's next
- Learn more about Identity and Access Management.
- Learn How to search.
- Go through the Tagging tables quickstart.