Data Analytics

Dataplex Automatic Discovery makes Cloud Storage data available for Analytics and governance

November 11, 2024

Parth Desai

Product Manager, Google

Try Gemini Enterprise Business Edition today

The front door to AI in the workplace

In today's data- and AI-driven world, organizations are grappling with an ever-growing volume of structured and unstructured data. This growth makes it increasingly challenging to locate the right data at the right time, and a significant portion of enterprise data remains undiscovered or underutilized — what’s often referred to as "dark data." In fact, a staggering 66% of organizations report that at least half of their data falls into this category.

To address this challenge, today we’re announcing automatic discovery and cataloging of Google Cloud Storage data with Dataplex, part of BigQuery’s unified platform for intelligent data to AI governance. This powerful capability empowers organizations to:

Automatically discover valuable data assets residing within Cloud Storage, including structured and unstructured data such as documents, files, PDFs, images, and more.
Harvest and catalog metadata for your discovered assets by keeping schema definitions up-to-date with built-in compatibility checks and partition detection, as data evolves.
Enable analytics for data science and AI use cases at scale with auto-created BigLake, external or object tables, eliminating the need for data duplication or manually creating table definitions.

How it works

The automatic discovery and cataloging process in Dataplex is designed to be integrated and efficient, and performs the following steps:

Discovery scan: Discovery scan is configured by the user using the BigQuery Studio UI, CLI or gcloud, which scans your Cloud Storage bucket with up to millions of files, identifying and classifying data assets.
Metadata extraction: Relevant metadata, including schema definitions and partition information, is extracted from the discovered assets.
Creation of dataset and tables in BigQuery: A new dataset with numerous BigLake, external or object tables (for unstructured data) is automatically created in BigQuery with accurate, up-to-date table definitions. For scheduled scans, these tables will be updated as the data in cloud storage bucket evolves.
Analytics and AI preparation: The published dataset and tables are available for analysis, processing, data science, and AI use cases in BigQuery, as well as open-source engines like Spark, Hive, and Pig.
Catalog integration: All BigLake tables are integrated into the Dataplex catalog, making them easily searchable and accessible.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_eTn70Jg.max-800x800.png

https://storage.googleapis.com/gweb-cloudblog-publish/original_images/2_wkd44m5.gif

Key benefits

Dataplex’s automatic discovery and cataloging feature offers a multitude of benefits for organizations:

Enhanced data visibility: Gain a clear understanding of your data and AI assets across Google Cloud, eliminating the guesswork and reducing the time spent searching for relevant information.
Reduced manual effort: Cut back on the toil and effort of creating table definitions manually by letting Dataplex scan the bucket and create numerous BigLake tables that correspond to your data in Cloud Storage.
Accelerated analytics and AI: Integrate the data that’s discovered into your analytics and AI workflows, unlocking valuable insights and driving informed decision-making.
Simplified data access: Provide authorized users with easy access to the data they need, while maintaining appropriate security and control measures.

For Storage admins who are interested in Cloud Storage management and gaining insights into their entire storage estate, please refer to Understand your Cloud Storage footprint with AI-powered queries and insights

Unlock your data’s potential

Automatic discovery and cataloging in Dataplex marks a significant step forward in helping organizations unlock the full potential of their data. By eliminating the challenges associated with dark data and providing a comprehensive, searchable catalog of your Cloud Storage assets, Dataplex empowers you to make data-driven decisions with confidence.

We encourage you to explore this powerful new feature and experience the benefits firsthand. To learn more and get started, please visit the Dataplex documentation or contact our team for assistance.

Posted in

Data Analytics

What’s new in serverless Managed Service for Apache Spark

By Vinay Londhe • 4-minute read

Data Analytics

Accelerating data lakes: Optimizing Apache Iceberg and Spark with gcs-analytics-core

By Ajay Yadav • 5-minute read

Data Analytics

The fully-managed Remote MCP Server for AlloyDB is now Generally Available

By Paul Ramsey • 5-minute read

Data Analytics

Modeling a digital twin of a food supply chain using BigQuery Graph

By Guru Rangavittal • 4-minute read

Dataplex Automatic Discovery makes Cloud Storage data available for Analytics and governance

Parth Desai

Try Gemini Enterprise Business Edition today

How it works

Key benefits

Unlock your data’s potential

Related articles

What’s new in serverless Managed Service for Apache Spark

Accelerating data lakes: Optimizing Apache Iceberg and Spark with gcs-analytics-core

The fully-managed Remote MCP Server for AlloyDB is now Generally Available

Modeling a digital twin of a food supply chain using BigQuery Graph