Cloud Data Fusion provides a Dataplex Source plugin to read data from Dataplex entities (tables) residing on Cloud Storage or BigQuery assets. The Dataplex Source plugin lets you treat data in Cloud Storage assets as tables and filter the data with SQL queries.
Before you begin
Create a Cloud Data Fusion instance, if you don't have one. This plugin is available in instances that run in Cloud Data Fusion version 6.6 or later.
The source data must already be part of a Dataplex zone and an asset (either a Cloud Storage bucket or a BigQuery dataset).
To use tables from Cloud Storage, you must configure a metastore for your lake.
For data to be read from Cloud Storage entities, Dataproc Metastore must be attached to the lake.
CSV data in Cloud Storage entities isn't supported.
In the Dataplex project, enable Private Google Access on the subnetwork, which is usually set to
default
, or setinternal_ip_only
tofalse
.
Limitations
For Cloud Storage assets: this plugin does not support reading from CSV files. It supports reading from JSON, Avro, Parquet, and ORC formats.
For Cloud Storage assets: Partition Start Date and Partition End Date aren't applicable.
Required roles
To get the permissions that you need to manage roles, ask your administrator to grant you the following IAM roles on the Dataproc service agent and the Cloud Data Fusion service agent (service-CUSTOMER_PROJECT_NUMBER@gcp-sa-datafusion.iam.gserviceaccount.com):
-
Dataplex Developer (
roles/dataplex.developer
) -
Dataplex Data Reader (
roles/dataplex.dataReader
) -
Dataproc Metastore Metadata User (
roles/metastore.metadataUser
) -
Cloud Dataplex Service Agent (
roles/dataplex.serviceAgent
) -
Dataplex Metadata Reader (
roles/dataplex.metadataReader
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Add the plugin to your pipeline
In the Google Cloud console, go to the Cloud Data Fusion Instances page.
This page lets you manage your instances.
Click View instance to open your instance in the Cloud Data Fusion UI.
Go to the Studio page, expand the Source menu, and click Dataplex.
Configure the plugin
After you add this plugin to your pipeline on the Studio page, click the Dataplex source to configure its properties.
For more information about configurations, see the Dataplex Source reference.
Optional: Get started with a sample pipeline
Sample pipelines are available, including an SAP source to Dataplex sink pipeline and a Dataplex source to BigQuery sink pipeline.
To use a sample pipeline, open your instance in the Cloud Data Fusion UI, click Hub > Pipelines, and select one of the Dataplex pipelines. A dialog opens to help you create the pipeline.
What's next
- Ingest data with Cloud Data Fusion using the Dataplex Sink plugin.