Cloud Data Fusion provides a Dataplex Sink plugin for ingesting data to any of the Dataplex supported assets.
Before you start
- Create a Cloud Data Fusion instance, if you don't have one. This plugin is available in instances that run in Cloud Data Fusion version 6.6 or later.
- The BigQuery dataset or Cloud Storage bucket where data is ingested must be part of a Dataplex lake.
- For data to be read from Cloud Storage entities, Dataproc Metastore must be attached to the lake.
- CSV data in Cloud Storage entities isn't supported.
- In the Dataplex project, enable Private Google Access on
the subnetwork, which is usually set to
default, or set
To get the permissions that you need to manage roles,
ask your administrator to grant you the
following IAM roles on Dataproc service account and the Google-managed service account (
Dataplex Developer (
Dataplex Data Reader (
Dataproc Metastore Metadata User (
Cloud Dataplex Service Agent (
Dataplex Metadata Reader (
For more information about granting roles, see Manage access.
Add the plugin to your pipeline
In the Google Cloud console, go to the Cloud Data Fusion Instances page.
This page lets you manage your instances.
Click View instance to open your instance in the Cloud Data Fusion UI.
Go to the Studio page, expand the Sink menu, and click Dataplex.
Configure the plugin
After you add this plugin to your pipeline on the Studio page, click the Dataplex sink to configure and save its properties.
For more information about configurations, see the Dataplex Sink reference.
Optional: Get started with a sample pipeline
Sample pipelines are available, including an SAP source to Dataplex sink pipeline and a Dataplex source to BigQuery sink pipeline.
To use a sample pipeline, open your instance in the Cloud Data Fusion UI, click Hub > Pipelines, and select one of the Dataplex pipelines. A dialog opens to help you create the pipeline.
Run your pipeline
After deploying the pipeline, open your pipeline on the Cloud Data Fusion Studio page.
Click Configure > Resources.
Optional: Change the Executor CPU and Memory based on the overall data size and the number of transformations used in your pipeline.
To start the data pipeline, click Run.
- Process data with Cloud Data Fusion using the Dataplex Source plugin.