The Cloud Data Fusion Web UI supports authentication mechanisms supported by Google Cloud Console, with access controlled through Identity and Access Management.
Users can create a private Cloud Data Fusion instance, which can be peered with their VPC network. Private Cloud Data Fusion instances have a private IP address, and are not exposed to the public Internet. Additional security is available using VPC Service Controls to establish a security perimeter around a Cloud Data Fusion private instance.
Pipeline execution on pre-created private IP Dataproc clusters
You can use a private Cloud Data Fusion instance with the Remote Hadoop Provisioner. The Dataproc cluster must be on the VPC network peered with Cloud Data Fusion. The remote Hadoop provisioner is configured with the private IP address of the master node of the Dataproc cluster.
User access to the Cloud Data Fusion instance: Cloud Data Fusion currently only supports user access control at an instance level. If you have access to an instance, you have access to all pipelines and metadata in that instance.
Pipeline access to the user data: Pipeline access to data is provided by granting access to the service account, which can be a custom service account specified by the user.
End user access to the Cloud Data Fusion resources
Cloud Data Fusion resources are created in Google-owned tenant projects. Cloud Data Fusion does not provide user access to underlying Cloud Data Fusion VM instances and resources in tenant projects.
For a pipeline execution, ingress and egress can be controlled by setting the appropriate firewall rules on the customer VPC on which the pipeline is being executed.
Users can store passwords, keys, and other data securely in the Cloud Key Management Service. At runtime, Cloud Data Fusion calls Cloud Key Management Service to retrieve the keys.
By default, data is encrypted at rest using Google-managed encryption keys, and in transit using TLS v1.2. Customer-managed encryption keys (CMEK) provide user control over the data written by Cloud Data Fusion pipelines, including Dataproc cluster metadata and Cloud Storage, BigQuery, and Pub/Sub data sources and sinks.
Cloud Data Fusion pipelines execute in Dataproc clusters in the customer project, and can be configured to run using a customer-specified (custom) service account. A custom service account must be granted the Service Account User role.
Cloud Data Fusion services are created in Google-managed tenant projects that users cannot access. Cloud Data Fusion pipelines execute on Dataproc clusters inside customer projects. Customers can access these clusters during their lifetime.
Cloud Data Fusion audit logs are available from Logging.