Using customer-managed encryption keys (CMEK)

This page describes how to use a Cloud Key Management Service (Cloud KMS) encryption key with Cloud Data Fusion.

A customer-managed encryption key (CMEK) enables encryption of data at rest with a key that you can control through Cloud KMS. CMEK provides user control over the data written to Google internal resources in tenant projects and data written by Cloud Data Fusion pipelines, including:

  • Pipeline logs and metadata
  • Dataproc cluster metadata
  • Cloud Storage, BigQuery, Pub/Sub, and Cloud Spanner data sinks

Cloud Data Fusion resources

Cloud Data Fusion supports CMEK for the following Cloud Data Fusion plugins:

  • Cloud Data Fusion sinks:

    • Cloud Storage
    • Cloud Storage multi-file
    • BigQuery
    • BigQuery multi-table
    • Pub/Sub
    • Spanner
  • Cloud Data Fusion actions:

    • Cloud Storage create
    • BigQuery execute

Cloud Data Fusion supports CMEK for Dataproc clusters. Cloud Data Fusion creates a temporary Dataproc cluster for use in the pipeline, and then deletes the cluster when the pipeline completes. CMEK protects the cluster metadata written to the following:

  • Persistent disks (PD) attached to cluster VMs
  • Job driver output and other metadata written to the auto-created or user-created Dataproc staging bucket.

Set up CMEK

Create a Cloud KMS key

Create a Cloud KMS key.

You can create the key in the same Google Cloud project as the Cloud Data Fusion instance or in a separate user project. The Cloud KMS key ring location must match the region where you want to create the instance. A multi-region or global region key is not allowed.

Get the resource name for the key

REST API

Get the resource name of the key that you created with the following command:

projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name

Console

  1. Go to the Cryptographic keys page.

    Go to Cryptographic keys

  2. Next to your key, click More .

  3. Select Copy Resource Name to copy the full resource name to the clipboard.

Update your project's service accounts to use the key

To set up your project's service accounts to use your key:

  1. Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role (roles/cloudkms.cryptoKeyEncrypterDecrypter) to the Cloud Data Fusion service agent (see Granting roles to a service account for specific resources). This account is in the following format:

    service-PROJECT_NUMBER@gcp-sa-datafusion.iam.gserviceaccount.com

    Granting the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Data Fusion service agent enables Cloud Data Fusion to use CMEK to encrypt any customer data stored in tenant projects.

  2. Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service agent (see Assigning a Cloud KMS key to a Cloud Storage service account). This account, which by default is granted the Compute Engine Service Agent role, is of the form:

    service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com

    Granting the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service agent enables Cloud Data Fusion to use CMEK to encrypt persistent disk (PD) metadata written by the Dataproc cluster running in your pipeline.

  3. Required: Grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service agent (see Assigning a Cloud KMS key to a Cloud Storage service agent). This service agent is of the form:

    service-PROJECT_NUMBER@gs-project-accounts.iam.gserviceaccount.com

    Granting the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service agent enables Cloud Data Fusion to use CMEK to encrypt data written to the Dataproc cluster staging bucket and any other Cloud Storage resources used by your pipeline.

  4. Optional: If your pipeline uses BigQuery resources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the BigQuery service account (see Grant encryption and decryption permission). This account is of the form:

    bq-PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com

  5. Optional: If your pipeline uses Pub/Sub resources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Pub/Sub service account (see Using customer-managed encryption keys). This account is of the form:

    service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com

  6. Optional: If your pipeline uses Spanner resources, grant the {Cloud KMS CryptoKey Encrypter/Decrypter role to the Spanner service account. This account is of the form:

    service-PROJECT_NUMBER@gcp-sa-spanner.iam.gserviceaccount.com

Create a Cloud Data Fusion instance with CMEK

To create an instance with a customer-managed encryption key, export the following variables or directly substitute these values into the following commands.

export PROJECT=PROJECT_ID // the user project that will host the Data Fusion instance
export LOCATION=REGION
export INSTANCE=INSTANCE_ID
export DATA_FUSION_API_NAME=datafusion.googleapis.com
export KEY=CMEK_KEY // the full resource name of the CMEK key, which is of the form projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name

Run the following command to create a Cloud Data Fusion instance:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://$DATA_FUSION_API_NAME/v1/projects/$PROJECT/locations/$LOCATION/instances?instance_id=INSTANCE -X POST -d '{"description": "CMEK-enabled CDF instance created through REST.", "type": "BASIC", "cryptoKeyConfig": {"key_reference": "$KEY"} }'

Use CMEK with Dataproc cluster metadata

The pre-created compute profiles use the CMEK key provided during instance creation to encrypt the Persistent Disk (PD) and the staging bucket metadata written by the Dataproc cluster running in your pipeline. You can modify to use another key by doing one of the following:

  • Recommended: Create a new Dataproc compute profile (Enterprise edition only).
  • Edit an existing Dataproc compute profile (Developer, Basic, or Enterprise editions).

Console

  1. Go to the Cloud Data Fusion Instances page.

    Go to Instances

  2. In the Actions column for the instance, click View Instance.

  3. In the Cloud Data Fusion web UI, click SYSTEM ADMIN.

  4. Click the Configuration tab.

  5. Click the System Compute Profiles drop-down.

  6. Click Create New Profile.

  7. Select Cloud Dataproc.

  8. Enter a Profile label, Profile name, and Description.

  9. By default, Cloud Data Fusion auto-creates a Cloud Storage bucket to be used as the Dataproc staging bucket. If you prefer to use a Cloud Storage bucket that already exists in your project, follow these steps:

    1. In the General Settings section, enter your existing Cloud Storage bucket in the Cloud Storage Bucket field.

    2. Add your Cloud KMS key to your Cloud Storage bucket.

  10. Get the resource ID of your Cloud KMS key. In the General Settings section, enter your resource ID in the Encryption Key Name field.

  11. Click Create.

  12. If more than one profile is listed in the System Compute Profiles section of the Configuration tab, make the new Dataproc profile the default profile by holding the pointer over the profile name field and clicking the star that appears.

    Select default profile.

Use CMEK with other resources

The provided CMEK key is set to the system preference during Cloud Data Fusion instance creation. It is used to encrypt data written to newly created resources by pipeline sinks such as Cloud Storage, BigQuery, Pub/Sub, or Spanner sinks.

This key only applies to newly created resources. If the resource already exists before pipeline execution, you should manually apply the CMEK key to those existing resources.

The location of the resource should be the same as the region that the CMEK key resides in. A multi-region or global region resource is not allowed with CMEK. You can change the CMEK key by doing one of the following:

  • Use a runtime argument.
  • Set a Cloud Data Fusion system preference.

Runtime argument

  1. In the Cloud Data Fusion Pipeline Studio page, click the drop-down arrow to the right of the Run button.
  2. In the Name field, enter gcp.cmek.key.name.
  3. In the Value field, enter your key's resource ID.
    Select Data Fusion edition.
  4. Click Save.

    The runtime argument you set here applies only to runs of the current pipeline.

Preference

  1. In the Cloud Data Fusion UI, click SYSTEM ADMIN.
  2. Click the Configuration tab.
  3. Click the System Preferences drop-down.
  4. Click Edit System Preferences.
  5. In the Key field, enter gcp.cmek.key.name.
  6. In the Value field, enter your key's resource ID.
    Select Data Fusion edition.
  7. Click Save & Close.