Using customer-managed encryption keys (CMEK)

Customer-managed encryption keys (CMEK) provide user control over the data written by Cloud Data Fusion pipelines, including:

  • Dataproc cluster metadata
  • Cloud Storage, BigQuery, and Pub/Sub data sources and sinks

This page describes how to use a Cloud Key Management Service (Cloud KMS) encryption key with Cloud Data Fusion. A customer-managed encryption key (CMEK) enables encryption of data at rest with a key that you can control through Cloud KMS. You can create a pipeline that is protected with a CMEK or access CMEK-protected data in sources and sinks.

Cloud Data Fusion resources

Cloud Data Fusion supports customer-managed encryption keys (CMEK) for the following Cloud Data Fusion plugins:

  • Cloud Data Fusion sinks:

    • Cloud Storage
    • Cloud Storage multi-file
    • BigQuery
    • BigQuery multi-table
    • Pub/Sub
  • Cloud Data Fusion actions:

    • Cloud Storage create
    • BigQuery execute

Cloud Data Fusion supports customer-managed encryption keys (CMEK) for Dataproc clusters. Cloud Data Fusion creates a temporary Dataproc cluster for use in the pipeline, which is deleted when the pipeline completes. CMEK protects the cluster metadata written to the following:

  • Persistent disks (PDs) attached to cluster VMs.
  • Job driver output and other metadata written to the auto-created or user-created Dataproc staging bucket.

Set up CMEK

  1. Create a Cloud KMS key.

  2. Get the resource ID of the key that you created, which you use later in this procedure.

      projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
    

    1. In the Cloud Console, go to the Cryptographic Keys page.
    2. Click the three-dot menu next to your key.
    3. Click Copy Resource ID. This copies your resource ID to the clipboard.
  3. Set up your project's service accounts to use your key:

    1. Required: You must grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine System service account (see Granting roles to a service account for specific resources). This account, which by default is granted the Compute Engine Service Agent role, is of the form:
      service-[PROJECT_NUMBER]@compute-system.iam.gserviceaccount.com
      
    2. Required: You must grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service account (see Assigning a Cloud KMS key to a Cloud Storage service account). This account is of the form:
      service-[PROJECT_NUMBER]@gs-project-accounts.iam.gserviceaccount.com
      
    3. Optional: If your pipeline uses BigQuery resources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the BigQuery service account (see Grant encryption and decryption permission). This account is of the form:
      bq-[PROJECT_NUMBER]@bigquery-encryption.iam.gserviceaccount.com
      
    4. Optional: If your pipeline uses Pub/Sub resources, grant the Cloud KMS CryptoKey Encrypter/Decrypter role to the Pub/Sub service account (see Using customer-managed encryption keys). This account is of the form:
      service-[PROJECT_NUMBER]@gcp-sa-pubsub.iam.gserviceaccount.com
      

Use CMEK with Dataproc cluster metadata

To use CMEK to encrypt PD (Persistent Disk) and the staging bucket metadata written by the Dataproc cluster running in your pipeline, do one of the following:

  • Recommended: Create a Dataproc compute profile (Enterprise edition only).
  • Edit an existing Dataproc compute profile (Basic or Enterprise edition).
  1. Open the Cloud Data Fusion Instances page in the Cloud Console.

    Open the Instances page

  2. In the Actions column for the instance, click View Instance.
  3. In the Cloud Data Fusion web UI, click SYSTEM ADMIN.
  4. Click the Configuration tab.
  5. Click the System Compute Profiles drop-down.
  6. Click Create New Profile.
  7. Select Cloud Dataproc.
  8. Enter a Profile label, Profile name, and Description.
  9. By default, Cloud Data Fusion auto-creates a Cloud Storage bucket to be used as the Dataproc staging bucket. If you prefer to use a Cloud Storage bucket that already exists in your project, follow these steps:
    1. In the General Settings section, enter your existing Cloud Storage bucket in the GCS Bucket field.
    2. Add your Cloud KMS key to your Cloud Storage bucket.
  10. Get the resource ID of your Cloud KMS key. In the General Settings section, enter your resource ID in the Encryption Key Name field.
  11. Click Create.
  12. If more than one profile is listed in the System Compute Profiles section of the Configuration tab, make the new Dataproc profile the default profile by holding the pointer over the profile name field and clicking the star that appears.
    Select default profile.

Use CMEK with other resources

To use CMEK to encrypt data written by other resources, such as Cloud Storage, BigQuery, or Pub/Sub sinks, do one of the following:

  • Use a runtime argument.
  • Set a Cloud Data Fusion system preference.

Runtime argument

  1. In the Cloud Data Fusion Pipeline Studio page, click the drop-down arrow to the right of the Run button.
  2. In the Name field, enter gcp.cmek.key.name.
  3. In the Value field, enter your key's resource ID.
    Select Data Fusion edition.
  4. Click Save.

Preference

  1. In the Cloud Data Fusion UI, click SYSTEM ADMIN.
  2. Click the Configuration tab.
  3. Click the System Preferences drop-down.
  4. Click Edit System Preferences.
  5. In the Key field, enter gcp.cmek.key.name.
  6. In the Value field, enter your key's resource ID.
    Select Data Fusion edition.
  7. Click Save & Close.