Using customer-managed encryption keys

This page describes how to use a Key Management Service (KMS) encryption key with Dataflow. A customer-managed encryption key (CMEK) enables encryption of data at rest with a key that you can control through KMS. You can create a batch or streaming pipeline that is protected with a CMEK or access CMEK-protected data in sources and sinks.

You can also use Cloud HSM, a cloud-hosted Hardware Security Module (HSM) service that allows you to host encryption keys and perform cryptographic operations in a cluster of FIPS 140-2 Level 3 certified HSMs. For more information on additional quotas for Cloud HSM, see KMS Quotas.

For more information, see encryption options on Google Cloud.

Before you begin

  1. Verify that you have the Apache Beam SDK for Java >= 2.13.0 or the Apache Beam SDK for Python >= 2.13.0.

    For more information, see Installing the Apache Beam SDK.

  2. Decide whether you are going to run Dataflow and KMS in the same Google Cloud project or in different projects. This page uses the following convention:

    • PROJECT_ID is the project ID of the project that is running Dataflow.
    • PROJECT_NUMBER is the project number of the project that is running Dataflow.
    • KMS_PROJECT_ID is the project ID of the project that is running KMS.

    For information about Google Cloud project IDs and project numbers, see Identifying projects.

  3. On the Google Cloud project that you want to run KMS:

    1. Enable the KMS API.
    2. Create a key ring and a key as described in Creating symmetric keys. KMS and Dataflow are both regionalized services. Create the key ring in a location that matches the location of your Dataflow worker instances:

      • For a Dataflow job specified with region, select keys for use in the same region.
      • For a Dataflow job specified with zone, select keys for use in the same region.

Granting Encrypter/Decrypter permissions

  1. Assign the KMS CryptoKey Encrypter/Decrypter role to the Dataflow service account. This grants your Dataflow service account the permission to encrypt and decrypt with the CMEK you specify. If you use the Google Cloud Console and the Create job from template page, this permission is granted automatically and you can skip this step.

    Use the gcloud command-line tool to assign the role:

    gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
    --member serviceAccount:service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
    --role roles/cloudkms.cryptoKeyEncrypterDecrypter
    

    Replace KMS_PROJECT_ID with the ID of your Google Cloud project that is running KMS, and replace PROJECT_NUMBER with the project number (not project ID) of your Google Cloud project that is running the Dataflow resources.

  2. Assign the KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service account. This grants your Compute Engine service account the permission to encrypt and decrypt with the CMEK you specify.

    Use the gcloud command-line tool to assign the role:

    gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
    --member serviceAccount:service-PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
    --role roles/cloudkms.cryptoKeyEncrypterDecrypter
    

    Replace KMS_PROJECT_ID with the ID of your Google Cloud project that is running KMS, and replace PROJECT_NUMBER with the project number (not project ID) of your Google Cloud project that is running the Compute Engine resources.

Create a pipeline that is protected by Cloud KMS

When you create a batch or streaming pipeline, you can select a KMS key to encrypt the pipeline state. The pipeline state is the data that is stored by Dataflow in temporary storage.

Command-line interface

To create a new pipeline with pipeline state that is protected by a KMS key, add the relevant flag to the pipeline parameters. The following example demonstrates running a word count pipeline with KMS.

Java

Dataflow does not support creating default Cloud Storage paths for temporary files when using a KMS key. You must specify gcpTempLocation.

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
  -Dexec.args="--inputFile=gs://dataflow-samples/shakespeare/kinglear.txt \
               --output=gs://STORAGE_BUCKET/counts \
               --runner=DataflowRunner --project=PROJECT_ID \
               --gcpTempLocation=gs://STORAGE_BUCKET/tmp \
               --dataflowKmsKey=KMS_KEY"
  -Pdataflow-runner

Python

Dataflow does not support creating default Cloud Storage paths for temporary files when using a KMS key. You must specify temp_location.

python -m apache_beam.examples.wordcount \
  --input gs://dataflow-samples/shakespeare/kinglear.txt \
  --output gs://STORAGE_BUCKET/counts \
  --runner DataflowRunner \
  --project PROJECT_ID \
  --temp_location gs://STORAGE_BUCKET/tmp/ \
  --dataflow_kms_key=KMS_KEY

Cloud Console

  1. Open the Dataflow monitoring UI.
    Go to the Dataflow Web UI
  2. Select Create job from template.
  3. In the Encryption section, select Customer-managed key.
The encryption options on the Create job from template page to use
              a Google-manage key or customer-managed keys.

The first time you attempt to run a job with a particular KMS key, your Compute Engine service account and/or Dataflow service account might not have been granted the permissions to encrypt and decrypt using that key. In this case, a warning message appears to prompt you to grant the permission to your service account.

Prompts to grant permissions to encrypt and decrypt on your
              Compute Engine and Cloud Dataflow service accounts using a
              particular CMEK.

Encryption of pipeline state artifacts

Data that a Dataflow pipeline reads from user-specified data sources is encrypted, except for the data keys that the user specified for key-based transforms.

Data keys used in key-based operations, such as Windowing, Grouping, and Joining, are not protected by CMEK encryption. If these keys contain personally identifiable information (PII), you need to hash or otherwise transform the keys before they enter the Dataflow pipeline. The values of the key-value pairs are in-scope for CMEK encryption.

Job metadata is not encrypted with KMS keys. Job metadata includes the following:

  • User-supplied data, such as Job Names, Job Parameter values, and Pipeline Graph
  • System-generated data, such as Job IDs and IP addresses of workers

Encryption of pipeline state locations

The following storage locations are protected with KMS keys:

  • Persistent Disks attached to Dataflow workers and used for Persistent Disk-based shuffle and streaming state storage.
  • Dataflow Shuffle state for batch pipelines.
  • Cloud Storage buckets that store temporary export or import data. Dataflow only supports default keys set by the user on the bucket level.
  • Cloud Storage buckets used to store binary files containing pipeline code. Dataflow only supports default keys set by the user on the bucket level.

Currently, Dataflow Streaming Engine state cannot be protected by a CMEK and is encrypted by a Google-managed key. If you want all of your pipeline state to be protected by CMEKs, do not use this optional feature.

Verifying Cloud KMS key usage

You can verify if your pipeline uses a KMS key using the Cloud Console or the gcloud command-line tool.

Console

  1. Open the Dataflow monitoring UI.
    Go to the Dataflow Web UI
  2. Select your Dataflow job to view job details.
  3. In the Job Summary section, the key type is listed in the Encryption type field.
    Job Summary section listing the details of a Cloud Dataflow job.
          The type of key your job uses is listed in the Encryption type field.

CLI

Run the describe command using the gcloud tool:

gcloud dataflow jobs describe JOB_ID

Search for the line that contains serviceKmsKeyName. This information shows that a KMS key was used for Dataflow pipeline state encryption.

You can verify KMS key usage for encrypting sources and sinks by using the Cloud Console pages and tools of those sources and sinks, including Pub/Sub, Cloud Storage, and BigQuery. You can also verify KMS key usage through viewing your KMS audit logs.

Audit logging Cloud KMS key usage

Dataflow enables KMS to use Cloud Audit Logs for logging key operations such as encrypt and decrypt. Dataflow provides the job ID as context to a KMS caller. This allows you to track each instance a specific KMS key is used for a Dataflow job.

Cloud Audit Logs maintains audit logs for each Google Cloud project, folder, and organization. You have several options for viewing your KMS audit logs.

KMS writes Admin Activity audit logs for your Dataflow jobs with CMEK encryption. These logs record operations that modify the configuration or metadata of a resource. You can't disable Admin Activity audit logs.

If explicitly enabled, KMS writes Data Access audit logs for your Dataflow jobs with CMEK encryption. Data Access audit logs contain API calls that read the configuration or metadata of resources, as well as user-driven API calls that create, modify, or read user-provided resource data. For instructions on enabling some or all of your Data Access audit logs, go to Configuring data access Logs.

Removing Cloud Dataflow's access to the Cloud KMS key

You can remove Dataflow's access to the KMS key by using the following steps:

  1. Revoke KMS CryptoKey Encrypter/Decrypter role to the Dataflow service account using the Cloud Console or the gcloud tool.
  2. Revoke KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service account using the Cloud Console or the gcloud tool.
  3. Optionally, you can also destroy the key version material to further prevent Dataflow and other services from accessing the pipeline state.

Although you can destroy the key version material, you cannot delete keys and key rings. Key rings and keys do not have billable costs or quota limitations, so their continued existence does not impact costs or production limits.

Dataflow jobs periodically validate whether the Dataflow service account can successfully use the given KMS key. If an encrypt or decrypt request fails, the Dataflow service halts all data ingestion and processing as soon as possible and immediately begins cleaning up the Google Cloud resources attached to your job.

Using GCP sources and sinks that are protected with Cloud KMS keys

Dataflow can access sources and sinks that are protected by KMS keys without you having to specify the KMS key of those sources and sinks, as long as you are not creating new objects. If your Dataflow pipeline might create new objects in a sink, you must define pipeline parameters that specify the KMS keys for that sink and pass this KMS key to appropriate I/O connector methods.

For Dataflow pipeline sources and sinks that do not support CMEK managed by KMS, such as Confluent Kafka hosted on Google Cloud or Amazon Simple Storage Service (S3), the Dataflow CMEK settings are irrelevant.

Cloud KMS key permissions

When accessing services that are protected with KMS keys, verify that you have assigned the Cloud KMS CryptoKey Encrypter/Decrypter role to that service. The accounts are of the following form:

  • Cloud Storage: service-{project_number}@gs-project-accounts.iam.gserviceaccount.com
  • BigQuery: bq-{project_number}@bigquery-encryption.iam.gserviceaccount.com
  • Pub/Sub: service-{project_number}@gcp-sa-pubsub.iam.gserviceaccount.com

Cloud Storage

If you want to protect the temporary and staging buckets that you specified with the TempLocation/temp_location and stagingLocation/staging_location pipeline parameters, see setting up CMEK-protected Cloud Storage buckets.

BigQuery

Java

Use the with_kms_key() method on return values from BigQueryIO.readTableRows(), BigQueryIO.read(), BigQueryIO.writeTableRows(), and BigQueryIO.write().

You can find an example in the Apache Beam GitHub repository.

Python

Use the kms_key argument in BigQuerySource and BigQuerySink.

You can find an example on the Apache Beam GitHub repository.

Cloud Pub/Sub

Dataflow handles access to CMEK-protected topics by using your topic CMEK configuration.

To read from and write to CMEK-protected Pub/Sub topics, see Pub/Sub instructions for using CMEK.

Pricing

You can use KMS encryption keys with Dataflow in all Dataflow regional endpoints where KMS is available.

This integration does not incur additional costs beyond the key operations, which are billed to your Google Cloud project. Each time the Dataflow service account uses your KMS key, the operation is billed at the rate of KMS key operations.

For more information, see KMS pricing details.

Cette page vous a-t-elle été utile ? Évaluez-la :

Envoyer des commentaires concernant…

Besoin d'aide ? Consultez notre page d'assistance.