Google Cloud Dataflow Security and Permissions

Google Cloud Dataflow pipelines can be run either locally (to perform tests on small datasets), or on managed Cloud Platform resources using the Cloud Dataflow managed service. Whether running locally or in the cloud, your pipeline and its workers use a permissions system to maintain secure access to pipeline files and resources. Cloud Dataflow permissions are assigned according to the role used to access pipeline resources. The sections below explain the roles and permissions associated with local and cloud pipelines, the default settings, and how to check your project’s permissions.

Security and Permissions for Local Pipelines

Google Cloud Platform Account

When you run Cloud Dataflow locally, your pipeline runs as the Cloud Platform account that you configured with the gcloud command-line tool executable. Hence, locally-run Cloud Dataflow SDK operations have access to the files and resources that your Cloud Platform account has access to.

To list the Google Cloud Platform account you selected as your default, run the gcloud config list command.

Note: Local pipelines can output data to local destinations, such as local files, or to cloud destinations, such as Google Cloud Storage or Google BigQuery. If your locally-run pipeline writes files to cloud-based resources such as Cloud Storage, it uses your Cloud Platform account credentials and the Cloud Platform project that you configured as the gcloud command-line tool default. For instructions about how to authenticate with your Cloud Platform account credentials, see the quickstart for the language you are using.

Security and Permissions for Pipelines on Cloud Platform

Cloudservices Account

When you run your pipeline on the Dataflow service, it runs as a cloudservices account. This account is automatically created when a Cloud Dataflow project is created, and it defaults to having read/write access to the project's resources. The cloudservices account performs “metadata” operations: those that don’t run on your local client or on Google Compute Engine workers, such as determining input sizes, accessing Google Cloud Storage files, and starting Compute Engine workers. For example, if your project is the owner of a Cloud Storage bucket (has read/write access to the bucket), then the cloudservices account associated with your project also has owner (read/write) access to the bucket (see the FAQ for information on how to check if your project owns a bucket).

Since Cloud Platform services expect to have read/write access to the project and its resources, it is recommended that you do not change the default permissions automatically established for your project.

Note: It’s usually best to create a bucket owned by your project to use as the staging bucket for Cloud Dataflow. Doing so ensures that permissions are automatically set correctly for staging your pipeline’s executable files.

Dataflow Service Account

As part of Dataflow's pipeline execution, the Dataflow service manipulates resources on your behalf (for example, creating additional VMs). When you run your pipeline on the Dataflow service, it uses a service account (service-<project-number>@dataflow-service-producer-prod.iam.gserviceaccount.com). This account is automatically created when a Cloud Dataflow project is created, gets assigned the editor role on the project, and defaults to having read/write access to the project's resources. The account is used exclusively by the Dataflow service and is specific to your project.

You can review the permissions of the Dataflow Service accounts in the Cloud Console on the IAM & Admin > IAM page.

Since Cloud Platform services expect to have read/write access to the project and its resources, it is recommended that you do not change the default permissions automatically established for your project. If you remove the permissions for the service account from the IAM policy, the accounts will continue to be present as they are owned by the Dataflow service. If a Dataflow service account loses permissions to a project, Dataflow will not be able to launch VMs and perform other management tasks.

Best practice: Create a bucket owned by your project to use as the staging bucket for Dataflow. Doing this will ensure that permissions are automatically set correctly for staging your pipeline’s executable files.

Compute Engine Service Account

Compute Engine instances (or workers) perform the work of executing Dataflow SDK operations in the cloud. These workers use your project’s Compute Engine service account to access your pipeline’s files and other resources. This service account (<project-number>-compute@developer.gserviceaccount.com) is automatically created when you enable the Compute Engine API for your project from the APIs page in the Google Cloud Platform Console.

By default, the Compute Engine service account associated with a project automatically has access to Cloud Storage buckets and resources owned by the project. Since most Compute Engine workers expect to have read/write access to project resources, it is recommended that you do not change the default permissions automatically established for your project.

You can associate multiple Compute Engine service accounts with a project to support more granular permissions by selecting the APIs and auth > Credentials > Create Client ID > Service Account in the the Google Cloud Platform Console page for the project.

You can obtain a list of your project’s service accounts from the Permissions page in the Cloud Platform Console.

Accessing Cloud Platform Resources Across Multiple Cloud Platform Projects

Your Dataflow pipelines can access Google Cloud Platform resources in other Cloud Platform projects. These include:

Java

  • Cloud Storage Buckets
  • BigQuery Datasets
  • Pub/Sub Topics and Subscriptions
  • Datastore Datasets

Python

  • Cloud Storage Buckets
  • BigQuery Datasets

To ensure that your Dataflow pipeline can access these resources across projects, you'll need to use the resources' respective access control mechanisms to explicitly grant access to your Dataflow project's cloudservices and Compute Engine service accounts.

Accessing Cloud Storage Buckets Across Cloud Platform Projects

To give your Dataflow project access to a Cloud Storage bucket owned by a different Cloud Platform project, you'll need to make the bucket accessible to your Dataflow project's cloudservices (<project-number>@cloudservices.gserviceaccount.com) and Compute Engine (<project-number>-compute@developer.gserviceaccount.com) service accounts. You can use Cloud Storage Access Controls to grant the required access.

Note: If you are not using the default service accounts, make sure permissions are consistent with your IAM settings.

To obtain a list of your Dataflow project’s service accounts, check the IAM & Admin page in the Cloud Platform Console. Once you have the account names, you can run gsutil commands to grant the project's service accounts ownership (read/write permission) to both the bucket and its contents.

To grant your Dataflow project's service accounts access to a Cloud Storage bucket in another project, use the following command in your shell or terminal window:

gsutil acl ch -u <project-number>@cloudservices.gserviceaccount.com:OWNER gs://<bucket>
gsutil acl ch -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>

To grant your Dataflow project's service accounts access to the existing contents of a Cloud Storage bucket in another project, use the following command in your shell or terminal window:

gsutil -m acl ch -r -u <project-number>@cloudservices.gserviceaccount.com:OWNER gs://<bucket>
gsutil -m acl ch -r -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>

Note: The -m option runs the command in parallel for quicker processing; the -r option runs the command recursively on resources within the bucket.

Note that the prior command only grants access to existing resources. Granting the Dataflow project's service accounts default permission to the bucket will allow it to access future resources added to the bucket:

gsutil defacl ch -u <project-number>@cloudservices.gserviceaccount.com:OWNER gs://<bucket>
gsutil defacl ch -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>

Accessing BigQuery Datasets Across Cloud Platform Projects

You can use the BigQueryIO API to access BigQuery datasets owned by a different Cloud Platform project (i.e., not the project with which you're using Dataflow). For the BigQuery source and sink to operate properly, the following three accounts must have access to any BigQuery datasets that your Dataflow job reads from or writes to:

  • The Google Cloud Platform account you use to execute the Dataflow job
  • The Google cloudservices account of the Cloud Platform project running the Dataflow job
  • The Google Compute Engine service account of the Cloud Platform project running the Dataflow job

You might need to configure BigQuery to explicitly grant access to these accounts. See BigQuery Access Control for more information on granting access to BigQuery datasets using either the BigQuery Web UI or the BigQuery API .

For example, if your Google Cloud Platform account is abcde@gmail.com and the project number of the project where you execute the Dataflow job is 123456789, the following accounts must all be granted access to the BigQuery Datasets used: abcde@gmail.com, 123456789@cloudservices.gserviceaccount.com, and 123456789-compute@developer.gserviceaccount.com.

Accessing Pub/Sub Topics and Subscriptions Across Cloud Platform Projects

Java

To access a Pub/Sub topic or subscription owned by a different Cloud Platform project (i.e. not the project with which you're using Dataflow), you'll need to use Pub/Sub's Identity and Access Management features to set up cross-project permissions. Dataflow uses two service accounts to run your jobs, and you need to grant these service accounts access to the Pub/Sub resources in the other project.

For example, if your Google Cloud Platform account is abcde@gmail.com and the project number of the project where you execute the Dataflow job is 123456789, you'll need to grant access to 123456789@cloudservices.gserviceaccount.com and 123456789-compute@developer.gserviceaccount.com.

See Sample Use Case: Cross-Project Communication for more information and some code examples that demonstrate how to use Pub/Sub's Identity and Access Management features.

Python

This feature is not yet available in the Dataflow SDK for Python.

Accessing Cloud Datastore Across Cloud Platform Projects

Java

To access a Datastore owned by a different Cloud Platform project, you'll need to add your Dataflow project's cloudservices (<project-number>@cloudservices.gserviceaccount.com) and Compute Engine (<project-number>-compute@developer.gserviceaccount.com) service accounts as editors of the project that owns the Datastore. You will also need to enable the Cloud Datastore API in both projects at https://console.cloud.google.com/project/<project-id>/apiui/apiview/datastore/overview.

Python

This feature is not yet available in the Dataflow SDK for Python.

Data Access and Security

The Dataflow service uses several security mechanisms to keep your data secure and private. These mechanisms apply to the following scenarios:

  • When you submit a pipeline to the service
  • When the service evaluates your pipeline
  • When you request access to telemetry and metrics during and after pipeline execution

Pipeline Submission

Your Google Cloud Project permissions control access to the Dataflow service. Any members of your project given edit or owner rights can submit pipelines to the service. To submit pipelines, you must authenticate using the gcloud command-line tool. Once authenticated, your pipelines are submitted using the HTTPS protocol. For instructions about how to authenticate with your Cloud Platform account credentials, see the quickstart for the language you are using.

Pipeline Evaluation

Temporary Data

As part of evaluating a pipeline, temporary data might be generated and stored locally in the workers or in Google Cloud Storage. Temporary data is encrypted at rest, and does not persist after a pipeline's evaluation concludes.

Java

By default, Google Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the job succeeds or fails. This means that the associated Persistent Disk, and any intermediate data that might be stored on it, is deleted. The intermediate data stored in Cloud Storage can be found in sub-locations of the Cloud Storage path that you provide as your --stagingLocation and/or --tempLocation. If you are writing output to a Cloud Storage file, temporary files might be created in the output location before the Write operation is finalized.

Note: You can control whether VMs are deleted when the job completes using the --teardownPolicy pipeline option. The valid options are: TEARDOWN_ALWAYS, the default, which always deletes all the VMs; TEARDOWN_NEVER, which leaves all VMs running regardless of failure or success; and TEARDOWN_ON_SUCCESS, which leaves all VMs running only when the job fails. TEARDOWN_ON_SUCCESS may be particularly useful for debugging.

Python

By default, Google Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the job succeeds or fails. This means that the associated Persistent Disk, and any intermediate data that might be stored on it, is deleted. The intermediate data stored in Cloud Storage can be found in sub-locations of the Cloud Storage path that you provide as your --staging_location and/or --temp_location. If you are writing output to a Cloud Storage file, temporary files might be created in the output location before the Write operation is finalized.

Note: You can control whether VMs are deleted when the job completes using the --teardown_policy pipeline option. The valid options are: TEARDOWN_ALWAYS, the default, which always deletes all the VMs; TEARDOWN_NEVER, which leaves all VMs running regardless of failure or success; and TEARDOWN_ON_SUCCESS, which leaves all VMs running only when the job fails. TEARDOWN_ON_SUCCESS may be particularly useful for debugging.

Logged Data

Information stored in Cloud Logging is primarily generated by the code in your Dataflow program. The Dataflow Service may also generate warning and error data in Cloud Logging, but this is the only intermediate data that the service adds to logs.

In-Flight Data

There are two modes in which data is transmitted during pipeline evaluation: when reading/writing from sources and sinks, and between worker instances while data is being processed within the pipeline itself. All communication with Google Cloud sources and sinks is encrypted and is carried over HTTPS. All inter-worker communication occurs over a private network and is subject to your project's permissions and firewall rules.

Data Locality

A pipeline's logic is evaluated on individual Compute Engine instances. You can specify the zone in which those instances, and the private network over which they communicate, are located. Ancillary computations that occur in Google's infrastructure rely on metadata (such as Cloud Storage locations or file sizes). Your data does not ever leave the zone or break your security boundaries.

Telemetry and Metrics

Telemetry data and associated metrics are encrypted at rest, and access to this data is controlled by your Cloud Platform project's read permissions.

Recommended Practice

We recommend that you make use of the security mechanisms available in your pipeline's underlying cloud resources. These mechanisms include the data security capabilities of data sources and sinks such as BigQuery and Cloud Storage. It's also best not to mix different trust levels in a single project.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Dataflow Documentation