Dataflow pipelines can be run locally (to perform tests on small datasets), or on managed Google Cloud resources using the Dataflow managed service. Whether running locally or in the cloud, your pipeline and its workers use a permissions system to maintain secure access to pipeline files and resources. Dataflow permissions are assigned according to the role that's used to access pipeline resources. This document explains the following concepts:
- Upgrading Dataflow VMs.
- Roles and permissions required for running local and Google Cloud pipelines.
- Roles and permissions required for accessing pipeline resources across projects.
- Types of data used in a Dataflow service and in data security.
Before you begin
Read about Google Cloud project identifiers in the Platform Overview. These identifiers include the project name, project ID, and project number.
Upgrading and patching Dataflow VMs
Dataflow uses Container-Optimized OS. Hence, the security processes of Container-Optimized OS also apply to Dataflow.
Batch pipelines are time-bound and do not require maintenance. When a new batch pipeline starts, the latest Dataflow image is used.
For streaming pipelines, if a security patch is immediately required,
Google Cloud notifies you by using security bulletins. For streaming pipelines,
we recommend that you use the --update
option
to restart your job with the latest Dataflow image.
Dataflow container images are available in the Google Cloud console.
Security and permissions for local pipelines
When you run locally, your Apache Beam pipeline runs as the Google Cloud account that you configured with the Google Cloud CLI executable. Hence, locally run Apache Beam SDK operations and your Google Cloud account have access to the same files and resources.
To list the Google Cloud account you selected as your default, run the
gcloud config list
command.
Security and permissions for pipelines on Google Cloud
When you run your pipeline, Dataflow uses two service accounts to manage security and permissions:
The Dataflow service account. The Dataflow service uses the Dataflow service account as part of the job creation request (for example, to check project quota and to create worker instances on your behalf), and during job execution to manage the job.
The worker service account. Worker instances use the worker service account to access input and output resources after you submit your job.
Dataflow service account
As part of running the Dataflow pipeline, the
Dataflow service manipulates resources on your behalf (for
example, creating additional VMs). When you run your pipeline on the
Dataflow service, it
uses the Dataflow service account
(service-<project-number>@dataflow-service-producer-prod.iam.gserviceaccount.com
).
The Dataflow service account is created upon first usage of the
resource Dataflow Job
. This account gets assigned the Dataflow
Service Agent role on the project, and has the necessary permissions to run a
Dataflow job under the project, including starting
Compute Engine workers. This account is used exclusively by the
Dataflow service and is
specific to your project.
You can review the Dataflow service account's permissions in the Cloud console or the Google Cloud CLI.
Console
Go to the Roles page.
On the Cloud console toolbar, select your project.
To view the Dataflow service account's permissions, select the Include Google-provided role grants checkbox at the top right, and select the Cloud Dataflow Service Agent checkbox.
gcloud
View the Dataflow service account's permissions:
gcloud iam roles describe roles/dataflow.serviceAgent
Since Google Cloud services expect to have read/write access to the project and its resources, it is recommended that you do not change the default permissions automatically established for your project. If you remove the permissions for the service account from the Identity and Access Management (IAM) policy, the accounts continue to be present as they are owned by the Dataflow service. If a Dataflow service account loses permissions to a project, Dataflow cannot launch VMs or perform other management tasks.
Worker service account
Compute Engine instances execute Apache Beam SDK operations in the cloud. These workers use your project's worker service account to access your pipeline's files and other resources. The worker service account is used as the identity for all worker VMs. All requests that originate from the VM use the worker service account. This service account is also used to interact with resources such as Cloud Storage buckets and Pub/Sub topics.
For the worker service account to be able to create, run, and examine a job,
ensure that it has the roles/dataflow.admin
and
roles/dataflow.worker
roles. In addition,
the iam.serviceAccounts.actAs
permission is required for
your user account in order to impersonate the service account.
Default worker service account
By default, workers use your project's Compute Engine default service
account as the worker service account. This service account
(<project-number>-compute@developer.gserviceaccount.com
) is
automatically created when you enable the Compute Engine API for your
project from the API Library
in the Cloud console.
The Compute Engine default service account has broad access to your project's resources, which makes it easy to get started with Dataflow. However, for production workloads, we recommend that you create a new service account with only the roles and permissions that you need.
Specifying a user-managed worker service account
If you want to create and use resources with fine-grained access control, you can create a user-managed service account and use it as the worker service account.
If you do not have a user-managed service account, you must
create a service account
and set your service account's required IAM roles. At a minimum,
your service account must have the Dataflow Worker role
or a custom IAM role with the required permissions listed for roles/dataflow.worker
at
Roles. Your service account
might also need additional roles to use Google Cloud resources as required by
your job (such as, BigQuery, Pub/Sub, or writing
to Cloud Storage). For example, if your job reads from
BigQuery, your service account must also have at least the
roles/bigquery.dataViewer
role. Also, ensure that your user-managed service account
has read and write access to the staging and temporary locations specified
in the Dataflow job.
The user-managed service account can be in the same project as your job, or in a different project. If the service account and the job are in different projects, you must configure the service account before you run the job. You must also grant the Service Account Token Creator role to the following Google-managed service accounts on the user-managed service account:
- Compute Engine default service account
(
<project-number>-compute@developer.gserviceaccount.com
) - Compute Engine Service Agent
(
service-<project-number>@compute-system.iam.gserviceaccount.com
) - Dataflow Service Agent
(
service-<project-number>@dataflow-service-producer-prod.iam.gserviceaccount.com
)
Java
Use the --serviceAccount
option and specify your service
account when you run your pipeline job:
--serviceAccount=my-service-account-name@<project-id>.iam.gserviceaccount.com
Python
Use the --service_account_email
option and specify your
service account when you run your pipeline job:
--service_account_email=my-service-account-name@<project-id>.iam.gserviceaccount.com
Go
Use the --service_account_email
option and specify your
service account when you run your pipeline job:
--service_account_email=my-service-account-name@<project-id>.iam.gserviceaccount.com
You can obtain a list of your project's service accounts from the Permissions page in the Cloud console.
Accessing Google Cloud resources
Your Apache Beam pipelines can access Google Cloud resources, either in the same Google Cloud project or in other projects. These resources include:
To ensure that your Apache Beam pipeline can access these resources, you may need to use the resources' respective access control mechanisms to explicitly grant access to your Dataflow project's worker service account.
If you are using the Compute Engine default service account as the worker service account, and only accessing resources within the same project, no further action may be needed, because the default service account has broad access to resources within the same project.
However, if you are using a user-managed worker service account, or accessing resources in other projects, then additional action may be needed. The examples below assume the Compute Engine default service account is used, but a user-managed worker service account can also be used as well.
Accessing Cloud Storage buckets
To give your Dataflow project access to a Cloud Storage bucket, make the bucket accessible to your Dataflow project's worker service account. You can use Cloud Storage Access Controls to grant the required access.
To obtain a list of your Dataflow project's service accounts, check the
IAM & Admin page
in the Cloud console. Once you have the account names, you can run
gsutil
commands to grant the project's service accounts ownership (read/write
permission) to both the bucket and its contents.
To grant your Dataflow project's service accounts access to a
Cloud Storage bucket, use the following command
in your shell or terminal window: gsutil acl ch -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>
To grant your Dataflow project's service accounts access to
the existing contents of a Cloud Storage bucket, use the following
command in your shell or terminal window:
gsutil -m acl ch -r -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>
The prior command grants access only to existing resources.
Granting the Dataflow project's service accounts default
permission to the bucket allows it to access future resources added to
the bucket: gsutil defacl ch -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>
Accessing BigQuery datasets
You can use the BigQueryIO
API to access BigQuery datasets, in
the same project where you're using Dataflow, or a different
project. For the BigQuery
source and sink to operate properly, the following two accounts must have access
to any BigQuery datasets that your Dataflow job
reads from or writes to:
- The Google Cloud account that you use to run the Dataflow job.
- The worker service account that runs the Dataflow job.
You might need to configure BigQuery to explicitly grant access to these accounts. See BigQuery Access Control for more information on granting access to BigQuery datasets using either the BigQuery page or the BigQuery API.
Among the required BigQuery permissions,
the bigquery.datasets.get
IAM permission is required by the pipeline
to access a BigQuery dataset. Typically, most BigQuery IAM roles include the
bigquery.datasets.get
permission, but the roles/bigquery.jobUser
role is an exception.
For example, if your Google Cloud account is abcde@gmail.com
and the project number of the project where you execute the
Dataflow job is 123456789
, the following accounts must all be
granted access to the BigQuery Datasets used: abcde@gmail.com
,
and 123456789-compute@developer.gserviceaccount.com
.
Accessing Pub/Sub topics and subscriptions
To access a Pub/Sub topic or subscription, use Pub/Sub's Identity and Access Management features to set up permissions for the worker service account.
Permissions from the following Pub/Sub roles are relevant:
roles/pubsub.subscriber
is required to consume data and create subscriptionsroles/pubsub.viewer
is recommended, so that Dataflow can query the configurations of topics and subscriptions. This has two benefits:- Dataflow can check for unsupported settings on subscriptions that might not work as expected.
- If the subscription does not use the default ack deadline
of 10 seconds, performance improves. Dataflow repeatedly
extends the ack deadline for a message while it is being processed by the
pipeline. Without
pubsub.viewer
permissions, Dataflow is unable to query the ack deadline, and therefore must assume a default deadline. This will cause Dataflow to issue more modifyAckDeadline requests than necessary. - If VPC Service Controls is enabled on the project that owns the subscription or topic, IP address-based ingress rules don't allow Dataflow to query the configurations. In this case, an ingress rule based on the worker service account is required.
See Sample use case: cross-project communication for more information and some code examples that demonstrate how to use Pub/Sub's Identity and Access Management features.
Accessing Firestore
To access a Firestore database (in Native mode or
Datastore mode), add your Dataflow worker service account
(for example, <project-number>-compute@developer.gserviceaccount.com
)
as editor of the project that owns the database,
or use a more restrictive Datastore role like roles/datastore.viewer
.
Also, enable the Firestore API in both projects from the
API Library
in the Google Cloud console.
Accessing images for projects with a trusted image policy
If you have a trusted image policy
set up for your project and your boot image is located in another
project, ensure that the trusted image policy is configured to have access to the image.
For example, if you are running a templated Dataflow job, ensure that
the policy file includes access to the dataflow-service-producer-prod
project.
This is a Cloud project that contains the images for template jobs.
Data access and security
The Dataflow service works with two kinds of data:
End-user data. This is the data that is processed by a Dataflow pipeline. A typical pipeline reads data from one or more sources, implements transformations of the data, and writes the results to one or more sinks. All the sources and sinks are storage services that are not directly managed by Dataflow.
Operational data. This data includes all the metadata that is required for managing a Dataflow pipeline. This data includes both user-provided metadata such as a job name or pipeline options and also system-generated metadata such as a job ID.
The Dataflow service uses several security mechanisms to keep your data secure and private. These mechanisms apply to the following scenarios:
- Submitting a pipeline to the service
- Evaluating a pipeline
- Requesting access to telemetry and metrics during and after a pipeline execution
- Using a Dataflow service such as Shuffle or Streaming Engine
Data locality
All the core data processing for the Dataflow service happens in
the region that is specified in the pipeline code. If a region is not specified,
the default region us-central1
is used. A pipeline job can optionally
read and write from sources and sinks in other regions if you specify that option in the
pipeline code. However, the actual data processing occurs only in the region
that is specified to run the Dataflow VMs.
Pipeline logic is evaluated on individual worker VM instances. You can specify the zone in which these instances and the private network over which they communicate are located. Ancillary computations for the platform depend on metadata such as Cloud Storage locations or file sizes.
Dataflow is a regional service. For more information about data locality and regional endpoints, see Regional endpoints.
Data in a pipeline submission
The IAM permissions for your Google Cloud project control access to the Dataflow service. Any principals who are given editor or owner rights to your project can submit pipelines to the service. To submit pipelines, you must authenticate using the Google Cloud CLI. After you are authenticated, your pipelines are submitted using the HTTPS protocol. For instructions about how to authenticate with your Google Cloud account credentials, see the quickstart for the language that you are using.
Data in a pipeline evaluation
As part of evaluating a pipeline, temporary data might be generated and stored locally in the worker VM instances or in Cloud Storage. Temporary data is encrypted at rest and does not persist after a pipeline evaluation concludes. Such data can also be stored in the Shuffle service or Streaming Engine service (if you have opted for the service) in the same region as specified in the Dataflow pipeline.
Java
By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the
job succeeds or fails. This means that the associated
Persistent Disk, and any intermediate data that might be stored
on it, is deleted. The intermediate data stored in Cloud Storage can be found in sublocations of
the Cloud Storage path that you provide as your --stagingLocation
and/or
--tempLocation
. If you are writing output to a Cloud Storage file, temporary files
might be created in the output location before the Write operation is finalized.
Python
By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the
job succeeds or fails. This means that the associated
Persistent Disk, and any intermediate data that might be stored
on it, is deleted. The intermediate data stored in Cloud Storage can be found in sublocations of
the Cloud Storage path that you provide as your --staging_location
and/or
--temp_location
. If you are writing output to a Cloud Storage file, temporary files
might be created in the output location before the Write operation is finalized.
Go
By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the
job succeeds or fails. This means that the associated
Persistent Disk, and any intermediate data that might be stored
on it, is deleted. The intermediate data stored in Cloud Storage can be found in sublocations of
the Cloud Storage path that you provide as your --staging_location
and/or
--temp_location
. If you are writing output to a Cloud Storage file, temporary files
might be created in the output location before the Write operation is finalized.
Data in pipeline logs and telemetry
Information stored in Cloud Logging is primarily generated by the code in your Dataflow program. The Dataflow service might also generate warning and error data in Cloud Logging, but this is the only intermediate data that the service adds to logs. Cloud Logging is a global service.
Telemetry data and associated metrics are encrypted at rest, and access to this data is controlled by your Google Cloud project's read permissions.
Data in Dataflow services
If you use Dataflow Shuffle or Dataflow Streaming for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the regions where Shuffle or Streaming is currently available. Dataflow auto-selects the zone in the region you specified. The end-user data in transit stays within the worker VMs and in the same zone. These Dataflow jobs can still read and write to sources and sinks that are outside the VM zone. The data in transit can also be sent to Dataflow Shuffle or Dataflow Streaming services, however the data always remains in the region specified in the pipeline code.
Recommended practice
We recommend that you use the security mechanisms available in your pipeline's underlying cloud resources. These mechanisms include the data security capabilities of data sources and sinks such as BigQuery and Cloud Storage. It's also best not to mix different trust levels in a single project.