Dataflow pipelines can be run either locally (to perform tests on small datasets), or on managed Google Cloud resources using the Dataflow managed service. Whether running locally or in the cloud, your pipeline and its workers use a permissions system to maintain secure access to pipeline files and resources. Dataflow permissions are assigned according to the role used to access pipeline resources. The sections below explain the roles and permissions associated with local and cloud pipelines, the default settings, and how to check your project’s permissions.
Before you begin
Read about Google Cloud project identifiers in the Platform Overview. These identifiers include the project name, project ID, and project number.
Security and permissions for local pipelines
Google Cloud Platform account
When you run locally, your Apache Beam pipeline runs as the Google Cloud account that you configured with the gcloud command-line tool executable. Hence, locally-run Apache Beam SDK operations have access to the files and resources that your Google Cloud account has access to.
To list the Google Cloud account you selected as your default, run the
gcloud config list
command.
Note: Local pipelines can output data to local destinations, such as local
files, or to cloud destinations, such as Cloud Storage or BigQuery. If your
locally-run pipeline writes files to cloud-based resources such as Cloud Storage, it uses your
Google Cloud account credentials and the
Google Cloud project that you configured as
the gcloud
command-line tool default. For instructions about how to authenticate
with your Google Cloud account credentials, see the
quickstart for the language you are using.
Security and permissions for pipelines on Google Cloud Platform
When you run your pipeline, Dataflow uses two service accounts to manage security and permissions: the Dataflow service account and the controller service account. The Dataflow service uses the Dataflow service account as part of the job creation request (for example, to check project quota and to create worker instances on your behalf), and during job execution to manage the job. Worker instances use the controller service account to access input and output resources after you submit your job.
Cloud Dataflow service account
As part of Dataflow's pipeline execution, the
Dataflow service manipulates resources on your behalf (for
example, creating additional VMs). When you run your pipeline on the
Dataflow service, it
uses a service account
(service-<project-number>@dataflow-service-producer-prod.iam.gserviceaccount.com
).
This account is automatically created when a Dataflow project is
created, gets assigned the Dataflow Service Agent role on the
project, and has the necessary permissions to run a Dataflow job
under the project, including starting Compute Engine workers. The
account is used exclusively by the
Dataflow service and is
specific to your project.
You can review the permissions of the Dataflow Service accounts
using the
gcloud
command-line tool
by typing following command into your shell or terminal:
gcloud iam roles describe roles/dataflow.serviceAgent
Since Google Cloud services expect to have read/write access to the project and its resources, it is recommended that you do not change the default permissions automatically established for your project. If you remove the permissions for the service account from the IAM policy, the accounts will continue to be present as they are owned by the Dataflow service. If a Dataflow service account loses permissions to a project, Dataflow will not be able to launch VMs and perform other management tasks.
Best practice: Create a bucket owned by your project to use as the staging bucket for Dataflow. Doing this will ensure that permissions are automatically set correctly for staging your pipeline’s executable files.
Controller service account
Compute Engine instances execute Apache Beam SDK operations in the cloud. These workers use your project’s controller service account to access your pipeline’s files and other resources. Dataflow also uses the controller service account to perform “metadata” operations, which don’t run on your local client or on Compute Engine workers. These operations perform tasks such as determining input sizes and accessing Cloud Storage files.
Default controller service account
By default, workers use your project’s Compute Engine service account as
the controller service account. This service account
(<project-number>-compute@developer.gserviceaccount.com
) is
automatically created when you enable the Compute Engine API for your
project from the APIs page
in the Google Cloud Console.
In addition, the Compute Engine service account associated with a project has access to Cloud Storage buckets and resources owned by the project by default. Since most Compute Engine workers expect to have read/write access to project resources, it is recommended that you do not change the automatically established default permissions for your project.
You can obtain a list of your project’s service accounts from the Permissions page in the Cloud Console.
Specifying a user-managed controller service account
If you want to create and use resources with fine-grained access and control, you can use a service account from your job's project as the user-managed controller service account.
If you do not have a user-managed service account, you must
create a service account
that is in the same project as your job and set your service account's required
IAM roles. At a minimum, your service account must have the
Dataflow Worker role. Your service account
might also need additional roles to use Cloud Platform resources as required by
your job (such as, BigQuery, Pub/Sub, or writing
to Cloud Storage). For example, if your job reads from
BigQuery, your service account must also have at least the
bigquery.dataViewer
role.
Java
Use the --serviceAccount
option and specify your service
account when you run your pipeline job:
--serviceAccount=my-service-account-name@<project-id>.iam.gserviceaccount.com
Python
Use the --service_account_email
option and specify your
service account when you run your pipeline job:
--service_account_email=my-service-account-name@<project-id>.iam.gserviceaccount.com
Accessing Google Cloud Platform resources across multiple Google Cloud Platform projects
Your Apache Beam pipelines can access Google Cloud resources in other Google Cloud projects. These include:
Java
- Cloud Storage Buckets
- BigQuery Datasets
- Pub/Sub Topics and Subscriptions
- Datastore Datasets
Python
- Cloud Storage Buckets
- BigQuery Datasets
- Datastore Datasets
To ensure that your Apache Beam pipeline can access these resources across projects, you'll need to use the resources' respective access control mechanisms to explicitly grant access to your Dataflow project's controller service account.
Accessing Cloud Storage buckets across Google Cloud Platform projects
To give your Dataflow project access to a Cloud Storage bucket owned by a different Google Cloud project, you need to make the bucket accessible to your Dataflow project's controller service account. You can use Cloud Storage Access Controls to grant the required access.
To obtain a list of your Dataflow project’s service accounts, check the
IAM & Admin page
in the Cloud Console. Once you have the account names, you can run
gsutil
commands to grant the project's service accounts ownership (read/write
permission) to both the bucket and its contents.
To grant your Dataflow project's service accounts access to a
Cloud Storage bucket in another project, use the following command
in your shell or terminal window: gsutil acl ch -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>
To grant your Dataflow project's service accounts access to
the existing contents of a Cloud Storage bucket in another
project, use the following command in your shell or terminal window:
gsutil -m acl ch -r -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>
Note: The -m
option runs the command in parallel for quicker processing; the
-r
option runs the command recursively on resources within the bucket.
Note that the prior command only grants access to existing resources.
Granting the Dataflow project's service accounts default
permission to the bucket will allow it to access future resources added to
the bucket: gsutil defacl ch -u <project-number>-compute@developer.gserviceaccount.com:OWNER gs://<bucket>
Accessing BigQuery datasets across Google Cloud Platform projects
You can use the BigQueryIO
API to access BigQuery datasets
owned by a different Google Cloud project (i.e., not the project with
which you're using Dataflow). For the BigQuery
source and sink to operate properly, the following two accounts must have access
to any BigQuery datasets that your Dataflow job
reads from or writes to:
- The Google Cloud account you use to execute the Dataflow job
- The controller service account running the Dataflow job
You might need to configure BigQuery to explicitly grant access to these accounts. See BigQuery Access Control for more information on granting access to BigQuery datasets using either the BigQuery Web UI or the BigQuery API.
For example, if your Google Cloud account is abcde@gmail.com
and the project number of the project where you execute the
Dataflow job is 123456789
, the following accounts must all be
granted access to the BigQuery Datasets used: abcde@gmail.com
,
and 123456789-compute@developer.gserviceaccount.com
.
Accessing Cloud Pub/Sub Topics and subscriptions across Google Cloud Platform projects
To access a Pub/Sub topic or subscription owned by a different Google Cloud project, use Pub/Sub's Identity and Access Management features to set up cross-project permissions. Dataflow uses the controller service account to run your jobs, and you need to grant this service account access to the Pub/Sub resources in the other project.
Permissions from the following Pub/Sub roles are required:
roles/pubsub.subscriber
roles/pubsub.viewer
See Sample Use Case: Cross-Project Communication for more information and some code examples that demonstrate how to use Pub/Sub's Identity and Access Management features.
Accessing Cloud Datastore across Google Cloud Platform projects
To access a Datastore owned by a different Google Cloud
project, you'll need to add your Dataflow project's
Compute Engine (<project-number>-compute@developer.gserviceaccount.com
)
service account as editor of the project that owns the Datastore.
You will also need to enable the Datastore API in both projects at
https://console.cloud.google.com/project/<project-id>/apiui/apiview/datastore/overview
.
Data access and security
The Dataflow service uses several security mechanisms to keep your data secure and private. These mechanisms apply to the following scenarios:
- When you submit a pipeline to the service
- When the service evaluates your pipeline
- When you request access to telemetry and metrics during and after pipeline execution
Pipeline submission
Your Google Cloud Project permissions control access to the
Dataflow service. Any members of your project given edit or owner
rights can submit pipelines to the service. To submit pipelines, you must
authenticate using the gcloud
command-line tool. Once authenticated,
your pipelines are submitted using the HTTPS protocol. For instructions about
how to authenticate with your Google Cloud account credentials, see the
quickstart for the language you are using.
Pipeline evaluation
Temporary data
As part of evaluating a pipeline, temporary data might be generated and stored locally in the workers or in Cloud Storage. Temporary data is encrypted at rest, and does not persist after a pipeline's evaluation concludes.
Java
By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the
job succeeds or fails. This means that the associated
Persistent Disk, and any intermediate data that might be stored
on it, is deleted. The intermediate data stored in Cloud Storage can be found in sub-locations of
the Cloud Storage path that you provide as your --stagingLocation
and/or
--tempLocation
. If you are writing output to a Cloud Storage file, temporary files
might be created in the output location before the Write operation is finalized.
Python
By default, Compute Engine VMs are deleted when the Dataflow job completes, regardless of whether the
job succeeds or fails. This means that the associated
Persistent Disk, and any intermediate data that might be stored
on it, is deleted. The intermediate data stored in Cloud Storage can be found in sub-locations of
the Cloud Storage path that you provide as your --staging_location
and/or
--temp_location
. If you are writing output to a Cloud Storage file, temporary files
might be created in the output location before the Write operation is finalized.
Logged data
Information stored in Cloud Logging is primarily generated by the code in your Dataflow program. The Dataflow service may also generate warning and error data in Cloud Logging, but this is the only intermediate data that the service adds to logs.
In-flight data
There are two modes in which data is transmitted during pipeline evaluation:
- When reading/writing from sources and sinks.
- Between worker instances while data is being processed within the pipeline itself.
All communication with Google Cloud sources and sinks is encrypted and is carried over HTTPS. All inter-worker communication occurs over a private network and is subject to your project's permissions and firewall rules.
Data locality
A pipeline's logic is evaluated on individual Compute Engine instances. You can specify the zone in which those instances, and the private network over which they communicate, are located. Ancillary computations that occur in Google's infrastructure rely on metadata (such as Cloud Storage locations or file sizes). Your data does not ever leave the zone or break your security boundaries.
Telemetry and metrics
Telemetry data and associated metrics are encrypted at rest, and access to this data is controlled by your Google Cloud project's read permissions.
Recommended practice
We recommend that you make use of the security mechanisms available in your pipeline's underlying cloud resources. These mechanisms include the data security capabilities of data sources and sinks such as BigQuery and Cloud Storage. It's also best not to mix different trust levels in a single project.