The Dataflow region stores and handles metadata about your Dataflow job and deploys and controls your Dataflow workers.
Region names follow a standard convention based on
Compute Engine region names.
For example, the name for the Central US region is us-central1
.
This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.
Guidelines for choosing a region
Use the following guidelines to choose an appropriate region for your job.
Security and compliance
You might need to constrain Dataflow job processing to a specific geographic region in support of the security and compliance needs of your project.
Data locality
You can minimize network latency and network transport costs by running a Dataflow job from the same region as its sources, sinks, staging file locations, and temporary file locations. If you use sources, sinks, staging file locations, or temporary file locations that are located outside of your job's region, your data might be sent across regions.
In running a pipeline, user data is only handled by the Dataflow worker pool and the movement of the data is restricted to the network paths that connect the Dataflow workers in the pool.
Although user data is strictly handled by Dataflow workers in their assigned geographic region, pipeline log messages are stored in Cloud Logging, which has a single global presence in Google Cloud.
If you need more control over the location of pipeline log messages, you can do the following:
- Create an exclusion filter
for the
_Default
log router sink to prevent Dataflow logs from being exported to the_Default
log bucket. - Create a log bucket in the region of your choice.
- Configure a new log router sink that exports your Dataflow logs to your new log bucket.
To learn more about configuring logging, see Routing and storage overview and Log routing overview.
Notes about common Dataflow job sources:
- When using a Cloud Storage bucket as a source, we recommend that you perform read operations in the same region as the bucket.
- Pub/Sub topics, when published to the global Pub/Sub endpoint, are stored in the nearest Google Cloud region. However, you can modify the topic storage policy to a specific region or a set of regions. Similarly, Pub/Sub Lite topics support only zonal storage.
Resilience and geographic separation
You might want to isolate your normal Dataflow operations from outages that could occur in other geographic regions. Or, you might need to plan alternate sites for business continuity in the event of a region-wide disaster.
In your disaster recovery and business continuity plans, we recommend incorporating details for sources and sinks used with your Dataflow jobs. The Google Cloud sales team can help you work towards meeting your requirements.
Regional placement
By default, the region that you select configures the Dataflow worker pool to utilize all available zones within the region. Zone selection is calculated for each worker at its creation time, optimizing for resource acquisition and utilization of unused reservations.
Regional placement offers benefits such as:
- Improved resource availability: Dataflow jobs are more resilient to zonal resource availability errors, because workers can continue to be created in other zones with remaining availability.
- Improved reliability: In the event of a zonal failure, Dataflow jobs can continue to run, because workers are recreated in other zones.
The following limitations apply:
- Regional placement is supported only for jobs using Streaming Engine or Dataflow Shuffle. Jobs that have opted out of Streaming Engine or Dataflow Shuffle cannot use regional placement.
- Regional placement applies to VMs only, and doesn't apply to backend resources.
- VMs are not replicated across multiple zones. If a VM becomes unavailable, for example, its work items are considered lost and are reprocessed by another VM.
- If a region-wide stockout occurs, the Dataflow service cannot create any more VMs.
View job resource zones
Dataflow jobs depend on internal resources. Some of these backend job resources are zonal. If a single zone fails and a zonal resource necessary for your Dataflow job is in that zone, the job might fail.
To understand whether a job failed because of a zonal outage, review the service zones that your job's backend resources are using. This feature is only available for Streaming Engine jobs.
To view the service zones in the Google Cloud console, use the Service zones field in the Job info panel.
To use the API to review the service zones, use the
ServiceResources
field.
The values in this field update throughout the duration of the job, because the resources that the job uses change while the job runs.
Automatic zone placement
For jobs not supported for regional placement, the best zone within the region is selected automatically, based on the available zone capacity at the time of the job creation request. Automatic zone selection helps ensure that job workers run in the best zone for your job.
Because the job is configured to run in a single zone, the operation might fail
with a
zonal resource availability
error if sufficient Compute Engine resources are not available.
If a stock out occurs in a region, you might see a
ZONE_RESOURCE_POOL_EXHAUSTED
error. You can implement a retry loop to start the job when resources are
available.
Also, when a zone is unavailable, the streaming backend can also become unavailable, which might result in data loss.
Specify a region
To specify a region for your job, set the --region
option to one of
the supported regions.
The --region
option overrides the default region that is set in the metadata
server, your local client, or the environment variables.
The Dataflow command-line interface
also supports the --region
option to specify regions.
Override the worker region or zone
By default, when you submit a job with the --region
option,
workers are automatically assigned to
either zones across the region or the
single best zone within the region, depending on the job type.
In cases where you want to ensure that the workers for your Dataflow job run strictly in a specific zone, you can specify the zone using the following pipeline option. This usage pattern is uncommon for Dataflow jobs.
This option only controls the zone used for the Dataflow workers. It doesn't apply to backend resources. Backend resources might be created in any zone within the job region.
Java
--workerZone
Python
--worker_zone
Go
--worker_zone
For all other cases, we don't recommend overriding the worker location. The common scenarios table contains usage recommendations for these situations.
Because the job is configured to run in a single zone, the operation might fail with a zonal resource availability error if sufficient Compute Engine resources are not available.
You can run the gcloud compute regions list
command to see a listing of
regions and zones that are available for worker deployment.
Common scenarios
The following table contains usage recommendations for common scenarios.
Scenario | Recommendation |
---|---|
I want to use a supported region and have no zone preference within the region. In this case, the best zone is automatically selected based on available capacity. | Use --region to specify a job region.
This ensures that Dataflow manages your job and processes
data within the specified region. |
I need worker processing to occur in a specific zone of a region. | Specify both --region and either --workerZone or
--worker_zone .Use |