A Dataflow regional endpoint stores and handles metadata about your Dataflow job and deploys and controls your Dataflow workers.
Regional endpoint names follow a standard convention based on Compute Engine
region names.
For example, the name for the Central US region is us-central1
.
This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.
Guidelines for choosing a regional endpoint
Specifying a regional endpoint in a Dataflow job is mandatory.
Security and compliance
You might need to constrain Dataflow job processing to a specific geographic region in support of your project’s security and compliance needs.
Data locality
You can minimize network latency and network transport costs by running a Dataflow job from the same region as its sources, sinks, staging file locations, and temporary file locations. If you use sources, sinks, staging file locations, or temporary file locations that are located outside of your job's region, your data might be sent across regions.
In running a pipeline, user data is only handled by the Dataflow worker pool and the movement of the data is restricted to the network paths that connect the Dataflow workers in the pool.
If you need more control over the location of pipeline log messages, you can do the following:
- Create an exclusion filter
for the
_Default
log router sink to prevent Dataflow logs from being exported to the_Default
log bucket. - Create a log bucket in the region of your choice.
- Configure a new log router sink that exports your Dataflow logs to your new log bucket.
To learn more about configuring logging, see Routing and storage overview and Log routing overview.
Notes about common Dataflow job sources:
- When using a Cloud Storage bucket as a source, we recommend that you perform read operations in the same [region as the bucket]((/storage/docs/bucket-locations).
- Pub/Sub topics, when published to the global Pub/Sub endpoint, are stored in the nearest Google Cloud region. However, you can modify the topic storage policy to a specific region or a set of regions. Similarly, Pub/Sub Lite topics support only zonal storage.
Resilience and geographic separation
You might want to isolate your normal Dataflow operations from outages that could occur in other geographic regions. Or, you may need to plan alternate sites for business continuity in the event of a region-wide disaster.
Auto zone placement
By default, a regional endpoint automatically selects the best zone within the region based on the available zone capacity at the time of the job creation request. Automatic zone selection helps ensure that job workers run in the best zone for your job.
Specifying a regional endpoint
To specify a regional endpoint for your job, set the --region
option to one of
the supported regional endpoints.
The --region
option overrides the default region that is set in the metadata server,
your local client, or environment variables.
The Dataflow command-line interface
also supports the --region
option to specify regional endpoints.
Overriding the worker region or zone
By default, when you submit a job with the --region
option, the regional
endpoint automatically assigns workers to the best zone within the
region. However, you may want to specify either a region or a specific zone (using --worker_region
or
--worker_zone
, respectively) for your worker instances.
You might want to override the worker location in the following cases:
Your workers are in a region or zone that does not have a regional endpoint, and you want to use a regional endpoint that is closer to that region or zone.
You want to ensure that data processing for your Dataflow job occurs strictly in a specific region or zone.
For all other cases, we do not recommend overriding the worker location. The common scenarios table contains usage recommendations for these situations.
You can run the gcloud compute regions list
command to see a listing of
regions and zones that are available for worker deployment.
Common scenarios
The following table contains usage recommendations for common scenarios.
Scenario | Recommendation |
---|---|
I want to use a supported regional endpoint and have no zone preference within the region. In this case, the regional endpoint automatically selects the best zone based on available capacity. | Use --region to specify a regional endpoint.
This ensures that Dataflow manages your job and processes
data within the specified region. |
I need worker processing to occur in a specific zone of a region that has a regional endpoint. | Specify both --region and --worker_zone .Use |
I need worker processing to occur in a specific region that does not have a regional endpoint. | Specify both --region and --worker_region .Use |
I need to use Dataflow Shuffle. | Use --region to specify a regional endpoint that supports
Dataflow Shuffle. Some regional endpoints may not support
this feature; see the feature documentation
for a list of supported regions. |