Error disabling Dataflow Shuffle or Streaming Engine

Problem

When trying to disable Dataflow Shuffle or Streaming Engine, you receive one of the following error messages:

Rpc to <worker-harness>:12345 completed with error UNAVAILABLE: failed to connect to all addresses
java.util.concurrent.ExecutionException: java.io.IOException: DEADLINE_EXCEEDED: (g)RPC timed out when <source-worker-harness> talking to <destination-worker-harness>:12346. Server unresponsive (ping error: Deadline Exceeded

Environment

  • Dataflow Shuffle or Streaming Engine is disabled
  • Pipeline running with more than one worker

Solution

  1. You must add a ingress firewall rule to allow network traffic to port 12345-12346 with the following details:
    • INGRESS_FIREWALL_RULE_NAME: any unique name. For example: allow-ingress-dataflow
    • NETWORK: <network containing subnetwork for the Dataflow job>
    • DIRECTION: ingress
      $ gcloud compute firewall-rules create INGRESS_FIREWALL_RULE_NAME \
          --network NETWORK \
          --action allow \
          --direction DIRECTION \
          --target-tags dataflow \
          --source-tags dataflow \
          --priority 0 \
          --rules tcp:12345-12346
      
  2. If the default egress allow rule is blocked, add an egress rule to allow network traffic to port 12345-12346 with the following details:
    • EGRESS_FIREWALL_RULE_NAME: any unique name. For example: allow-egress-dataflow
    • NETWORK: <network containing subnetwork for the Dataflow job>
    • DIRECTION: egress
    • CIDR_RANGE : <ip range of subnetwork used by Dataflow job>
      $ gcloud compute firewall-rules create EGRESS_FIREWALL_RULE_NAME \
          --network NETWORK \
          --action allow \
          --direction DIRECTION \
          --target-tags dataflow \
          --destination-ranges CIDR_RANGE \
          --priority 0 \
          --rules tcp:12345-12346

Cause

Dataflow workers stores the intermediate data locally when Dataflow Shuffle/Streaming Engine is disabled. Some operations (like GroupByKey) need shuffling of the intermediate data between workers and it happens over 12345-12346 ports. Job will get stuck or fail if appropriate firewall rules are not present.