Egress Firewall denying Health Check traffic

Problem

You are seeing traffic being dropped by a deny all egress firewall rule, which should have been allowed in a higher priority Ingress Firewall Rule.

Environment

  • Google Cloud Platform Service that requires Health Checks to ensure availability.
  • Allow Ingress Firewall Rule for health check ranges has been created to allow for our Health Checks to pass.
  • Within the Deny All Egress Firewall, you're seeing consistent DENY logs for traffic destined towards the Health Check ranges (35.191.0.0/16, 130.211.0.0/22).
  • The dropped traffic pertains to Health Check probes. Terms:
    • allow-hc: Allow Health Check Ingress Firewall
    • deny-all: Deny all Egress traffic Firewall

Solution

How to identify:

Step 1 Identify the DENY logs of the traffic being dropped by the deny-all Firewall Rule and confirm if the dest_ip fall under Health Check Probe ranges.
  • This is a Health Check Response. Traffic from your service is being sent to our Google Front End (GFEs).
  • Take note of the DENY Logs and mark down the Source/Destination 5-tuple (source IP address, source port, destination IP address, destination port, transport protocol).
Step 2 View your allow-hc Firewall Logs and check to see if there exists a matching ALLOW request to the DENY log we saw above.
  • This is a Health Check Request. Traffic from our Google Front End (GFEs) are being sent to your service.
  • Verify that this is a matching request entry to the denied response from above based on the noted 5-tuple.
  • Expect to see the logs to be at least 10 minutes apart.

While there is no direct impact at the moment, failed health check probes can still be a point of concern. Mitigation steps to reduce the pattern of rejected probes are the following:

Adjusting Health Check Frequency. 

Step 1 Identify the Health Check that is being rejected by your deny-egress-all Firewall by identifying the resource that is being probed.

Step 2 Adjust your health check intervals and thresholds. Depending on the frequency of the failed returning Health Check responses being blocked by the deny-all-egress firewall rule, you can adjust your Health Check configurations to increase/decrease the Health Check intervals and thresholds to allow for a more accurate sampling of success probes. 

Create an Allow Egress to Health Check Server's Firewall.

Create a Firewall Rule that would allow Egress traffic to health check sources.
  • Direction of Traffic: Egress
  • Action on match: Allow
  • Targets: Specified target tags
    • 'allow-health-check'
  • Destination Filter: IP ranges
    • 130.211.0.0/22 and 35.191.0.0/16
  • Protocols and ports: Specified protocols and ports
    • all
Note: This will only remove the returning Health Check Responses from being denied by the Deny All Egress Firewall rule. The Health Check probes will have already registered this Health Check as failed due to the long response time.

Cause

The DENY logs that are being seen for traffic to destinations (130.211.0.0/22 and 35.191.0.0/16) are health check responses that are somehow being delayed or stalled due to your Application/Service.

Health Checks are a way for our Google Front End (GFEs) to identify the operating state of an instance or a service to validate if it is able to accept incoming traffic. If the instance is considered to be unhealthy, then traffic is rejected and other ongoing services related to that instance will react accordingly (in either sending traffic to another location or recreating it, etc.).

Our Google Front End (GFEs) send health check requests at intervals configured by the users to ping the instances to check it’s state on whether it’s able to accept traffic. When these health check requests go through to the instances, they are allowed in using the ‘allow-hc’ Firewall Rule that is created, in which Traffic coming from Health Check sources 130.211.0.0/22 and 35.191.0.0/16 are allowed through the Firewall. As our firewalls are stateful, when a connection is allowed through from an Ingress direction, the returning traffic matching this connection is also allowed back. This stateful connection remains active as long as at least one packet is sent every 10 minutes. Anything that is returned past that is then subject to face other existing Firewalls, as the stateful connection session is no longer active.

Regarding the behavior that is being observed, the Health Check Responses are being denied due to the lack of return traffic passing through the 'allow-hc' Firewall during its 10-minute stateful connection period. (At least one packet needs to be sent through every 10 minutes for the connection to remain stateful). With the stateful connection of the 'allow-hc' Firewall being expired, the returning traffic is subjected to be evaluated by the 'deny-all-egress' Firewall Rule.

Note: further investigation into the Service's operations will be necessary to fully identify the behavior that is occurring on your machine. Google has visibility on your machines to the extent of the Machine configurations and details, but no visibility into the operating processes and activity inside.