Developers & Practitioners

Health checking your gRPC servers on GKE

#containers

gRPC is a high-performance open source RPC framework under the Cloud Native Computing Foundation. gRPC is a frequently preferred method for creating microservices among GKE customers since it lets you build services that talk to each other efficiently, regardless of the implementation language of the client and the server.

However, Kubernetes does not support gRPC health checks natively. To address this, we have developed and released an open source project called grpc-health-probe, a command-line tool to assess health of a gRPC server that has been downloaded over 1.8 million times.

In this article, we will discuss the need for a custom-built tool for health-checking gRPC servers running on Kubernetes clusters such as Kubernetes Engine.

The case for gRPC health checking on Kubernetes

Since Kubernetes runs microservices and gRPC is a popular choice for microservices communication (and gRPC and Kubernetes both are CNCF projects), you might think Kubernetes natively supports gRPC protocol for health checking. However, that's not the case.

Kubernetes natively supports some health check methods to assert readiness or liveness of a Pod:

  • TCP socket open
  • HTTP GET request
  • executing a binary inside the container

Despite the fact that gRPC primarily uses HTTP/2 stack as its transport layer, it's not possible to craft a gRPC request using Kubernetes "httpGet" probe. Therefore, it was proposed to Kubernetes to add gRPC health checks. However, the current position is to maintain an equal stance towards other RPC frameworks such as Apache Thrift and not support any of these natively in Kubernetes health checks.

As a result, Kubernetes does not support gRPC health checks natively for the time being and you need to use a custom-built tool (or write your own).

What does "health" mean for a gRPC server?

Typically at Google, we have a set of well-known endpoints on every microservice called z-pages (such as /healthz) that helps us standardize health checking across the fleet, among other things. However, there's no such single well-known health check endpoint that comes with all gRPC servers.

To address this, the gRPC core offers a Health Checking Protocol that’s distributed with all gRPC language implementations. All you need to implement this protocol is to register this health service to your server, and implement the rpc Check(HealthCheckRequest) returns (HealthCheckResponse) method to reflect your service’s status.

Providing an implementation of this protocol in your gRPC service adds a /grpc.health.v1.Health/Check  path to your service that an external tool (or potentially Kubernetes itself) can query to figure out whether your server is healthy or not.

Meet grpc_health_probe

To address the problems listed above, we have released a small open source command-line utility called grpc_health_probe that uses the gRPC Health Checking Protocol to query the health of a service, print its status and exit with a success or error code indicating the check result.

grpc_health_probe has been downloaded over 1.8 million times and is used in production at many companies using gRPC in part of their stack, including Google.

Running this command-line probe tool with a healthy server will show its status and return a zero status code, indicating success:

$ grpc_health_probe -addr localhost:5000

status: SERVING

However, a misbehaving server, such as a frozen one might return a different response and exit with a non-zero exit code:

$ grpc_health_probe -addr localhost:5000 -dial-timeout 100ms
timeout: failed to connect service "localhost:10000" within 100ms
(exit code 2)

grpc_health_probe is designed primarily for Kubernetes. You integrate it to your health checks by making use of exec probes that execute the binary in your container’s Linux namespace periodically. This means that the probe can query the gRPC server running over the loopback interface (i.e. localhost).

As a result, integrating grpc_health_probe to your Kubernetes manifests requires you to bundle the binary in your gRPC server’s container image and configuring an "exec" probe in your manifests such as:

  [...]
spec:
  containers:
  - name: grpc-server
    ports:
    - containerPort: 5000
    livenessProbe:
      exec:
        command: ["/bin/grpc_health_probe", "-addr=:5000", "-connect-timeout=100ms", "-rpc-timeout=150ms"]
      initialDelaySeconds: 5

Conclusion

Despite Kubernetes not supporting a native way to health-checking gRPC servers, it can be done simply by bundling a standard probe tool in your container image and invoking it via exec probes.

Not surprisingly, this approach works for health checking on other Kubernetes-based compute environments such as Knative (and therefore on Cloud Run for Anthos) as well.

Alternative to executing a probe binary, you can implement your own /healthz endpoint using vanilla HTTP within the same server process and use Kubernetes httpGet probe. Some languages like Go make this easier, while others might not. For that scenario, you can develop a sidecar container running a vanilla HTTP server that queries the gRPC server in the same pod, and use httpGet probe with that.

If you are interested in reading more about the reasoning that led to creation of the grpc-health-probe project, make sure to check out the reading links below and I'm always happy to chat about this topic, so find me on Twitter.

Further reading: