Check cluster connectivity with nettest

GKE on Bare Metal nettest identifies connectivity issues in the Kubernetes objects in your clusters, such as Pods, Nodes, Services, and some external targets. nettest doesn't check connections from external targets to Pods, Nodes, or Services. This document describes how to deploy and run nettest with one of the manifests, nettest.yaml or nettest_rhel.yaml, in the anthos-samples GitHub repository. Use nettest_rhel.yaml if you run GKE on Bare Metal on Red HatEnterprise Linux (RHEL) or CentOS. Use nettest.yaml if you run GKE on Bare Metal on Ubuntu.

This document also describes how you interpret the logs generated by nettest to identify connectivity problems with your clusters.

About nettest

The nettest diagnostic tool consists of the following Kubernetes objects. Each object is specified in the nettest YAML manifest files.

  • cloudprober: a DaemonSet and a Service responsible for collecting network connection status, such as error rate and latency.
  • echoserver: a DaemonSet and a Service responsible for responding to cloudprober, providing it the metrics for network connectivity.
  • nettest: a Pod containing the prometheus and nettest containers.
    • prometheus collects metrics from cloudprober.
    • nettest queries prometheus and displays the network test results in the log.
  • nettest-engine: a ConfigMap to configure the nettest container in the nettest Pod.

The manifest also specifies the nettest namespace and a dedicated ServiceAccount (along with ClusterRole and ClusterRoleBinding) to isolate nettest from other cluster resources.

Run nettest

Deploy nettest by running the following command for your operating system. When the nettest Pod starts, the test runs automatically. The test takes about five minutes to complete.

For Ubuntu OS:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

For RHEL or CentOS OS:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest_rhel.yaml

Get the test results

After the test has completed, which should take around five minutes after the nettest manifest is deployed, run the following command to see the nettest results:

kubectl -n nettest logs nettest -c nettest

While nettest is running, it sends messages like the following to stdout:

I0413 03:33:04.879141       1 collectorui.go:130] Listening on ":8999"
I0413 03:33:04.879258       1 prometheus.go:172] Running prometheus controller
E0413 03:33:04.879628       1 prometheus.go:178] Prometheus controller: failed to
retries probers: Get "http://127.0.0.1:9090/api/v1/targets": dial tcp 127.0.0.1:9090:
connect: connection refused

If nettest runs successfully without identifying any connectivity failures, you see the following log entry:

I0211 21:58:34.689290       1 validate_metrics.go:78] Metric validation passed!

If nettest found connection issues, it writes log entries like the following:

E0211 06:40:11.948634       1 collector.go:65] Engine error: step validateMetrics failed:
"Error rate in percentage": probe from "10.200.0.3" to "172.26.115.210:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "10.200.0.3" to "172.26.27.229:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "192.168.3.248" to "echoserver-hostnetwork_10.200.0.2_8080"
has value 2.007046, threshold is 1.000000

Although the default threshold is one percent (1.000000), error rates up to five percent can be ignored safely. For example, the error rate for connectivity from IP address 192.168.3.248 to echoserver-hostnetwork_10.200.0.2_8080 in the preceding example is approximately two percent (2.007046). This is an example of a reported connectivity issue that you can ignore.

Interpret the test results

When nettest finishes and finds a connectivity issue, you see the following entry in the nettest Pod logs:

"Error rate in percentage": probe from {src} to {dst} has value 100.000000, threshold is 1.000000

Here, {src} and {dst} can be either:

  • echoserver Pod IP: the connection to/from a Pod on the node.
  • Node IP: the connection to/from the node.
  • Service IP (see the following text for details)

In addition, {dst} can also be:

  • google.com: an external connection.
  • dns: the connection to a non-hostNetwork Service through DNS, that is echoserver-non-hostnetwork.nettest.svc.cluster.local.

    The details for Service IP are in JSON-formatted probe entries in the log, like the following example. The following probe example shows that 172.26.27.229:80 is the address forservice-clusterip. There are two probes with this targets value, one for the Pod (pod-service-clusterip) and one for the Node (node-service-clusterip).

    probe {
      name: "node-service-clusterip"
      …
      targets {
        host_names: "172.26.27.229:80"
      }
    

Validate your fixes

When have addressed all reported connectivity issues, remove the nettest Pod and reapply the nettest manifest to rerun the connectivity tests.

For example, to rerun nettest for Ubuntu, run the following commands:

kubectl -n nettest delete pod nettest
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

Clean up nettest

When you're done testing, run the following commands to remove all nettest resources:

kubectl delete namespace nettest
kubectl delete clusterroles nettest:nettest
kubectl delete clusterrolebindings nettest:nettest