nettest를 사용한 클러스터 연결 확인

Anthos clusters on bare metal nettest는 포드, 노드, 서비스, 일부 외부 대상 등, 클러스터의 Kubernetes 객체에 있는 연결 문제를 식별합니다. nettest는 외부 대상에서 포드, 노드 또는 서비스로의 연결을 확인하지 않습니다. 이 문서에서는 anthos-samples GitHub 저장소에서 매니페스트 nettest.yaml 또는nettest_rhel.yaml 중 하나를 사용하여 nettest를 배포 및 실행하는 방법을 설명합니다. Red HatEnterprise Linux(RHEL) 또는 CentOS에서 Anthos clusters on bare metal을 실행하는 경우에는 nettest_rhel.yaml을 사용하세요. Ubuntu에서 Anthos clusters on bare metal을 실행하는 경우는 nettest.yaml을 사용합니다.

이 문서에서는 nettest에 의해 생성된 로그를 해석하여 클러스터와의 연결 문제를 식별하는 방법도 설명합니다.

`nettest` 개요

nettest 진단 도구는 다음 Kubernetes 객체로 구성됩니다. 각 객체는 nettest YAML 매니페스트 파일에 지정됩니다.

cloudprober: 오류율 및 지연 시간과 같은 네트워크 연결 상태를 수집하는 DaemonSet 및 서비스입니다.
echoserver: cloudprober에 응답하는 DaemonSet 및 서비스로, 네트워크 연결에 대한 측정항목을 제공합니다.
nettest: prometheus 및 nettest 컨테이너가 포함된 포드입니다.
- prometheus는 cloudprober에서 측정항목을 수집합니다.
- nettest는 prometheus를 쿼리하고 네트워크 테스트 결과를 로그에 표시합니다.
nettest-engine: nettest 포드에 nettest 컨테이너를 구성하는 ConfigMap입니다.

이 매니페스트는 nettest를 다른 클러스터 리소스에서 격리하기 위해 nettest 네임스페이스 및 전용 ServiceAccount(ClusterRole 및 ClusterRoleBinding)를 지정합니다.

nettest 실행

운영체제에 다음 명령어를 실행하여 nettest를 배포합니다. nettest 포드가 시작되면 테스트가 자동으로 실행됩니다. 테스트를 완료하는 데 약 5분이 소요됩니다.

Ubuntu OS:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

RHEL 또는 CentOS OS:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest_rhel.yaml

테스트 결과 가져오기

nettest 매니페스트가 배포되고 약 5분 후에 테스트가 완료되면 다음 명령어를 실행하여 nettest 결과를 확인합니다.

kubectl -n nettest logs nettest -c nettest

실행되는 동안 nettest는 다음과 같은 메시지를 stdout에 전송합니다.

I0413 03:33:04.879141       1 collectorui.go:130] Listening on ":8999"
I0413 03:33:04.879258       1 prometheus.go:172] Running prometheus controller
E0413 03:33:04.879628       1 prometheus.go:178] Prometheus controller: failed to
retries probers: Get "http://127.0.0.1:9090/api/v1/targets": dial tcp 127.0.0.1:9090:
connect: connection refused

nettest가 확인된 연결 실패 없이 성공적으로 실행되면 다음 로그 항목이 표시됩니다.

I0211 21:58:34.689290       1 validate_metrics.go:78] Metric validation passed!

연결 문제를 발견하면 nettest는 다음과 같은 로그 항목을 작성합니다.

E0211 06:40:11.948634       1 collector.go:65] Engine error: step validateMetrics failed:
"Error rate in percentage": probe from "10.200.0.3" to "172.26.115.210:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "10.200.0.3" to "172.26.27.229:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "192.168.3.248" to "echoserver-hostnetwork_10.200.0.2_8080"
has value 2.007046, threshold is 1.000000

기본 임곗값은 1%(1.000000)이지만 오류율 최대 5%까지는 무시해도 됩니다. 예를 들어 이전 예시에서 IP 주소 192.168.3.248에서 echoserver-hostnetwork_10.200.0.2_8080으로의 연결 오류율은 약 2%(2.007046)입니다. 다음은 무시해도 보고된 연결 문제의 예시입니다.

테스트 결과 해석

nettest가 완료되고 연결 문제를 발견하면 nettest 포드 로그에 다음 항목이 표시됩니다.

"Error rate in percentage": probe from {src} to {dst} has value 100.000000, threshold is 1.000000

여기서 {src} 및 {dst}는 다음 중 하나일 수 있습니다.

echoserver 포드 IP: 노드의 포드와의 연결
노드 IP: 노드와의 연결
서비스 IP(자세한 내용은 다음 텍스트 참조)

또한 {dst}는 다음일 수도 있습니다.

google.com: 외부 연결
dns: DNS를 통한 hostNetwork 외의 서비스 연결(echoserver-non-hostnetwork.nettest.svc.cluster.local)

서비스 IP의 세부정보는 다음 예시와 같이 로그의 JSON 형식 프로브 항목에 있습니다. 다음 프로브 예시에서는 172.26.27.229:80이 service-clusterip의 주소임을 보여줍니다. 이 targets 값이 있는 프로브는 포드(pod-service-clusterip)용으로 하나, 노드(node-service-clusterip)용으로 하나 있습니다.
```
probe {
  name: "node-service-clusterip"
  …
  targets {
    host_names: "172.26.27.229:80"
  }
```

수정 결과 확인

보고된 연결 문제를 모두 해결했으면 nettest 포드를 삭제하고 nettest 매니페스트를 다시 적용하여 연결 테스트를 다시 실행합니다.

예를 들어 Ubuntu용 nettest를 다시 실행하려면 다음 명령어를 실행하세요.

kubectl -n nettest delete pod nettest
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

`nettest` 삭제

테스트를 완료하면 다음 명령어를 실행하여 모든 nettest 리소스를 삭제합니다.

kubectl delete namespace nettest
kubectl delete clusterroles nettest:nettest
kubectl delete clusterrolebindings nettest:nettest