Node Exporter

本文說明如何設定 Google Kubernetes Engine 部署作業,以便使用 Google Cloud Managed Service for Prometheus 從 Node Exporter 收集指標。本文將說明如何執行下列操作:

  • 設定 Node Exporter 來回報指標。
  • 設定 Managed Service for Prometheus 的 PodMonitoring 資源,以收集匯出的指標。
  • 在 Cloud Monitoring 中存取資訊主頁,即可查看指標。
  • 設定快訊規則來監控指標。

只有在使用 Managed Service for Prometheus 的 代管收集作業時,才適用這些操作說明。如果您使用自行部署的收集器,請參閱 Node Exporter 的來源存放區,瞭解安裝資訊。

這些操作說明僅供參考,適用於大多數 Kubernetes 環境。 如果因安全或機構政策限制而無法安裝應用程式或匯出工具,建議您參閱開放原始碼文件尋求支援。

必要條件

如要使用 Managed Service for Prometheus 和代管收集作業,從 Node Exporter 收集指標,部署作業必須符合下列規定:

  • 叢集必須執行 Google Kubernetes Engine 1.21.4-gke.300 以上版本。
  • 您必須執行 Managed Service for Prometheus,並啟用代管收集作業。詳情請參閱「 開始使用代管集合」一文。

  • 如要使用 Cloud Monitoring 提供的整合式資訊主頁,請務必使用 node_exporter 1.3.1 以上版本。

    如要進一步瞭解可用的資訊主頁,請參閱「安裝資訊主頁」。

GKE Autopilot 叢集中的節點已啟用這些指標。

安裝 Node Exporter

您可以使用下列設定安裝 Node Exporter:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/version: 1.8.2
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
        app.kubernetes.io/version: 1.8.2
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.8.2
        args:
        - --web.listen-address=:8080
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        - --collector.netdev.device-exclude=^(veth.*)$
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          limits:
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        volumeMounts:
        - mountPath: /host/sys
          mountPropagation: HostToContainer
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      volumes:
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: root
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  endpoints:
  - port: metrics
    interval: 30s

如要套用本機檔案的設定變更,請執行下列指令:

kubectl apply -f FILE_NAME

您也可以使用 Terraform 管理設定。

定義規則和快訊

您可以使用下列 Rules 設定,為指標定義快訊:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-public
  name: node-exporter-rules
  labels:
    app.kubernetes.io/component: rules
    app.kubernetes.io/name: node-exporter-rules
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  groups:
  - name: node-exporter
    interval: 30s
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24 * 60 * 60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 4 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 5% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 3% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: critical
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 24 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 4 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 5% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 3% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeNetworkReceiveErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} receive errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs
        summary: Network interface is reporting many receive errors.
      expr: |
        rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeNetworkTransmitErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs
        summary: Network interface is reporting many transmit errors.
      expr: |
        rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeHighNumberConntrackEntriesUsed
      annotations:
        description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused
        summary: Number of conntrack are getting close to the limit.
      expr: |
        (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
      labels:
        severity: warning
    - alert: NodeTextFileCollectorScrapeError
      annotations:
        description: Node Exporter text file collector failed to scrape.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror
        summary: Node Exporter text file collector failed to scrape.
      expr: |
        node_textfile_scrape_error{job="node-exporter"} == 1
      labels:
        severity: warning
    - alert: NodeClockSkewDetected
      annotations:
        description: Clock on {{ $labels.instance }} is out of sync by more than 300s.
          Ensure NTP is configured correctly on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected
        summary: Clock skew detected.
      expr: |
        (
          node_timex_offset_seconds{job="node-exporter"} > 0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
        )
        or
        (
          node_timex_offset_seconds{job="node-exporter"} < -0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
        )
      for: 10m
      labels:
        severity: warning
    - alert: NodeClockNotSynchronising
      annotations:
        description: Clock on {{ $labels.instance }} is not synchronising. Ensure
          NTP is configured on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising
        summary: Clock not synchronising.
      expr: |
        min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
        and
        node_timex_maxerror_seconds{job="node-exporter"} >= 16
      for: 10m
      labels:
        severity: warning
    - alert: NodeRAIDDegraded
      annotations:
        description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is
          in degraded state due to one or more disks failures. Number of spare drives
          is insufficient to fix issue automatically.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddegraded
        summary: RAID Array is degraded
      expr: |
        node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}) > 0
      for: 15m
      labels:
        severity: critical
    - alert: NodeRAIDDiskFailure
      annotations:
        description: At least one device in RAID array on {{ $labels.instance }} failed.
          Array '{{ $labels.device }}' needs attention and possibly a disk swap.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddiskfailure
        summary: Failed device in RAID array
      expr: |
        node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} > 0
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
        )
      for: 15m
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
        )
      for: 15m
      labels:
        severity: critical
  - name: node-exporter.rules
    interval: 30s
    rules:
    - expr: |
        count without (cpu, mode) (
          node_cpu_seconds_total{job="node-exporter",mode="idle"}
        )
      record: instance:node_num_cpu:sum
    - expr: |
        1 - avg without (cpu) (
          sum without (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal"}[5m]))
        )
      record: instance:node_cpu_utilisation:rate5m
    - expr: |
        (
          node_load1{job="node-exporter"}
        /
          instance:node_num_cpu:sum{job="node-exporter"}
        )
      record: instance:node_load1_per_cpu:ratio
    - expr: |
        1 - (
          (
            node_memory_MemAvailable_bytes{job="node-exporter"}
            or
            (
              node_memory_Buffers_bytes{job="node-exporter"}
              +
              node_memory_Cached_bytes{job="node-exporter"}
              +
              node_memory_MemFree_bytes{job="node-exporter"}
              +
              node_memory_Slab_bytes{job="node-exporter"}
            )
          )
        /
          node_memory_MemTotal_bytes{job="node-exporter"}
        )
      record: instance:node_memory_utilisation:ratio
    - expr: |
        rate(node_vmstat_pgmajfault{job="node-exporter"}[5m])
      record: instance:node_vmstat_pgmajfault:rate5m
    - expr: |
        rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_seconds:rate5m
    - expr: |
        rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_weighted_seconds:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_drop_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_drop_excluding_lo:rate5m

如要套用本機檔案的設定變更,請執行下列指令:

kubectl apply -f FILE_NAME

您也可以使用 Terraform 管理設定。

如要進一步瞭解如何將規則套用至叢集,請參閱「受管理規則評估和快訊」。

這項 Rules 設定是根據貢獻給 kube-prometheus 存放區的規則和快訊改編而成。

驗證設定

您可以使用 Metrics Explorer 驗證是否已正確設定 Node Exporter。Cloud Monitoring 可能需要一到兩分鐘才能擷取指標。

如要確認指標已擷取,請按照下列步驟操作:

  1. 前往 Google Cloud 控制台的 「Metrics Explorer」頁面:

    前往 Metrics Explorer

    如果您是使用搜尋列尋找這個頁面,請選取子標題為「Monitoring」的結果

  2. 在查詢建構工具窗格的工具列中,選取名稱為  MQL PromQL 的按鈕。
  3. 確認已在「Language」(語言) 切換按鈕中選取「PromQL」。語言切換按鈕位於同一工具列,可供你設定查詢格式。
  4. 輸入並執行下列查詢:
    up{job="node-exporter", cluster="CLUSTER_NAME", namespace="gmp-public"}
    

安裝資訊主頁

Cloud Monitoring 提供整合的範例資訊主頁程式庫。範例程式庫包含「Prometheus」資訊主頁,您可以安裝這些資訊主頁,在 Google Cloud 控制台中查看資料。

請注意,Kubernetes 叢集 Prometheus 總覽資訊主頁需要安裝 Kube State MetricsKubernetes Pod Prometheus 總覽資訊主頁需要安裝 Kube State MetricsKubelet/cAdvisor

如要從範本庫安裝資訊主頁,請按照下列步驟操作:

  1. 在 Google Cloud 控制台中,前往「Dashboards」(資訊主頁) 頁面:

    前往「Dashboards」(資訊主頁)

    如果您是使用搜尋列尋找這個頁面,請選取子標題為「Monitoring」的結果

  2. 選取「範例庫」分頁標籤。
  3. 選擇「其他」類別。
  4. (選用) 如要查看資訊主頁的靜態預覽畫面,而不需安裝,請按一下「預覽」
  5. 選取要安裝的資訊主頁,然後按一下 「匯入」

如要進一步瞭解如何安裝資訊主頁,請參閱「 安裝範例資訊主頁」。

疑難排解

如要瞭解如何排解指標擷取問題,請參閱「 排解擷取端問題」一文中的「 收集匯出工具資料時發生問題」一節。