Esportatore di nodi

Mantieni tutto organizzato con le raccolte Salva e classifica i contenuti in base alle tue preferenze.

Questo documento descrive come configurare il tuo deployment Google Kubernetes Engine in modo da utilizzare Google Cloud Managed Service per Prometheus per raccogliere metriche da Exporter esportatore. Questo documento illustra come procedere nel seguente modo:

  • Configura l'esportatore di nodi per creare report sulle metriche.
  • Configura una risorsa PodMonitoring per Managed Service per Prometheus per raccogliere le metriche esportate.
  • Accedi a una dashboard in Cloud Monitoring per visualizzare le metriche.
  • Configura le regole di avviso per monitorare le metriche.

Queste istruzioni si applicano solo se utilizzi la raccolta gestita con Managed Service per Prometheus. Se invece utilizzi una raccolta di cui è stato eseguito il deployment autonomo, consulta il repository del codice sorgente per l'esportatore di nodi per le informazioni sull'installazione.

Prerequisiti

Per raccogliere metriche da Exporter esportatore utilizzando Managed Service per Prometheus e la raccolta gestita, il tuo deployment deve soddisfare i seguenti requisiti:

  • Il cluster deve eseguire Google Kubernetes Engine versione 1.21.4-gke.300 o successive.
  • Devi eseguire Managed Service per Prometheus con la raccolta gestita abilitata. Per maggiori informazioni, consulta la Guida introduttiva alla raccolta gestita.

  • Per utilizzare le dashboard disponibili in Cloud Monitoring per l'integrazione, devi utilizzare node_exporter 1.3.1 o versioni successive.

    Per saperne di più sulle dashboard disponibili, consulta Installare le dashboard.

Installa l'esportatore di nodi

Puoi utilizzare la seguente configurazione per installare l'esportatore di nodi:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/version: 1.3.1
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
        app.kubernetes.io/version: 1.3.1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.3.1
        args:
        - --web.listen-address=:8080
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        - --collector.netdev.device-exclude=^(veth.*)$
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          limits:
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        volumeMounts:
        - mountPath: /host/sys
          mountPropagation: HostToContainer
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      volumes:
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: root
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  endpoints:
  - port: metrics
    interval: 30s

Per applicare le modifiche alla configurazione da un file locale, esegui questo comando:

kubectl apply -f FILE_NAME

Puoi anche utilizzare Terraform per gestire le configurazioni.

Definisci regole e avvisi

Puoi utilizzare la seguente configurazione Rules per definire gli avvisi sulle metriche:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-public
  name: node-exporter-rules
  labels:
    app.kubernetes.io/component: rules
    app.kubernetes.io/name: node-exporter-rules
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  groups:
  - name: node-exporter
    interval: 30s
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24 * 60 * 60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 4 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 5% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 3% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: critical
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 24 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 4 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 5% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 3% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeNetworkReceiveErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} receive errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs
        summary: Network interface is reporting many receive errors.
      expr: |
        rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeNetworkTransmitErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs
        summary: Network interface is reporting many transmit errors.
      expr: |
        rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeHighNumberConntrackEntriesUsed
      annotations:
        description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused
        summary: Number of conntrack are getting close to the limit.
      expr: |
        (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
      labels:
        severity: warning
    - alert: NodeTextFileCollectorScrapeError
      annotations:
        description: Node Exporter text file collector failed to scrape.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror
        summary: Node Exporter text file collector failed to scrape.
      expr: |
        node_textfile_scrape_error{job="node-exporter"} == 1
      labels:
        severity: warning
    - alert: NodeClockSkewDetected
      annotations:
        description: Clock on {{ $labels.instance }} is out of sync by more than 300s.
          Ensure NTP is configured correctly on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected
        summary: Clock skew detected.
      expr: |
        (
          node_timex_offset_seconds{job="node-exporter"} > 0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
        )
        or
        (
          node_timex_offset_seconds{job="node-exporter"} < -0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
        )
      for: 10m
      labels:
        severity: warning
    - alert: NodeClockNotSynchronising
      annotations:
        description: Clock on {{ $labels.instance }} is not synchronising. Ensure
          NTP is configured on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising
        summary: Clock not synchronising.
      expr: |
        min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
        and
        node_timex_maxerror_seconds{job="node-exporter"} >= 16
      for: 10m
      labels:
        severity: warning
    - alert: NodeRAIDDegraded
      annotations:
        description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is
          in degraded state due to one or more disks failures. Number of spare drives
          is insufficient to fix issue automatically.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddegraded
        summary: RAID Array is degraded
      expr: |
        node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}) > 0
      for: 15m
      labels:
        severity: critical
    - alert: NodeRAIDDiskFailure
      annotations:
        description: At least one device in RAID array on {{ $labels.instance }} failed.
          Array '{{ $labels.device }}' needs attention and possibly a disk swap.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddiskfailure
        summary: Failed device in RAID array
      expr: |
        node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} > 0
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
        )
      for: 15m
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
        )
      for: 15m
      labels:
        severity: critical
  - name: node-exporter.rules
    interval: 30s
    rules:
    - expr: |
        count without (cpu, mode) (
          node_cpu_seconds_total{job="node-exporter",mode="idle"}
        )
      record: instance:node_num_cpu:sum
    - expr: |
        1 - avg without (cpu) (
          sum without (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal"}[5m]))
        )
      record: instance:node_cpu_utilisation:rate5m
    - expr: |
        (
          node_load1{job="node-exporter"}
        /
          instance:node_num_cpu:sum{job="node-exporter"}
        )
      record: instance:node_load1_per_cpu:ratio
    - expr: |
        1 - (
          (
            node_memory_MemAvailable_bytes{job="node-exporter"}
            or
            (
              node_memory_Buffers_bytes{job="node-exporter"}
              +
              node_memory_Cached_bytes{job="node-exporter"}
              +
              node_memory_MemFree_bytes{job="node-exporter"}
              +
              node_memory_Slab_bytes{job="node-exporter"}
            )
          )
        /
          node_memory_MemTotal_bytes{job="node-exporter"}
        )
      record: instance:node_memory_utilisation:ratio
    - expr: |
        rate(node_vmstat_pgmajfault{job="node-exporter"}[5m])
      record: instance:node_vmstat_pgmajfault:rate5m
    - expr: |
        rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_seconds:rate5m
    - expr: |
        rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_weighted_seconds:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_drop_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_drop_excluding_lo:rate5m

Per applicare le modifiche alla configurazione da un file locale, esegui questo comando:

kubectl apply -f FILE_NAME

Puoi anche utilizzare Terraform per gestire le configurazioni.

Per maggiori informazioni sull'applicazione di regole al tuo cluster, consulta Avvisi e valutazione delle regole gestite.

Questa configurazione Rules è stata adattata dalle regole e dagli avvisi che hanno contribuito al repository kube-prometheus.

Verificare la configurazione

Puoi utilizzare Metrics Explorer per verificare di aver configurato correttamente l'esportatore. Cloud Monitoring potrebbe impiegare uno o due minuti per importare le metriche.

Per verificare se le metriche sono state importate, procedi nel seguente modo:

  1. Nella console Google Cloud, seleziona Monitoring o fai clic sul pulsante seguente:
    Vai a Monitoring
  2. Nel riquadro di navigazione, seleziona Metrics Explorer.
  3. Seleziona la scheda PromQL ed esegui la seguente query:
    up{job="node-exporter", cluster="CLUSTER_NAME", namespace="gmp-public"}
    

Installa dashboard

Cloud Monitoring offre una libreria di dashboard di esempio per le integrazioni. La libreria di esempio include le dashboard di "Prometheus", che puoi installare per visualizzare i tuoi dati in Google Cloud Console.

Tieni presente che le dashboard Kubernetes Prometheus Cluster Overview e Kubernetes Prometheus Overview di Kubernetes richiedono anche l'installazione di metriche relative allo stato di Kube.

Per installare una dashboard dalla libreria di esempio, procedi nel seguente modo:

  1. Nella console Google Cloud, seleziona Monitoring o fai clic sul pulsante seguente:
    Vai a Monitoring
  2. Nel pannello di navigazione, seleziona Dashboard.
  3. Seleziona la scheda Libreria di esempio.
  4. Scegli la categoria Altro.
  5. (Facoltativo) Per visualizzare un'anteprima statica di una dashboard senza installarla, fai clic su Anteprima.
  6. Seleziona le dashboard da installare, poi fai clic su Importa.

Per saperne di più sull'installazione delle dashboard, consulta Installazione di dashboard di esempio.

Risolvere i problemi

Per informazioni sulla risoluzione dei problemi di importazione delle metriche, consulta la sezione Problemi con la raccolta dagli esportatori in Risolvere i problemi lato importazione.