Node Exporter

En este documento se describe cómo configurar tu implementación de Google Kubernetes Engine para que puedas usar Google Cloud Managed Service para Prometheus y recoger métricas de Node Exporter. En este documento se explica cómo hacer lo siguiente:

  • Configura Node Exporter para que genere informes de métricas.
  • Accede a un panel de control de Cloud Monitoring para ver las métricas.
  • Configura reglas de alertas para monitorizar las métricas.

Estas instrucciones solo se aplican si usas la recogida gestionada con Managed Service para Prometheus. Si utilizas la recogida autodesplegada, consulta el repositorio de origen de Node Exporter para obtener información sobre la instalación.

Estas instrucciones se proporcionan como ejemplo y deberían funcionar en la mayoría de los entornos de Kubernetes. Si tienes problemas para instalar una aplicación o un exportador debido a políticas de seguridad o de la organización restrictivas, te recomendamos que consultes la documentación de código abierto para obtener ayuda.

Requisitos previos

Para recoger métricas de Node Exporter con Managed Service para Prometheus y la recogida gestionada, tu despliegue debe cumplir los siguientes requisitos:

  • Tu clúster debe ejecutar la versión 1.28.15-gke.2475000 de Google Kubernetes Engine o una posterior.
  • Debes ejecutar Managed Service para Prometheus con la recogida gestionada habilitada. Para obtener más información, consulta el artículo Empezar a utilizar la recogida gestionada.

  • Para usar los paneles de control disponibles en Cloud Monitoring para la integración, debes usar la versión 1.3.1 o una posterior de node_exporter.

    Para obtener más información sobre los paneles disponibles, consulta el artículo Instalar paneles.

Estas métricas ya están habilitadas para los nodos de los clústeres de Autopilot de GKE.

Instalar Node Exporter

Puedes usar la siguiente configuración para instalar Node Exporter:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/version: 1.8.2
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
        app.kubernetes.io/version: 1.8.2
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.8.2
        args:
        - --web.listen-address=:8080
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        - --collector.netdev.device-exclude=^(veth.*)$
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          limits:
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        volumeMounts:
        - mountPath: /host/sys
          mountPropagation: HostToContainer
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      volumes:
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: root
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  endpoints:
  - port: metrics
    interval: 30s

Para aplicar los cambios de configuración de un archivo local, ejecuta el siguiente comando:

kubectl apply -f FILE_NAME

También puedes usar Terraform para gestionar tus configuraciones.

Definir reglas y alertas

Puedes usar la siguiente configuración de Rules para definir alertas en tus métricas:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-public
  name: node-exporter-rules
  labels:
    app.kubernetes.io/component: rules
    app.kubernetes.io/name: node-exporter-rules
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  groups:
  - name: node-exporter
    interval: 30s
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24 * 60 * 60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 4 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 5% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 3% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: critical
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 24 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 4 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 5% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 3% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeNetworkReceiveErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} receive errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs
        summary: Network interface is reporting many receive errors.
      expr: |
        rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeNetworkTransmitErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs
        summary: Network interface is reporting many transmit errors.
      expr: |
        rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeHighNumberConntrackEntriesUsed
      annotations:
        description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused
        summary: Number of conntrack are getting close to the limit.
      expr: |
        (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
      labels:
        severity: warning
    - alert: NodeTextFileCollectorScrapeError
      annotations:
        description: Node Exporter text file collector failed to scrape.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror
        summary: Node Exporter text file collector failed to scrape.
      expr: |
        node_textfile_scrape_error{job="node-exporter"} == 1
      labels:
        severity: warning
    - alert: NodeClockSkewDetected
      annotations:
        description: Clock on {{ $labels.instance }} is out of sync by more than 300s.
          Ensure NTP is configured correctly on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected
        summary: Clock skew detected.
      expr: |
        (
          node_timex_offset_seconds{job="node-exporter"} > 0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
        )
        or
        (
          node_timex_offset_seconds{job="node-exporter"} < -0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
        )
      for: 10m
      labels:
        severity: warning
    - alert: NodeClockNotSynchronising
      annotations:
        description: Clock on {{ $labels.instance }} is not synchronising. Ensure
          NTP is configured on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising
        summary: Clock not synchronising.
      expr: |
        min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
        and
        node_timex_maxerror_seconds{job="node-exporter"} >= 16
      for: 10m
      labels:
        severity: warning
    - alert: NodeRAIDDegraded
      annotations:
        description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is
          in degraded state due to one or more disks failures. Number of spare drives
          is insufficient to fix issue automatically.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddegraded
        summary: RAID Array is degraded
      expr: |
        node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}) > 0
      for: 15m
      labels:
        severity: critical
    - alert: NodeRAIDDiskFailure
      annotations:
        description: At least one device in RAID array on {{ $labels.instance }} failed.
          Array '{{ $labels.device }}' needs attention and possibly a disk swap.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddiskfailure
        summary: Failed device in RAID array
      expr: |
        node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} > 0
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
        )
      for: 15m
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
        )
      for: 15m
      labels:
        severity: critical
  - name: node-exporter.rules
    interval: 30s
    rules:
    - expr: |
        count without (cpu, mode) (
          node_cpu_seconds_total{job="node-exporter",mode="idle"}
        )
      record: instance:node_num_cpu:sum
    - expr: |
        1 - avg without (cpu) (
          sum without (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal"}[5m]))
        )
      record: instance:node_cpu_utilisation:rate5m
    - expr: |
        (
          node_load1{job="node-exporter"}
        /
          instance:node_num_cpu:sum{job="node-exporter"}
        )
      record: instance:node_load1_per_cpu:ratio
    - expr: |
        1 - (
          (
            node_memory_MemAvailable_bytes{job="node-exporter"}
            or
            (
              node_memory_Buffers_bytes{job="node-exporter"}
              +
              node_memory_Cached_bytes{job="node-exporter"}
              +
              node_memory_MemFree_bytes{job="node-exporter"}
              +
              node_memory_Slab_bytes{job="node-exporter"}
            )
          )
        /
          node_memory_MemTotal_bytes{job="node-exporter"}
        )
      record: instance:node_memory_utilisation:ratio
    - expr: |
        rate(node_vmstat_pgmajfault{job="node-exporter"}[5m])
      record: instance:node_vmstat_pgmajfault:rate5m
    - expr: |
        rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_seconds:rate5m
    - expr: |
        rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_weighted_seconds:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_drop_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_drop_excluding_lo:rate5m

Para aplicar los cambios de configuración de un archivo local, ejecuta el siguiente comando:

kubectl apply -f FILE_NAME

También puedes usar Terraform para gestionar tus configuraciones.

Para obtener más información sobre cómo aplicar reglas a tu clúster, consulta Evaluación y alertas de reglas gestionadas.

Esta configuración de Rules se ha adaptado a partir de las reglas y alertas aportadas al repositorio kube-prometheus.

Verificar la configuración

Puede usar el Explorador de métricas para verificar que ha configurado correctamente Node Exporter. Cloud Monitoring puede tardar uno o dos minutos en ingerir tus métricas.

Para comprobar que las métricas se han insertado, haz lo siguiente:

  1. En la Google Cloud consola, ve a la página  Explorador de métricas:

    Ve al explorador de métricas.

    Si usas la barra de búsqueda para encontrar esta página, selecciona el resultado cuya sección sea Monitorización.

  2. En la barra de herramientas del panel de creación de consultas, selecciona el botón cuyo nombre sea  MQL o  PromQL.
  3. Verifica que PromQL esté seleccionado en el interruptor Idioma. El interruptor de idioma se encuentra en la misma barra de herramientas que te permite dar formato a tu consulta.
  4. Introduce y ejecuta la siguiente consulta:
    up{job="node-exporter", cluster="CLUSTER_NAME", namespace="gmp-public"}
    

Instalar paneles de control

Cloud Monitoring proporciona una biblioteca de paneles de control de ejemplo para integraciones. La biblioteca de ejemplos incluye paneles de control de Prometheus, que puedes instalar para ver tus datos en la consola de Google Cloud .

Ten en cuenta que el panel de control Resumen de Prometheus del clúster de Kubernetes requiere que se instale Kube State Metrics. El panel de control Resumen de Prometheus de pods de Kubernetes requiere que se instalen Kube State Metrics y Kubelet/cAdvisor.

Para instalar un panel de control de la biblioteca de ejemplos, siga estos pasos:

  1. En la Google Cloud consola, ve a la página  Paneles de control:

    Ve a Paneles.

    Si usas la barra de búsqueda para encontrar esta página, selecciona el resultado cuya sección sea Monitorización.

  2. Selecciona la pestaña Biblioteca de muestras.
  3. Elige la categoría Otro.
  4. (Opcional) Para ver una vista previa estática de un panel de control sin instalarlo, haz clic en Vista previa.
  5. Seleccione los paneles de control que quiera instalar y, a continuación, haga clic en  Importar.

Para obtener más información sobre cómo instalar paneles de control, consulta el artículo Instalar paneles de control de muestra.

Solución de problemas

Para obtener información sobre cómo solucionar problemas de ingestión de métricas, consulta la sección Problemas con la recogida de datos de los exportadores del artículo Solucionar problemas del lado de la ingestión.