Exportador de nós

Neste documento, descrevemos como configurar a implantação do Google Kubernetes Engine para usar o Google Cloud Managed Service para Prometheus a fim de coletar métricas do Exportador de nós. Esta página mostra como fazer o seguinte:

  • Configurar o Exportador de nós para gerar relatórios de métricas.
  • Configurar um recurso PodMonitoring para o serviço gerenciado para Prometheus a fim de coletar as métricas exportadas.
  • Instalar um painel no Cloud Monitoring para ver as métricas.
  • Configure regras de alertas para monitorar as métricas.

Estas instruções se aplicam somente ao usar a coleção gerenciada com o serviço gerenciado para Prometheus. Se você estiver usando a coleção autoimplantada, consulte o repositório de origem do Exportador de nós para ver informações da instalação.

Estas instruções são um exemplo e devem funcionar na maioria dos ambientes do Kubernetes. Se você estiver com problemas para instalar um aplicativo ou exportador devido a políticas restritivas de segurança ou da organização, recomendamos consultar a documentação de código aberto para receber suporte.

Pré-requisitos

Para coletar métricas do Exportador de nós usando o Managed Service para Prometheus e a coleta gerenciada, sua implantação precisa atender aos seguintes requisitos:

  • O cluster precisa executar a versão 1.21.4-gke.300 ou posterior do Google Kubernetes Engine.
  • É necessário executar o Managed Service para Prometheus com a coleta gerenciada ativada. Para mais informações, consulte Começar a usar a coleta gerenciada.

  • Para usar os painéis disponíveis no Cloud Monitoring para a integração, use a versão 1.3.1 ou posterior do node_exporter.

    Para mais informações sobre os painéis disponíveis, consulte Instalar painéis.

Instalar o Exportador de nós

É possível usar a seguinte configuração para instalar o Exportador de nós:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/version: 1.8.2
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
        app.kubernetes.io/version: 1.8.2
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.8.2
        args:
        - --web.listen-address=:8080
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        - --collector.netdev.device-exclude=^(veth.*)$
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          limits:
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        volumeMounts:
        - mountPath: /host/sys
          mountPropagation: HostToContainer
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      volumes:
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: root
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  endpoints:
  - port: metrics
    interval: 30s

Para aplicar as alterações de configuração de um arquivo local, execute o seguinte comando:

kubectl apply -f FILE_NAME

Também é possível usar o Terraform para gerenciar as configurações.

Definir regras e alertas

Use a configuração Rules a seguir para definir alertas nas suas métricas:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-public
  name: node-exporter-rules
  labels:
    app.kubernetes.io/component: rules
    app.kubernetes.io/name: node-exporter-rules
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  groups:
  - name: node-exporter
    interval: 30s
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24 * 60 * 60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 4 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 5% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 3% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: critical
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 24 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 4 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 5% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 3% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeNetworkReceiveErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} receive errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs
        summary: Network interface is reporting many receive errors.
      expr: |
        rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeNetworkTransmitErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs
        summary: Network interface is reporting many transmit errors.
      expr: |
        rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeHighNumberConntrackEntriesUsed
      annotations:
        description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused
        summary: Number of conntrack are getting close to the limit.
      expr: |
        (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
      labels:
        severity: warning
    - alert: NodeTextFileCollectorScrapeError
      annotations:
        description: Node Exporter text file collector failed to scrape.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror
        summary: Node Exporter text file collector failed to scrape.
      expr: |
        node_textfile_scrape_error{job="node-exporter"} == 1
      labels:
        severity: warning
    - alert: NodeClockSkewDetected
      annotations:
        description: Clock on {{ $labels.instance }} is out of sync by more than 300s.
          Ensure NTP is configured correctly on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected
        summary: Clock skew detected.
      expr: |
        (
          node_timex_offset_seconds{job="node-exporter"} > 0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
        )
        or
        (
          node_timex_offset_seconds{job="node-exporter"} < -0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
        )
      for: 10m
      labels:
        severity: warning
    - alert: NodeClockNotSynchronising
      annotations:
        description: Clock on {{ $labels.instance }} is not synchronising. Ensure
          NTP is configured on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising
        summary: Clock not synchronising.
      expr: |
        min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
        and
        node_timex_maxerror_seconds{job="node-exporter"} >= 16
      for: 10m
      labels:
        severity: warning
    - alert: NodeRAIDDegraded
      annotations:
        description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is
          in degraded state due to one or more disks failures. Number of spare drives
          is insufficient to fix issue automatically.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddegraded
        summary: RAID Array is degraded
      expr: |
        node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}) > 0
      for: 15m
      labels:
        severity: critical
    - alert: NodeRAIDDiskFailure
      annotations:
        description: At least one device in RAID array on {{ $labels.instance }} failed.
          Array '{{ $labels.device }}' needs attention and possibly a disk swap.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddiskfailure
        summary: Failed device in RAID array
      expr: |
        node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} > 0
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
        )
      for: 15m
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
        )
      for: 15m
      labels:
        severity: critical
  - name: node-exporter.rules
    interval: 30s
    rules:
    - expr: |
        count without (cpu, mode) (
          node_cpu_seconds_total{job="node-exporter",mode="idle"}
        )
      record: instance:node_num_cpu:sum
    - expr: |
        1 - avg without (cpu) (
          sum without (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal"}[5m]))
        )
      record: instance:node_cpu_utilisation:rate5m
    - expr: |
        (
          node_load1{job="node-exporter"}
        /
          instance:node_num_cpu:sum{job="node-exporter"}
        )
      record: instance:node_load1_per_cpu:ratio
    - expr: |
        1 - (
          (
            node_memory_MemAvailable_bytes{job="node-exporter"}
            or
            (
              node_memory_Buffers_bytes{job="node-exporter"}
              +
              node_memory_Cached_bytes{job="node-exporter"}
              +
              node_memory_MemFree_bytes{job="node-exporter"}
              +
              node_memory_Slab_bytes{job="node-exporter"}
            )
          )
        /
          node_memory_MemTotal_bytes{job="node-exporter"}
        )
      record: instance:node_memory_utilisation:ratio
    - expr: |
        rate(node_vmstat_pgmajfault{job="node-exporter"}[5m])
      record: instance:node_vmstat_pgmajfault:rate5m
    - expr: |
        rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_seconds:rate5m
    - expr: |
        rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_weighted_seconds:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_drop_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_drop_excluding_lo:rate5m

Para aplicar as alterações de configuração de um arquivo local, execute o seguinte comando:

kubectl apply -f FILE_NAME

Também é possível usar o Terraform para gerenciar as configurações.

Para mais informações sobre como aplicar regras ao cluster, consulte Avaliação e alerta de regras gerenciadas.

Essa configuração de Rules foi adaptada das regras e alertas que contribuíram para o repositório kube-prometheus.

Verificar a configuração

Use o Metrics Explorer para verificar se o exportador foi configurado corretamente. Pode levar um ou dois minutos para que o Cloud Monitoring transfira as métricas.

Para verificar se as métricas foram transferidas, faça o seguinte:

  1. No Console do Google Cloud, acesse a página do  Metrics Explorer:

    Acesse o Metrics explorer

    Se você usar a barra de pesquisa para encontrar essa página, selecione o resultado com o subtítulo Monitoring.

  2. Na barra de ferramentas do painel do criador de consultas, selecione o botão  MQL ou  PromQL.
  3. Verifique se PromQL está selecionado na opção de ativar/desativar Idioma. A alternância de idiomas está na mesma barra de ferramentas que permite formatar sua consulta.
  4. Digite e execute a seguinte consulta:
    up{job="node-exporter", cluster="CLUSTER_NAME", namespace="gmp-public"}
    

Instalar painéis

O Cloud Monitoring oferece uma biblioteca de painéis de amostra para integrações. A biblioteca de amostra inclui painéis "Prometheus", que podem ser instalados para visualizar os dados no console do Google Cloud.

Observe que o painel Visão geral do Prometheus do cluster do Kubernetes requer a instalação das métricas de estado do Kube. O painel Visão geral do Prometheus do pod do Kubernetes requer a instalação das métricas de estado do Kube e do Kubelet/cAdvisor.

Para instalar um painel a partir da biblioteca de amostra, faça o seguinte:

  1. No console do Google Cloud, acesse a página  Painéis:

    Ir para Painéis

    Se você usar a barra de pesquisa para encontrar essa página, selecione o resultado com o subtítulo Monitoring.

  2. Selecione a guia Biblioteca de amostra.
  3. Escolha a categoria Outro.
  4. (Opcional) Para conferir uma visualização estática de um painel sem instalá-lo, clique em Visualizar.
  5. Selecione os painéis que você quer instalar e clique em  Importar.

Para mais informações sobre como instalar painéis, consulte Como instalar painéis de amostra.

Solução de problemas

Para resolver problemas de transferências de métricas, consulte Problemas com a coleta de exportadores em Resolver problemas no processamento.