排查 Ops Agent 问题

注意:此视频中 3:18 提到的 localhost:2020/api/v1/metrics 端点不再适用于 Ops Agent。如需了解其他选项,请参阅代理在运行,但无法提取数据

本文档可帮助您诊断 Ops Agent 安装或运行中出现的问题。

适用于 Linux 虚拟机的代理诊断工具

代理诊断工具会从 Linux 虚拟机收集以下所有代理的关键本地调试信息:Ops Agent、旧版 Logging 代理和旧版 Monitoring 代理。调试信息包括项目信息、虚拟机信息、代理配置、代理日志、代理服务状态、通常需要手动收集的信息。该工具还会检查本地虚拟机环境,确保它满足代理正常运行的要求,例如网络连接和所需权限。

在 Linux 虚拟机上提交代理的客户案例时,请运行代理诊断工具并将收集的信息附加到案例。在将信息附加到支持案例之前,请隐去密码等任何敏感信息。提供此信息可以减少排查支持案例问题所需的时间。

代理诊断工具必须从 Linux 虚拟机内部运行,因此,通常需要先通过 SSH 连接到虚拟机。以下命令会检索代理诊断工具并执行它:

curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh

跟踪脚本执行的输出,找到包含所收集信息的文件。通常,您可以在 /var/tmp/google-agents 目录中找到它们,除非您在运行脚本时已经自定义了输出目录。

如需了解详情,请检查 diagnose-agents.sh 脚本。此工具没有 Windows 版本。

代理无法安装

您在运行安装脚本时可能会遇到以下错误。

  • 操作系统不受支持。错误消息可能类似于以下内容:

    Linux

    https://packages.cloud.google.com/yum/repos/google-cloud-ops-agent-el6-x86_64-all/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
    Trying other mirror.
    To address this issue please refer to the below wiki article
    
    https://wiki.centos.org/yum-errors
    
    If above article doesn't help to resolve this issue please use https://bugs.centos.org/.
    
    Error: Cannot retrieve repository metadata (repomd.xml) for repository: google-cloud-ops-agent. Please verify its path and try again
    
  • 虚拟机已安装 Cloud Logging 代理Cloud Monitoring 代理,它们与新代理冲突。错误消息可能类似于以下内容:

    Linux

    Error:
    Problem: problem with installed package stackdriver-agent-6.0.5-1.el8.x86_64 - package google-cloud-ops-agent-0.1.0-1.el8.x86_64 conflicts with stackdriver-agent provided by stackdriver-agent-6.0.5-1.el8.x86_64
    

    Ops Agent 会使用与旧代理不兼容的新配置文件。如需了解详情,请参阅配置 Ops Agent 指南。

    要消除此错误,请执行以下操作:

    1. 保存 Cloud Monitoring 代理Cloud Logging 代理的自定义配置文件。

    2. 卸载旧的 Cloud Monitoring 代理Cloud Logging 代理

      卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。

代理已安装,但无法运行

代理服务未在运行

当代理服务按预期运行时,您可能会看到以下状态:

适用于 Linux

computer@debian9:~$ sudo systemctl status google-cloud-ops-agent"*"
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: active (exited) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2240 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
  Process: 2214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
 Main PID: 2240 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent.service

Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent...
Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent.

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
  Drop-In: /lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
           └─directories.conf
   Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2234 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
  Process: 2216 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
 Main PID: 2247 (fluent-bit)
    Tasks: 22 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
           └─2247 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf --parser /run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf --log_file /var/log/google-cloud-ops-agent/subagents/logging-module.log --storage_path /var/lib/google-cloud-ops-agent/fluent-bit/buffers

Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent - Logging Agent...
Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent - Logging Agent.
Aug 05 20:33:44 debian9 fluent-bit[2247]: Fluent Bit v1.7.8
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2015-2018 Treasure Data
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Aug 05 20:33:44 debian9 fluent-bit[2247]: * https://fluentbit.io

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
  Drop-In: /lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d
           └─directories.conf
   Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2237 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
  Process: 2215 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
 Main PID: 2251 (otelopscol)
    Tasks: 6 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
           └─2251 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml

Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        builder/pipelines_builder.go:51        Pipeline is starting...        {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        builder/pipelines_builder.go:62        Pipeline is started.        {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        service/service.go:192        Starting receivers...
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z        info        builder/receivers_builder.go:70        Receiver is starting...        {"kind": "receiver", "name": "hostmetrics/hostmetrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z        info        builder/receivers_builder.go:75        Receiver started.        {"kind": "receiver", "name": "hostmetrics/hostmetrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        builder/receivers_builder.go:70        Receiver is starting...        {"kind": "receiver", "name": "prometheus/agent"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        discovery/manager.go:195        Starting provider        {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0", "subs": "[otel-collector]"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        builder/receivers_builder.go:75        Receiver started.        {"kind": "receiver", "name": "prometheus/agent"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        service/collector.go:182        Everything is ready. Begin running and processing data.
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.256Z        info        discovery/manager.go:213        Discoverer channel closed        {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0"}

适用于 Windows

Get-Service google-cloud-ops-agent*

Status   Name               DisplayName
------   ----               -----------
Running  google-cloud-op... Google Cloud Ops Agent
Running  google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running  google-cloud-op... Google Cloud Ops Agent - Metrics Agent

如果代理服务无法运行,您可能会看到以下状态:

Linux

$ sudo service google-cloud-ops-agent status
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Wed 2021-06-30 21:20:43 UTC; 6s ago

Windows

Get-Service google-cloud-ops-agent

Status   Name                    DisplayName
------   ----                    -----------
Stopped  google-cloud-ops-agent  Google Cloud Ops Agent

如需消除此错误,请运行以下命令来启动服务:

Linux

sudo service google-cloud-ops-agent start

Windows

Start-Service google-cloud-ops-agent

如果服务无法启动,则说明配置可能无效。

与当前已安装的代理冲突

  • 虚拟机已安装 Cloud Logging 代理Cloud Monitoring 代理,并且其配置与新代理的配置冲突。错误消息可能类似于以下内容:

    Windows

    We detected an existing Windows service for the StackdriverLogging agent,
    which is not compatible with the Ops Agent when the Ops Agent configuration
    has a non-empty logging section. Please either remove the logging section
    from the Ops Agent configuration, or disable the StackdriverLogging agent,
    and then retry enabling the Ops Agent.
    

    如需修复此错误,您有以下两种选择:

    1. 停用 Ops Agent 配置文件的冲突部分。如需了解详情,请参阅配置 Ops Agent 指南。

    2. 停用有冲突的 Cloud Logging 代理Cloud Monitoring 代理

      1. 保存 Cloud Logging 代理的所有自定义配置文件。
      2. 卸载旧的 Cloud Monitoring 代理Cloud Logging 代理

      卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。

配置无效

如果配置无效,您可能会在尝试重启代理服务时看到以下错误:

Linux

$ sudo service google-cloud-ops-agent restart \
    && sudo service google-cloud-ops-agent status
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
   Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
           └─directories.conf
   Active: failed (Result: exit-code) since Wed 2021-06-30 22:21:08 UTC; 2s ago
  Process: 1141421 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_>
  Process: 1141847 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIR>
 Main PID: 1141421 (code=exited, status=0/SUCCESS)

Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Control process exited, code=exited status=1
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Jun 30 22:21:08 centos8-2 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

使用 journalctl 获取确切的错误消息:

sudo journalctl -xe | grep "google_cloud_ops_agent_engine"

您可能会看到类似如下内容的消息:

Jun 30 22:00:26 centos8-2 google_cloud_ops_agent_engine[1141491]: 2021/06/30 22:00:26 the agent config file is not valid YAML. detailed error: yaml: line 21: did not find expected key

Windows

failed to generate config files: can't parse configuration: yaml: line 20: could not find expected ':'

如需修复该错误,请更正无效配置并重启代理。如需了解参考信息,请参阅配置 Ops Agent 指南。

代理在运行,但无法提取数据

使用 Metrics Explorer 查询代理 uptime 指标,并验证代理组件 google-cloud-ops-agent-metricsgoogle-cloud-ops-agent-logging 是否正在写入指标。

  1. 在控制台中,选择 Monitoring 或点击以下按钮:

    转至 Resources

  2. 在导航窗格中,选择 Metrics Explorer

  3. 选择 MQL 标签页。

  4. 输入以下查询,然后点击运行

    fetch gce_instance
    | metric 'agent.googleapis.com/agent/uptime'
    | align rate(1m)
    | every 1m
    

代理正在向 Cloud Logging 发送日志吗?

检查本地指标

以下步骤要求您通过 SSH 连接到虚拟机。

  • 日志记录模块正在运行吗?使用以下命令进行检查:

Linux

sudo systemctl status google-cloud-ops-agent"*"

Windows

以管理员身份打开 Windows PowerShell 并运行以下命令:

Get-Service google-cloud-ops-agent

您还可以在“服务”应用中检查服务状态,在“任务管理器”应用中检查正在运行的进程。

检查日志记录模块日志

此步骤要求您通过 SSH 连接到虚拟机。

您可以在 /var/log/google-cloud-ops-agent/subagents/*.log(对于 Linux)和 C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log(对于 Windows)中找到日志记录模块日志。如果没有日志,则说明代理服务未正常运行。请先转到“代理已安装,但无法运行”部分,以消除该状况。

  • 写入 Logging API 时,您可能会看到 403 权限错误。例如:

    [2020/10/13 18:55:09] [ warn] [output:stackdriver:stackdriver.0] error
    {
    "error": {
      "code": 403,
      "message": "Cloud Logging API has not been used in project 147627806769 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.",
      "status": "PERMISSION_DENIED",
      "details": [
        {
          "@type": "type.googleapis.com/google.rpc.Help",
          "links": [
            {
              "description": "Google developers console API activation",
              "url": "https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769"
            }
          ]
        }
      ]
    }
    }
    

    如需消除此错误,请启用 Logging API 并设置 Logs Writer 角色。

  • 您可能会看到 Logging API 的配额问题。例如:

    error="8:Insufficient tokens for quota 'logging.googleapis.com/write_requests' and limit 'WriteRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:648320274015'." error_code="8"
    

    如需消除此错误,请增加配额或减少日志吞吐量。

  • 您可能会在模块日志中看到以下错误:

    {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
    

    can't fetch token from the metadata server
    

    这些错误可能表示您在部署代理时没有服务帐号或指定的凭据。如需了解如何解决此问题,请参阅授权 Ops Agent

代理正在向 Cloud Monitoring 发送指标吗?

检查指标模块日志

此步骤要求您通过 SSH 连接到虚拟机。

您可以在 syslog 中找到指标模块日志。如果没有日志,则说明代理服务未正常运行。请先转到“代理已安装,但无法运行”部分,以消除该状况。

  • 写入 Monitoring API 时,您可能会看到 PermissionDenied 错误。如果 Ops Agent 的权限未正确配置,则会出现此错误。例如:

    Nov  2 14:51:27 test-ops-agent-error otelopscol[412]: 2021-11-02T14:51:27.343Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "[rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).; rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).]", "interval": "6.934781228s"}
    

    如需消除此错误,请启用 Monitoring API 并设置 Monitoring Metric Writer 角色。

  • 写入 Monitoring API 时,您可能会看到 ResourceExhausted 错误。如果项目达到任何 Monitoring API 配额上限,则会出现此错误。例如:

    Nov  2 18:48:32 test-ops-agent-error otelopscol[441]: 2021-11-02T18:48:32.175Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Total requests' and limit 'Total requests per minute per user' of service 'monitoring.googleapis.com' for consumer 'project_number:8563942476'.\nerror details: name = ErrorInfo reason = RATE_LIMIT_EXCEEDED domain = googleapis.com metadata = map[consumer:projects/8563942476 quota_limit:DefaultRequestsPerMinutePerUser quota_metric:monitoring.googleapis.com/default_requests service:monitoring.googleapis.com]", "interval": "2.641515416s"}
    

    如需消除此错误,请增加配额或减少指标吞吐量。

  • 您可能会在模块日志中看到以下错误:

    {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
    

    can't fetch token from the metadata server
    

    这些错误可能表示您在部署代理时没有服务帐号或指定的凭据。如需了解如何解决此问题,请参阅授权 Ops Agent

非有害日志

以下日志是可以安全忽略的非有害日志垃圾内容的示例。

  • 从伪进程或受限进程中抓取指标时出错

    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:55.848Z        error        scraperhelper/scrapercontroller.go:205        Error scraping metrics        {"kind"
    : "receiver", "name": "hostmetrics/hostmetrics", "error": "[error reading process name for pid 2: readlink /proc/2/exe: no such file or directory; error reading process name for
    pid 3: readlink /proc/3/exe: no such file or directory; error reading process name for pid 4: readlink /proc/4/exe: no such file or directory; error reading process name for pid
    5: readlink /proc/5/exe: no such file or directory; error reading process name for pid 6: readlink /proc/6/exe: no such file or directory; error reading process name for pid 7: r
    eadlink /proc/7/exe: no such file or directory; error reading process name for pid 8: readlink /proc/8/exe: no such file or directory; error reading process name for pid 9: readl
    ink /proc/9/exe: no such file or directory; error reading process name for pid 10: readlink /proc/10/exe: no such file or directory; error reading process name for pid 11: readli
    nk /proc/11/exe: no such file or directory; error reading process name for pid 12: readlink /proc/12/exe: no such file or directory; error reading process name for pid 13: readli
    nk /proc/13/exe: no such file or directory; error reading process name for pid 14: readlink /proc/14/exe: no such file or directory; error reading process name for pid 15: readli
    nk /proc/15/exe: no such file or directory; error reading process name for pid 16: readlink /proc/16/exe: no such file or directory; error reading process name for pid 17: readli
    nk /proc/17/exe: no such file or directory; error reading process name for pid 18: readlink /proc/18/exe: no such file or directory; error reading process name for pid 19: readli
    nk /proc/19/exe: no such file or directory; error reading process name for pid 20: readlink /proc/20/exe: no such file or directory; error reading process name for pid 21: readli
    nk /proc/21/exe: no such file or directory; error reading process name for pid 22: readlink /proc/22/exe: no such file or directory; error reading process name for pid
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 23: readlink /proc/23/exe: no such file or directory; error reading process name for pid 24: readlink /proc/24/exe: no such file
    or directory; error reading process name for pid 25: readlink /proc/25/exe: no such file or directory; error reading process name for pid 26: readlink /proc/26/exe: no such file
    or directory; error reading process name for pid 27: readlink /proc/27/exe: no such file or directory; error reading process name for pid 28: readlink /proc/28/exe: no such file
    or directory; error reading process name for pid 30: readlink /proc/30/exe: no such file or directory; error reading process name for pid 31: readlink /proc/31/exe: no such file
    or directory; error reading process name for pid 43: readlink /proc/43/exe: no such file or directory; error reading process name for pid 44: readlink /proc/44/exe: no such file
    or directory; error reading process name for pid 45: readlink /proc/45/exe: no such file or directory; error reading process name for pid 90: readlink /proc/90/exe: no such file
    or directory; error reading process name for pid 92: readlink /proc/92/exe: no such file or directory; error reading process name for pid 106: readlink /proc/106/exe: no such fi
    le or directory; error reading process name for pid 360: readlink /proc/360/exe: no such file or directory; error reading process name for pid 375: readlink /proc/375/exe: no suc
    h file or directory; error reading process name for pid 384: readlink /proc/384/exe: no such file or directory; error reading process name for pid 386: readlink /proc/386/exe: no
    such file or directory; error reading process name for pid 387: readlink /proc/387/exe: no such file or directory; error reading process name for pid 422: readlink /proc/422/exe
    : no such file or directory; error reading process name for pid 491: readlink /proc/491/exe: no such file or directory; error reading process name for pid 500: readlink /proc/500
    /exe: no such file or directory; error reading process name for pid 2121: readlink /proc/2121/exe: no such file or directory; error reading
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: process name for pid 2127: readlink /proc/2127/exe: no such file or directory]"}
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]:         /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:205
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]:         /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:186
    
  • 丢弃累积指标的第一个数据点时出错:

    Jul 13 17:28:03 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:03.092Z        info        exporterhelper/queued_retry.go:316        Exporting failed. Will retry the request a
    fter interval.        {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[1].points[0].interval.start_time had a
    n invalid value of \"2021-07-13T10:25:18.061-07:00\": The start time must be before the end time (2021-07-13T10:25:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag
    ent/uptime'.", "interval": "23.491024535s"}
    Jul 13 17:28:41 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:41.269Z        info        exporterhelper/queued_retry.go:316        Exporting failed. Will retry the request a
    fter interval.        {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had a
    n invalid value of \"2021-07-13T10:26:18.061-07:00\": The start time must be before the end time (2021-07-13T10:26:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag
    ent/monitoring/point_count'.", "interval": "21.556591578s"}
    

如需了解 Cloud Monitoring 代理的其他已知问题,请参阅 Cloud Monitoring 代理问题排查指南

部分指标缺失或不一致

有少量指标在 Ops Agent 2.0.0 及更高版本上的处理方式与 Ops Agent“预览版”(低于 2.0.0 版)或 Monitoring 代理不同。

下表介绍了 Ops Agent 和 Monitoring 代理提取的数据之间的差异。
指标类型,省略了
agent.googleapis.com
Ops Agent(正式版) Ops Agent(预览版) Monitoring 代理
disk/bytes_used
disk/percent_used
提取时 device 标签中包含完整路径;例如 /dev/sda15

未针对 tmpfsudev 等虚拟设备提取该指标。
提取时 device 标签的路径中不含 /dev;例如 sda15

针对 tmpfsudev 等虚拟设备提取该指标。
提取时 device 标签的路径中不含 /dev;例如 sda15

针对 tmpfsudev 等虚拟设备提取该指标。
正式版列指 Ops Agent 2.0.0 版及更高版本。预览版列是指低于 2.0.0 的 Ops Agent 版本。

移除了由 Google Cloud Console 报告为已安装的代理

卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。

代理日志占用的空间过多

旧版本的 Ops Agent 可能会因 /var/log/google-cloud-ops-agent/subagents/logging-module.log 文件占用大量磁盘空间。查找大量消息,如下所示:

  [2022/04/30 05:23:38] [error] [input chunk] error writing data from tail.2 instance
  [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb
  [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb
  [2022/04/30 05:23:38] [error] [storage] [cio file] file is not mmap()ed: tail.2:2004860-1650614856.691268293.flb
  

如需解决此问题,请将 Ops Agent 升级到 2.17 版或更高版本。