排查 Ops Agent 问题

使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

注意:此视频中 3:18 提到的 localhost:2020/api/v1/metrics 端点不再适用于 Ops Agent。如需了解其他选项,请参阅代理在运行,但无法注入数据

本文档可帮助您诊断 Ops Agent 安装或运行中出现的问题。

虚拟机的代理诊断工具

代理诊断工具会从虚拟机收集以下所有代理的关键本地调试信息:Ops Agent、旧版 Logging 代理和旧版 Monitoring 代理。调试信息包括项目信息、虚拟机信息、代理配置、代理日志、代理服务状态、通常需要手动收集的信息。该工具还会检查本地虚拟机环境,确保它满足代理正常运行的要求,例如网络连接和所需权限。

在虚拟机上提交代理的客户案例时,请运行代理诊断工具并将收集的信息附加到案例。在将信息附加到支持案例之前,请隐去密码等任何敏感信息。提供此信息可以减少排查支持案例问题所需的时间。

代理诊断工具必须从虚拟机内部运行,因此,通常需要先通过 SSH 连接到虚拟机。以下命令会检索代理诊断工具并执行它:

Linux

curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh

Windows

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/diagnose-agents.ps1", "${env:UserProfile}\diagnose-agents.ps1")
Invoke-Expression "${env:UserProfile}\diagnose-agents.ps1"

跟踪脚本执行的输出,找到包含所收集信息的文件。通常,您可以在 /var/tmp/google-agents 目录(在 Linux 上)或 $env:LOCALAPPDATA/Temp 目录(在 Windows 上)中找到它们,除非您在运行脚本时已经自定义了输出目录。

如需了解详情,请检查 diagnose-agents.sh 脚本(在 Linux 上)或 diagnose-agents.ps1 脚本(在 Windows 上)。

代理无法安装

您在运行安装脚本时可能会遇到以下错误。

  • 操作系统不受支持。错误消息可能类似于以下内容:

    Linux

    https://packages.cloud.google.com/yum/repos/google-cloud-ops-agent-el6-x86_64-all/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
    Trying other mirror.
    To address this issue please refer to the below wiki article
    
    https://wiki.centos.org/yum-errors
    
    If above article doesn't help to resolve this issue please use https://bugs.centos.org/.
    
    Error: Cannot retrieve repository metadata (repomd.xml) for repository: google-cloud-ops-agent. Please verify its path and try again
    
  • 虚拟机已安装 Cloud Logging 代理Cloud Monitoring 代理,它们与新代理冲突。错误消息可能类似于以下内容:

    Linux

    Error:
    Problem: problem with installed package stackdriver-agent-6.0.5-1.el8.x86_64 - package google-cloud-ops-agent-0.1.0-1.el8.x86_64 conflicts with stackdriver-agent provided by stackdriver-agent-6.0.5-1.el8.x86_64
    

    Ops Agent 会使用与旧代理不兼容的新配置文件。如需了解详情,请参阅配置 Ops Agent 指南。

    要消除此错误,请执行以下操作:

    1. 保存 Cloud Monitoring 代理Cloud Logging 代理的自定义配置文件。

    2. 卸载旧的 Cloud Monitoring 代理Cloud Logging 代理

      卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。

代理已安装,但无法运行

代理服务未在运行

当代理服务按预期运行时,您可能会看到以下状态:

适用于 Linux

computer@debian9:~$ sudo systemctl status google-cloud-ops-agent"*"
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: active (exited) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2240 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
  Process: 2214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
 Main PID: 2240 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent.service

Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent...
Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent.

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
  Drop-In: /lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
           └─directories.conf
   Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2234 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
  Process: 2216 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
 Main PID: 2247 (fluent-bit)
    Tasks: 22 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
           └─2247 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf --parser /run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf --log_file /var/log/google-cloud-ops-agent/subagents/logging-module.log --storage_path /var/lib/google-cloud-ops-agent/fluent-bit/buffers

Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent - Logging Agent...
Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent - Logging Agent.
Aug 05 20:33:44 debian9 fluent-bit[2247]: Fluent Bit v1.7.8
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2015-2018 Treasure Data
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Aug 05 20:33:44 debian9 fluent-bit[2247]: * https://fluentbit.io

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
  Drop-In: /lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d
           └─directories.conf
   Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2237 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
  Process: 2215 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
 Main PID: 2251 (otelopscol)
    Tasks: 6 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
           └─2251 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml

Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        builder/pipelines_builder.go:51        Pipeline is starting...        {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        builder/pipelines_builder.go:62        Pipeline is started.        {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        service/service.go:192        Starting receivers...
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z        info        builder/receivers_builder.go:70        Receiver is starting...        {"kind": "receiver", "name": "hostmetrics/hostmetrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z        info        builder/receivers_builder.go:75        Receiver started.        {"kind": "receiver", "name": "hostmetrics/hostmetrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        builder/receivers_builder.go:70        Receiver is starting...        {"kind": "receiver", "name": "prometheus/agent"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        discovery/manager.go:195        Starting provider        {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0", "subs": "[otel-collector]"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        builder/receivers_builder.go:75        Receiver started.        {"kind": "receiver", "name": "prometheus/agent"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        service/collector.go:182        Everything is ready. Begin running and processing data.
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.256Z        info        discovery/manager.go:213        Discoverer channel closed        {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0"}

适用于 Windows

Get-Service google-cloud-ops-agent*

Status   Name               DisplayName
------   ----               -----------
Running  google-cloud-op... Google Cloud Ops Agent
Running  google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running  google-cloud-op... Google Cloud Ops Agent - Metrics Agent

如果代理服务无法运行,您可能会看到以下状态:

Linux

$ sudo service google-cloud-ops-agent status
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Wed 2021-06-30 21:20:43 UTC; 6s ago

Windows

Get-Service google-cloud-ops-agent

Status   Name                    DisplayName
------   ----                    -----------
Stopped  google-cloud-ops-agent  Google Cloud Ops Agent

如需消除此错误,请运行以下命令来启动服务:

Linux

sudo service google-cloud-ops-agent start

Windows

Start-Service google-cloud-ops-agent

如果服务无法启动,则说明配置可能无效。

与当前已安装的代理冲突

  • 虚拟机已安装 Cloud Logging 代理Cloud Monitoring 代理,并且其配置与新代理的配置冲突。错误消息可能类似于以下内容:

    Windows

    We detected an existing Windows service for the StackdriverLogging agent,
    which is not compatible with the Ops Agent when the Ops Agent configuration
    has a non-empty logging section. Please either remove the logging section
    from the Ops Agent configuration, or disable the StackdriverLogging agent,
    and then retry enabling the Ops Agent.
    

    如需修复此错误,您有以下两种选择:

    1. 停用 Ops Agent 配置文件的冲突部分。如需了解详情,请参阅配置 Ops Agent 指南。

    2. 停用有冲突的 Cloud Logging 代理Cloud Monitoring 代理

      1. 保存 Cloud Logging 代理的所有自定义配置文件。
      2. 卸载旧的 Cloud Monitoring 代理Cloud Logging 代理

      卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。

配置无效

如果配置无效,您可能会在尝试重启代理服务时看到以下错误:

Linux

$ sudo service google-cloud-ops-agent restart \
    && sudo service google-cloud-ops-agent status
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
   Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
           └─directories.conf
   Active: failed (Result: exit-code) since Wed 2021-06-30 22:21:08 UTC; 2s ago
  Process: 1141421 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_>
  Process: 1141847 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIR>
 Main PID: 1141421 (code=exited, status=0/SUCCESS)

Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Control process exited, code=exited status=1
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Jun 30 22:21:08 centos8-2 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

使用 journalctl 获取确切的错误消息:

sudo journalctl -xe | grep "google_cloud_ops_agent_engine"

您可能会看到类似如下内容的消息:

Jun 30 22:00:26 centos8-2 google_cloud_ops_agent_engine[1141491]: 2021/06/30 22:00:26 the agent config file is not valid YAML. detailed error: yaml: line 21: did not find expected key

Windows

failed to generate config files: can't parse configuration: yaml: line 20: could not find expected ':'

如需修复该错误,请更正无效配置并重启代理。如需了解参考信息,请参阅配置 Ops Agent 指南。

代理在运行,但无法注入数据

使用 Metrics Explorer 查询代理 uptime 指标,并验证代理组件 google-cloud-ops-agent-metricsgoogle-cloud-ops-agent-logging 是否正在写入指标。

  1. 在 Google Cloud 控制台中,选择 Monitoring,或点击以下按钮:

    转至 Resources

  2. 在导航窗格中,选择 Metrics Explorer

  3. 选择 MQL 标签页。

  4. 输入以下查询,然后点击运行

    fetch gce_instance
    | metric 'agent.googleapis.com/agent/uptime'
    | align rate(1m)
    | every 1m
    

代理正在向 Cloud Logging 发送日志吗?

检查本地指标

以下步骤要求您通过 SSH 连接到虚拟机。

  • 日志记录模块正在运行吗?使用以下命令进行检查:

Linux

sudo systemctl status google-cloud-ops-agent"*"

Windows

以管理员身份打开 Windows PowerShell 并运行以下命令:

Get-Service google-cloud-ops-agent

您还可以在“服务”应用中检查服务状态,在“任务管理器”应用中检查正在运行的进程。

检查日志记录模块日志

此步骤要求您通过 SSH 连接到虚拟机。

您可以在 /var/log/google-cloud-ops-agent/subagents/*.log(对于 Linux)和 C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log(对于 Windows)中找到日志记录模块日志。如果没有日志,则说明代理服务未正常运行。请先转到“代理已安装,但无法运行”部分,以消除该状况。

  • 写入 Logging API 时,您可能会看到 403 权限错误。例如:

    [2020/10/13 18:55:09] [ warn] [output:stackdriver:stackdriver.0] error
    {
    "error": {
      "code": 403,
      "message": "Cloud Logging API has not been used in project 147627806769 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.",
      "status": "PERMISSION_DENIED",
      "details": [
        {
          "@type": "type.googleapis.com/google.rpc.Help",
          "links": [
            {
              "description": "Google developers console API activation",
              "url": "https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769"
            }
          ]
        }
      ]
    }
    }
    

    如需消除此错误,请启用 Logging API 并设置 Logs Writer 角色。

  • 您可能会看到 Logging API 的配额问题。例如:

    error="8:Insufficient tokens for quota 'logging.googleapis.com/write_requests' and limit 'WriteRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:648320274015'." error_code="8"
    

    如需消除此错误,请增加配额或减少日志吞吐量。

  • 您可能会在模块日志中看到以下错误:

    {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
    

    can't fetch token from the metadata server
    

    这些错误可能表示您在部署代理时没有服务帐号或指定的凭据。如需了解如何解决此问题,请参阅授权 Ops Agent

代理正在向 Cloud Monitoring 发送指标吗?

检查指标模块日志

此步骤要求您通过 SSH 连接到虚拟机。

您可以在 syslog 中找到指标模块日志。如果没有日志,则说明代理服务未正常运行。请先转到“代理已安装,但无法运行”部分,以消除该状况。

  • 写入 Monitoring API 时,您可能会看到 PermissionDenied 错误。如果 Ops Agent 的权限未正确配置,则会出现此错误。例如:

    Nov  2 14:51:27 test-ops-agent-error otelopscol[412]: 2021-11-02T14:51:27.343Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "[rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).; rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).]", "interval": "6.934781228s"}
    

    如需消除此错误,请启用 Monitoring API 并设置 Monitoring Metric Writer 角色。

  • 写入 Monitoring API 时,您可能会看到 ResourceExhausted 错误。如果项目达到任何 Monitoring API 配额上限,则会出现此错误。例如:

    Nov  2 18:48:32 test-ops-agent-error otelopscol[441]: 2021-11-02T18:48:32.175Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Total requests' and limit 'Total requests per minute per user' of service 'monitoring.googleapis.com' for consumer 'project_number:8563942476'.\nerror details: name = ErrorInfo reason = RATE_LIMIT_EXCEEDED domain = googleapis.com metadata = map[consumer:projects/8563942476 quota_limit:DefaultRequestsPerMinutePerUser quota_metric:monitoring.googleapis.com/default_requests service:monitoring.googleapis.com]", "interval": "2.641515416s"}
    

    如需消除此错误,请增加配额或减少指标吞吐量。

  • 您可能会在模块日志中看到以下错误:

    {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
    

    can't fetch token from the metadata server
    

    这些错误可能表示您在部署代理时没有服务帐号或指定的凭据。如需了解如何解决此问题,请参阅授权 Ops Agent

检查代理自身日志

如果代理无法将日志注入 Cloud Logging,您可能需要在虚拟机上本地检查日志,以进行问题排查。

Linux

如需检查写入 Journald 的自身日志,请运行以下命令:

journalctl -u google-cloud-ops-agent*

如需检查日志记录模块写入磁盘的自身日志,请运行以下命令:

vim /var/log/google-cloud-ops-agent/subagents/logging-module.log

Windows

如需检查写入 Windows Event Logs 的自身日志,请运行以下命令:

Get-WinEvent -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent*' } | Format-Table -AutoSize -Wrap

如需检查日志记录模块写入磁盘的自身日志,请运行以下命令:

notepad "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"

如需检查 Windows Service Control Manager 中的 Ops Agent 服务日志,请运行以下命令:

Get-WinEvent -FilterHashtable @{ Logname='System'; ProviderName='Service Control Manager' } | Where-Object -Property Message -Match 'Google Cloud Ops Agent' | Format-Table -AutoSize -Wrap

在 Linux 虚拟机上设置自身日志文件轮替

如需将日志记录分代理日志的大小限制为 /var/log/google-cloud-ops-agent/subagents/logging-module.log,请安装并配置 logrotate 实用程序。

  1. 通过运行以下命令安装 logrotate 实用程序:

    在 Debian 和 Ubuntu 上

    sudo apt install logrotate
    

    在 CentOS、RHEL 和 Fedora 上

    sudo yum install logrotate
    
  2. /etc/logrotate.d/google-cloud-ops-agent.conf 处创建 logrotate 配置文件。

    sudo tee /etc/logrotate.d/google-cloud-ops-agent.conf > /dev/null << EOF
    # logrotate config to rotate Google Cloud Ops Agent self log file.
    # See https://manpages.debian.org/jessie/logrotate/logrotate.8.en.html for
    # the full options.
    /var/log/google-cloud-ops-agent/subagents/logging-module.log
    {
        # Log files are rotated every day.
        daily
        # Log files are rotated this many times before being removed. This
        # effectively limits the disk space used by the Ops Agent self log files.
        rotate 30
        # Log files are rotated when they grow bigger than maxsize even before the
        # additionally specified time interval
        maxsize 256M
        # Skip rotation if the log file is missing.
        missingok
        # Do not rotate the log if it is empty.
        notifempty
        # Old versions of log files are compressed with gzip by default.
        compress
        # Postpone compression of the previous log file to the next rotation
        # cycle.
        delaycompress
    }
    EOF
    
  3. 设置 crontabsystemd timer 以定期触发 logrotate 实用程序。

日志轮替生效后,您会在 /var/log/google-cloud-ops-agent/subagents/ 目录中看到轮替的文件。结果类似于以下输出内容:

/var/log/google-cloud-ops-agent/subagents$ ls -lh
total 24K
-rw-r--r-- 1 root root  717 Sep  3 19:54 logging-module.log
-rw-r--r-- 1 root root 6.8K Sep  3 19:51 logging-module.log.1
-rw-r--r-- 1 root root  874 Sep  3 19:50 logging-module.log.2.gz
-rw-r--r-- 1 root root  873 Sep  3 19:50 logging-module.log.3.gz
-rw-r--r-- 1 root root 3.2K Sep  3 19:34 logging-module.log.4.gz

如需测试日志轮替,请执行以下操作:

  1. 通过将 /etc/logrotate.d/google-cloud-ops-agent.conf 文件中的 maxsize 值设置为 1k,暂时减小触发轮替的文件大小。

  2. 通过重启代理几次来触发代理自我日志文件,使其大于 1K:

    sudo service google-cloud-ops-agent restart
    
  3. 等待 crontabsystemd timer 生效以触发 logrotate 实用程序,或运行以下命令手动触发 logrotate 实用程序:

    sudo logrotate /etc/logrotate.d/google-cloud-ops-agent.conf
    
  4. 验证您是否在 /var/log/google-cloud-ops-agent/subagents/ 目录中看到轮替的日志文件。

  5. 通过恢复原始 maxsize 值来重置日志轮替配置。

完全重置代理状态

如果代理进入不可恢复的状态,请按照以下步骤将代理恢复到全新状态。

Linux

停止代理服务:

sudo service google-cloud-ops-agent stop

移除代理软件包:

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --uninstall --remove-repo

移除代理在磁盘上的自身日志:

sudo rm -rf /var/log/google-cloud-ops-agent

移除代理在磁盘上的本地缓冲区:

sudo rm -rf /var/lib/google-cloud-ops-agent/fluent-bit/buffers/*/

重新安装并重启代理:

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
sudo service google-cloud-ops-agent restart

Windows

停止代理服务:

Stop-Service google-cloud-ops-agent -Force;
Get-Service google-cloud-ops-agent* | %{sc.exe delete $_};
taskkill /f /fi "SERVICES eq google-cloud-ops-agent*";

移除代理软件包:

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1");
$env:REPO_SUFFIX="";
Invoke-Expression "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1 -Uninstall -RemoveRepo"

移除代理在磁盘上的自身日志:

rmdir -R -ErrorAction SilentlyContinue "C:\ProgramData\Google\Cloud Operations\Ops Agent\log";

移除代理在磁盘上的本地缓冲区:

Get-ChildItem -Path "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\" -Directory -ErrorAction SilentlyContinue | %{rm -r -Path $_.FullName}

重新安装并重启代理:

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1");
$env:REPO_SUFFIX="";
Invoke-Expression "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall"

重置但保存缓冲区文件

如果虚拟机没有损坏的缓冲区区块(即 Ops Agent 的自身日志文件中没有 format check failed 消息),您可以跳过在重置代理状态时移除本地缓冲区的先前命令。

如果虚拟机有损坏的缓冲区区块,则必须移除它们。以下选项介绍了处理缓冲区的不同方法。完全重置代理状态中所述的其他步骤仍然适用。

  • 方法 1:删除整个 buffers 目录。这是最简单的方法,但可能会导致丢失未损坏的缓冲区日志,或位置文件丢失导致的重复日志。

    Linux

    sudo rm -rf /var/lib/google-cloud-ops-agent/fluent-bit/buffers
    

    Windows

    rmdir -R -ErrorAction SilentlyContinue "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers";
    
  • 方法 2:从 buffers 目录删除缓冲区子目录,但保留位置文件。完全重置代理状态中介绍了此方法。

  • 选项 3:如果您不想删除所有缓冲区文件,则可以从代理的自身日志中提取损坏的缓冲区文件的名称,并仅删除损坏的缓冲区文件。

    Linux

    grep "format check failed" /var/log/google-cloud-ops-agent/subagents/logging-module.log | sed 's|.*format check failed: |/var/lib/google-cloud-ops-agent/fluent-bit/buffers/|' | xargs sudo rm -f
    

    Windows

    $oalogspath="C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log";
    if (Test-Path $oalogspath) {
      Select-String "format check failed" $oalogspath |
      %{$_ -replace '.*format check failed: (.*)/(.*)', '$1\$2'} |
      %{rm -ErrorAction SilentlyContinue -Path ('C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\' + $_)}
    };
    
  • 选项 4:如果有许多损坏的缓冲区并且您想要重新处理所有日志文件,则可以使用选项 3 中的命令并删除位置文件(这些文件存储每个日志文件的 Ops Agent 进度)。删除位置文件可能导致已成功注入的日志出现重复。此选项仅重新处理当前日志文件,不会重新处理已轮替掉的文件或其他来源(如 TCP 端口)的日志。位置文件存储在 buffers 目录中,但存储为文件。本地缓冲区作为子目录存储在 buffers 目录中。

    Linux

    grep "format check failed" /var/log/google-cloud-ops-agent/subagents/logging-module.log | sed 's|.*format check failed: |/var/lib/google-cloud-ops-agent/fluent-bit/buffers/|' | xargs sudo rm -f
    sudo find /var/lib/google-cloud-ops-agent/fluent-bit/buffers -maxdepth 1 -type f -delete
    

    Windows

    $oalogspath="C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log";
    if (Test-Path $oalogspath) {
      Select-String "format check failed" $oalogspath |
      %{$_ -replace '.*format check failed: (.*)/(.*)', '$1\$2'} |
      %{rm -ErrorAction SilentlyContinue -Path ('C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\' + $_)}
    };
    Get-ChildItem -Path "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\" -File -ErrorAction SilentlyContinue | %{$_.Delete()}
    

已知问题

以下部分包含已知的常见问题。对于已修复或已缓解的问题,请按照具体说明获取修复。

非有害日志

  • 从伪进程或受限进程中抓取指标时出错

    以下日志无害,可以放心地忽略。要消除这些日志,请将 Ops Agent 升级到 2.10.0 或更高版本。

    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:55.848Z        error        scraperhelper/scrapercontroller.go:205        Error scraping metrics        {"kind"
    : "receiver", "name": "hostmetrics/hostmetrics", "error": "[error reading process name for pid 2: readlink /proc/2/exe: no such file or directory; error reading process name for
    pid 3: readlink /proc/3/exe: no such file or directory; error reading process name for pid 4: readlink /proc/4/exe: no such file or directory; error reading process name for pid
    5: readlink /proc/5/exe: no such file or directory; error reading process name for pid 6: readlink /proc/6/exe: no such file or directory; error reading process name for pid 7: r
    eadlink /proc/7/exe: no such file or directory; error reading process name for pid 8: readlink /proc/8/exe: no such file or directory; error reading process name for pid 9: readl
    ink /proc/9/exe: no such file or directory; error reading process name for pid 10: readlink /proc/10/exe: no such file or directory; error reading process name for pid 11: readli
    nk /proc/11/exe: no such file or directory; error reading process name for pid 12: readlink /proc/12/exe: no such file or directory; error reading process name for pid 13: readli
    nk /proc/13/exe: no such file or directory; error reading process name for pid 14: readlink /proc/14/exe: no such file or directory; error reading process name for pid 15: readli
    nk /proc/15/exe: no such file or directory; error reading process name for pid 16: readlink /proc/16/exe: no such file or directory; error reading process name for pid 17: readli
    nk /proc/17/exe: no such file or directory; error reading process name for pid 18: readlink /proc/18/exe: no such file or directory; error reading process name for pid 19: readli
    nk /proc/19/exe: no such file or directory; error reading process name for pid 20: readlink /proc/20/exe: no such file or directory; error reading process name for pid 21: readli
    nk /proc/21/exe: no such file or directory; error reading process name for pid 22: readlink /proc/22/exe: no such file or directory; error reading process name for pid
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 23: readlink /proc/23/exe: no such file or directory; error reading process name for pid 24: readlink /proc/24/exe: no such file
    or directory; error reading process name for pid 25: readlink /proc/25/exe: no such file or directory; error reading process name for pid 26: readlink /proc/26/exe: no such file
    or directory; error reading process name for pid 27: readlink /proc/27/exe: no such file or directory; error reading process name for pid 28: readlink /proc/28/exe: no such file
    or directory; error reading process name for pid 30: readlink /proc/30/exe: no such file or directory; error reading process name for pid 31: readlink /proc/31/exe: no such file
    or directory; error reading process name for pid 43: readlink /proc/43/exe: no such file or directory; error reading process name for pid 44: readlink /proc/44/exe: no such file
    or directory; error reading process name for pid 45: readlink /proc/45/exe: no such file or directory; error reading process name for pid 90: readlink /proc/90/exe: no such file
    or directory; error reading process name for pid 92: readlink /proc/92/exe: no such file or directory; error reading process name for pid 106: readlink /proc/106/exe: no such fi
    le or directory; error reading process name for pid 360: readlink /proc/360/exe: no such file or directory; error reading process name for pid 375: readlink /proc/375/exe: no suc
    h file or directory; error reading process name for pid 384: readlink /proc/384/exe: no such file or directory; error reading process name for pid 386: readlink /proc/386/exe: no
    such file or directory; error reading process name for pid 387: readlink /proc/387/exe: no such file or directory; error reading process name for pid 422: readlink /proc/422/exe
    : no such file or directory; error reading process name for pid 491: readlink /proc/491/exe: no such file or directory; error reading process name for pid 500: readlink /proc/500
    /exe: no such file or directory; error reading process name for pid 2121: readlink /proc/2121/exe: no such file or directory; error reading
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: process name for pid 2127: readlink /proc/2127/exe: no such file or directory]"}
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]:         /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:205
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]:         /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:186
    
  • 丢弃累积指标的第一个数据点时出错:

    以下日志无害,可以放心地忽略。

    Jul 13 17:28:03 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:03.092Z        info        exporterhelper/queued_retry.go:316        Exporting failed. Will retry the request a
    fter interval.        {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[1].points[0].interval.start_time had a
    n invalid value of \"2021-07-13T10:25:18.061-07:00\": The start time must be before the end time (2021-07-13T10:25:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag
    ent/uptime'.", "interval": "23.491024535s"}
    Jul 13 17:28:41 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:41.269Z        info        exporterhelper/queued_retry.go:316        Exporting failed. Will retry the request a
    fter interval.        {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had a
    n invalid value of \"2021-07-13T10:26:18.061-07:00\": The start time must be before the end time (2021-07-13T10:26:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag
    ent/monitoring/point_count'.", "interval": "21.556591578s"}
    

部分指标缺失或不一致

有少量指标在 Ops Agent 2.0.0 及更高版本上的处理方式与 Ops Agent“预览版”(低于 2.0.0 版)或 Monitoring 代理不同。

下表介绍了 Ops Agent 和 Monitoring 代理注入的数据之间的差异。
指标类型,省略了
agent.googleapis.com
Ops Agent(正式版) Ops Agent(预览版) Monitoring 代理
disk/bytes_used
disk/percent_used
注入时 device 标签中包含完整路径;例如 /dev/sda15

未针对 tmpfsudev 等虚拟设备注入该指标。
注入时 device 标签的路径中不含 /dev;例如 sda15

针对 tmpfsudev 等虚拟设备注入该指标。
注入时 device 标签的路径中不含 /dev;例如 sda15

针对 tmpfsudev 等虚拟设备注入该指标。
正式版列指 Ops Agent 2.0.0 版及更高版本。预览版列是指低于 2.0.0 的 Ops Agent 版本。

移除了由 Google Cloud Console 报告为已安装的代理

卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。

代理自身日志占用大量 CPU、内存和磁盘可用空间

由于缓冲区区块损坏,旧版 Ops Agent 可能会占用大量 CPU、内存和磁盘可用空间。在 Linux 虚拟机上为 /var/log/google-cloud-ops-agent/subagents/logging-module.log 文件,在 Windows 虚拟机上为 C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log 文件。发生这种情况时,您会在 logging-module.log 文件中看到大量如下消息。

  [2022/04/30 05:23:38] [error] [input chunk] error writing data from tail.2 instance
  [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb
  [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb
  [2022/04/30 05:23:38] [error] [storage] [cio file] file is not mmap()ed: tail.2:2004860-1650614856.691268293.flb
  

如需解决此问题,请将 Ops Agent 升级到 2.17 或更高版本,并完全重置代理状态

Windows 上的损坏性能计数器

如果指标子代理无法启动,您可能会在 Cloud Logging 中看到以下错误之一:

Failed to retrieve perf counter object "LogicalDisk"
Failed to retrieve perf counter object "Memory"
Failed to retrieve perf counter object "System"

如果系统的性能计数器损坏,则可能会发生这些错误。您可以通过重新构建性能计数器来消除错误。在 PowerShell 中以管理员身份运行以下命令:

cd C:\Windows\system32
lodctr /R

上一条命令可能偶尔失败;在这种情况下,请重新加载 PowerShell 并重试,直到成功为止。

命令成功后,重启 Ops Agent:

Restart-Service -Name google-cloud-ops-agent -Force

Windows 上的事件日志时间戳不正确

与 Cloud Logging 中的 Windows 事件日志关联的时间戳可能不正确,具体取决于系统的时区设置。如果您发现发生这种情况,可以尝试以下解决方法之一。

使用世界协调时间 (UTC) 时区

在 PowerShell 中,以管理员身份运行以下命令:

Set-TimeZone -Id "UTC"
Restart-Service -Name "google-cloud-ops-agent-fluent-bit" -Force

仅替换日志记录子代理服务的时区设置

在 PowerShell 中,以管理员身份运行以下命令:

Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\google-cloud-ops-agent-fluent-bit" -Name "Environment" -Type "MultiString" -Value "TZ=UTC0"
Restart-Service -Name "google-cloud-ops-agent-fluent-bit" -Force