注意:此视频中 3:18 提到的 localhost:2020/api/v1/metrics
端点不再适用于 Ops Agent。如需了解其他选项,请参阅代理在运行,但无法提取数据。
本文档可帮助您诊断 Ops Agent 安装或运行中出现的问题。
适用于 Linux 虚拟机的代理诊断工具
代理诊断工具会从 Linux 虚拟机收集以下所有代理的关键本地调试信息:Ops Agent、旧版 Logging 代理和旧版 Monitoring 代理。调试信息包括项目信息、虚拟机信息、代理配置、代理日志、代理服务状态、通常需要手动收集的信息。该工具还会检查本地虚拟机环境,确保它满足代理正常运行的要求,例如网络连接和所需权限。
在 Linux 虚拟机上提交代理的客户案例时,请运行代理诊断工具并将收集的信息附加到案例。在将信息附加到支持案例之前,请隐去密码等任何敏感信息。提供此信息可以减少排查支持案例问题所需的时间。
代理诊断工具必须从 Linux 虚拟机内部运行,因此,通常需要先通过 SSH 连接到虚拟机。以下命令会检索代理诊断工具并执行它:
curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh
跟踪脚本执行的输出,找到包含所收集信息的文件。通常,您可以在 /var/tmp/google-agents
目录中找到它们,除非您在运行脚本时已经自定义了输出目录。
如需了解详情,请检查 diagnose-agents.sh
脚本。此工具没有 Windows 版本。
代理无法安装
您在运行安装脚本时可能会遇到以下错误。
操作系统不受支持。错误消息可能类似于以下内容:
Linux
https://packages.cloud.google.com/yum/repos/google-cloud-ops-agent-el6-x86_64-all/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found" Trying other mirror. To address this issue please refer to the below wiki article https://wiki.centos.org/yum-errors If above article doesn't help to resolve this issue please use https://bugs.centos.org/. Error: Cannot retrieve repository metadata (repomd.xml) for repository: google-cloud-ops-agent. Please verify its path and try again
虚拟机已安装 Cloud Logging 代理或 Cloud Monitoring 代理,它们与新代理冲突。错误消息可能类似于以下内容:
Linux
Error: Problem: problem with installed package stackdriver-agent-6.0.5-1.el8.x86_64 - package google-cloud-ops-agent-0.1.0-1.el8.x86_64 conflicts with stackdriver-agent provided by stackdriver-agent-6.0.5-1.el8.x86_64
Ops Agent 会使用与旧代理不兼容的新配置文件。如需了解详情,请参阅配置 Ops Agent 指南。
要消除此错误,请执行以下操作:
保存 Cloud Monitoring 代理和 Cloud Logging 代理的自定义配置文件。
卸载旧的 Cloud Monitoring 代理和 Cloud Logging 代理。
卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。
代理已安装,但无法运行
代理服务未在运行
当代理服务按预期运行时,您可能会看到以下状态:
适用于 Linux
computer@debian9:~$ sudo systemctl status google-cloud-ops-agent"*" ● google-cloud-ops-agent.service - Google Cloud Ops Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled) Active: active (exited) since Thu 2021-08-05 20:33:44 UTC; 7s ago Process: 2240 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Process: 2214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS) Main PID: 2240 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 4915) CGroup: /system.slice/google-cloud-ops-agent.service Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent... Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent. ● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled) Drop-In: /lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d └─directories.conf Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago Process: 2234 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS) Process: 2216 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS) Main PID: 2247 (fluent-bit) Tasks: 22 (limit: 4915) CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service └─2247 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf --parser /run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf --log_file /var/log/google-cloud-ops-agent/subagents/logging-module.log --storage_path /var/lib/google-cloud-ops-agent/fluent-bit/buffers Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent - Logging Agent... Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent - Logging Agent. Aug 05 20:33:44 debian9 fluent-bit[2247]: Fluent Bit v1.7.8 Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2019-2021 The Fluent Bit Authors Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2015-2018 Treasure Data Aug 05 20:33:44 debian9 fluent-bit[2247]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd Aug 05 20:33:44 debian9 fluent-bit[2247]: * https://fluentbit.io ● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled) Drop-In: /lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d └─directories.conf Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago Process: 2237 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS) Process: 2215 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS) Main PID: 2251 (otelopscol) Tasks: 6 (limit: 4915) CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service └─2251 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z info builder/pipelines_builder.go:51 Pipeline is starting... {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z info builder/pipelines_builder.go:62 Pipeline is started. {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z info service/service.go:192 Starting receivers... Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z info builder/receivers_builder.go:70 Receiver is starting... {"kind": "receiver", "name": "hostmetrics/hostmetrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z info builder/receivers_builder.go:75 Receiver started. {"kind": "receiver", "name": "hostmetrics/hostmetrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info builder/receivers_builder.go:70 Receiver is starting... {"kind": "receiver", "name": "prometheus/agent"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info discovery/manager.go:195 Starting provider {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0", "subs": "[otel-collector]"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info builder/receivers_builder.go:75 Receiver started. {"kind": "receiver", "name": "prometheus/agent"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info service/collector.go:182 Everything is ready. Begin running and processing data. Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.256Z info discovery/manager.go:213 Discoverer channel closed {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0"}
适用于 Windows
Get-Service google-cloud-ops-agent* Status Name DisplayName ------ ---- ----------- Running google-cloud-op... Google Cloud Ops Agent Running google-cloud-op... Google Cloud Ops Agent - Logging Agent Running google-cloud-op... Google Cloud Ops Agent - Metrics Agent
如果代理服务无法运行,您可能会看到以下状态:
Linux
$ sudo service google-cloud-ops-agent status ● google-cloud-ops-agent.service - Google Cloud Ops Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled) Active: inactive (dead) since Wed 2021-06-30 21:20:43 UTC; 6s ago
Windows
Get-Service google-cloud-ops-agent Status Name DisplayName ------ ---- ----------- Stopped google-cloud-ops-agent Google Cloud Ops Agent
如需消除此错误,请运行以下命令来启动服务:
Linux
sudo service google-cloud-ops-agent start
Windows
Start-Service google-cloud-ops-agent
如果服务无法启动,则说明配置可能无效。
与当前已安装的代理冲突
虚拟机已安装 Cloud Logging 代理或 Cloud Monitoring 代理,并且其配置与新代理的配置冲突。错误消息可能类似于以下内容:
Windows
We detected an existing Windows service for the StackdriverLogging agent, which is not compatible with the Ops Agent when the Ops Agent configuration has a non-empty logging section. Please either remove the logging section from the Ops Agent configuration, or disable the StackdriverLogging agent, and then retry enabling the Ops Agent.
如需修复此错误,您有以下两种选择:
停用 Ops Agent 配置文件的冲突部分。如需了解详情,请参阅配置 Ops Agent 指南。
停用有冲突的 Cloud Logging 代理或 Cloud Monitoring 代理。
- 保存 Cloud Logging 代理的所有自定义配置文件。
- 卸载旧的 Cloud Monitoring 代理和 Cloud Logging 代理。
卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。
配置无效
如果配置无效,您可能会在尝试重启代理服务时看到以下错误:
Linux
$ sudo service google-cloud-ops-agent restart \ && sudo service google-cloud-ops-agent status ● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d └─directories.conf Active: failed (Result: exit-code) since Wed 2021-06-30 22:21:08 UTC; 2s ago Process: 1141421 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_> Process: 1141847 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIR> Main PID: 1141421 (code=exited, status=0/SUCCESS) Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Control process exited, code=exited status=1 Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'. Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5. Jun 30 22:21:08 centos8-2 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'. Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
使用 journalctl
获取确切的错误消息:
sudo journalctl -xe | grep "google_cloud_ops_agent_engine"
您可能会看到类似如下内容的消息:
Jun 30 22:00:26 centos8-2 google_cloud_ops_agent_engine[1141491]: 2021/06/30 22:00:26 the agent config file is not valid YAML. detailed error: yaml: line 21: did not find expected key
Windows
failed to generate config files: can't parse configuration: yaml: line 20: could not find expected ':'
如需修复该错误,请更正无效配置并重启代理。如需了解参考信息,请参阅配置 Ops Agent 指南。
代理在运行,但无法提取数据
使用 Metrics Explorer 查询代理 uptime
指标,并验证代理组件 google-cloud-ops-agent-metrics
或 google-cloud-ops-agent-logging
是否正在写入指标。
在控制台中,选择 Monitoring 或点击以下按钮:
在导航窗格中,选择
Metrics Explorer。
选择 MQL 标签页。
输入以下查询,然后点击运行:
fetch gce_instance | metric 'agent.googleapis.com/agent/uptime' | align rate(1m) | every 1m
代理正在向 Cloud Logging 发送日志吗?
检查本地指标
以下步骤要求您通过 SSH 连接到虚拟机。
- 日志记录模块正在运行吗?使用以下命令进行检查:
Linux
sudo systemctl status google-cloud-ops-agent"*"
Windows
以管理员身份打开 Windows PowerShell 并运行以下命令:
Get-Service google-cloud-ops-agent
您还可以在“服务”应用中检查服务状态,在“任务管理器”应用中检查正在运行的进程。
检查日志记录模块日志
此步骤要求您通过 SSH 连接到虚拟机。
您可以在 /var/log/google-cloud-ops-agent/subagents/*.log
(对于 Linux)和 C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
(对于 Windows)中找到日志记录模块日志。如果没有日志,则说明代理服务未正常运行。请先转到“代理已安装,但无法运行”部分,以消除该状况。
写入 Logging API 时,您可能会看到 403 权限错误。例如:
[2020/10/13 18:55:09] [ warn] [output:stackdriver:stackdriver.0] error { "error": { "code": 403, "message": "Cloud Logging API has not been used in project 147627806769 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.", "status": "PERMISSION_DENIED", "details": [ { "@type": "type.googleapis.com/google.rpc.Help", "links": [ { "description": "Google developers console API activation", "url": "https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769" } ] } ] } }
如需消除此错误,请启用 Logging API 并设置 Logs Writer 角色。
您可能会看到 Logging API 的配额问题。例如:
error="8:Insufficient tokens for quota 'logging.googleapis.com/write_requests' and limit 'WriteRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:648320274015'." error_code="8"
如需消除此错误,请增加配额或减少日志吞吐量。
您可能会在模块日志中看到以下错误:
{"error":"invalid_request","error_description":"Service account not enabled on this instance"}
或
can't fetch token from the metadata server
这些错误可能表示您在部署代理时没有服务帐号或指定的凭据。如需了解如何解决此问题,请参阅授权 Ops Agent。
代理正在向 Cloud Monitoring 发送指标吗?
检查指标模块日志
此步骤要求您通过 SSH 连接到虚拟机。
您可以在 syslog 中找到指标模块日志。如果没有日志,则说明代理服务未正常运行。请先转到“代理已安装,但无法运行”部分,以消除该状况。
写入 Monitoring API 时,您可能会看到
PermissionDenied
错误。如果 Ops Agent 的权限未正确配置,则会出现此错误。例如:Nov 2 14:51:27 test-ops-agent-error otelopscol[412]: 2021-11-02T14:51:27.343Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "[rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).; rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).]", "interval": "6.934781228s"}
如需消除此错误,请启用 Monitoring API 并设置 Monitoring Metric Writer 角色。
写入 Monitoring API 时,您可能会看到
ResourceExhausted
错误。如果项目达到任何 Monitoring API 配额上限,则会出现此错误。例如:Nov 2 18:48:32 test-ops-agent-error otelopscol[441]: 2021-11-02T18:48:32.175Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Total requests' and limit 'Total requests per minute per user' of service 'monitoring.googleapis.com' for consumer 'project_number:8563942476'.\nerror details: name = ErrorInfo reason = RATE_LIMIT_EXCEEDED domain = googleapis.com metadata = map[consumer:projects/8563942476 quota_limit:DefaultRequestsPerMinutePerUser quota_metric:monitoring.googleapis.com/default_requests service:monitoring.googleapis.com]", "interval": "2.641515416s"}
如需消除此错误,请增加配额或减少指标吞吐量。
您可能会在模块日志中看到以下错误:
{"error":"invalid_request","error_description":"Service account not enabled on this instance"}
或
can't fetch token from the metadata server
这些错误可能表示您在部署代理时没有服务帐号或指定的凭据。如需了解如何解决此问题,请参阅授权 Ops Agent。
非有害日志
以下日志是可以安全忽略的非有害日志垃圾内容的示例。
从伪进程或受限进程中抓取指标时出错
Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:55.848Z error scraperhelper/scrapercontroller.go:205 Error scraping metrics {"kind" : "receiver", "name": "hostmetrics/hostmetrics", "error": "[error reading process name for pid 2: readlink /proc/2/exe: no such file or directory; error reading process name for pid 3: readlink /proc/3/exe: no such file or directory; error reading process name for pid 4: readlink /proc/4/exe: no such file or directory; error reading process name for pid 5: readlink /proc/5/exe: no such file or directory; error reading process name for pid 6: readlink /proc/6/exe: no such file or directory; error reading process name for pid 7: r eadlink /proc/7/exe: no such file or directory; error reading process name for pid 8: readlink /proc/8/exe: no such file or directory; error reading process name for pid 9: readl ink /proc/9/exe: no such file or directory; error reading process name for pid 10: readlink /proc/10/exe: no such file or directory; error reading process name for pid 11: readli nk /proc/11/exe: no such file or directory; error reading process name for pid 12: readlink /proc/12/exe: no such file or directory; error reading process name for pid 13: readli nk /proc/13/exe: no such file or directory; error reading process name for pid 14: readlink /proc/14/exe: no such file or directory; error reading process name for pid 15: readli nk /proc/15/exe: no such file or directory; error reading process name for pid 16: readlink /proc/16/exe: no such file or directory; error reading process name for pid 17: readli nk /proc/17/exe: no such file or directory; error reading process name for pid 18: readlink /proc/18/exe: no such file or directory; error reading process name for pid 19: readli nk /proc/19/exe: no such file or directory; error reading process name for pid 20: readlink /proc/20/exe: no such file or directory; error reading process name for pid 21: readli nk /proc/21/exe: no such file or directory; error reading process name for pid 22: readlink /proc/22/exe: no such file or directory; error reading process name for pid Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 23: readlink /proc/23/exe: no such file or directory; error reading process name for pid 24: readlink /proc/24/exe: no such file or directory; error reading process name for pid 25: readlink /proc/25/exe: no such file or directory; error reading process name for pid 26: readlink /proc/26/exe: no such file or directory; error reading process name for pid 27: readlink /proc/27/exe: no such file or directory; error reading process name for pid 28: readlink /proc/28/exe: no such file or directory; error reading process name for pid 30: readlink /proc/30/exe: no such file or directory; error reading process name for pid 31: readlink /proc/31/exe: no such file or directory; error reading process name for pid 43: readlink /proc/43/exe: no such file or directory; error reading process name for pid 44: readlink /proc/44/exe: no such file or directory; error reading process name for pid 45: readlink /proc/45/exe: no such file or directory; error reading process name for pid 90: readlink /proc/90/exe: no such file or directory; error reading process name for pid 92: readlink /proc/92/exe: no such file or directory; error reading process name for pid 106: readlink /proc/106/exe: no such fi le or directory; error reading process name for pid 360: readlink /proc/360/exe: no such file or directory; error reading process name for pid 375: readlink /proc/375/exe: no suc h file or directory; error reading process name for pid 384: readlink /proc/384/exe: no such file or directory; error reading process name for pid 386: readlink /proc/386/exe: no such file or directory; error reading process name for pid 387: readlink /proc/387/exe: no such file or directory; error reading process name for pid 422: readlink /proc/422/exe : no such file or directory; error reading process name for pid 491: readlink /proc/491/exe: no such file or directory; error reading process name for pid 500: readlink /proc/500 /exe: no such file or directory; error reading process name for pid 2121: readlink /proc/2121/exe: no such file or directory; error reading Jul 13 17:28:55 debian9-trouble otelopscol[2134]: process name for pid 2127: readlink /proc/2127/exe: no such file or directory]"} Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport Jul 13 17:28:55 debian9-trouble otelopscol[2134]: /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:205 Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1 Jul 13 17:28:55 debian9-trouble otelopscol[2134]: /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:186
丢弃累积指标的第一个数据点时出错:
Jul 13 17:28:03 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:03.092Z info exporterhelper/queued_retry.go:316 Exporting failed. Will retry the request a fter interval. {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[1].points[0].interval.start_time had a n invalid value of \"2021-07-13T10:25:18.061-07:00\": The start time must be before the end time (2021-07-13T10:25:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag ent/uptime'.", "interval": "23.491024535s"} Jul 13 17:28:41 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:41.269Z info exporterhelper/queued_retry.go:316 Exporting failed. Will retry the request a fter interval. {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had a n invalid value of \"2021-07-13T10:26:18.061-07:00\": The start time must be before the end time (2021-07-13T10:26:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag ent/monitoring/point_count'.", "interval": "21.556591578s"}
如需了解 Cloud Monitoring 代理的其他已知问题,请参阅 Cloud Monitoring 代理问题排查指南。
部分指标缺失或不一致
有少量指标在 Ops Agent 2.0.0 及更高版本上的处理方式与 Ops Agent“预览版”(低于 2.0.0 版)或 Monitoring 代理不同。
下表介绍了 Ops Agent 和 Monitoring 代理提取的数据之间的差异。指标类型,省略了agent.googleapis.com |
Ops Agent(正式版)† | Ops Agent(预览版)† | Monitoring 代理 |
---|---|---|---|
disk/bytes_used 和disk/percent_used |
提取时 device 标签中包含完整路径;例如 /dev/sda15 。未针对 tmpfs 和 udev 等虚拟设备提取该指标。 |
提取时 device 标签的路径中不含 /dev ;例如 sda15 。针对 tmpfs 和 udev 等虚拟设备提取该指标。 |
提取时 device 标签的路径中不含 /dev ;例如 sda15 。针对 tmpfs 和 udev 等虚拟设备提取该指标。 |
移除了由 Google Cloud Console 报告为已安装的代理
卸载代理后,Google Cloud Console 最多可能需要一小时才能报告此更改。
代理日志占用的空间过多
旧版本的 Ops Agent 可能会因 /var/log/google-cloud-ops-agent/subagents/logging-module.log
文件占用大量磁盘空间。查找大量消息,如下所示:
[2022/04/30 05:23:38] [error] [input chunk] error writing data from tail.2 instance [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb [2022/04/30 05:23:38] [error] [storage] [cio file] file is not mmap()ed: tail.2:2004860-1650614856.691268293.flb
如需解决此问题,请将 Ops Agent 升级到 2.17 版或更高版本。