This document provides information to help you diagnose and resolve problems in the installation and start-up of the Ops Agent. If the agent is running but failing to ingest logs or metrics, see Troubleshoot data ingestion.
Agent fails to install
You may encounter the following errors when running the installation script.
The operating system is not supported. The error message might look similar to the following:
Linux
https://packages.cloud.google.com/yum/repos/google-cloud-ops-agent-el6-x86_64-all/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found" Trying other mirror. To address this issue please refer to the below wiki article https://wiki.centos.org/yum-errors If above article doesn't help to resolve this issue please use https://bugs.centos.org/. Error: Cannot retrieve repository metadata (repomd.xml) for repository: google-cloud-ops-agent. Please verify its path and try again
The VM already has the Cloud Logging agent or the Cloud Monitoring agent installed, and they conflict with the new agent. The error message might look similar to the following:
Linux
Error: Problem: problem with installed package stackdriver-agent-6.0.5-1.el8.x86_64 - package google-cloud-ops-agent-0.1.0-1.el8.x86_64 conflicts with stackdriver-agent provided by stackdriver-agent-6.0.5-1.el8.x86_64
The Ops Agent uses new configuration files that are not compatible with the old agents. For more information, refer to the Configure the Ops Agent guide.
To fix this error, do the following:
Save the custom configuration files for the Cloud Monitoring agent and the Cloud Logging agent.
Uninstall the old Cloud Monitoring agent and Cloud Logging agent.
After you uninstall the agent, the Google Cloud console might take up to one hour to report this change.
Agent is installed but not running
If you have installed the agent but the agent is not running, then the problem might be one of the following:
- One of the primary components, "Metrics Agent" or "Logging Agent", has failed to start; see Agent services not running.
- One of the legacy agents is also installed on the VM; see Conflict with currently installed agents.
- A port that one of the components requires is in use by another process; see Unavailable port.
- The configuration of the Ops Agent is invalid; see Invalid configuration.
Agent services not running
When the agent service is running as expected, you might see the following status:
For Linux
computer@debian9:~$ sudo systemctl status google-cloud-ops-agent"*" ● google-cloud-ops-agent.service - Google Cloud Ops Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled) Active: active (exited) since Thu 2021-08-05 20:33:44 UTC; 7s ago Process: 2240 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Process: 2214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS) Main PID: 2240 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 4915) CGroup: /system.slice/google-cloud-ops-agent.service Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent... Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent. ● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled) Drop-In: /lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d └─directories.conf Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago Process: 2234 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS) Process: 2216 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS) Main PID: 2247 (fluent-bit) Tasks: 22 (limit: 4915) CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service └─2247 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf --parser /run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf --log_file /var/log/google-cloud-ops-agent/subagents/logging-module.log --storage_path /var/lib/google-cloud-ops-agent/fluent-bit/buffers Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent - Logging Agent... Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent - Logging Agent. Aug 05 20:33:44 debian9 fluent-bit[2247]: Fluent Bit v1.7.8 Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2019-2021 The Fluent Bit Authors Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2015-2018 Treasure Data Aug 05 20:33:44 debian9 fluent-bit[2247]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd Aug 05 20:33:44 debian9 fluent-bit[2247]: * https://fluentbit.io ● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled) Drop-In: /lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d └─directories.conf Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago Process: 2237 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS) Process: 2215 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS) Main PID: 2251 (otelopscol) Tasks: 6 (limit: 4915) CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service └─2251 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z info builder/pipelines_builder.go:51 Pipeline is starting... {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z info builder/pipelines_builder.go:62 Pipeline is started. {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z info service/service.go:192 Starting receivers... Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z info builder/receivers_builder.go:70 Receiver is starting... {"kind": "receiver", "name": "hostmetrics/hostmetrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z info builder/receivers_builder.go:75 Receiver started. {"kind": "receiver", "name": "hostmetrics/hostmetrics"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info builder/receivers_builder.go:70 Receiver is starting... {"kind": "receiver", "name": "prometheus/agent"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info discovery/manager.go:195 Starting provider {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0", "subs": "[otel-collector]"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info builder/receivers_builder.go:75 Receiver started. {"kind": "receiver", "name": "prometheus/agent"} Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z info service/collector.go:182 Everything is ready. Begin running and processing data. Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.256Z info discovery/manager.go:213 Discoverer channel closed {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0"}
For Windows
Get-Service google-cloud-ops-agent* Status Name DisplayName ------ ---- ----------- Running google-cloud-op... Google Cloud Ops Agent Running google-cloud-op... Google Cloud Ops Agent - Logging Agent Running google-cloud-op... Google Cloud Ops Agent - Metrics Agent
If the agent service is not running, you might see the following status:
Linux
$ sudo service google-cloud-ops-agent status ● google-cloud-ops-agent.service - Google Cloud Ops Agent Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled) Active: inactive (dead) since Wed 2021-06-30 21:20:43 UTC; 6s ago
Windows
Get-Service google-cloud-ops-agent Status Name DisplayName ------ ---- ----------- Stopped google-cloud-ops-agent Google Cloud Ops Agent
To fix this error, run the following command to start the service:
Linux
sudo service google-cloud-ops-agent start
Windows
Start-Service google-cloud-ops-agent
If the service fails to start, the configuration might be invalid.
Conflict with currently installed agents
The VM already has the Cloud Logging agent or the Cloud Monitoring agent installed, and their configuration conflicts with the new agent's configuration. The error message might look similar to the following:
Windows
We detected an existing Windows service for the StackdriverLogging agent, which is not compatible with the Ops Agent when the Ops Agent configuration has a non-empty logging section. Please either remove the logging section from the Ops Agent configuration, or disable the StackdriverLogging agent, and then retry enabling the Ops Agent.
To fix this error, you have two options:
Disable the conflicting section of the Ops Agent configuration file. For more information, refer to the Configure the Ops Agent guide.
Disable the conflicting Cloud Logging agent or the Cloud Monitoring agent.
- Save any custom configuration files for the Cloud Logging agent.
- Uninstall the old Cloud Monitoring agent and Cloud Logging agent.
After you uninstall the agent, the Google Cloud console might take up to one hour to report this change.
Required port is unavailable
The Ops Agent or one of its components can fail to start when the port needed by the component is being used by another process. The Ops Agent uses the following ports:
- Port 20201, for the "Metrics Agent" component
- Port 20202, for the "Logging Agent" component
If a process other than an Ops Agent component is using port 20201 or port 20202, then stop that process and restart the Ops Agent. Use the following steps to determine which process is using the ports:
Linux
Metrics Agent component: To see which process is using port 20201, use the following command:
sudo netstat -ns -p | grep '20201'
The following output shows the expected result:
the Ops Agent metrics collector, otelopscol
, is using the port:
tcp 0 0 127.0.0.1:50138 127.0.0.1:20201 ESTABLISHED 16850/otelopscol tcp6 0 0 :::20201 :::* LISTEN 16850/otelopscol tcp6 0 0 127.0.0.1:20201 127.0.0.1:50138 ESTABLISHED 16850/otelopscol
Logging Agent component: To see which process is using port 20202, use the following command:
sudo netstat -ns -p | grep '20202'
The following output shows the expected result:
the Ops Agent logs collector, fluent-bit
, is using the port:
tcp 0 0 0.0.0.0:20202 0.0.0.0:* LISTEN 16640/fluent-bit tcp 0 0 127.0.0.1:20202 127.0.0.1:52998 TIME_WAIT -
Windows
Metrics Agent component: To see which process is using port 20201, use the following command:
netstat -na -b | Select-String "20201" -Context 0,1
The following output shows the expected result: the Ops Agent metrics
collector, google-cloud-metrics-agent_windows_amd64.exe
, is using the port:
> TCP 0.0.0.0:20201 0.0.0.0:0 LISTENING [google-cloud-metrics-agent_windows_amd64.exe] > TCP 127.0.0.1:20201 127.0.0.1:50090 ESTABLISHED [google-cloud-metrics-agent_windows_amd64.exe] > TCP 127.0.0.1:50090 127.0.0.1:20201 ESTABLISHED [google-cloud-metrics-agent_windows_amd64.exe] > TCP [::]:20201 [::]:0 LISTENING [google-cloud-metrics-agent_windows_amd64.exe]
Logging Agent component: To see which process is using port 20202, use the following command:
netstat -na -b | Select-String "20202" -Context 0,1
The following output shows the expected result:
the Ops Agent logs collector, fluent-bit.exe
, is using the port:
> TCP 0.0.0.0:20202 0.0.0.0:0 LISTENING [fluent-bit.exe] > TCP 127.0.0.1:20202 127.0.0.1:57535 TIME_WAIT > TCP 127.0.0.1:20202 127.0.0.1:57539 TIME_WAIT TCP 127.0.0.1:49807 127.0.0.1:49808 ESTABLISHED
Port-availablility errors can be detected by the health checks run by the Ops Agent.
Agent lacks API permissions
If the agent fails to start or fails to ingest data, then the problem might be that the "Metrics Agent" or "Logging agent" component lacks the necessary permission to access the API.
The service account used by the Ops Agent requires the following Identity and Access Management roles:
- For the "Logging Agent" component: Logs Writer (
roles/logging.logsWriter
) - For the "Metrics Agent" component: Monitoring Metric Writer (
roles/monitoring.metricWriter
).
These roles include the permissions needed to write logging or metric data and must be granted to the service account associated with the VM. The service account you are using depends on how you configured the VM and authorized the agent. You might be using one of the following:
- A service account attached to the VM.
- A service account that uses a private key.
To identify the service account associated with a VM, do the following:
In the Google Cloud console, navigate to Compute Engine, or use the following button:
If necessary, click the drop-down list of Google Cloud projects and select the name of your project.
Select VM instances from the navigation menu, and then select the Instances tab if necessary.
In the list of VM instances, click on the name of the VM to view the Details page for the VM.
Locate the API and identity management section of the page. The service account is listed as the value of the Service account field.
For information about setting the roles granted to the service account, see Verify and modify roles of an existing service account.
API-permission errors can be detected by the health checks run by the Ops Agent.
Invalid configuration
If the configuration is invalid, you might see the following error when trying to restart the agent service:
Linux
$ sudo service google-cloud-ops-agent restart \ && sudo service google-cloud-ops-agent status ● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d └─directories.conf Active: failed (Result: exit-code) since Wed 2021-06-30 22:21:08 UTC; 2s ago Process: 1141421 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_> Process: 1141847 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIR> Main PID: 1141421 (code=exited, status=0/SUCCESS) Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Control process exited, code=exited status=1 Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'. Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5. Jun 30 22:21:08 centos8-2 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly. Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'. Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
Use journalctl
to get the exact error message:
sudo journalctl -xe | grep "google_cloud_ops_agent_engine"
You might see a message similar to the following:
Jun 30 22:00:26 centos8-2 google_cloud_ops_agent_engine[1141491]: 2021/06/30 22:00:26 the agent config file is not valid YAML. detailed error: yaml: line 21: did not find expected key
Windows
failed to generate config files: can't parse configuration: yaml: line 20: could not find expected ':'
To fix the error, correct the invalid configuration and restart the agent. For reference, refer to the Configure the Ops Agent guide.
Status information in the Google Cloud console is wrong
The Google Cloud console reports information about the status of agents on Compute Engine VMs in various dashboards, for example, the VM Instances dashboard in Cloud Monitoring. If this information does not match what you expect, the cause might simply be a delay as configuration changes work their way thought the system. But unexpected information might also indicate that the agent isn't running as you expect.
Installed agent reported by Google Cloud console as undetected
The agent must be running and ingesting data for the Google Cloud console to recognize that the agent is present. If you have installed the agent but the console status remains "Not Detected", then the agent is not running or it is running and not ingesting data. For more information, see the following:
Removed agent reported by Google Cloud console as installed
After you uninstall the agent, the Google Cloud console might take up to one hour to report this change.