Find Ops Agent troubleshooting information

This document describes sources of diagnostic information that you can use to identify problems in the installation or running of the Ops Agent.

Agent health checks

Version 2.25.1 introduced health checks for the Ops Agent. When the Ops Agent starts, it performs a series of checks for conditions that prevent the agent from running correctly. If the agent detects one of the conditions, it logs a message describing the problem. The Ops Agent checks for the following:

  • Connectivity problems
  • Availability of ports used by the agent to report metrics about itself
  • Permission problems
  • Availability of the APIs used by the agent to write logs or metrics
  • A problem in the health-check routine itself.

The following table lists each error code in alphabetical order and describes what each code means:

Error code Category Meaning Suggestion
DLApiConnErr Connectivity Request to the downloads subdomain, dl.google.com, failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
FbMetricsPortErr Port availability Port 20202, needed for Ops Agent self metrics, is unavailable. Verify that port 20202 is open. For more information, see Required port is unavailable.
HcFailureErr Generic The Ops Agent health-check routine encountered an internal error. Submit a support case from the Google Cloud console. For more information, see Getting support.
LogApiConnErr Connectivity Request to the Logging API failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
LogApiDisabledErr API The Logging API is disabled in the current Google Cloud project. Enable the Logging API.
LogApiPermissionErr Permission Service account is missing the Logs Writer role (roles/logging.logWriter). Grant the Logs Writer role to the service account. For more information, see Agent lacks API permissions.
LogApiScopeErr Permission The VM is missing the https://www.googleapis.com/​auth/​logging.write access scope. Add the https://www.googleapis.com/​auth/​logging.write scope to the VM. For more information, see Verify your access scopes.
LogApiUnauthenticatedErr API The current VM couldn't authenticate to the Logging API. Verify that your credential files, VM access scopes, and permissions are set up correctly. For more information, see Authorize the Ops Agent.
MetaApiConnErr Connectivity Request to the G​C​E Metadata server, for querying VM access scopes, OAuth tokens, and resource labels, failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
MonApiConnErr Connectivity A request to the Monitoring API failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
MonApiDisabledErr API The Monitoring API is disabled in the current Google Cloud project. Enable the Monitoring API.
MonApiPermissionErr Permission Service account is missing the Monitoring Metric Writer role (roles/monitoring.metricWriter). Grant the Monitoring Metric Writer role to the service account. For more information, see Agent lacks API permissions.
MonApiScopeErr Permission The VM is missing the https://www.googleapis.com/​auth/​monitoring.write access scope. Add the https://www.googleapis.com/​auth/​monitoring.write scope to the VM. For more information, see Verify your access scopes.
MonApiUnauthenticatedErr API The current VM couldn't authenticate to the Monitoring API. Verify that your credential files, VM access scopes, and permissions are set up correctly. For more information, see Authorize the Ops Agent.
OtelMetricsPortErr Port availability Port 20201, needed for Ops Agent self metrics, is unavailable. Verify that port 20201 is open. For more information, see A required port is unavailable.
PacApiConnErr Connectivity Request to the package repository, packages.cloud.google.com, failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.

The agent writes information about health-check errors to a health-checks.log file as follows:

  • Linux: /var/log/google-cloud-ops-agent/health-checks.log
  • Windows: C:\ProgramData\Google\Cloud Operations\Ops Agent\log\health-checks.log

You can also view any health-check messages by querying the status of the Ops Agent service as follows:

  • On Linux, run the following command:
       sudo systemctl status google-cloud-ops-agent"*"
       

    Look for messages like "[Ports Check] Result: PASS". Other results include "ERROR" and "FAIL".

  • On Windows, use the Windows Event Viewer. Look for "Information", "Error", or "Failure" messages associated with the google-cloud-ops-agent service.

After you resolve any problems, you must restart the agent. The health checks are run when the agent starts, so to re-run the checks, you must restart the agent.

Agent diagnostics tool for VMs

The agent diagnostics tool gathers critical local debugging information from your VMs for all the following agents: Ops Agent, legacy Logging agent, and legacy Monitoring agent. The debugging information includes things like project info, VM info, agent configuration, agent logs, agent service status, information that typically requires manual work to gather. The tool also checks the local VM environment to ensure it meets certain requirements for the agents to function properly, for example, network connectivity and required permissions.

When filing a customer case for an agent on a VM, run the agent diagnostics tool and attach the collected information to the case. Providing this information reduces the time needed to troubleshoot your support case. Before you attach the information to the support case, redact any sensitive information like passwords.

The agent diagnostics tool must be run from inside the VM, so you will typically need to SSH into the VM first. The following command retrieves the agent diagnostics tool and executes it:

Linux

curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh

Windows

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/diagnose-agents.ps1", "${env:UserProfile}\diagnose-agents.ps1")
Invoke-Expression "${env:UserProfile}\diagnose-agents.ps1"

Follow the output of the script execution to locate the files that include the collected info. Typically you can find them in the /var/tmp/google-agents directory on Linux and in the $env:LOCALAPPDATA/Temp directory on Windows, unless you have customized the output directory when running the script.

For detailed information, examine the diagnose-agents.sh script on Linux or diagnose-agents.ps1 script on Windows.

Agent status

You can check the status of the Ops Agent processes on the VM to determine if the agent is running or not.

Linux

To check the status of the Ops Agent, use the following command:

sudo systemctl status google-cloud-ops-agent"*"

Verify that the "Metrics Agent" and "Logging Agent" components are listed as "active (running)", as shown in the following sample output:

● google-cloud-ops-agent.service - Google Cloud Ops Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2023-01-25 19:42:58 UTC; 2 weeks 4 days ago
    Process: 614636 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/go>
    Process: 614654 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 614654 (code=exited, status=0/SUCCESS)
        CPU: 306ms
Jan 25 19:42:58 test-vm systemd[1]: Finished Google Cloud Ops Agent.

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
     Active: active (running) since Wed 2023-01-25 19:42:58 UTC; 2 weeks 4 days ago
    Process: 614637 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=ot>
   Main PID: 614655 (otelopscol)
      Tasks: 9 (limit: 2355)
     Memory: 66.5M
        CPU: 27min 51.908s
     CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
             └─614655 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/g>

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static)
     Active: active (running) since Wed 2023-01-25 19:42:59 UTC; 2 weeks 4 days ago
    Process: 614664 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fl>
   Main PID: 614672 (fluent-bit)
      Tasks: 22 (limit: 2355)
     Memory: 31.7M
        CPU: 14min 55.847s
     CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
             └─614672 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-clo>

Jan 25 19:42:59 test-vm systemd[1]: Started Google Cloud Ops Agent - Logging Agent.
Jan 25 19:42:59 test-vm fluent-bit[614672]: Fluent Bit v1.9.3
Jan 25 19:42:59 test-vm fluent-bit[614672]: * Copyright (C) 2015-2022 The Fluent Bit Authors
Jan 25 19:42:59 test-vm fluent-bit[614672]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Jan 25 19:42:59 test-vm fluent-bit[614672]: * https://fluentbit.io

Windows

To check the status of the Ops Agent, use the following command:

Get-Service google-cloud-ops-agent*

Verify that the "Metrics Agent" and "Logging Agent" components are listed as "Running", as shown in the following sample output:

Status   Name               DisplayName
------   ----               -----------
Running  google-cloud-op... Google Cloud Ops Agent
Running  google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running  google-cloud-op... Google Cloud Ops Agent - Metrics Agent

Agent self logs

If the agent fails to ingest logs to Cloud Logging, then you might have to inspect the agent's logs locally on the VM for troubleshooting. You can also use log rotation to manage the agent's self logs.

Linux

To inspect self logs that are written to Journald, run the following command:

journalctl -u google-cloud-ops-agent*

To inspect the self logs that are written to the disk by the logging module, run the following command:

vim /var/log/google-cloud-ops-agent/subagents/logging-module.log

Windows

To inspect self logs that are written to Windows Event Logs, run the following command:

Get-WinEvent -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent*' } | Format-Table -AutoSize -Wrap

To inspect the self logs that are written to the disk by the logging module, run the following command:

notepad "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"

To inspect the logs from the Windows Service Control Manager for Ops Agent services, run the following command::

Get-WinEvent -FilterHashtable @{ Logname='System'; ProviderName='Service Control Manager' } | Where-Object -Property Message -Match 'Google Cloud Ops Agent' | Format-Table -AutoSize -Wrap