Find Ops Agent troubleshooting information

This document describes sources of diagnostic information that you can use to identify problems in the installation or running of the Ops Agent.

Agent health checks

Version 2.25.1 introduced start-time health checks for the Ops Agent. When the Ops Agent starts, it performs a series of checks for conditions that prevent the agent from running correctly. If the agent detects one of the conditions, it logs a message describing the problem. The Ops Agent checks for the following:

  • Connectivity problems
  • Availability of ports used by the agent to report metrics about itself
  • Permission problems
  • Availability of the APIs used by the agent to write logs or metrics
  • A problem in the health-check routine itself.
For information about locating start-time errors, see Find start-time errors.

Version 2.37.0 introduced runtime heath checks for the Ops Agent. These errors are reported to Cloud Logging and Error Reporting. For information about locating runtime errors, see Find runtime errors.

Version 2.46.0 introduced the informational LogPingOpsAgent code. This code does not represent an error. For more information, see Verify successful log collection.

The following table lists each health-check code in alphabetical order and describes what each code means. Codes that end with the string Err indicate errors; other codes are informational.

Health-check code Category Meaning Suggestion
DLApiConnErr Connectivity Request to the downloads subdomain, dl.google.com, failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
FbMetricsPortErr Port availability Port 20202, needed for Ops Agent self metrics, is unavailable. Verify that port 20202 is open. For more information, see Required port is unavailable.
HcFailureErr Generic The Ops Agent health-check routine encountered an internal error. Submit a support case from the Google Cloud console. For more information, see Getting support.
LogApiConnErr Connectivity Request to the Logging API failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
LogApiDisabledErr API The Logging API is disabled in the current Google Cloud project. Enable the Logging API.
LogApiPermissionErr Permission Service account is missing the Logs Writer role (roles/logging.logWriter). Grant the Logs Writer role to the service account. For more information, see Agent lacks API permissions.
LogApiScopeErr Permission The VM is missing the https://www.googleapis.com/​auth/​logging.write access scope. Add the https://www.googleapis.com/​auth/​logging.write scope to the VM. For more information, see Verify your access scopes.
LogApiUnauthenticatedErr API The current VM couldn't authenticate to the Logging API. Verify that your credential files, VM access scopes, and permissions are set up correctly. For more information, see Authorize the Ops Agent.
LogPingOpsAgent   An informational payload message written every 10 minutes to the ops-agent-health log. You can use the resulting log entries to verify that the agent is sending logs. This message is not an error. This message is expected to appear every 10 minutes. If the message does not appear for 20 minutes or longer, then agent might have encountered a problem. For troubleshooting information, see Troubleshoot the Ops Agent.
LogParseErr Runtime The Ops Agent was unable to parse one or more logs. Check the configuration of any logging processors you've created. For more information see Log-parsing errors.
LogPipeLineErr Runtime The Ops Agent's logging pipeline failed. Verify that the agent has access to the buffer files; check for a full disk, and verify that the Ops Agent configuration is correct. For more information, see Pipeline errors.
MetaApiConnErr Connectivity Request to the G​C​E Metadata server, for querying VM access scopes, OAuth tokens, and resource labels, failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
MonApiConnErr Connectivity A request to the Monitoring API failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.
MonApiDisabledErr API The Monitoring API is disabled in the current Google Cloud project. Enable the Monitoring API.
MonApiPermissionErr Permission Service account is missing the Monitoring Metric Writer role (roles/monitoring.metricWriter). Grant the Monitoring Metric Writer role to the service account. For more information, see Agent lacks API permissions.
MonApiScopeErr Permission The VM is missing the https://www.googleapis.com/​auth/​monitoring.write access scope. Add the https://www.googleapis.com/​auth/​monitoring.write scope to the VM. For more information, see Verify your access scopes.
MonApiUnauthenticatedErr API The current VM couldn't authenticate to the Monitoring API. Verify that your credential files, VM access scopes, and permissions are set up correctly. For more information, see Authorize the Ops Agent.
OtelMetricsPortErr Port availability Port 20201, needed for Ops Agent self metrics, is unavailable. Verify that port 20201 is open. For more information, see A required port is unavailable.
PacApiConnErr Connectivity Request to the package repository, packages.cloud.google.com, failed. Check your internet connection and firewall rules. For more information, see Network-connectivity issues.

Find start-time errors

Starting with version 2.35.0, health-check information is written to the ops-agent-health log by the Cloud Logging API (versions 2.33.0, 2.34.0 use ops-agent-health-checks). The same information is also written to a health-checks.log file as follows:

  • Linux: /var/log/google-cloud-ops-agent/health-checks.log
  • Windows: C:\ProgramData\Google\Cloud Operations\Ops Agent\log\health-checks.log

You can also view any health-check messages by querying the status of the Ops Agent service as follows:

  • On Linux, run the following command:
       sudo systemctl status google-cloud-ops-agent"*"
       

    Look for messages like "[Ports Check] Result: PASS". Other results include "ERROR" and "FAIL".

  • On Windows, use the Windows Event Viewer. Look for "Information", "Error", or "Failure" messages associated with the google-cloud-ops-agent service.

After you resolve any problems, you must restart the agent. The health checks are run when the agent starts, so to re-run the checks, you must restart the agent.

Find runtime errors

The runtime health checks are reported to both Cloud Logging and Error Reporting. If the agent failed to start but was able to report errors before failing, you might also see start-time errors reported.

To view runtime errors from the Ops Agent in Logging, do the following:

  1. In the navigation panel of the Google Cloud console, select Logging, and then select Logs Explorer:

    Go to Logs Explorer

  2. Enter the following query and click Run query:
    log_id("ops-agent-health")

To view runtime errors from the Ops Agent in Error Reporting, do the following:

  1. In the navigation panel of the Google Cloud console, select Error Reporting, and then select your Google Cloud project:

    Go to Error Reporting

  2. To see errors from the Ops Agent, filter the errors for Ops Agent.

Verify successful log collection

Version 2.46.0 of the Ops Agent introduced the informational LogPingOpsAgent health check. This check writes an informational message to the ops-agent-health every 10 minutes. You can use the presence of these messages to verify that the Ops Agent is writing logs by doing any of the following:

If any of these options indicates that the log messages are not being ingested, then you can do the following:

To check the status of the Ops Agent on a specific VM, you need the instance ID of the VM. To find the instance ID, do the following:

  • In the navigation panel of the Google Cloud console, select Compute Engine, and then select VM instances:

    Go to VM instances

  • Click the name of a VM instance.
  • On the Details tab, locate the Basic information section. The instance ID appears as a numeric string. Use this string for the INSTANCE_ID value in the subsequent sections.

Agent diagnostics tool for VMs

The agent diagnostics tool gathers critical local debugging information from your VMs for all the following agents: Ops Agent, legacy Logging agent, and legacy Monitoring agent. The debugging information includes things like project info, VM info, agent configuration, agent logs, agent service status, information that typically requires manual work to gather. The tool also checks the local VM environment to ensure it meets certain requirements for the agents to function properly, for example, network connectivity and required permissions.

When filing a customer case for an agent on a VM, run the agent diagnostics tool and attach the collected information to the case. Providing this information reduces the time needed to troubleshoot your support case. Before you attach the information to the support case, redact any sensitive information like passwords.

The agent diagnostics tool must be run from inside the VM, so you will typically need to SSH into the VM first. The following command retrieves the agent diagnostics tool and executes it:

Linux

curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh

Windows

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/diagnose-agents.ps1", "${env:UserProfile}\diagnose-agents.ps1")
Invoke-Expression "${env:UserProfile}\diagnose-agents.ps1"

Follow the output of the script execution to locate the files that include the collected info. Typically you can find them in the /var/tmp/google-agents directory on Linux and in the $env:LOCALAPPDATA/Temp directory on Windows, unless you have customized the output directory when running the script.

For detailed information, examine the diagnose-agents.sh script on Linux or diagnose-agents.ps1 script on Windows.

Agent diagnostics tool for automatic installation policies

If an attempt to install the Ops Agent by using an Ops Agent OS policy fails, you can use the diagnostics script described in this section for debugging. For example, you might see one of the following cases:

  • The Ops Agent installation fails when you used the Install Ops Agent for Monitoring and Logging checkbox to install the Ops Agent during VM creation.
  • The agent status on the Cloud Monitoring VM instances dashboard or the Observability tab on a Compute Engine VM details page stays in the Pending state for more than 10 minutes. A prolonged Pending status might indicate one of the following:

    • A problem applying the policy.
    • A problem in the actual installation of the Ops Agent.
    • A connectivity problem between the VM and Cloud Monitoring.

    For some of these issues, the general agent-diagnostics script and health checks might also be helpful.

To run the policy-diagnostics script, run the following commands:

curl -sSO https://dl.google.com/cloudagents/diagnose-ui-policies.sh
bash diagnose-ui-policies.sh VM_NAME VM_ZONE

This script shows information about affected VMs and related automatic installation policies.

When filing a customer case for an agent on a VM, run the agent diagnostics tools and attach the collected information to the case. Providing this information reduces the time needed to troubleshoot your support case. Before you attach the information to the support case, redact any sensitive information like passwords.

Agent status

You can check the status of the Ops Agent processes on the VM to determine if the agent is running or not.

Linux

To check the status of the Ops Agent, use the following command:

sudo systemctl status google-cloud-ops-agent"*"

Verify that the "Metrics Agent" and "Logging Agent" components are listed as "active (running)", as shown in the following sample output (some lines have been removed for brevity):

● google-cloud-ops-agent.service - Google Cloud Ops Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2023-05-03 21:22:28 UTC; 4 weeks 0 days ago
    Process: 3353828 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/go>
    Process: 3353837 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 3353837 (code=exited, status=0/SUCCESS)
        CPU: 195ms

[...]

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static)
     Active: active (running) since Wed 2023-05-03 21:22:29 UTC; 4 weeks 0 days ago
    Process: 3353840 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=ot>
   Main PID: 3353855 (otelopscol)
      Tasks: 9 (limit: 2355)
     Memory: 65.3M
        CPU: 40min 31.555s
     CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
             └─3353855 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/g>

[...]

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static)
     Active: active (running) since Wed 2023-05-03 21:22:29 UTC; 4 weeks 0 days ago
    Process: 3353838 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fl>
   Main PID: 3353856 (google_cloud_op)
      Tasks: 31 (limit: 2355)
     Memory: 58.3M
        CPU: 29min 6.771s
     CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
             ├─3353856 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_wrapper -config_path /etc/goo>
             └─3353872 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-clo>

[...]

● google-cloud-ops-agent-diagnostics.service - Google Cloud Ops Agent - Diagnostics
     Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-diagnostics.service; disabled; vendor preset: e>
     Active: active (running) since Wed 2023-05-03 21:22:26 UTC; 4 weeks 0 days ago
   Main PID: 3353819 (google_cloud_op)
      Tasks: 8 (limit: 2355)
     Memory: 36.0M
        CPU: 3min 19.488s
     CGroup: /system.slice/google-cloud-ops-agent-diagnostics.service
             └─3353819 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_diagnostics -config /etc/goog>

[...]

Windows

To check the status of the Ops Agent, use the following command:

Get-Service google-cloud-ops-agent*

Verify that the "Metrics Agent" and "Logging Agent" components are listed as "Running", as shown in the following sample output:

Status   Name               DisplayName
------   ----               -----------
Running  google-cloud-op... Google Cloud Ops Agent
Running  google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running  google-cloud-op... Google Cloud Ops Agent - Metrics Agent
Running  google-cloud-op... Google Cloud Ops Agent - Diagnostics

Agent self logs

If the agent fails to ingest logs to Cloud Logging, then you might have to inspect the agent's logs locally on the VM for troubleshooting. You can also use log rotation to manage the agent's self logs.

Linux

To inspect self logs that are written to Journald, run the following command:

journalctl -u google-cloud-ops-agent*

To inspect the self logs that are written to the disk by the logging module, run the following command:

vim -M /var/log/google-cloud-ops-agent/subagents/logging-module.log

Windows

To inspect self logs that are written to Windows Event Logs, run the following command:

Get-WinEvent -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent*' } | Format-Table -AutoSize -Wrap

To inspect the self logs that are written to the disk by the logging module, run the following command:

notepad "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"

To inspect the logs from the Windows Service Control Manager for Ops Agent services, run the following command::

Get-WinEvent -FilterHashtable @{ Logname='System'; ProviderName='Service Control Manager' } | Where-Object -Property Message -Match 'Google Cloud Ops Agent' | Format-Table -AutoSize -Wrap

View metric usage and diagnostics in Cloud Monitoring

The Cloud Monitoring Metrics Management page provides information that can help you control the amount you spend on chargeable metrics without affecting observability. The Metrics Management page reports the following information:

  • Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
  • Data about labels and cardinality of metrics.
  • Use of metrics in alerting policies and custom dashboards.
  • Rate of metric-write errors.

To view the Metrics Management page, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Metrics management:

    Go to Metrics management

  2. In the toolbar, select your time window. By default, the Metrics Management page displays information about the metrics collected in the previous one day.

For more information about the Metrics Management page, see View and manage metric usage.