Troubleshoot the Ops Agent

Stay organized with collections Save and categorize content based on your preferences.

Note: The localhost:2020/api/v1/metrics endpoint mentioned at 3:18 in this video is no longer available in the Ops Agent. For other options, see Agent is running but data is not ingested.

This document helps you diagnose problems in the installation or running of the Ops Agent.

Agent diagnostics tool for VMs

The agent diagnostics tool gathers critical local debugging information from your VMs for all the following agents: Ops Agent, legacy Logging agent, and legacy Monitoring agent. The debugging information includes things like project info, VM info, agent configuration, agent logs, agent service status, information that typically requires manual work to gather. The tool also checks the local VM environment to ensure it meets certain requirements for the agents to function properly, for example, network connectivity and required permissions.

When filing a customer case for an agent on a VM, run the agent diagnostics tool and attach the collected information to the case. Before you attach the information to the support case, redact any sensitive information like passwords. Providing this information reduces the time needed to troubleshoot your support case.

The agent diagnostics tool must be run from inside the VM, so you will typically need to SSH into the VM first. The following command retrieves the agent diagnostics tool and executes it:

Linux

curl -sSO https://dl.google.com/cloudagents/diagnose-agents.sh
sudo bash diagnose-agents.sh

Windows

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/diagnose-agents.ps1", "${env:UserProfile}\diagnose-agents.ps1")
Invoke-Expression "${env:UserProfile}\diagnose-agents.ps1"

Follow the output of the script execution to locate the files that include the collected info. Typically you can find them in the /var/tmp/google-agents directory on Linux and in the $env:LOCALAPPDATA/Temp directory on Windows, unless you have customized the output directory when running the script.

For detailed information, examine the diagnose-agents.sh script on Linux or diagnose-agents.ps1 script on Windows.

Agent fails to install

You may encounter the following errors when running the installation script.

  • The operating system is not supported. The error message might look similar to the following:

    Linux

    https://packages.cloud.google.com/yum/repos/google-cloud-ops-agent-el6-x86_64-all/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
    Trying other mirror.
    To address this issue please refer to the below wiki article
    
    https://wiki.centos.org/yum-errors
    
    If above article doesn't help to resolve this issue please use https://bugs.centos.org/.
    
    Error: Cannot retrieve repository metadata (repomd.xml) for repository: google-cloud-ops-agent. Please verify its path and try again
    
  • The VM already has the Cloud Logging agent or the Cloud Monitoring agent installed, and they conflict with the new agent. The error message might look similar to the following:

    Linux

    Error:
    Problem: problem with installed package stackdriver-agent-6.0.5-1.el8.x86_64 - package google-cloud-ops-agent-0.1.0-1.el8.x86_64 conflicts with stackdriver-agent provided by stackdriver-agent-6.0.5-1.el8.x86_64
    

    The Ops Agent uses new configuration files that are not compatible with the old agents. For more information, refer to the Configure the Ops Agent guide.

    To fix this error, do the following:

    1. Save the custom configuration files for the Cloud Monitoring agent and the Cloud Logging agent.

    2. Uninstall the old Cloud Monitoring agent and Cloud Logging agent.

      After you uninstall the agent, the Google Cloud console might take up to one hour to report this change.

Agent is installed but not running

Agent services not running

When the agent service is running as expected, you might see the following status:

For Linux

computer@debian9:~$ sudo systemctl status google-cloud-ops-agent"*"
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: active (exited) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2240 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
  Process: 2214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
 Main PID: 2240 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent.service

Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent...
Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent.

● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
  Drop-In: /lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
           └─directories.conf
   Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2234 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
  Process: 2216 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
 Main PID: 2247 (fluent-bit)
    Tasks: 22 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent-fluent-bit.service
           └─2247 /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config /run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf --parser /run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf --log_file /var/log/google-cloud-ops-agent/subagents/logging-module.log --storage_path /var/lib/google-cloud-ops-agent/fluent-bit/buffers

Aug 05 20:33:44 debian9 systemd[1]: Starting Google Cloud Ops Agent - Logging Agent...
Aug 05 20:33:44 debian9 systemd[1]: Started Google Cloud Ops Agent - Logging Agent.
Aug 05 20:33:44 debian9 fluent-bit[2247]: Fluent Bit v1.7.8
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Copyright (C) 2015-2018 Treasure Data
Aug 05 20:33:44 debian9 fluent-bit[2247]: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Aug 05 20:33:44 debian9 fluent-bit[2247]: * https://fluentbit.io

● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
  Drop-In: /lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service.d
           └─directories.conf
   Active: active (running) since Thu 2021-08-05 20:33:44 UTC; 7s ago
  Process: 2237 ExecStartPre=/bin/mkdir -p ${RUNTIME_DIRECTORY} ${STATE_DIRECTORY} ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
  Process: 2215 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
 Main PID: 2251 (otelopscol)
    Tasks: 6 (limit: 4915)
   CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
           └─2251 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml

Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        builder/pipelines_builder.go:51        Pipeline is starting...        {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        builder/pipelines_builder.go:62        Pipeline is started.        {"pipeline_name": "metrics/system", "pipeline_datatype": "metrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.234Z        info        service/service.go:192        Starting receivers...
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z        info        builder/receivers_builder.go:70        Receiver is starting...        {"kind": "receiver", "name": "hostmetrics/hostmetrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.235Z        info        builder/receivers_builder.go:75        Receiver started.        {"kind": "receiver", "name": "hostmetrics/hostmetrics"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        builder/receivers_builder.go:70        Receiver is starting...        {"kind": "receiver", "name": "prometheus/agent"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        discovery/manager.go:195        Starting provider        {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0", "subs": "[otel-collector]"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        builder/receivers_builder.go:75        Receiver started.        {"kind": "receiver", "name": "prometheus/agent"}
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.236Z        info        service/collector.go:182        Everything is ready. Begin running and processing data.
Aug 05 20:33:45 debian9 otelopscol[2251]: 2021-08-05T20:33:45.256Z        info        discovery/manager.go:213        Discoverer channel closed        {"kind": "receiver", "name": "prometheus/agent", "level": "debug", "provider": "static/0"}

For Windows

Get-Service google-cloud-ops-agent*

Status   Name               DisplayName
------   ----               -----------
Running  google-cloud-op... Google Cloud Ops Agent
Running  google-cloud-op... Google Cloud Ops Agent - Logging Agent
Running  google-cloud-op... Google Cloud Ops Agent - Metrics Agent

If the agent service is not running, you might see the following status:

Linux

$ sudo service google-cloud-ops-agent status
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Wed 2021-06-30 21:20:43 UTC; 6s ago

Windows

Get-Service google-cloud-ops-agent

Status   Name                    DisplayName
------   ----                    -----------
Stopped  google-cloud-ops-agent  Google Cloud Ops Agent

To fix this error, run the following command to start the service:

Linux

sudo service google-cloud-ops-agent start

Windows

Start-Service google-cloud-ops-agent

If the service fails to start, the configuration might be invalid.

Conflict with currently installed agents

  • The VM already has the Cloud Logging agent or the Cloud Monitoring agent installed, and their configuration conflicts with the new agent's configuration. The error message might look similar to the following:

    Windows

    We detected an existing Windows service for the StackdriverLogging agent,
    which is not compatible with the Ops Agent when the Ops Agent configuration
    has a non-empty logging section. Please either remove the logging section
    from the Ops Agent configuration, or disable the StackdriverLogging agent,
    and then retry enabling the Ops Agent.
    

    To fix this error, you have two options:

    1. Disable the conflicting section of the Ops Agent configuration file. For more information, refer to the Configure the Ops Agent guide.

    2. Disable the conflicting Cloud Logging agent or the Cloud Monitoring agent.

      1. Save any custom configuration files for the Cloud Logging agent.
      2. Uninstall the old Cloud Monitoring agent and Cloud Logging agent.

      After you uninstall the agent, the Google Cloud console might take up to one hour to report this change.

Invalid configuration

If the configuration is invalid, you might see the following error when trying to restart the agent service:

Linux

$ sudo service google-cloud-ops-agent restart \
    && sudo service google-cloud-ops-agent status
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
   Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service.d
           └─directories.conf
   Active: failed (Result: exit-code) since Wed 2021-06-30 22:21:08 UTC; 2s ago
  Process: 1141421 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_>
  Process: 1141847 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIR>
 Main PID: 1141421 (code=exited, status=0/SUCCESS)

Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Control process exited, code=exited status=1
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Jun 30 22:21:08 centos8-2 systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Jun 30 22:21:08 centos8-2 systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Jun 30 22:21:08 centos8-2 systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

Use journalctl to get the exact error message:

sudo journalctl -xe | grep "google_cloud_ops_agent_engine"

You might see a message similar to the following:

Jun 30 22:00:26 centos8-2 google_cloud_ops_agent_engine[1141491]: 2021/06/30 22:00:26 the agent config file is not valid YAML. detailed error: yaml: line 21: did not find expected key

Windows

failed to generate config files: can't parse configuration: yaml: line 20: could not find expected ':'

To fix the error, correct the invalid configuration and restart the agent. For reference, refer to the Configure the Ops Agent guide.

Agent is running, but data is not ingested

Use Metrics Explorer to query the agent uptime metric, and verify that the agent component, google-cloud-ops-agent-metrics or google-cloud-ops-agent-logging, is writing to the metric.

  1. In the Google Cloud console, select Monitoring or click the following button:

    Go to Monitoring

  2. In the navigation pane, select Metrics Explorer.

  3. Select the MQL tab.

  4. Enter the following query, then click Run:

    fetch gce_instance
    | metric 'agent.googleapis.com/agent/uptime'
    | align rate(1m)
    | every 1m
    

Is the agent sending logs to Cloud Logging?

Check the local metrics

These steps require you to SSH into the VM.

  • Is the logging module running? Use the following commands to check:

Linux

sudo systemctl status google-cloud-ops-agent"*"

Windows

Open Windows PowerShell as administrator and run:

Get-Service google-cloud-ops-agent

You can also check service status in the Services app and inspect running processes in the Task Manager app.

Check the logging module log

This step requires you to SSH into the VM.

You can find the logging module logs at /var/log/google-cloud-ops-agent/subagents/*.log for Linux and C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log for Windows. If there are no logs, this indicates that the agent service is not running properly. Go to the Agent is installed but not running section first to fix that condition.

  • You might see 403 permission errors when writing to the Logging API. For example:

    [2020/10/13 18:55:09] [ warn] [output:stackdriver:stackdriver.0] error
    {
    "error": {
      "code": 403,
      "message": "Cloud Logging API has not been used in project 147627806769 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.",
      "status": "PERMISSION_DENIED",
      "details": [
        {
          "@type": "type.googleapis.com/google.rpc.Help",
          "links": [
            {
              "description": "Google developers console API activation",
              "url": "https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769"
            }
          ]
        }
      ]
    }
    }
    

    To fix this error, enable the Logging API and set the Logs Writer role.

  • You might see a quota issue for the Logging API. For example:

    error="8:Insufficient tokens for quota 'logging.googleapis.com/write_requests' and limit 'WriteRequestsPerMinutePerProject' of service 'logging.googleapis.com' for consumer 'project_number:648320274015'." error_code="8"
    

    To fix this error, raise the quota or reduce the log throughput.

  • You might see the following errors in the module log:

    {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
    

    or

    can't fetch token from the metadata server
    

    These errors might indicate that you deployed the agent with no service account or specified credentials. For information about resolving this issue, see Authorize the Ops Agent.

Is the agent sending metrics to Cloud Monitoring?

Check the metrics module log

This step requires you to SSH into the VM.

You can find the metrics module logs in syslog. If there are no logs, this indicates that the agent service is not running properly. Go to the Agent is installed but not running section first to fix that condition.

  • You might see PermissionDenied errors when writing to the Monitoring API. This error occurs if the permission for the Ops Agent are not properly configured. For example:

    Nov  2 14:51:27 test-ops-agent-error otelopscol[412]: 2021-11-02T14:51:27.343Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "[rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).; rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist).]", "interval": "6.934781228s"}
    

    To fix this error, enable the Monitoring API and set the Monitoring Metric Writer role.

  • You might see ResourceExhausted errors when writing to the Monitoring API. This error occurs if the project is hitting the limit for any Monitoring API quotas. For example:

    Nov  2 18:48:32 test-ops-agent-error otelopscol[441]: 2021-11-02T18:48:32.175Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Total requests' and limit 'Total requests per minute per user' of service 'monitoring.googleapis.com' for consumer 'project_number:8563942476'.\nerror details: name = ErrorInfo reason = RATE_LIMIT_EXCEEDED domain = googleapis.com metadata = map[consumer:projects/8563942476 quota_limit:DefaultRequestsPerMinutePerUser quota_metric:monitoring.googleapis.com/default_requests service:monitoring.googleapis.com]", "interval": "2.641515416s"}
    

    To fix this error, raise the quota or reduce the metrics throughput.

  • You might see the following errors in the module log:

    {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
    

    or

    can't fetch token from the metadata server
    

    These errors might indicate that you deployed the agent with no service account or specified credentials. For information about resolving this issue, see Authorize the Ops Agent.

Inspect agent self logs

If the agent fails to ingest logs to Cloud Logging, then you might have to inspect the logs locally on the VM for troubleshooting.

Linux

To inspect self logs that are written to Journald, run the following command:

journalctl -u google-cloud-ops-agent*

To inspect the self logs that are written to the disk by the logging module, run the following command:

vim /var/log/google-cloud-ops-agent/subagents/logging-module.log

Windows

To inspect self logs that are written to Windows Event Logs, run the following command:

Get-WinEvent -FilterHashtable @{ Logname='Application'; ProviderName='google-cloud-ops-agent*' } | Format-Table -AutoSize -Wrap

To inspect the self logs that are written to the disk by the logging module, run the following command:

notepad "C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log"

To inspect the logs from the Windows Service Control Manager for Ops Agent services, run the following command::

Get-WinEvent -FilterHashtable @{ Logname='System'; ProviderName='Service Control Manager' } | Where-Object -Property Message -Match 'Google Cloud Ops Agent' | Format-Table -AutoSize -Wrap

Set up self log file rotation on Linux VMs

To limit the size of the logging sub-agent log at /var/log/google-cloud-ops-agent/subagents/logging-module.log, install and configure the logrotate utility.

  1. Install the logrotate utility by running the following command:

    On Debian and Ubuntu

    sudo apt install logrotate
    

    On CentOS, RHEL and Fedora

    sudo yum install logrotate
    
  2. Create a logrotate config file at /etc/logrotate.d/google-cloud-ops-agent.conf.

    sudo tee /etc/logrotate.d/google-cloud-ops-agent.conf > /dev/null << EOF
    # logrotate config to rotate Google Cloud Ops Agent self log file.
    # See https://manpages.debian.org/jessie/logrotate/logrotate.8.en.html for
    # the full options.
    /var/log/google-cloud-ops-agent/subagents/logging-module.log
    {
        # Log files are rotated every day.
        daily
        # Log files are rotated this many times before being removed. This
        # effectively limits the disk space used by the Ops Agent self log files.
        rotate 30
        # Log files are rotated when they grow bigger than maxsize even before the
        # additionally specified time interval
        maxsize 256M
        # Skip rotation if the log file is missing.
        missingok
        # Do not rotate the log if it is empty.
        notifempty
        # Old versions of log files are compressed with gzip by default.
        compress
        # Postpone compression of the previous log file to the next rotation
        # cycle.
        delaycompress
    }
    EOF
    
  3. Set up crontab or systemd timer to trigger the logrotate utility periodically.

After the log rotation takes effect, you see rotated files in the /var/log/google-cloud-ops-agent/subagents/ directory. The results look similar to the following output:

/var/log/google-cloud-ops-agent/subagents$ ls -lh
total 24K
-rw-r--r-- 1 root root  717 Sep  3 19:54 logging-module.log
-rw-r--r-- 1 root root 6.8K Sep  3 19:51 logging-module.log.1
-rw-r--r-- 1 root root  874 Sep  3 19:50 logging-module.log.2.gz
-rw-r--r-- 1 root root  873 Sep  3 19:50 logging-module.log.3.gz
-rw-r--r-- 1 root root 3.2K Sep  3 19:34 logging-module.log.4.gz

To test log rotation, do the following:

  1. Temporarily reduce the file size at which rotation is triggered by setting the maxsize value to 1k in the /etc/logrotate.d/google-cloud-ops-agent.conf file.

  2. Trigger the agent self log file to be larger than 1K by restarting the agent a few times:

    sudo service google-cloud-ops-agent restart
    
  3. Wait for the crontab or systemd timer to take effect to trigger the logrotate utility, or trigger the logrotate utility manually by running this command:

    sudo logrotate /etc/logrotate.d/google-cloud-ops-agent.conf
    
  4. Verify that you see rotated log files in the /var/log/google-cloud-ops-agent/subagents/ directory.

  5. Reset the log-rotation configuration by restoring the original maxsize value.

Completely reset the agent state

If the agent enters a non-recoverable state, follow these steps to restore the agent to a fresh state.

Linux

Stop the agent service:

sudo service google-cloud-ops-agent stop

Remove the agent package:

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --uninstall --remove-repo

Remove the agent's self logs on disk:

sudo rm -rf /var/log/google-cloud-ops-agent

Remove the agent's local buffers on disk:

sudo rm -rf /var/lib/google-cloud-ops-agent/fluent-bit/buffers/*/

Reinstall and restart the agent:

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
sudo service google-cloud-ops-agent restart

Windows

Stop the agent service:

Stop-Service google-cloud-ops-agent -Force;
Get-Service google-cloud-ops-agent* | %{sc.exe delete $_};
taskkill /f /fi "SERVICES eq google-cloud-ops-agent*";

Remove the agent package:

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1");
$env:REPO_SUFFIX="";
Invoke-Expression "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1 -Uninstall -RemoveRepo"

Remove the agent's self logs on disk:

rmdir -R -ErrorAction SilentlyContinue "C:\ProgramData\Google\Cloud Operations\Ops Agent\log";

Remove the agent's local buffers on disk:

Get-ChildItem -Path "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\" -Directory -ErrorAction SilentlyContinue | %{rm -r -Path $_.FullName}

Reinstall and restart the agent:

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1");
$env:REPO_SUFFIX="";
Invoke-Expression "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall"

Reset but save the buffer files

If the VM does not have corrupted buffer chunks (that is, there are no format check failed messages in the Ops Agent's self log file), then you can skip the previous commands that remove the local buffers when resetting the agent state.

If the VM does have corrupted buffer chunks, then you have to remove them. The following options describe different ways to handle the buffers. The other steps described in Completely reset the agent state are still applicable.

  • Option 1: Delete the entire buffers directory. This is the easiest option, but it can result in loss of the uncorrupted buffered logs or log duplication due to the loss of the position files.

    Linux

    sudo rm -rf /var/lib/google-cloud-ops-agent/fluent-bit/buffers
    

    Windows

    rmdir -R -ErrorAction SilentlyContinue "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers";
    
  • Option 2: Delete the buffer subdirectories from the buffers directory, but leave the position files. This approach is described in Completely reset the agent state.

  • Option 3: If you don't want to delete all the buffer files, then you can extract the names of the corrupted buffer files from the agent's self logs and delete only corrupted buffer files.

    Linux

    grep "format check failed" /var/log/google-cloud-ops-agent/subagents/logging-module.log | sed 's|.*format check failed: |/var/lib/google-cloud-ops-agent/fluent-bit/buffers/|' | xargs sudo rm -f
    

    Windows

    $oalogspath="C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log";
    if (Test-Path $oalogspath) {
      Select-String "format check failed" $oalogspath |
      %{$_ -replace '.*format check failed: (.*)/(.*)', '$1\$2'} |
      %{rm -ErrorAction SilentlyContinue -Path ('C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\' + $_)}
    };
    
  • Option 4: If there are many corrupted buffers and you want to reprocess all log files, then you can use the commands from Option 3 and also delete the position files (which store Ops Agent progress per log file). Deleting the position files can result in log duplication for any logs that are already successfully ingested. This option only reprocesses current log files; it does not reprocess files that had been rotated out already or logs from other sources like a TCP port. The position files are stored in the buffers directory but are stored as files. The local buffers are stored as subdirectories in the buffers directory,

    Linux

    grep "format check failed" /var/log/google-cloud-ops-agent/subagents/logging-module.log | sed 's|.*format check failed: |/var/lib/google-cloud-ops-agent/fluent-bit/buffers/|' | xargs sudo rm -f
    sudo find /var/lib/google-cloud-ops-agent/fluent-bit/buffers -maxdepth 1 -type f -delete
    

    Windows

    $oalogspath="C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log";
    if (Test-Path $oalogspath) {
      Select-String "format check failed" $oalogspath |
      %{$_ -replace '.*format check failed: (.*)/(.*)', '$1\$2'} |
      %{rm -ErrorAction SilentlyContinue -Path ('C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\' + $_)}
    };
    Get-ChildItem -Path "C:\ProgramData\Google\Cloud Operations\Ops Agent\run\buffers\" -File -ErrorAction SilentlyContinue | %{$_.Delete()}
    

Known issues

The following section contains known common issues. For the ones that are fixed or mitigated already, follow the specific instructions to pick up the fix.

Non-harmful logs

  • Errors scraping metrics from pseudo-processes or restricted processes

    The following logs are not harmful and can be safely ignored. To eliminate them, upgrade the Ops Agent to version 2.10.0 or higher.

    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:55.848Z        error        scraperhelper/scrapercontroller.go:205        Error scraping metrics        {"kind"
    : "receiver", "name": "hostmetrics/hostmetrics", "error": "[error reading process name for pid 2: readlink /proc/2/exe: no such file or directory; error reading process name for
    pid 3: readlink /proc/3/exe: no such file or directory; error reading process name for pid 4: readlink /proc/4/exe: no such file or directory; error reading process name for pid
    5: readlink /proc/5/exe: no such file or directory; error reading process name for pid 6: readlink /proc/6/exe: no such file or directory; error reading process name for pid 7: r
    eadlink /proc/7/exe: no such file or directory; error reading process name for pid 8: readlink /proc/8/exe: no such file or directory; error reading process name for pid 9: readl
    ink /proc/9/exe: no such file or directory; error reading process name for pid 10: readlink /proc/10/exe: no such file or directory; error reading process name for pid 11: readli
    nk /proc/11/exe: no such file or directory; error reading process name for pid 12: readlink /proc/12/exe: no such file or directory; error reading process name for pid 13: readli
    nk /proc/13/exe: no such file or directory; error reading process name for pid 14: readlink /proc/14/exe: no such file or directory; error reading process name for pid 15: readli
    nk /proc/15/exe: no such file or directory; error reading process name for pid 16: readlink /proc/16/exe: no such file or directory; error reading process name for pid 17: readli
    nk /proc/17/exe: no such file or directory; error reading process name for pid 18: readlink /proc/18/exe: no such file or directory; error reading process name for pid 19: readli
    nk /proc/19/exe: no such file or directory; error reading process name for pid 20: readlink /proc/20/exe: no such file or directory; error reading process name for pid 21: readli
    nk /proc/21/exe: no such file or directory; error reading process name for pid 22: readlink /proc/22/exe: no such file or directory; error reading process name for pid
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: 23: readlink /proc/23/exe: no such file or directory; error reading process name for pid 24: readlink /proc/24/exe: no such file
    or directory; error reading process name for pid 25: readlink /proc/25/exe: no such file or directory; error reading process name for pid 26: readlink /proc/26/exe: no such file
    or directory; error reading process name for pid 27: readlink /proc/27/exe: no such file or directory; error reading process name for pid 28: readlink /proc/28/exe: no such file
    or directory; error reading process name for pid 30: readlink /proc/30/exe: no such file or directory; error reading process name for pid 31: readlink /proc/31/exe: no such file
    or directory; error reading process name for pid 43: readlink /proc/43/exe: no such file or directory; error reading process name for pid 44: readlink /proc/44/exe: no such file
    or directory; error reading process name for pid 45: readlink /proc/45/exe: no such file or directory; error reading process name for pid 90: readlink /proc/90/exe: no such file
    or directory; error reading process name for pid 92: readlink /proc/92/exe: no such file or directory; error reading process name for pid 106: readlink /proc/106/exe: no such fi
    le or directory; error reading process name for pid 360: readlink /proc/360/exe: no such file or directory; error reading process name for pid 375: readlink /proc/375/exe: no suc
    h file or directory; error reading process name for pid 384: readlink /proc/384/exe: no such file or directory; error reading process name for pid 386: readlink /proc/386/exe: no
    such file or directory; error reading process name for pid 387: readlink /proc/387/exe: no such file or directory; error reading process name for pid 422: readlink /proc/422/exe
    : no such file or directory; error reading process name for pid 491: readlink /proc/491/exe: no such file or directory; error reading process name for pid 500: readlink /proc/500
    /exe: no such file or directory; error reading process name for pid 2121: readlink /proc/2121/exe: no such file or directory; error reading
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: process name for pid 2127: readlink /proc/2127/exe: no such file or directory]"}
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]:         /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:205
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]: go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
    Jul 13 17:28:55 debian9-trouble otelopscol[2134]:         /root/go/pkg/mod/go.opentelemetry.io/collector@v0.29.0/receiver/scraperhelper/scrapercontroller.go:186
    
  • Errors when the first data point of cumulative metrics gets dropped:

    The following logs are not harmful and can be safely ignored.

    Jul 13 17:28:03 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:03.092Z        info        exporterhelper/queued_retry.go:316        Exporting failed. Will retry the request a
    fter interval.        {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[1].points[0].interval.start_time had a
    n invalid value of \"2021-07-13T10:25:18.061-07:00\": The start time must be before the end time (2021-07-13T10:25:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag
    ent/uptime'.", "interval": "23.491024535s"}
    Jul 13 17:28:41 debian9-trouble otelopscol[2134]: 2021-07-13T17:28:41.269Z        info        exporterhelper/queued_retry.go:316        Exporting failed. Will retry the request a
    fter interval.        {"kind": "exporter", "name": "googlecloud/agent", "error": "rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.start_time had a
    n invalid value of \"2021-07-13T10:26:18.061-07:00\": The start time must be before the end time (2021-07-13T10:26:18.061-07:00) for the non-gauge metric 'agent.googleapis.com/ag
    ent/monitoring/point_count'.", "interval": "21.556591578s"}
    

Some of the metrics are missing or inconsistent

There is a small number of metrics that the Ops Agent version 2.0.0 and higher handles differently from the "preview" versions of the Ops Agent (versions less than 2.0.0) or the Monitoring agent.

The following table describes differences in the data ingested by the Ops Agent and the Monitoring agent.
Metric type, omitting
agent.googleapis.com
Ops Agent (GA) Ops Agent (Preview) Monitoring agent
disk/bytes_used and
disk/percent_used
Ingested with the full path in the device label; for example, /dev/sda15.

Not ingested for virtual devices like tmpfs and udev.
Ingested without /dev in the path in the device label; for example, sda15.

Ingested for virtual devices like tmpfs and udev.
Ingested without /dev in the path in the device label; for example, sda15.

Ingested for virtual devices like tmpfs and udev.
The GA column refers to Ops Agent versions 2.0.0 and higher. The Preview column refers to Ops Agent versions less than 2.0.0.

Removed agent reported by Google Cloud console as installed

After you uninstall the agent, the Google Cloud console might take up to one hour to report this change.

Agent self logs consume too much CPU, memory, and disk space

Old versions of the Ops Agent might consume a lot of CPU, memory, and disk space with /var/log/google-cloud-ops-agent/subagents/logging-module.log files on Linux VMs or C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log files on Windows VMs due to corrupted buffer chunks. When this happens, you see a large number of messages like the following in the logging-module.log file.

  [2022/04/30 05:23:38] [error] [input chunk] error writing data from tail.2 instance
  [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb
  [2022/04/30 05:23:38] [error] [storage] format check failed: tail.2/2004860-1650614856.691268293.flb
  [2022/04/30 05:23:38] [error] [storage] [cio file] file is not mmap()ed: tail.2:2004860-1650614856.691268293.flb
  

To resolve this problem, upgrade the Ops Agent to version 2.17 or higher, and Completely reset the agent state.