Why can't I see the menu option to access the dashboard?
If you don't have the GKE option in the Resource menu, you might not have any GKE clusters using Cloud Operations for GKE.
Why can't I see any Kubernetes resources in my dashboard?
If you don't see any Kubernetes resources in your Cloud Operations for GKE dashboard, then check the following:
Is the correct Google Cloud project selected at the top of the page?
If not, use the drop-down list in the menu bar to select a project. You must select the project whose data you want to see.
Does your project have any activity?
If you just created your cluster, wait a few minutes for it to populate with data. See Installing monitoring and logging support for details.
Is the time range too narrow?
You can use the Time menu in the dashboard toolbar to select other time ranges or define a Custom range.
Do you have the proper permissions to view the dashboard?
If you see either of the following permission-denied error messages when viewing a service's deployment details or a Google Cloud project's metrics, you need to update your Identity and Access Management role to include roles/monitoring.viewer or roles/viewer:
You do not have sufficient permissions to view this page
You don't have permissions to perform the action on the selected resources
For more details, go to Predefined roles.
Does the service account for your clusters and nodes have permission to write data into Monitoring and Logging?
If you see high error rates on your API dashboard, then your service account might be missing the following roles:
roles/logging.logWriter
: In the Google Cloud console, this role is named Logs Writer. For more information on Logging roles, see the Logging access control guide.roles/monitoring.metricWriter
: In the Google Cloud console, this role is named Monitoring Metric Writer. For more information on Monitoring roles, see the Monitoring access control guide.roles/stackdriver.resourceMetadata.writer
: In the Google Cloud console, this role is named Stackdriver Resource Metadata Writer. This role permits write-only access to resource metadata, and it provides exactly the permissions needed by agents to send metadata. For more information on Monitoring roles, see the Monitoring access control guide.
Why don't I see all of my logs?
Is your agent running and healthy?
GKE version 1.17 and later use Fluent Bit to capture logs. Fluent Bit is the Logging agent that runs on Kubernetes nodes. To check if the agent is running correctly, perform the following steps:
Check whether the agent is restarting by running the following command:
kubectl get pods -l k8s-app=fluentbit-gke -n kube-system
If there are no restarts, the output is similar to the following:
NAME READY STATUS RESTARTS AGE fluentbit-gke-6zr6g 2/2 Running 0 44d fluentbit-gke-dzh9l 2/2 Running 0 44d
Check Pod status conditions by running the following command:
JSONPATH='{range .items[*]};{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status},{end}{end};' \ && kubectl get pods -l k8s-app=fluentbit-gke -n kube-system -o jsonpath="$JSONPATH" | tr ";" "\n"
If the deployment is healthy, the output is similar to the following:
fluentbit-gke-nj4qs:Initialized=True,Ready=True,ContainersReady=True,PodScheduled=True, fluentbit-gke-xtcvt:Initialized=True,Ready=True,ContainersReady=True,PodScheduled=True,
Check the Pod status, which can help determine if the deployment is healthy by running the following command:
kubectl get daemonset -l k8s-app=fluentbit-gke -n kube-system
If the deployment is healthy, the output is similar to the following:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE fluentbit-gke 2 2 2 2 2 kubernetes.io/os=linux 5d19h
In this example output, the desired state matches the current state.
If the agent is running and healthy in these scenarios, and you still don't see all of your logs, the agent might be overloaded and dropping logs.
Is your agent overloaded and dropping logs?
One possible reason you're not seeing all of your logs is that the node's log volume is overloading the agent. The default Logging agent configuration in GKE is tuned for the rate of 100 kiB/s per node, and the agent might start dropping logs if the volume exceeds that limit.
To detect if you might be hitting this limit, look for any of the following indicators:
View the
kubernetes.io/container/cpu/core_usage_time
metric with the filtercontainer_name=fluentbit-gke
to see if the CPU usage of the Logging agent is near or at 100%.View the
logging.googleapis.com/byte_count
metric grouped bymetadata.system_labels.node_name
to see if any node reaches 100 kiB/s.
If you see any of these conditions, you can reduce the log volume of your nodes by adding more nodes to the cluster. If all of the log volume comes from a single pod, then you would need to reduce the volume from that pod.
If you want to change the Logging agent tuning parameters, review the Deploy a custom Fluent Bit daemonset community tutorial for deploying a custom Logging agent configuration.
For more information on investigating and resolving GKE logging related issues, see Troubleshooting logging in GKE.
Why isn't my incident matched to a GKE resource?
If you have an alerting policy condition that aggregates metrics across distinct GKE resources, you might need to edit the policy's condition to include more GKE hierarchy labels to associate incidents with specific entities.
For example, you might have two GKE clusters, one for
production and one for staging, each with their own copy of service
lilbuddy-2
. When the alerting policy condition aggregates a metric across
containers in both clusters, the GKE
Monitoring dashboard isn't able to associate this incident
uniquely with the production service or the staging service.
To resolve this situation, target the alerting policy to a specific service by
adding namespace
, cluster
, and location
to the policy's Group By
field. On the event card for the alert, click the Update alert policy link
to open the Edit alerting policy page for the relevant alert policy. From
here, you can update the alerting policy with the additional information so that
the dashboard can find the associated resource.
After you update the alerting policy, the GKE Monitoring dashboard is able to associate all future incidents with a unique service in a particular cluster, giving you additional information to diagnose the problem.
Depending on your use case, you might want to filter on some of these labels in
addition to adding them to the Group By field. For example, if you only want
alerts for your production cluster, you can filter on cluster_name
.