The product described by this documentation, Anthos Clusters on AWS (previous generation), is now in maintenance mode. All new installs must use the current generation product, Anthos clusters on AWS.

Logging and monitoring for GKE on AWS

This topic shows you how to export logs and metrics from an GKE on AWS user cluster to Cloud Logging and Cloud Monitoring.

Overview

There are multiple options for logging and monitoring with GKE on AWS. GKE Enterprise can be integrated with Cloud Logging and Cloud Monitoring. As GKE Enterprise is based on open source Kubernetes, many open source and third party tools are compatible.

Logging and monitoring options

You have several logging and monitoring options for your GKE Enterprise cluster:

Deploy the Cloud Logging and Cloud Monitoring agents to monitor and view logs from your workloads in the Google Cloud console. This topic explains this solution.
Use open source tools such as Prometheus, Grafana, and Elasticsearch. This topic does not describe this solution.
Use third-party solutions such as Datadog. This topic does not describe this solution.

Cloud Logging and Cloud Monitoring

With GKE Enterprise, Cloud Logging, and Cloud Monitoring you can create dashboards, send alerts, monitor, and review logs for the workloads running on your cluster. You must configure the Cloud Logging and Cloud Monitoring agents in order to collect logs and metrics into your Google Cloud project. If you do not configure these agents, GKE on AWS does not collect logging or monitoring data.

What data is collected

When configured, the agents collect logs and metric data from your cluster and the workloads running on your cluster. These data are stored in your Google Cloud project. You configure the project ID in the project_id field in a configuration file when you install the log forwarder.

The data collected includes the following:

Logs for system services on each of the worker nodes.
Application logs for all workloads running on the cluster.
Metrics for the cluster and system services. For more information on specific metrics, see GKE Enterprise metrics.
If your applications are configured with Prometheus scrape targets, and annotated with configuration including prometheus.io/scrape, prometheus.io/path, and prometheus.io/port, application metrics for Pods.

The agents can be disabled at any time. For more information, see Cleaning up. Data collected by the agents can be managed and deleted like any other metric and log data, as described in the Cloud Monitoring and Cloud Logging documentation.

Log data is stored according to your configured retention rules. Metrics data retention varies based on type.

Logging and monitoring components

To export cluster-level telemetry from GKE on AWS into Google Cloud, you deploy the following components into your cluster:

Stackdriver Log Forwarder (stackdriver-log-forwarder-*). A Fluentbit DaemonSet that forwards logs from each Kubernetes node to Cloud Logging.
GKE Metrics Agent (gke-metrics-agent-*). An OpenTelemetry Collector based DaemonSet that collects metrics data and forwards it to Cloud Monitoring.

Manifests for these components are in the anthos-samples repository on GitHub.

Prerequisites

A Google Cloud project with billing enabled. For more information on costs, see Pricing for Google Cloud Observability.

The project must also have the Cloud Logging and Cloud Monitoring APIs enabled. To enable these APIs, run the following commands:
```
gcloud services enable logging.googleapis.com
gcloud services enable monitoring.googleapis.com
```
An GKE on AWS environment, including a user cluster registered with Connect. Run the following command to verify that your cluster is registered.
```
gcloud container fleet memberships list
```
If your cluster is registered, Google Cloud CLI prints the cluster's name and ID.
```
NAME       EXTERNAL_ID
cluster-0  1abcdef-1234-4266-90ab-123456abcdef
```
If you do not see your cluster listed, see Connecting to a cluster with Connect
Install the git command-line tool on your machine.

Setting up permissions for Google Cloud Observability

Logging and monitoring agents use Fleet Workload Identity to communicate with Cloud Logging and Cloud Monitoring. The identity needs permissions to write logs and metrics in your project. To add the permissions, running the following commands:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:PROJECT_ID.svc.id.goog[kube-system/stackdriver]" \
  --role=roles/logging.logWriter
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:PROJECT_ID.svc.id.goog[kube-system/stackdriver]" \
  --role=roles/monitoring.metricWriter

Replace PROJECT_ID with your Google Cloud project.

Connect to the bastion host

To connect to your GKE on AWS resources, perform the following steps. Select if you have an existing AWS VPC (or direct connection to your VPC) or created a dedicated VPC when creating your management service.

Existing VPC

If you have a direct or VPN connection to an existing VPC, omit the line env HTTP_PROXY=http://localhost:8118 from commands in this topic.

Dedicated VPC

When you create a management service in a dedicated VPC, GKE on AWS includes a bastion host in a public subnet.

To connect to your management service, perform the following steps:

Change to the directory with your GKE on AWS configuration. You created this directory when Installing the management service.
```
cd anthos-aws
```
To open the tunnel, run the bastion-tunnel.sh script. The tunnel forwards to localhost:8118.

To open a tunnel to the bastion host, run the following command:
```
./bastion-tunnel.sh -N
```
Messages from the SSH tunnel appear in this window. When you are ready to close the connection, stop the process by using Control+C or closing the window.
Open a new terminal and change into your anthos-aws directory.
```
cd anthos-aws
```
Check that you're able to connect to the cluster with kubectl.
```
env HTTPS_PROXY=http://localhost:8118 \
kubectl cluster-info
```
The output includes the URL for the management service API server.

Cloud Logging and Cloud Monitoring on control plane nodes

With GKE on AWS 1.8.0 and higher, Cloud Logging and Cloud Monitoring for control plane nodes can be automatically configured when creating new user clusters. To enable Cloud Logging or Cloud Monitoring, you populate the controlPlane.cloudOperations section of your AWSCluster configuration.

cloudOperations:
  projectID: PROJECT_ID
  location: GC_REGION
  enableLogging: ENABLE_LOGGING
  enableMonitoring: ENABLE_MONITORING

Replace the following:

PROJECT_ID: your project ID.
GC_REGION: the Google Cloud region where you want to store logs. Choose a region that is near the AWS region. For more information, see Global Locations - Regions & Zones — for example, us-central1.
ENABLE_LOGGING: true or false, whether Cloud Logging is enabled on control plane nodes.
ENABLE_MONITORING: true or false, whether Cloud Monitoring is enabled on control plane nodes.

Next, follow the steps in Creating a custom user cluster .

Cloud Logging and Cloud Monitoring on worker nodes

Removing the previous version

If you have setup an earlier version of the logging and monitoring agents that includes stackdriver-log-aggregator (Fluentd) and stackdriver-prometheus-k8s (Prometheus), you might want to uninstall them first before moving on.

Click to see commands to uninstall earlier version.

    kubectl delete -n kube-system statefulset stackdriver-log-aggregator
    kubectl delete -n kube-system configmap stackdriver-log-aggregator-input-config
    kubectl delete -n kube-system configmap stackdriver-log-aggregator-output-config
    kubectl delete -n kube-system networkpolicy stackdriver-log-aggregator-in-forward
    kubectl delete -n kube-system networkpolicy stackdriver-log-aggregator-prometheus-scrape
    kubectl delete -n kube-system serviceaccount stackdriver-log-aggregator
    kubectl delete -n kube-system service stackdriver-log-aggregator-in-forward
    kubectl delete -n kube-system statefulset stackdriver-prometheus-k8s
    kubectl delete -n kube-system service stackdriver-prometheus-k8s
    kubectl delete -n kube-system secret stackdriver-prometheus-scrape
    kubectl delete -n kube-system clusterrolebinding stackdriver-prometheus-scrape
    kubectl delete -n kube-system clusterrole stackdriver-prometheus-scrape
    kubectl delete -n kube-system serviceaccount stackdriver-prometheus-scrape
    kubectl delete -n kube-system clusterrolebinding stackdriver-user:prometheus
    kubectl delete -n kube-system clusterrole stackdriver-user:prometheus
    kubectl delete -n kube-system serviceaccount stackdriver-prometheus
    kubectl delete -n kube-system configmap stackdriver-prometheus-k8s
    kubectl delete -n kube-system configmap stackdriver-prometheus-sidecar-config
    kubectl delete -n kube-system pvc/stackdriver-log-aggregator-persistent-volume-claim-stackdriver-log-aggregator-0
    kubectl delete -n kube-system pvc/stackdriver-log-aggregator-persistent-volume-claim-stackdriver-log-aggregator-1
    kubectl delete -n kube-system pvc/stackdriver-prometheus-data-stackdriver-prometheus-k8s-0
    kubectl delete secret google-cloud-credentials -n kube-system

Installing the logging forwarder

In this section, you install the Stackdriver Log Forwarder onto your cluster.

From the anthos-samples/aws-logging-monitoring/ directory, change into the logging/ directory.
```
cd logging/
```
Modify the file forwarder.yaml to match your project configuration:
```
sed -i "s/PROJECT_ID/PROJECT_ID/g" forwarder.yaml
sed -i "s/CLUSTER_NAME/CLUSTER_NAME/g" forwarder.yaml
sed -i "s/CLUSTER_LOCATION/GC_REGION/g" forwarder.yaml
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_NAME: The name of your cluster — for example, cluster-0
- GC_REGION: the Google Cloud region where you want to store logs. Choose a region that is near the AWS region. For more information, see Global Locations - Regions & Zones — for example, us-central1.
(Optional) Based upon your workloads, the number of nodes in your cluster, and the number of pods per node, you might have to set memory and CPU resource requests. For more information, see Recommended CPU and memory allocations.
From your anthos-aws directory, use anthos-gke to switch context to your user cluster.
```
cd anthos-aws
env HTTPS_PROXY=http://localhost:8118 \
  anthos-gke aws clusters get-credentials CLUSTER_NAME
```
Replace CLUSTER_NAME with your user cluster name.

Create the stackdriver service account if it does not exist and deploy the log forwarder to the cluster.

env HTTPS_PROXY=http://localhost:8118 \
  kubectl create serviceaccount stackdriver -n kube-system
env HTTPS_PROXY=http://localhost:8118 \
  kubectl apply -f forwarder.yaml

Use kubectl to verify that the pods have started up.

env HTTPS_PROXY=http://localhost:8118 \
  kubectl get pods -n kube-system | grep stackdriver-log

You should see one forwarder pod per node in a node pool. For example, in a 6-node cluster, you should see six forwarder pods.

stackdriver-log-forwarder-2vlxb              2/2     Running   0          21s
stackdriver-log-forwarder-dwgb7              2/2     Running   0          21s
stackdriver-log-forwarder-rfrdk              2/2     Running   0          21s
stackdriver-log-forwarder-sqz7b              2/2     Running   0          21s
stackdriver-log-forwarder-w4dhn              2/2     Running   0          21s
stackdriver-log-forwarder-wrfg4              2/2     Running   0          21s

Testing log forwarding

In this section, you deploy a workload containing a basic HTTP web server with a load generator to your cluster. You then test that logs are present in Cloud Logging.

Before installing this workload, you can verify the manifests for the web server and load generator.

Deploy the web server and load generator to your cluster.

env HTTPS_PROXY=http://localhost:8118 \
  kubectl apply -f  https://raw.githubusercontent.com/GoogleCloudPlatform/istio-samples/master/sample-apps/helloserver/server/server.yaml
env HTTPS_PROXY=http://localhost:8118 \
  kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/istio-samples/master/sample-apps/helloserver/loadgen/loadgen.yaml

To verify that you can view logs from your cluster in the Cloud Logging dashboard, go to the Logs Explorer in the Google Cloud console:

Go to Logs Explorer
Copy the sample query below into the Query builder field.
```
resource.type="k8s_container" resource.labels.cluster_name="CLUSTER_NAME"
```
Replace CLUSTER_NAME with your cluster name.
Click Run query. You should see recent cluster logs appear under Query results.

After you have confirmed the logs appear in query results, remove the load generator and web server.

env HTTPS_PROXY=http://localhost:8118 \
  kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/istio-samples/master/sample-apps/helloserver/loadgen/loadgen.yaml

env HTTPS_PROXY=http://localhost:8118 \
  kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/istio-samples/master/sample-apps/helloserver/server/server.yaml

Installing the metrics collector

In this section, you install an agent to send data to Cloud Monitoring.

From the anthos-samples/aws-logging-monitoring/logging/ directory, change into the anthos-samples/aws-logging-monitoring/monitoring/ directory.
```
cd ../monitoring
```
Modify the file gke-metrics-agent.yaml to match your project configuration:
```
sed -i "s/PROJECT_ID/PROJECT_ID/g" gke-metrics-agent.yaml
sed -i "s/CLUSTER_NAME/CLUSTER_NAME/g" gke-metrics-agent.yaml
sed -i "s/CLUSTER_LOCATION/GC_REGION/g" gke-metrics-agent.yaml
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_NAME: The name of your cluster — for example, cluster-0
- GC_REGION: the Google Cloud region where you want to store logs. Choose a region that is near the AWS region. For more information, see Global Locations - Regions & Zones — for example, us-central1.
(Optional) Based upon your workloads, the number of nodes in your cluster, and the number of pods per node, you might have to set memory and CPU resource requests. For more information, see Recommended CPU and memory allocations.

Create the stackdriver service account if it does not exist and deploy the metrics agent to your cluster.

env HTTPS_PROXY=http://localhost:8118 \
  kubectl create serviceaccount stackdriver -n kube-system
env HTTPS_PROXY=http://localhost:8118 \
  kubectl apply -f gke-metrics-agent.yaml

Use the kubectl tool to verify that the gke-metrics-agent Pod is running.

env HTTPS_PROXY=http://localhost:8118 \
  kubectl get pods -n kube-system | grep gke-metrics-agent

You should see one agent pod per node in a node pool. For example, in a 3-node cluster, you should see three agent pods.

gke-metrics-agent-gjxdj                    2/2     Running   0          102s
gke-metrics-agent-lrnzl                    2/2     Running   0          102s
gke-metrics-agent-s6p47                    2/2     Running   0          102s

To verify that your cluster metrics are being exported to Cloud Monitoring, go to the Metrics Explorer in the Google Cloud console:

Go to Metrics Explorer
In Metrics Explorer, click Query editor and then copy in the following command:
```
fetch k8s_container
| metric 'kubernetes.io/anthos/otelcol_exporter_sent_metric_points'
| filter
    resource.project_id == 'PROJECT_ID'
    && (resource.cluster_name =='CLUSTER_NAME')
| align rate(1m)
| every 1m
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_NAME: the cluster name you used when Creating a user cluster— for example, cluster-0.
Click Run query. The rate of metric points sent to Cloud Monitoring from each gke-metrics-agent pod in your cluster appears.

Note: You might need to wait several minutes before metrics are visible in Metrics Explorer.

Some other metrics worth trying are, but not limited to:
- kubernetes.io/anthos/container_memory_working_set_bytes: Container memory usage;
- kubernetes.io/anthos/container_cpu_usage_seconds_total: Container CPU usage;
- kubernetes.io/anthos/apiserver_aggregated_request_total: kube-apiserver request count, only available if Cloud Monitoring is enabled on control plane.
For a complete list of Metrics available, see Anthos Metrics. For information on how to use the user interface, see Metrics Explorer.

Creating a Dashboard in Cloud Monitoring

In this section, you create a Cloud Monitoring dashboard that monitors container status in your cluster.

From the anthos-samples/aws-logging-monitoring/monitoring/ directory, change into the anthos-samples/aws-logging-monitoring/monitoring/dashboards directory.
```
cd dashboards
```
Replace instances of the CLUSTER_NAME string in pod-status.json with your cluster name.
```
sed -i "s/CLUSTER_NAME/CLUSTER_NAME/g" pod-status.json
```
Replace CLUSTER_NAME with your cluster name.
Create a custom dashboard with the configuration file by running the following command:
```
gcloud monitoring dashboards create --config-from-file=pod-status.json
```
To verify that your dashboard is created, go to Cloud Monitoring Dashboards in the Google Cloud console.

Go to Dashboards

Open the newly created dashboard with a name in the format of CLUSTER_NAME (Anthos cluster on AWS) pod status.

Cleaning up

In this section, you remove the logging and monitoring components from your cluster.

Delete the monitoring dashboard in the Dashboards list view in the Google Cloud console by clicking the delete button associated with the dashboard name.
Change into the anthos-samples/aws-logging-monitoring/ directory.
```
cd anthos-samples/aws-logging-monitoring
```

To remove all the resources created in this guide, run the following commands:

env HTTPS_PROXY=http://localhost:8118 \
  kubectl delete -f logging/
env HTTPS_PROXY=http://localhost:8118 \
  kubectl delete -f monitoring/

Recommended CPU and memory allocations

This section includes recommended CPU and allocations for the individual components used in logging and monitoring. Each of the following tables lists CPU and memory requests for a cluster with a range of node sizes. You set resource requests for a component in the file listed in the table.

For more information, see Kubernetes best practices: Resource requests and limits and Managing Resources for Containers.

1-10 Nodes

File	Resource	CPU Requests	CPU Limits	Memory Requests	Memory Limits
`monitoring/gke-metrics-agent.yaml`	gke-metrics-agent	30m	100m	50Mi	500Mi
`logging/forwarder.yaml`	stackdriver-log-forwarder	50m	100m	100Mi	600Mi

10-100 Nodes

File	Resource	CPU Requests	CPU Limits	Memory Requests	Memory Limits
`monitoring/gke-metrics-agent.yaml`	gke-metrics-agent	50m	100m	50Mi	500Mi
`logging/forwarder.yaml`	stackdriver-log-forwarder	60m	100m	100Mi	600Mi

More than 100 nodes

File	Resource	CPU Requests	CPU Limits	Memory Requests	Memory Limits
`monitoring/gke-metrics-agent.yaml`	gke-metrics-agent	50m	100m	100Mi	N/A
`logging/forwarder.yaml`	stackdriver-log-forwarder	60m	100m	100Mi	600Mi