The following sections explain how to check the status of your mesh and review the various logs that contain helpful details to assist your troubleshooting.
Interpret control plane metrics
When installing Anthos Service Mesh using the asm-gcp
profile, Istiod exports
metrics to Google Cloud's operations suite for monitoring, by default. Istiod
prefixes these metrics with istio.io/control
and give insights into the
control plane state, such as number of proxies connected to each control plane
instance, configuration events, pushes and validations.
Observe or troubleshoot the control plane, using the following steps.
Load a sample dashboard:
git clone https://github.com/GoogleCloudPlatform/monitoring-dashboard-samples && cd monitoring-dashboard-samples && git checkout asm
Install the Anthos Service Mesh dashboard:
gcloud monitoring dashboards create --config-from-file=dashboards/servicemesh/anthos-service-mesh-control-plane-monitoring.json
Look for for a dashboard named
Istio Control Plane Dashboard
in the list. For more information, see Viewing the installed dashboard.If you are not using the
asm-gcp
profile, you can still enable control plane metrics by adding the environment variables to the Istiod deployment:- name: ENABLE_STACKDRIVER_MONITORING value: "true"
In addition, you can add this environment variable using the installation option,
components.pilot.k8s.env
.
For the full list of metrics available, see Exported metrics.
Diagnose configuration delays
The following steps explain how to use the pilot_proxy_convergence_time
metric
to diagnose a delay between a configuration change and all proxies converging.
Run a shell command in a pod:
kubectl exec -it $(kubectl get pod -l app=pilot -o jsonpath='{.items[0].metadata.name}' -n istio-system) -n istio-system -c istio-proxy -- curl -s
Access
localhost:15014
andgrep
forconvergence
in metrics:curl http://localhost:15014/metrics | grep convergence
Interpret Google Cloud's operations suite access logs
The following information explains how to use the Google Cloud's operations suite access logs to troubleshoot connection problems.
Anthos Service Mesh exports data into Google Cloud's operations suite access logs that can help you debug the following types of problems:
- Traffic flow and failures
- End-to-end request routing
Google Cloud's operations suite access/traffic logs are enabled by default in the
asm-gcp
profile across the entire mesh. However, if they are not already
enabled, you can use the following command:
istioctl install --set profile=PROFILE_NAME --set revision==ASM_VERSION \ --set values.telemetry.v2.stackdriver.enabled=true \ --set values.telemetry.v2.stackdriver.logging=true
There are two types of access logs:
Server access logs give a server-side view of requests. They are located under
server-accesslog-stackdriver
, attached to thek8s_container
monitored resource. Use the following URL syntax to display server-side access logs:https://console.cloud.google.com/logs/viewer?advancedFilter=logName="projects/PROJECT_ID/logs/server-accesslog-stackdriver"&project=PROJECT_ID
Client access logs give a client-side view of requests. They are located under
client-accesslog-stackdriver
, attached to thek8s_pod
monitored resource. Use the following URL syntax to display client-side access logs:https://console.cloud.google.com/logs/viewer?advancedFilter=logName="projects/PROJECT_ID/logs/client-accesslog-stackdriver"&project=PROJECT_ID
.
Only client error logs are enabled by default to reduce logging costs, however you can enable all client logs (success and errors) using the following command:
istioctl install --set profile=PROFILE_NAME --set revision==ASM_VERSION
--set values.telemetry.v2.stackdriver.enabled=true
--set values.telemetry.v2.stackdriver.outboundAccessLogging=FULL
Access logs contain the following information:
- HTTP request properties, such as ID, URL, size, latency, and common headers.
- Source and destination workload information, such as name, namespace, identity, and common labels.
- Source and destination canonical service and revision information.
- If tracing is enabled, the logs contain trace information, such as sampling, trace ID, and span ID.
The information displayed in the Google Cloud's operations suite access logs originates from Envoy access logs, when you enable them in the Istio configuration. They contain the following headers:
route_name
upstream_cluster
X-Envoy-Original-Path
X-Envoy-Original-Host
This is an example log entry:
{ "insertId": "1j84zg8g68vb62z", "httpRequest": { "requestMethod": "GET", "requestUrl": "http://35.235.89.201:80/productpage", "requestSize": "795", "status": 200, "responseSize": "7005", "remoteIp": "10.168.0.26:0", "serverIp": "10.36.3.153:9080", "latency": "0.229384205s", "protocol": "http" }, "resource": { "type": "k8s_container", "labels": { "cluster_name": "istio-e2e22", "namespace_name": "istio-bookinfo-1-68819", "container_name": "productpage", "project_id": "***", "location": "us-west2-a", "pod_name": "productpage-v1-64794f5db4-8xbtf" } }, "timestamp": "2020-08-13T21:37:42.963881Z", "severity": "INFO", "labels": { "protocol": "http", "upstream_host": "127.0.0.1:9080", "source_canonical_service": "istio-ingressgateway", "source_namespace": "istio-system", "x-envoy-original-path": "", "source_canonical_revision": "latest", "connection_id": "32", "upstream_cluster": "inbound|9080|http|productpage.istio-bookinfo-1-68819.svc.cluster.local", "requested_server_name": "outbound_.9080_._.productpage.istio-bookinfo-1-68819.svc.cluster.local", "destination_version": "v1", "destination_workload": "productpage-v1", "source_workload": "istio-ingressgateway", "destination_canonical_revision": "v1", "mesh_uid": "cluster.local", "source_principal": "spiffe://cluster.local/ns/istio-system/sa/istio-ingressgateway-service-account", "x-envoy-original-dst-host": "", "service_authentication_policy": "MUTUAL_TLS", "destination_principal": "spiffe://cluster.local/ns/istio-bookinfo-1-68819/sa/bookinfo-productpage", "response_flag": "-", "log_sampled": "false", "destination_service_host": "productpage.istio-bookinfo-1-68819.svc.cluster.local", "destination_name": "productpage-v1-64794f5db4-8xbtf", "destination_canonical_service": "productpage", "destination_namespace": "istio-bookinfo-1-68819", "source_name": "istio-ingressgateway-6845f6d664-lnfvp", "source_app": "istio-ingressgateway", "destination_app": "productpage", "request_id": "39013650-4e62-9be2-9d25-78682dd27ea4", "route_name": "default" }, "logName": "projects/***/logs/server-accesslog-stackdriver", "trace": "projects/***t/traces/466d77d15753cb4d7749ba5413b5f70f", "receiveTimestamp": "2020-08-13T21:37:48.758673203Z", "spanId": "633831cb1fda4fd5", "traceSampled": true }
You can use this log in various ways:
- Integrate with Cloud Trace, which is an optional feature in Anthos Service Mesh.
- Export traffic logs to BigQuery, where you can run queries like selecting all requests take more than 5 seconds.
- Create log-based metrics.
- Troubleshoot
404
and503
errors
Troubleshoot 404
and 503
errors
The following example explains how to use this log to troubleshoot when a
request fails with a 404
or 503
response code.
In the client access log, search for an entry like the following:
httpRequest: { requestMethod: "GET" requestUrl: "://IP_ADDRESS/src/Util/PHP/eval-stdin.php" requestSize: "2088" status: 404 responseSize: "75" remoteIp: "10.168.0.26:34165" serverIp: "10.36.3.149:8080" latency: "0.000371440s" protocol: "http" }
Navigate to the labels in the access log entry. Find the
response_flag
field that looks like the following:response_flag: "NR"
The
NR
value is an acronym forNoRoute
, which means no route was found for the destination or there was no matching filter chain for a downstream connection. Similarly, you can use theresponse_flag
label to troubleshoot503
errors also.If you see
503
errors in both the client and server access logs, check that the port names set for each service match the name of the protocol in use between them. For example, if a golang binary client connects to a golang server using HTTP, but the port is namedhttp2
, the protocol will not auto-negotiate correctly.
For more information, see response flags.
Interpret Envoy logs
The following steps explain how to use the Envoy proxy access logs to show traffic between both ends of a connection for troubleshooting purposes.
Envoy access logs are useful for diagnosing issues like:
- Traffic flow and failures
- End-to-end request routing
Access logs are not enabled by default in Anthos Service Mesh and can only be enabled globally across the entire mesh.
You can troubleshoot connection/request failures by generating activity in your application that triggers an HTTP request, then inspecting the associated request in the source or destination logs.
If you trigger a request appears and it appears in the source proxy logs, it
indicates that iptables
traffic redirection is working correctly and the Envoy
proxy is handling traffic. If you see errors in the logs, generate an Envoy
configuration dump and check the envoy cluster configuration to ensure it is
correct. If you see the request but the log has no errors, check the destination
proxy logs instead.
If the request appears in the destination proxy logs, it indicates that the mesh itself is working correctly. If you see an error instead, run an Envoy configuration dump and verify the correct values for the traffic port set in the listener configuration.
If the problem persists after performing the previous steps, Envoy might be
unable to auto-negotiate the protocol between the sidecar and its application
pod. Ensure that the Kubernetes service port name, for example http-80
,
matches the protocol that the application uses.
Use Logs Explorer to query logs
You can use the Logs Explorer interface
to query specific access logs. For example, to query all requests that have
MULTUAL_TLS
enabled and use protocol grpc
, append following to the server
access logs query:
labels.protocol="grpc" labels.service_authentication_policy="MULTUAL_TLS"
Set an access log policy
You can set a proxy logging policy that determines what type of information you
collect. This is useful to control the scope of your logs to troubleshoot
problems. For example, logging automatically captures errors, but you can set
logWindowDuration
to also capture successful events for a specific time
duration or set the window duration to zero to log all accesses. The policy is
set at the mesh or cluster level.
To enable an access log policy, use the following command:
istioctl install --set profile=PROFILE_NAME \ --set values.telemetry.v2.stackdriver.enabled=true \ --set values.telemetry.v2.stackdriver.logging=true \ --set values.telemetry.accessLogPolicy.enabled=true
For more information about the access log configuration options, such as
logWindowDuration
, see
AccessLogPolicy Config.
View service or workload-specific information
If you have an issue with a specific service or workload rather than a mesh-wide
problem, inspect the individual Envoy proxies and gather relevant information
from them. To gather information about a particular workload and its proxies,
you can use pilot-agent
:
kubectl exec POD_NAME -c istio-proxy -- pilot-agent request GET SCOPE
In the example, SCOPE is one of the following:
certs
- Certificates within the Envoy instanceclusters
- Clusters with Envoy configuredconfig_dump
- Dumps the Envoy configurationlisteners
- Listeners with Envoy configuredlogging
- View and change logging settingsstats
- Envoy statisticsstats/prometheus
- Envoy statistics as Prometheus records
View proxy socket states
You can directly examine the state of Envoy proxy sockets by using the following process.
Display a list of established sockets, including sockets in the
TIME_WAIT
state, which can negatively affect scalability if their count is high:kubectl exec POD_NAME -c istio-proxy -- ss -anopim
Display a summary of socket statistics:
kubectl exec POD_NAME -c istio-proxy -- ss -s
For more information, see An Introduction to the ss Command.
istio-proxy
and istio-init
logs
In addition, retrieve the istio-proxy
logs and review its contents for any
errors that might suggest the cause of the problem:
kubectl logs POD_NAME -c istio-proxy
You can do the same for the init
container:
kubectl logs POD_NAME -c istio-init