Troubleshooting the container runtime


This document provides troubleshooting steps for common issues that you might encounter with the container runtime on your Google Kubernetes Engine (GKE) nodes.

Mount Paths with simple drive letters fail on Windows node pools with Containerd

GKE clusters running Windows Server node pools that use the containerd runtime prior to version 1.6.6 may experience errors when starting containers like the following:

failed to create containerd task : CreateComputeSystem : The parameter is incorrect : unknown

For more details, refer to issue #6589

Workarounds

Upgrade node pools to the latest GKE versions which utilizes containerd runtime version 1.6.6 or higher.

Container images with non-array pre-escaped CMD or ENTRYPOINT command line fail on Windows node pools with Containerd

GKE clusters running Windows Server node pools that use the containerd runtime 1.5.X may experience errors when starting containers like the following:

failed to start containerd task : hcs::System::CreateProcess : The system cannot find the file specified.: unknown

For more details, refer to issue #5067 and issue #6300

Workarounds

Upgrade node pools to the latest GKE versions which utilizes containerd runtime versions 1.6.X or higher.

Container image volumes with non-existing paths or Linux-like (forward slash) paths fail on Windows node pools with Containerd

GKE clusters running Windows Server node pools that use the containerd runtime 1.5.X may experience errors when starting containers like the following:

failed to generate spec: failed to stat "<volume_path>": CreateFile : The system cannot find the path specified.

For more details, refer to issue #5671

Workarounds

Upgrade node pools to the latest GKE versions which utilizes containerd runtime versions 1.6.X or higher.

/etc/mtab: No such file or directory

The Docker container runtime populates this symlink inside the container by default, but the containerd runtime does not.

For more details, refer to issue #2419

Workarounds

To work around this issue, manually create the symlink /etc/mtab during your image build.

ln -sf /proc/mounts /etc/mtab

Image pull error: not a directory

Affected GKE versions: all

When you build an image with kaniko, it may fail to be pulled with containerd with the error message "not a directory". This error happens if the image is built in a special way: when a previous command removes a directory and the next command recreates the same files in that directory.

Below is a Dockerfile example with npm that illustrates this problem.

RUN npm cache clean --force
RUN npm install

For more details, refer to issue #4659.

Workarounds

To work around this issue, build your image using docker build, which is unaffected by this issue.

If docker build isn't an option for you, then combine the commands into one. Below is the Dockerfile example mentioned above with npm. The woraround is to combine "RUN npm cache clean --force" and "RUN npm install".

RUN npm cache clean --force && npm install

Some filesystem metrics are missing and the metrics format is different

Affected GKE versions: all

The Kubelet /metrics/cadvisor endpoint provides Prometheus metrics, as documented in Metrics for Kubernetes system components. If you install a metrics collector that depends on that endpoint, you might see the following issues:

  • The metrics format on the Docker node is k8s_<container-name>_<pod-name>_<namespace>_<pod-uid>_<restart-count> but the format on the containerd node is <container-id>.
  • Some filesystem metrics are missing on the containerd node, as follows:

    container_fs_inodes_free
    container_fs_inodes_total
    container_fs_io_current
    container_fs_io_time_seconds_total
    container_fs_io_time_weighted_seconds_total
    container_fs_limit_bytes
    container_fs_read_seconds_total
    container_fs_reads_merged_total
    container_fs_sector_reads_total
    container_fs_sector_writes_total
    container_fs_usage_bytes
    container_fs_write_seconds_total
    container_fs_writes_merged_total
    

Workarounds

You can mitigate this issue by using cAdvisor as a standalone daemonset.

  1. Find the latest cAdvisor release with the name pattern vX.Y.Z-containerd-cri (for example, v0.42.0-containerd-cri).
  2. Follow the steps in cAdvisor Kubernetes Daemonset to create the daemonset.
  3. Point the installed metrics collector to use the cAdvisor /metrics endpoint which provides the full set of Prometheus container metrics.

Alternatives

  1. Migrate your monitoring solution to Cloud Monitoring, which provides the full set of container metrics.
  2. Collect metrics from the Kubelet summary API with an endpoint of /stats/summary.

Attach-based operations do not function correctly after container-runtime restarts on GKE Windows

Affected GKE versions: 1.21 to 1.21.5-gke.1802, 1.22 to 1.22.3-gke.700

GKE clusters running Windows Server node pools that use the containerd runtime (version 1.5.4 and 1.5.7-gke.0) might experience issues if the container runtime is forcibly restarted, with attach operations to existing running containers not being able to bind IO again. The issue will not cause failures in API calls, however data will not be sent or received. This includes data for attach and logs CLIs and APIs through the cluster API server.

A patched container runtime version (1.5.7-gke.1) with newer GKE releases address the issue.

Pods display failed to allocate for range 0: no IP addresses available in range set error message

Affected GKE versions: 1.24.6-gke.1500 or earlier, 1.23.14-gke.1800 or earlier, and 1.22.16-gke.2000 or earlier

GKE clusters running node pools that use containerd might experience IP leak issues and exhaust all the Pod IPs on a node. A Pod scheduled on an affected node displays an error message similar to the following:

failed to allocate for range 0: no IP addresses available in range set: 10.48.131.1-10.48.131.62

For more information about the issue, see containerd issue #5438 and issue #5768.

There is a known issue in GKE Dataplane V2 that can trigger this issue. However, this issue can be triggered by other causes, including runc stuck.

Workarounds

Follow the workarounds mentioned in the Workarounds for Standard GKE clusters for GKE Dataplane V2.

Exec probe behavior difference when probe exceeds the timeout

Affected GKE versions: all

Exec probe behavior on containerd images is different from the behavior on dockershim images. When exec probe, defined for the Pod, exceeds the declared timeoutSeconds threshold, on dockershim images, it is treated as a probe failure. On containerd images, probe results returned after the declared timeoutSeconds threshold are ignored.

Insecure registry option is not configured for local network (10.0.0.0/8)

Affected GKE versions: all

On containerd images the insecure registry option is not configured for local network 10.0.0.0/8. When migrating from images with Docker, which were using the private image registry, ensure the correct certificate is installed on the registry, or the registry is configured to use http.

containerd ignores any device mappings for privileged pods

Affected GKE versions: all

For privileged Pods, the container runtime ignores any device mappings that volumeDevices.devicePath passed to it, and instead makes every device on the host available to the container under /dev.

IPv6 address family is enabled on pods running containerd

Affected GKE versions: 1.18, 1.19, 1.20.0 to 1.20.9

IPv6 image family is enabled for Pods running with containerd. The dockershim image disables IPv6 on all Pods, while the containerd image does not. For example, localhost resolves to IPv6 address ::1 first. This is typically not a problem, however, this might result in unexpected behavior in certain cases.

As a workaround, use an IPv4 address such as 127.0.0.1 explicitly, or configure an application running in the Pod to work on both address families.

Node auto-provisioning only provisions Container-Optimized OS with Docker node pools

Affected GKE versions: 1.18, 1.19, 1.20.0 to 1.20.6-gke.1800

Node auto-provisioning allows auto-scaling node pools with any supported image type, but can only create new node pools with the Container-Optimized OS with Docker image type.

In GKE version 1.20.6-gke.1800 and later, the default image type can be set for the cluster.

Conflict with 172.17/16 IP address range

Affected GKE versions: 1.18.0 to 1.18.14

The 172.17/16 IP address range is occupied by the docker0 interface on the node VM with containerd enabled. Traffic sending to or originating from that range might not be routed correctly (for example, a Pod might not be able to connect to a VPN-connected host with an IP address within 172.17/16).

GPU metrics not collected

Affected GKE versions: 1.18.0 to 1.18.18

GPU usage metrics are not collected when using containerd as a runtime on GKE versions before 1.18.18.

Images with config.mediaType set to application/octet-stream cannot be used on containerd

Affected GKE versions: All

Images with config.mediaType set to "application/octet-stream" cannot be used on containerd. See Issue #4756. These images are not compatible with the Open Container Initiative specification and are considered incorrect. These images work with Docker to provide backward compatibility, while in containerd these images are not supported.

Symptom and diagnosis

Example error in node logs:

Error syncing pod <pod-uid> ("<pod-name>_<namespace>(<pod-uid>)"), skipping: failed to "StartContainer" for "<container-name>" with CreateContainerError: "failed to create containerd container: error unpacking image: failed to extract layer sha256:<some id>: failed to get reader from content store: content digest sha256:<some id>: not found"

The image manifest can be usually found in the registry where it is hosted. Once you have the manifest, check config.mediaType to determine if you have this issue:

"mediaType": "application/octet-stream",

Fix

As the containerd community decided to not support such images, all versions of containerd are affected and there is no fix. The container image must be rebuilt with Docker version 1.11 or later and you must ensure that the config.mediaType field is not set to "application/octet-stream".

CNI config uninitialized

Affected GKE versions: All

GKE fails to create nodes during an upgrade, resize, or other action.

Symptom and diagnosis

Example error in the Google Cloud console:

Error: "runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized".

This error might occur in the following situations:

  • During node bootstrapping in log files while GKE installs the CNI config.
  • As a node error status in the Google Cloud console if a custom webhook that intercepts the DaemonSet controller command to create a Pod has errors. This prevents GKE from creating a netd or calico-node Pod. If netd or calico-node Pods started successfully while the error persists, contact support.

Fix

To resolve this issue, try the following solutions:

  • Wait for GKE to finish installing the CNI config.
  • Remove any misconfigured webhooks.
  • Configure webhooks to ignore system Pods.