Configure writable cgroups for containers

Autopilot Standard

You can let Google Kubernetes Engine (GKE) workloads manage resources, like CPU and memory, for child processes by using the Linux cgroups API. This document shows you how to provide containers with read-write access to the cgroups API without running those containers in privileged mode.

When to use writable cgroups

By default, Kubernetes provides all Linux containers with read-only access to the cgroups API by mounting the /sys/fs/cgroup file system in each container. You can optionally let GKE mount this file system in read-write mode in specific Pods to let root processes manage and constrain resources for child processes.

These writable cgroups help to improve reliability in applications like Ray that run system processes and user code in the same container. By writing to the /sys/fs/cgroup file system, Ray can reserve portions of a container's resources for critical processes. You can use writable cgroups to improve reliability in these applications without the security risk of using privileged mode for the containers.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Ensure that you have an Autopilot or Standard cluster running version 1.34 or later. To create a new cluster, see Create an Autopilot cluster.
Ensure that your cluster uses cgroup v2. For more information, see Migrate nodes to Linux cgroup v2.

Enable writable cgroups for your nodes

Enable writable cgroups on your node pools by customizing the containerd configuration. You can apply this configuration to your entire cluster or to specific node pools in Standard clusters.

In your containerd configuration file, add a writableCgroups section and set the enabled field to true. For more information, see Customize containerd configuration in GKE nodes.

writableCgroups:
  enabled: true

Specify the updated configuration file when you create or update a cluster or a node pool.

Use writable cgroups in workloads

After you enable writable cgroups for your cluster or node pools, configure your workloads to meet all of the following requirements:

Select a node that has writable cgroups enabled.
Enable writable cgroups for one or more containers in the Pod.
Use the Guaranteed Quality of Service (QoS) class by meeting one of the following conditions:
- For workloads that specify resources at the Pod level, set equal values for resources.requests and resources.limits in the Pod specification.
- For workloads that specify resources for each container, set equal values for resources.requests and resources.limits in the specification of every container in the Pod, including init containers.

To configure these requirements, follow these steps:

To select nodes that have writable cgroups enabled, add the node.gke.io/enable-writable-cgroups: "true" label to the spec.nodeSelector field in your Pod specification:
```
node.gke.io/enable-writable-cgroups: "true"
```
To enable writable cgroups for your workload, add one of the following labels to the metadata.annotations field in your Pod specification:
- Enable for the entire Pod:
```
node.gke.io/enable-writable-cgroups: "true"
```
- Enable for a specific container in the Pod:
```
node.gke.io/enable-writable-cgroups.CONTAINER_NAME: "true"
```
  Replace CONTAINER_NAME with the name of the container.
To configure the Guaranteed QoS class for your Pod, specify equal CPU and memory requests and limits for every container in the Pod or for the entire Pod, like in the following example:
```
resources:
  requests:
    cpu: "100m"
    memory: "100Mi"
  limits:
    cpu: "100m"
    memory: "100Mi"
```
You must specify equal requests and limits for every container, even if you enable writable cgroups only for one of the containers in the Pod.

Your final Pod specification should be similar to the following examples.

This example enables writable cgroups for all containers in the Pod:

apiVersion: v1
kind: Pod
metadata:
  name: writable-cgroups-pod
  annotations:
    node.gke.io/enable-writable-cgroups: "true"
spec:
  nodeSelector:
    node.gke.io/enable-writable-cgroups: "true"
  containers:
  - name: container
    image: busybox:stable
    command: ["/bin/sh", "-c"]
    args:
    -   |
      trap 'echo "Caught SIGTERM, exiting..."; exit 0' TERM
      echo "Waiting for termination signal..."
      while true; do sleep 1; done
  resources:
    requests:
      cpu: "100m"
      memory: "100Mi"
    limits:
      cpu: "100m"
      memory: "100Mi"

This example enables writable cgroups for a specific container in a multi-container Pod:

apiVersion: v1
kind: Pod
metadata:
  name: writable-cgroups-per-container
  annotations:
    node.gke.io/enable-writable-cgroups.busybox-container: "true"
spec:
  nodeSelector:
    node.gke.io/enable-writable-cgroups: "true"
  containers:
  - name: busybox-container
    image: busybox:stable
    command: ["/bin/sh", "-c"]
    args:
    -   |
      trap 'echo "Caught SIGTERM, exiting..."; exit 0' TERM
      echo "Waiting for termination signal..."
      while true; do sleep 1; done
    resources:
      requests:
        cpu: "100m"
        memory: "100Mi"
      limits:
        cpu: "100m"
        memory: "100Mi"
  - name: container-disabled
    image: busybox:stable
    command: ["/bin/sh", "-c"]
    args:
    -   |
      trap 'echo "Caught SIGTERM, exiting..."; exit 0' TERM
      echo "Waiting for termination signal..."
      while true; do sleep 1; done
    resources:
      requests:
        cpu: "100m"
        memory: "100Mi"
      limits:
        cpu: "100m"
        memory: "100Mi"

Verify that the cgroup file system is writable

To verify the permissions on the /sys/fs/cgroup file system for a Pod or a container, follow these steps:

Identify a Pod that you want to check. You can use one of the sample Pods from the Use writable cgroups in workloads section.
Create a shell session in the Pod:
```
kubectl exec -it POD_NAME -- /bin/sh
```
Replace POD_NAME with the name of the Pod.
Describe the mounted cgroup file system:
```
mount | grep cgroup
```
The output is similar to the following:
```
cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
```
In this output, rw indicates that the file system is writable. If you see ro in the output, the file system is read-only.

What's next

Customize containerd configuration