This page provides information about the Linux secure computing mode (seccomp) in Google Kubernetes Engine (GKE). Use this information to understand which actions your containerized applications can perform on the host virtual machine (VM) that backs your nodes.
What is seccomp?
Secure computing mode, or seccomp, is a security capability in Linux that lets you restrict the system calls (syscalls) that a process can make to the Linux kernel.
By default, GKE nodes use the Container-Optimized OS operating system with the containerd container runtime. containerd protects the Linux kernel by limiting the allowed Linux capabilities to a default list, and you can further limit allowed syscalls with a seccomp profile. containerd has a default seccomp profile available. Whether GKE applies the default seccomp profile for you depends on the cluster mode that you use, as follows:
- Autopilot (recommended): GKE applies the containerd default seccomp profile to all workloads automatically.
- Standard: GKE does not apply the containerd default seccomp profile to all workloads automatically. We recommend that you apply either the default seccomp profile or a custom seccomp profile to your workloads.
The default containerd seccomp profile provides baseline hardening while maintaining compatibility with most workloads. The full seccomp profile definition for containerd is available on GitHub.
Linux capabilities and syscalls
Non-root processes running on Linux systems might require specific privileges to perform actions as the root user. Linux uses capabilities to divide the available privileges into groups, so that a non-root process can perform a specific action without being granted all privileges. For a process to successfully make a specific syscall, the process must have the corresponding privileges granted by a capability.
For a list of all Linux capabilities, refer to capabilities .
Denied syscalls in the default GKE seccomp profile
The containerd default seccomp profile blocks all syscalls and then selectively
allows specific syscalls, some of which depend on the CPU architecture of the
node's VM and the kernel version. The
syscalls
variable in the DefaultProfile
function
lists the allowed syscalls for all architectures.
The default seccomp profile blocks syscalls that can be used to bypass container isolation boundaries and allow privileged access to the node or to other containers. The following table describes some of the significant syscalls that the default seccomp profile denies:
Denied syscalls | |
---|---|
mount , umount , umount2 ,
fsmount , mount_setattr |
Restrict processes from accessing or manipulating the node filesystem outside of the container boundaries. Also denied because the
|
bpf |
Restrict processes from creating eBPF programs in the kernel, which
can lead to privilege escalation on the node. For example,
CVE-2021-3490
used the Also denied because the
|
clone , clone3 , unshare |
Restrict processes from creating new processes in new namespaces
that might be outside the container's restricted namespace. These new
processes might have elevated permissions and capabilities. For example,
CVE-2022-0185
used the Also denied because the
|
reboot |
Restrict processes from rebooting the node. Also denied because the
|
open_by_handle_at , name_to_handle_at |
Restrict access to files outside of the container. These syscalls were used in one of the earliest Docker container escape exploits. Also denied because the
|
How to use seccomp in GKE
In Autopilot clusters, GKE automatically applies the containerd default seccomp profile to all your workloads. No further action is required. Attempts to make restricted syscalls fail. Autopilot disallows custom seccomp profiles because GKE manages the nodes.
In Standard clusters, you must manually apply a seccomp profile. GKE doesn't apply a profile for you.
Enable seccomp in Standard clusters
Apply a seccomp profile manually by setting the Pod or container
Security Context
using the spec.securityContext.seccompProfile
field in the Pod specification,
such as in the following example. We strongly recommend that you use a seccomp
profile for your workloads unless your use case requires using any restricted
syscalls. The two supported seccompProfile
types are as follows:
RuntimeDefault
: the default profile specified by the containerd runtime.Localhost
: a custom profile definition.
The following example manifest sets the seccomp profile to the runtime default profile:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
labels:
app: default-pod
spec:
replicas: 3
selector:
matchLabels:
app: default-pod
template:
metadata:
labels:
app: default-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: seccomp-test
image: nginx
When you deploy this manifest, if a container in the Pod tries to make a syscall that violates the runtime default seccomp profile, the Pod or the workload might experience unexpected behavior. For example, a Pod that makes a restricted syscall during startup would fail to start. If an application tries to make a restricted syscall while the Pod is running, you might notice errors in the container. The severity of a failed syscall depends on how the application handles errors.
Use a custom seccomp profile in Standard clusters
If the runtime default seccomp profile is too restrictive for your application (or not restrictive enough), you can apply a custom seccomp profile to Pods in Standard clusters. This process requires access to the filesystem on the node. For a tutorial on how to load and use custom seccomp profiles, refer to Restrict a Container's Syscalls with seccomp.
What's next
- Use PodSecurityAdmission to enforce pre-defined Pod-level policies
- Use Organization Policy Service to set project or organization-level policies