This document describes how to set up single-root input/output virtualization (SR-IOV) networking for Google Distributed Cloud. SR-IOV provides I/O virtualization to make a network interface card (NIC), available as network devices in the Linux kernel. This lets you manage and assign network connections to your pods. Performance is improved as packets move directly between the NIC and the pod.
Use this feature if you require fast networking to your pod workloads. SR-IOV for Google Distributed Cloud lets you configure the virtual functions (VFs) on the supported devices of your cluster nodes. You can also specify the particular kernel module to bind to the VFs.
This feature is available for clusters that run workloads, such as hybrid, standalone, and user clusters. The SR-IOV networking feature requires the cluster to have at least two nodes.
The setup process consists of the following high-level steps:
- Configure the cluster to enable SR-IOV networking.
- Configure the SR-IOV operator, a
SriovOperatorConfig
custom resource. - Set up SR-IOV policies and configure your VFs.
- Create a
NetworkAttachmentDefinition
custom resource that references your VFs.
Requirements
The SR-IOV networking feature requires the official drivers for the network
adapters to be present on the cluster nodes. Install the drivers before using
the SR-IOV operator. Also, to use the vfio-pci
module for your VFs, ensure
that the module is available on the nodes where it's to be used.
Enable SR-IOV networking for a cluster
To enable SR-IOV networking for Google Distributed Cloud, add the
multipleNetworkInterfaces
field and the
sriovOperator
field to the clusterNetwork
section of the Cluster object and set both fields
to true
.
apiVersion: baremetal.cluster.gke.io/v1
kind: Cluster
metadata:
name: cluster1
spec:
clusterNetwork:
multipleNetworkInterfaces: true
sriovOperator: true
...
The sriovOperator
field is mutable, and can be changed after cluster creation.
Configure the SR-IOV operator
The SriovOperatorConfig
custom resource provides global configuration for the
SR-IOV networking feature. This bundled custom resource has the name default
and is in the gke-operators
namespace. The SriovOperatorConfig
custom
resource is honored for this name and namespace only.
You can edit this object with the following command:
kubectl -n gke-operators edit sriovoperatorconfigs.sriovnetwork.k8s.cni.cncf.io default
Here's an example of a SriovOperatorConfig
custom resource configuration:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: gke-operators
spec:
configDaemonNodeSelector:
nodePool: "withSriov"
disableDrain: false
logLevel: 0
The configDaemonNodeSelector
section lets you limit what nodes the SR-IOV
operator can handle. In the preceding example, the operator is limited to only
nodes that have a nodePool: withSriov
label. If configDaemonNodeSelector
field isn't specified, the following default labels are applied:
beta.kubernetes.io/os: linux
node-role.kubernetes.io/worker: ""
The disableDrain
field specifies whether to perform a Kubernetes node drain
operation before the node has to be rebooted or before a specific VF
configuration is changed.
Create SR-IOV policies
To configure specific VFs in your cluster, you have to create a
SriovNetworkNodePolicy
custom resource in the gke-operators
namespace.
Here's an example manifest for a SriovNetworkNodePolicy
custom resource:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
namespace: gke-operators
spec:
deviceType: "netdevice"
mtu: 1600
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0
deviceID: "1015"
rootDevices:
- 0000:01:00.0
vendor: "15b3"
numVfs: 4
priority: 80
resourceName: "mlnx"
The nodeSelector
section lets you further limit the nodes on which the VFs
have to be created. This limitation is on top of the selectors from the
SriovOperatorConfig
described in the previous section.
The deviceType
field specifies the kernel module to use for the VFs. Available
options for deviceType
are:
netdevice
for VF-specific standard kernel modulevfio-pci
for the VFIO-PCI driver
The resourceName
defines what name the VFs are represented as in the
Kubernetes Node.
After the configuration process is done, your selected cluster nodes
contain the defined resource as presented in the following example (notice the
gke.io/mlnx
):
apiVersion: v1
kind: Node
metadata:
name: worker-01
spec:
…
status:
allocatable:
cpu: 47410m
ephemeral-storage: "210725550141"
gke.io/mlnx: "4"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 59884492Ki
pods: "250"
capacity:
cpu: "48"
ephemeral-storage: 228651856Ki
gke.io/mlnx: "4"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 65516492Ki
pods: "250"
The operator will always add the gke.io/
prefix to every resource you define
with SriovNetworkNodePolicy
.
Specify a NIC selector
For the SriovNetworkNodePolicy
to function properly, specify at least one
selector in the nicSelector
section. This field contains multiple options on
how to identify specific physical functions (PFs) in your cluster nodes. Most of
the information required by this field is discovered for you and saved in the
SriovNetworkNodeState
custom resource. There will be an object per each node
that this operator can handle.
Use the following command to view all the available nodes:
kubectl -n gke-operators get sriovnetworknodestates.sriovnetwork.k8s.cni.cncf.io -o yaml
Here's an example of a node:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1
kind: SriovNetworkNodeState
metadata:
name: worker-01
namespace: gke-operators
spec:
dpConfigVersion: "6368949"
status:
interfaces:
- deviceID: "1015"
driver: mlx5_core
eSwitchMode: legacy
linkSpeed: 10000 Mb/s
linkType: ETH
mac: 1c:34:da:5c:2b:9c
mtu: 1500
name: enp1s0f0
pciAddress: "0000:01:00.0"
totalvfs: 4
vendor: 15b3
- deviceID: "1015"
driver: mlx5_core
linkSpeed: 10000 Mb/s
linkType: ETH
mac: 1c:34:da:5c:2b:9d
mtu: 1500
name: enp1s0f1
pciAddress: "0000:01:00.1"
totalvfs: 2
vendor: 15b3
syncStatus: Succeeded
Set Physical Function partitioning
Pay special attention to the pfNames
field of the nicSelector
section. In
addition to defining the exact PF to use, it lets you specify the exact VFs to
use for the specified PF and the resource defined in the policy.
Here's an example:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
namespace: gke-operators
spec:
deviceType: "netdevice"
mtu: 1600
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0#3-6
deviceID: "1015"
rootDevices:
- 0000:01:00.0
vendor: "15b3"
numVfs: 7
priority: 80
resourceName: "mlnx"
In the preceding example, the gke.io/mlnx
resource uses VFs numbered 3-6 only
and shows just four available VFs. Since the VFs are always created from the
zero index, your requested number of VFs, numVfs
, has to be at least as high
as the range-closing value (counting from zero). This numbering logic is why
numVfs
is set to 7
in the preceding example. If you set a range from 3 to 4
(enp65s0f0#3-4
), your numVfs
must be at least 5
.
When the partitioning isn't specified, the numVfs
defines the VFs range that
is being used, which always starts from zero. For example, if you set numVfs=3
without specifying partitioning, VFs 0-2
are used.
Understand policy priority
You can specify multiple SriovNetworkNodePolicy
objects to handle various
vendors or different VF configurations. Managing multiple objects and vendors
might become troublesome when multiple policies reference the same PF. To handle
such situations, the priority
field resolves the conflicts on a per-node basis.
Here is the prioritization logic for overlapping PF policies:
A higher priority policy overwrites one with lower priority only when PF partitioning is overlapping.
Same priority policies are merged:
- Policies are sorted by name and processed in that order
- Policies with overlapping PF partitioning are overwritten
- Policies with non-overlapping PF partitioning are merged and all present
A high priority policy is one with lower numerical value in the priority
field. For example, the priority is higher for a policy with priority: 10
,
than for a policy with priority: 20
.
The following sections provide policy examples for different partitioning configurations.
Partitioned PF
Deploying the following two SriovNetworkNodePolicy
manifests results in two
available resources: gke.io/dev-kernel
and gke.io/dev-vfio
. Each resource
has two VFs that are non-overlapping.
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
spec:
deviceType: "netdevice"
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0#0-1
numVfs: 2
priority: 70
resourceName: "dev-kernel"
kind: SriovNetworkNodePolicy
metadata:
name: policy-2
spec:
deviceType: "vfio-pci"
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0#2-3
numVfs: 4
priority: 70
resourceName: "dev-vfio"
Overlapping PF partitioning
Deploying the following two SriovNetworkNodePolicy
manifests results in only
the gke.io/dev-vfio
resource being available. The policy-1
VF range is
0-2
, which overlaps with policy-2
. Due to naming, policy-2
is processed
after policy-1
. Therefore, only the resource specified in policy-2
,
gke.io/dev-vfio
, is available.
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
spec:
deviceType: "netdevice"
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0
numVfs: 3
priority: 70
resourceName: "dev-kernel"
kind: SriovNetworkNodePolicy
metadata:
name: policy-2
spec:
deviceType: "vfio-pci"
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0#2-3
numVfs: 4
priority: 70
resourceName: "dev-vfio"
Non-overlapping PF partitioning with different priorities
Deploying the following two SriovNetworkNodePolicy
manifests results in two
available resources: gke.io/dev-kernel
and gke.io/dev-vfio
. Each resource
has two VFs that are non-overlapping. Even though policy-1
has higher priority
than policy-2
, since the PF partitioning is non-overlapping, we merge the two
policies.
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
spec:
deviceType: "netdevice"
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0
numVfs: 2
priority: 10
resourceName: "dev-kernel"
kind: SriovNetworkNodePolicy
metadata:
name: policy-2
spec:
deviceType: "vfio-pci"
nodeSelector:
baremetal.cluster.gke.io/node-pool: node-pool-1
nicSelector:
pfNames:
- enp65s0f0#2-3
numVfs: 4
priority: 70
resourceName: "dev-vfio"
Check SR-IOV policy setup status
When you apply the SR-IOV policies, you can track and view the final
configuration of the nodes in the SriovNetworkNodeState
custom resource for
the specific node. In the status
section, the syncStatus
field represents
the current stage for the configuration daemon. The Succeeded
state indicates
that configuration is finished. The spec
section of the
SriovNetworkNodeState
custom resource defines the final state of VFs
configuration for that Node, based on the number of policies and their
priorities. All the created VFs will be listed in the status
section for the
specified PFs.
Here is an example SriovNetworkNodeState
custom resource:
apiVersion: sriovnetwork.k8s.cni.cncf.io/v1
kind: SriovNetworkNodeState
metadata:
name: worker-02
namespace: gke-operators
spec:
dpConfigVersion: "9022068"
interfaces:
- linkType: eth
name: enp1s0f0
numVfs: 2
pciAddress: "0000:01:00.0"
vfGroups:
- deviceType: netdevice
policyName: policy-1
resourceName: mlnx
vfRange: 0-1
status:
interfaces:
- Vfs:
- deviceID: "1016"
driver: mlx5_core
mac: 96:8b:39:d8:89:d2
mtu: 1500
name: enp1s0f0np0v0
pciAddress: "0000:01:00.2"
vendor: 15b3
vfID: 0
- deviceID: "1016"
driver: mlx5_core
mac: 82:8e:65:fe:9b:cb
mtu: 1500
name: enp1s0f0np0v1
pciAddress: "0000:01:00.3"
vendor: 15b3
vfID: 1
deviceID: "1015"
driver: mlx5_core
eSwitchMode: legacy
linkSpeed: 10000 Mb/s
linkType: ETH
mac: 1c:34:da:5c:2b:9c
mtu: 1500
name: enp1s0f0
numVfs: 2
pciAddress: "0000:01:00.0"
totalvfs: 2
vendor: 15b3
- deviceID: "1015"
driver: mlx5_core
linkSpeed: 10000 Mb/s
linkType: ETH
mac: 1c:34:da:5c:2b:9d
mtu: 1500
name: enp1s0f1
pciAddress: "0000:01:00.1"
totalvfs: 2
vendor: 15b3
syncStatus: Succeeded
Create a NetworkAttachmentDefinition
custom resource
After you successfully configure the VFs on the cluster, and they are visible in
the Kubernetes Node as a resource, you need to create a
NetworkAttachmentDefinition
that references the resource. Make the reference
with a k8s.v1.cni.cncf.io/resourceName
annotation.
Here is an example NetworkAttachmentDefinition
manifest that references the
gke.io/mlnx
resource:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: gke-sriov-1
annotations:
k8s.v1.cni.cncf.io/resourceName: gke.io/mlnx
spec:
config: '{
"cniVersion": "0.3.0",
"name": "mynetwork",
"type": "sriov",
"ipam": {
"type": "whereabouts",
"range": "21.0.108.0/21",
"range_start": "21.0.111.16",
"range_end": "21.0.111.18"
}
}'
The NetworkAttachmentDefinition
must have the sriov
as the CNI type.
Reference any deployed NetworkAttachmentDefinition
custom resources in your
pods with a k8s.v1.cni.cncf.io/networks
annotation.
Here's an example of how to reference the preceding
NetworkAttachmentDefinition
custom resource in a pod:
apiVersion: v1
kind: Pod
metadata:
name: samplepod
annotations:
k8s.v1.cni.cncf.io/networks: gke-sriov-1
spec:
containers:
...
When referencing a NetworkAttachmentDefinition
custom resource in workloads,
you don't have to worry about the Pods' resources definitions, or placement in
specific Nodes, which is done automatically for you.
The following example shows a NetworkAttachmentDefinition
custom resource with
a VLAN configuration. In this sample, every VF belongs to the 100
VLAN:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: gke-sriov-vlan-100
annotations:
k8s.v1.cni.cncf.io/resourceName: gke.io/mlnx
spec:
config: '{
"cniVersion": "0.3.0",
"name": "mynetwork",
"type": "sriov",
"vlan": 100,
"ipam": {
"type": "whereabouts",
"range": "21.0.100.0/21"
}
}'
Additional information
The following sections contain information to help you configure SR-IOV networking.
Node reboots
When the SR-IOV operator configures the nodes, the nodes may need to be rebooted. Rebooting nodes might be needed during VF or kernel configuration. The kernel configuration involves enabling support of the SR-IOV functionality in the operating system.
Supported Network Adapters
The following table lists the supported network adapters for version 1.31.x clusters:
Name | Vendor ID | Device ID | VF device ID |
---|---|---|---|
Intel i40e XXV710 | 8086 | 158a | 154c |
Intel i40e 25G SFP28 | 8086 | 158b | 154c |
Intel i40e 10G X710 SFP | 8086 | 1572 | 154c |
Intel i40e XXV710 N3000 | 8086 | 0d58 | 154c |
Intel i40e 40G XL710 QSFP | 8086 | 1583 | 154c |
Intel ice Columbiaville E810-CQDA2 2CQDA2 | 8086 | 1592 | 1889 |
Intel ice Columbiaville E810-XXVDA4 | 8086 | 1593 | 1889 |
Intel ice Columbiaville E810-XXVDA2 | 8086 | 159b | 1889 |
Nvidia mlx5 ConnectX-4 | 15b3 | 1013 | 1014 |
Nvidia mlx5 ConnectX-4LX | 15b3 | 1015 | 1016 |
Nvidia mlx5 ConnectX-5 | 15b3 | 1017 | 1018 |
Nvidia mlx5 ConnectX-5 Ex | 15b3 | 1019 | 101a |
Nvidia mlx5 ConnectX-6 | 15b3 | 101b | 101c |
Nvidia mlx5 ConnectX-6_Dx | 15b3 | 101d | 101e |
Nvidia mlx5 MT42822 BlueField-2 integrated ConnectX-6 Dx | 15b3 | a2d6 | 101e |
Broadcom bnxt BCM57414 2x25G | 14e4 | 16d7 | 16dc |
Broadcom bnxt BCM75508 2x100G | 14e4 | 1750 | 1806 |