Kubernetes Platform Optimization with Resource Management¶

Introduction¶

This document provides an example and recommendations how to manage which CPUs and memories (NUMA nodes) are allowed to be used by which containers on a Kubernetes node.

Managing CPUs and memories enables improving AI container performance and maintaining predictable response times even under heavy load. Reasons for performance improvements include the following.

Better cache hit ratios in all cache levels.
Fewer remote memory accesses.
Fewer processes and threads per CPU in the whole system.
Disabling CPU hyperthreading on containers that run faster when the other CPU thread is idle.

More predictable response times are possible by using dedicated CPUs for containers and sets of containers. This ensures that critical containers will always have enough compute resources, and that resource hungry containers will not be able to hurt all processes in the system.

NRI Plugins¶

NRI plugins connect to the container runtime running on a Kubernetes node. Containerd and CRI-O runtimes support NRI plugins.

The NRI plugins project includes two resource policies, balloons and topology-aware. They manage allowed CPUs and memories (cpuset.cpus and cpuset.mems) of all Kubernetes containers created and running on the node.

In this example, we use the balloons policy because it can be tuned for certain applications (like RAG pipelines) using even node-specific parameters for each container in applications. The topology-aware policy, on the other hand, needs no configuration and does CPU assignment automatically based on resource requests in containers and underlying hardware topology.

Install¶

Warning: installing and reconfiguring the balloons policy can change allowed CPUs and memories of already running containers in the cluster. This may hurt containers that rely on the number of allowed CPUs being static. Furthermore, if there are containers with gigabytes of memory allocated, reconfiguring the policy may cause the kernel to move large amounts of memory between NUMA nodes. This may cause extremely slow response times until moves have finished. Therefore, it is recommended that nodes are empty or relatively lightly loaded when new resource policy is applied.

Install the balloons policy with helm:

Add the NRI plugins repository

helm repo add nri-plugins https://containers.github.io/nri-plugins

Install the balloons resource policy and patch container runtime’s configuration on the individual worker nodes/hosts to enable NRI support.
```
helm install balloons nri-plugins/nri-resource-policy-balloons --set patchRuntimeConfig=true
```

Now the balloons policy is managing node resources in the cluster as a DaemonSet that communicates with the container runtime on every node.

Validate policy status¶

The balloons policy is running on a node once you can find nri-resource-policy-balloons-... pod.

kubectl get pods -A -o wide | grep nri-resource-policy

default   nri-resource-policy-balloons-v6bvq   1/1   Running   0   12s   10.0.0.136   spr-2   <none>   <none>

Status of the policy on each node in a cluster can be read from the balloonspolicy custom resource. For instance, see Status from

kubectl describe balloonspolicy default

Configure¶

Edit the default balloons policy:

kubectl edit balloonspolicy default

Let us consider isolating AI inference and reranking containers in ChatQnA application’s Gaudi accelerated pipeline.

In helm charts there are “vllm”, “tgi”, “tei” and “teirerank” containers in the services that will need a lot of CPUs. They implement text-generation-interface and text-embeddings-interface services.

Warning: an issue in the text-generation-interface causes bad performance when CPUs are managed. As a workaround, prevent CPU management of these containers by adding a pod annotation in both “chatqna-tei” and “chatqna-teirerank” deployments:

cpu.preserve.resource-policy.nri.io: "true"

A note on terminology: we refer to physical CPU cores as “CPU cores” and hyperthreads as vCPUs or just CPUs. When hyperthreading is on, the operating system typically sees every CPU core as two separate vCPUs.

In the example configuration below, we assume that hyperthreading is on. We allocate 16 CPUs (8 CPU cores with two hyperthreads per core) for each tgi container, and 32 CPUs (that is 16 CPU cores) for each tei container. This happens with the following balloons policy configuration.

apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
  name: default
spec:
  allocatorTopologyBalancing: true
  balloonTypes:
  - name: llm-inference
    allocatorPriority: high
    minCPUs: 16
    minBalloons: 1
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: In
      values: ["tgi", "vllm"]
  - name: tei
    allocatorPriority: high
    minCPUs: 32
    minBalloons: 1
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: In
      values:
      - tei
      - teirerank
  - name: default
    hideHyperthreads: false
    namespaces:
    - "*"
    shareIdleCPUsInSame: numa
  instrumentation:
    httpEndpoint: :8891
    prometheusExport: true
    reportPeriod: 60s
    samplingRatePerMillion: 0
  log:
    source: true
    debug: ["policy"]
  pinCPU: true
  pinMemory: false
  reservedPoolNamespaces:
  - kube-system
  reservedResources:
    cpu: "2"

The balloons policy creates “balloons” of CPUs that only containers assigned into a balloon are allowed to use. A CPU belongs into at most one balloon at a time. CPUs that do not belong to any balloon are called idle CPUs.

The most important options in the above configuration example are:

allocatorTopologyBalancing: true. This option ensures that balloons (sets of allowed CPUs) are balanced between CPU sockets in the system. Balancing happens also within a CPU socket if the system is running in a sub-NUMA clustering (SNC) mode. Without this option balloons would be tightly packed on a single socket allowing the other CPU socket to sleep and save power. Here we have optimized for performance, but to optimize for power savings, one could alternately have set allocatorTopologyBalancing: false. For more information about sub-NUMA clustering, see Xeon scalable overview
The list of balloonTypes includes two application-specific balloon types: one for tgi and one for tei containers.
matchExpressions of a balloon type enable matching containers that should be run in balloons of this type. We select tei and tgi containers into their special balloon types based on container name. Matching could be done based on labels and pod name, too.
preferNewBalloon: true on both tei and tgi balloon types means that when a container is assigned into this balloon type and it is possible to create a new balloon of this type because there are enough free CPUs in the system, then the new balloon will be created for the container. As a result, both tei and tgi containers will get dedicated set of CPUs, unlike other containers that will run in the default balloon type. Each container is allowed to use only CPUs of the balloon where they are assigned.
minCPUs: 16 and minCPUs: 32 define the minimum number of CPUs in a balloon. Created balloon will never be smaller even if containers assigned to a balloon of this type would request fewer or no CPUs at all. Correspondingly maxCPUs could be used to set an upper limit for CPUs.
minBalloons: 1 means that the policy must preallocate CPUs for one balloon of this balloon type immediately when the policy starts. This ensures that the CPUs are selected optimally without any restrictions that could be imposed by other CPU allocations. Without preallocation, some other balloons created for other containers could get their CPUs first, which would force this allocation to be made from scattered left-over CPUs. The number of preallocated CPUs for the balloon is specified by minCPUs.
hideHyperthreads: true means that containers in balloons of this type are allowed to use only single CPU hyperthread from each CPU core in the balloon. By default, both using hyperthreads of all CPUs in the balloon is allowed. Note that when true, both hyperthreads are allocated to the balloon in any case, preventing allocating them into other balloons. This ensures that the whole CPU core is dedicated to containers in these balloons only.
hideHyperthreads: false allows containers in a balloon use all balloon’s CPUs, whether or not they are from same CPU cores. As the default balloon option, this option applies to all other containers but tgi and tei in the example configuration. Note that false cannot unhide hyperthreads if hyperthreading is off in BIOS.
shareIdleCPUsInSame: numa means that containers in a balloon of this type are allowed to use, not only balloon’s own CPUs, but also idle CPUs within the same NUMA nodes as balloon’s own CPUs. This enables bursting CPU usage above what is requested by containers in the balloon, yet still keep using only CPUs with the lowest latency to the data in the memory.

For more information about the configuration and the balloons resource policy, refer to the balloons documentation.

Validate CPU affinity and hardware alignment in containers¶

CPUs allowed in each container of the ChatQnA RAG pipeline can be listed by running grep in each container. Assuming that the pipeline is running in the “chatqna” namespace, this can be done as follows.

namespace=chatqna
for pod in $(kubectl get pods -n $namespace -o name); do
    echo $(kubectl exec -t -n $namespace $pod -- grep Cpus_allowed_list /proc/self/status) $pod
done | sort

Cpus_allowed_list: 0-30 chatqna-tgi-84c98dd9b7-26dhl
Cpus_allowed_list: 32-39 chatqna-teirerank-7fd4d88d85-swjjv
Cpus_allowed_list: 40-47 chatqna-tei-f5dd58487-vfv45
Cpus_allowed_list: 56-62,120-126 chatqna-85fb984fb9-7rfrk
Cpus_allowed_list: 56-62,120-126 chatqna-data-prep-5489d9b65d-szgth
Cpus_allowed_list: 56-62,120-126 chatqna-embedding-usvc-64566dd669-hdr4k
Cpus_allowed_list: 56-62,120-126 chatqna-llm-uservice-678dc9f98c-tvtqq
Cpus_allowed_list: 56-62,120-126 chatqna-redis-vector-db-676fb75667-trqm6
Cpus_allowed_list: 56-62,120-126 chatqna-reranking-usvc-74b5684cbc-28gdr
Cpus_allowed_list: 56-62,120-126 chatqna-retriever-usvc-64fd64475b-f892k
Cpus_allowed_list: 56-62,120-126 chatqna-ui-dd657bbf6-2wzhr

Alignment of allowed CPU sets with the underlying hardware topology can be validated by comparing above output to CPUs in each NUMA node.

lscpu | grep NUMA

NUMA node(s):                       8
NUMA node0 CPU(s):                  0-7,64-71
NUMA node1 CPU(s):                  8-15,72-79
NUMA node2 CPU(s):                  16-23,80-87
NUMA node3 CPU(s):                  24-31,88-95
NUMA node4 CPU(s):                  32-39,96-103
NUMA node5 CPU(s):                  40-47,104-111
NUMA node6 CPU(s):                  48-55,112-119
NUMA node7 CPU(s):                  56-63,120-127

This shows that chatqna-tgi is executed on CPUs 0-30, that is, on NUMA nodes 0-3. All these NUMA nodes are located in the same CPU socket, as they have the same physical package id:

cat /sys/devices/system/node/node[0-3]/cpu*/topology/physical_package_id | sort -u
0

The output also shows that chatqna-teirerank and chatqna-tei have been given CPUs from two separate NUMA nodes (4 and 5) from the other CPU socket.

cat /sys/devices/system/node/node[4-5]/cpu*/topology/physical_package_id | sort -u
1

Finally, taking a deeper look into CPUs of chatqna-teirerank (32-39), we can find out that each of them is selected from a separate physical CPU core in NUMA node4. That is, there are no two vCPUs (hyperthreads) from the same core.

cat /sys/devices/system/node/node4/cpu3[2-9]/topology/core_id
0
1
2
3
4
5
6
7

Remove a policy¶

The balloons policy is uninstalled from the cluster with helm:

helm uninstall balloons

Note that removing the policy does not modify CPU affinity (cgroups cpuset.cpus files) of running containers. For that the containers need to be recreated or new policy installed.

NRI topology-aware resource policy¶

NRI plugins include the topology-aware resource policy, too. Unlike balloons, it does not require configuration to start with. Instead, it will create CPU pools for containers purely based on their resource requests and limits, that must be set for effective use of the policy. Containers in the Guaranteed QoS class get dedicated CPUs. Yet container and node type-specific configuration possibilities are more limited, the policy works well for ensuring NUMA alignment and choosing CPUs with low latency access to accelerators like Gaudi cards. See the topology-aware policy documentation for more information.

Recommendations¶

Following recommendations help using CPU and memory resources effectively and serving maximal number of CPU inference users.

For best throughput, run multiple inference engines on a server with non-overlapping CPUs, instead of using all CPUs for single inference. Logical CPUs for two different engines should not be taken from the same physical CPU core, if CPU cores are hyperthreaded.
For best throughput (and potentially lowest latencies), on a multisocket systems, run each inference engine using CPUs only from single socket. Spread engines evenly across sockets in order to balance memory bandwidth usage in the system. Each engine uses only the memory that is local to the socket.
For lowest variance, enable sub-NUMA clustering, and run each inference engine using CPUs only from single sub-NUMA node. This gives the best predictability, as every engine accesses lowest latency memory and access times to all the data are uniform. Throughput is best when all engines in the system are under heavy load. But there is a trade-off: this limits the peak memory bandwidth for each inference engine, compared to the setup where each engine uses memory from all NUMA nodes local to its socket.
For best throughput and latencies, hide hyperthreads from physical CPU cores of the inference engine. However, if using very recent platforms and relatively small number of CPUs for each engine (for example less than 10), hyperthreads may improve performance. In other words, it is worth measuring which one is better because the difference can be significant in both ways.

In case of the NRI balloons resource policy, “non-overlapping CPUs” in recommendations 1-3 can be ensured by setting preferNewBalloons: true to all balloon types of inference engine containers. The policy will then assign each container to a balloon that has no other containers.

Balancing memory bandwidth in recommendations 2-3 can be handled by setting allocatorTopologyBalancing: true and minBalloons: NN where NN is the number of balloons that should be precreated for inference containers. Balancing is guaranteed to succeed with precreated balloons, yet it may succeed when creating balloons when needed, too.

Recommendation 4 can be followed by setting hideHyperthreads: true and allocating double the number of (logical) CPUs compared to the number of physical cores that each inference engine will get. The number of CPUs can be chosen with minCPUs for precreated balloons, or by using inference container’s resources.requests.cpu in Kubernetes.