# Kubernetes Platform Optimization with Resource Management

## Introduction

This document provides an example and recommendations how to manage
which CPUs and memories (NUMA nodes) are allowed to be used by which
containers on a Kubernetes node.

Managing CPUs and memories enables improving AI container performance
and maintaining predictable response times even under heavy
load. Reasons for performance improvements include the following.

- Better cache hit ratios in all cache levels.
- Fewer remote memory accesses.
- Fewer processes and threads per CPU in the whole system.
- Disabling CPU hyperthreading on containers that run faster when the
  other CPU thread is idle.

More predictable response times are possible by using dedicated CPUs
for containers and sets of containers. This ensures that critical
containers will always have enough compute resources, and that
resource hungry containers will not be able to hurt all processes in
the system.

## NRI Plugins

[NRI plugins](https://github.com/containers/nri-plugins) connect to
the container runtime running on a Kubernetes node. Containerd and
CRI-O runtimes support NRI plugins.

The NRI plugins project includes two resource policies, balloons and
topology-aware. They manage allowed CPUs and memories (cpuset.cpus and
cpuset.mems) of all Kubernetes containers created and running on the
node.

In this example, we use the balloons policy because it can be tuned
for certain applications (like RAG pipelines) using even node-specific
parameters for each container in applications. The topology-aware
policy, on the other hand, needs no configuration and does CPU
assignment automatically based on resource requests in containers and
underlying hardware topology.

## Install

Warning: installing and reconfiguring the balloons policy can change
allowed CPUs and memories of already running containers in the
cluster. This may hurt containers that rely on the number of allowed
CPUs being static. Furthermore, if there are containers with gigabytes
of memory allocated, reconfiguring the policy may cause the kernel to
move large amounts of memory between NUMA nodes. This may cause
extremely slow response times until moves have finished. Therefore, it
is recommended that nodes are empty or relatively lightly loaded when
new resource policy is applied.

Install the balloons policy with helm:

1. Add the NRI plugins repository
   ```bash
   helm repo add nri-plugins https://containers.github.io/nri-plugins
   ```

2. Install the balloons resource policy and patch container runtime's
   configuration on the individual worker nodes/hosts to enable NRI support.
   ```bash
   helm install balloons nri-plugins/nri-resource-policy-balloons --set patchRuntimeConfig=true
   ```

Now the balloons policy is managing node resources in the cluster as a
DaemonSet that communicates with the container runtime on every node.

## Validate policy status

The balloons policy is running on a node once you can find
`nri-resource-policy-balloons-...` pod.

```
kubectl get pods -A -o wide | grep nri-resource-policy

default   nri-resource-policy-balloons-v6bvq   1/1   Running   0   12s   10.0.0.136   spr-2   <none>   <none>
```

Status of the policy on each node in a cluster can be read from the
balloonspolicy custom resource. For instance, see Status from

```
kubectl describe balloonspolicy default
```

## Configure

Edit the default balloons policy:
```bash
kubectl edit balloonspolicy default
```

Let us consider isolating AI inference and reranking containers in
[ChatQnA](/GenAIExamples/ChatQnA/README.md)
application's Gaudi accelerated pipeline.

In [helm
charts](/GenAIInfra/helm-charts/chatqna/README.md)
there are "vllm", "tgi", "tei" and "teirerank" containers in the
services that will need a lot of CPUs. They implement
text-generation-interface and text-embeddings-interface services.

Warning: an
[issue](https://github.com/opea-project/GenAIExamples/issues/763) in
the text-generation-interface causes bad performance when CPUs are
managed. As a workaround, prevent CPU management of these containers
by adding a pod annotation in both "chatqna-tei" and
"chatqna-teirerank" deployments:
```
cpu.preserve.resource-policy.nri.io: "true"
```

A note on terminology: we refer to physical CPU cores as "CPU cores"
and hyperthreads as vCPUs or just CPUs. When hyperthreading is on, the
operating system typically sees every CPU core as two separate vCPUs.

In the example configuration below, we assume that hyperthreading is
on. We allocate 16 CPUs (8 CPU cores with two hyperthreads per core)
for each tgi container, and 32 CPUs (that is 16 CPU cores) for each
tei container. This happens with the following balloons policy
configuration.

```yaml
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
  name: default
spec:
  allocatorTopologyBalancing: true
  balloonTypes:
  - name: llm-inference
    allocatorPriority: high
    minCPUs: 16
    minBalloons: 1
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: In
      values: ["tgi", "vllm"]
  - name: tei
    allocatorPriority: high
    minCPUs: 32
    minBalloons: 1
    preferNewBalloons: true
    hideHyperthreads: true
    matchExpressions:
    - key: name
      operator: In
      values:
      - tei
      - teirerank
  - name: default
    hideHyperthreads: false
    namespaces:
    - "*"
    shareIdleCPUsInSame: numa
  instrumentation:
    httpEndpoint: :8891
    prometheusExport: true
    reportPeriod: 60s
    samplingRatePerMillion: 0
  log:
    source: true
    debug: ["policy"]
  pinCPU: true
  pinMemory: false
  reservedPoolNamespaces:
  - kube-system
  reservedResources:
    cpu: "2"
```

The balloons policy creates "balloons" of CPUs that only containers
assigned into a balloon are allowed to use. A CPU belongs into at most
one balloon at a time. CPUs that do not belong to any balloon are
called idle CPUs.

The most important options in the above configuration example are:

- `allocatorTopologyBalancing: true`. This option ensures that
  balloons (sets of allowed CPUs) are balanced between CPU sockets in
  the system. Balancing happens also within a CPU socket if the system
  is running in a sub-NUMA clustering (SNC) mode. Without this option
  balloons would be tightly packed on a single socket allowing the
  other CPU socket to sleep and save power. Here we have optimized for
  performance, but to optimize for power savings, one could
  alternately have set `allocatorTopologyBalancing: false`. For more
  information about sub-NUMA clustering, see [Xeon scalable
  overview](https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html)
- The list of `balloonTypes` includes two application-specific balloon
  types: one for tgi and one for tei containers.
- `matchExpressions` of a balloon type enable matching containers that
  should be run in balloons of this type. We select tei and tgi
  containers into their special balloon types based on container
  name. Matching could be done based on labels and pod name, too.
- `preferNewBalloon: true` on both tei and tgi balloon types means
  that when a container is assigned into this balloon type and it is
  possible to create a new balloon of this type because there are
  enough free CPUs in the system, then the new balloon will be created
  for the container. As a result, both tei and tgi containers will get
  dedicated set of CPUs, unlike other containers that will run in the
  default balloon type. Each container is allowed to use only CPUs of
  the balloon where they are assigned.
- `minCPUs: 16` and `minCPUs: 32` define the minimum number of CPUs in
  a balloon. Created balloon will never be smaller even if containers
  assigned to a balloon of this type would request fewer or no CPUs at
  all. Correspondingly `maxCPUs` could be used to set an upper limit
  for CPUs.
- `minBalloons: 1` means that the policy must preallocate CPUs for one
  balloon of this balloon type immediately when the policy
  starts. This ensures that the CPUs are selected optimally without
  any restrictions that could be imposed by other CPU
  allocations. Without preallocation, some other balloons created for
  other containers could get their CPUs first, which would force this
  allocation to be made from scattered left-over CPUs. The number of
  preallocated CPUs for the balloon is specified by `minCPUs`.
- `hideHyperthreads: true` means that containers in balloons of this
  type are allowed to use only single CPU hyperthread from each CPU
  core in the balloon. By default, both using hyperthreads of all CPUs
  in the balloon is allowed. Note that when `true`, both hyperthreads
  are allocated to the balloon in any case, preventing allocating them
  into other balloons. This ensures that the whole CPU core is
  dedicated to containers in these balloons only.
- `hideHyperthreads: false` allows containers in a balloon use all
  balloon's CPUs, whether or not they are from same CPU cores. As the
  default balloon option, this option applies to all other containers
  but tgi and tei in the example configuration. Note that `false`
  cannot unhide hyperthreads if hyperthreading is off in BIOS.
- `shareIdleCPUsInSame: numa` means that containers in a balloon of
  this type are allowed to use, not only balloon's own CPUs, but also
  idle CPUs within the same NUMA nodes as balloon's own CPUs. This
  enables bursting CPU usage above what is requested by containers in
  the balloon, yet still keep using only CPUs with the lowest latency
  to the data in the memory.

For more information about the configuration and the balloons resource
policy, refer to the balloons
[documentation](https://containers.github.io/nri-plugins/stable/docs/resource-policy/policy/balloons.html).


## Validate CPU affinity and hardware alignment in containers

CPUs allowed in each container of the ChatQnA RAG pipeline can be
listed by running grep in each container. Assuming that the pipeline
is running in the "chatqna" namespace, this can be done as follows.

```
namespace=chatqna
for pod in $(kubectl get pods -n $namespace -o name); do
    echo $(kubectl exec -t -n $namespace $pod -- grep Cpus_allowed_list /proc/self/status) $pod
done | sort

Cpus_allowed_list: 0-30 chatqna-tgi-84c98dd9b7-26dhl
Cpus_allowed_list: 32-39 chatqna-teirerank-7fd4d88d85-swjjv
Cpus_allowed_list: 40-47 chatqna-tei-f5dd58487-vfv45
Cpus_allowed_list: 56-62,120-126 chatqna-85fb984fb9-7rfrk
Cpus_allowed_list: 56-62,120-126 chatqna-data-prep-5489d9b65d-szgth
Cpus_allowed_list: 56-62,120-126 chatqna-embedding-usvc-64566dd669-hdr4k
Cpus_allowed_list: 56-62,120-126 chatqna-llm-uservice-678dc9f98c-tvtqq
Cpus_allowed_list: 56-62,120-126 chatqna-redis-vector-db-676fb75667-trqm6
Cpus_allowed_list: 56-62,120-126 chatqna-reranking-usvc-74b5684cbc-28gdr
Cpus_allowed_list: 56-62,120-126 chatqna-retriever-usvc-64fd64475b-f892k
Cpus_allowed_list: 56-62,120-126 chatqna-ui-dd657bbf6-2wzhr
```

Alignment of allowed CPU sets with the underlying hardware topology
can be validated by comparing above output to CPUs in each NUMA node.

```
lscpu | grep NUMA

NUMA node(s):                       8
NUMA node0 CPU(s):                  0-7,64-71
NUMA node1 CPU(s):                  8-15,72-79
NUMA node2 CPU(s):                  16-23,80-87
NUMA node3 CPU(s):                  24-31,88-95
NUMA node4 CPU(s):                  32-39,96-103
NUMA node5 CPU(s):                  40-47,104-111
NUMA node6 CPU(s):                  48-55,112-119
NUMA node7 CPU(s):                  56-63,120-127
```

This shows that chatqna-tgi is executed on CPUs 0-30, that is, on NUMA
nodes 0-3. All these NUMA nodes are located in the same CPU socket, as
they have the same physical package id:

```
cat /sys/devices/system/node/node[0-3]/cpu*/topology/physical_package_id | sort -u
0
```

The output also shows that chatqna-teirerank and chatqna-tei have been
given CPUs from two separate NUMA nodes (4 and 5) from the other CPU socket.

```
cat /sys/devices/system/node/node[4-5]/cpu*/topology/physical_package_id | sort -u
1
```

Finally, taking a deeper look into CPUs of chatqna-teirerank (32-39),
we can find out that each of them is selected from a separate physical
CPU core in NUMA node4. That is, there are no two vCPUs (hyperthreads)
from the same core.

```
cat /sys/devices/system/node/node4/cpu3[2-9]/topology/core_id
0
1
2
3
4
5
6
7
```

## Remove a policy

The balloons policy is uninstalled from the cluster with helm:

```
helm uninstall balloons
```

Note that removing the policy does not modify CPU affinity (cgroups
cpuset.cpus files) of running containers. For that the containers need
to be recreated or new policy installed.

## NRI topology-aware resource policy

NRI plugins include the topology-aware resource policy, too. Unlike
balloons, it does not require configuration to start with. Instead, it
will create CPU pools for containers purely based on their resource
requests and limits, that must be set for effective use of the
policy. Containers in the
[Guaranteed](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed)
QoS class get dedicated CPUs. Yet container and node type-specific
configuration possibilities are more limited, the policy works well
for ensuring NUMA alignment and choosing CPUs with low latency access
to accelerators like Gaudi cards. See the topology-aware policy
[documentation](https://containers.github.io/nri-plugins/stable/docs/resource-policy/policy/topology-aware.html)
for more information.

## Recommendations

Following recommendations help using CPU and memory resources
effectively and serving maximal number of CPU inference users.

1. For best throughput, run multiple inference engines on a server
   with non-overlapping CPUs, instead of using all CPUs for single
   inference. Logical CPUs for two different engines should not be
   taken from the same physical CPU core, if CPU cores are
   hyperthreaded.

2. For best throughput (and potentially lowest latencies), on a
   multisocket systems, run each inference engine using CPUs only from
   single socket. Spread engines evenly across sockets in order to
   balance memory bandwidth usage in the system. Each engine uses only
   the memory that is local to the socket.

3. For lowest variance, enable sub-NUMA clustering, and run each
   inference engine using CPUs only from single sub-NUMA node. This
   gives the best predictability, as every engine accesses lowest
   latency memory and access times to all the data are
   uniform. Throughput is best when all engines in the system are
   under heavy load. But there is a trade-off: this limits the peak
   memory bandwidth for each inference engine, compared to the setup
   where each engine uses memory from all NUMA nodes local to its
   socket.

4. For best throughput and latencies, hide hyperthreads from physical
   CPU cores of the inference engine. However, if using very recent
   platforms and relatively small number of CPUs for each engine (for
   example less than 10), hyperthreads may improve performance. In
   other words, it is worth measuring which one is better because the
   difference can be significant in both ways.

In case of the NRI balloons resource policy, "non-overlapping CPUs" in
recommendations 1-3 can be ensured by setting `preferNewBalloons:
true` to all balloon types of inference engine containers. The policy
will then assign each container to a balloon that has no other
containers.

Balancing memory bandwidth in recommendations 2-3 can be handled by
setting `allocatorTopologyBalancing: true` and `minBalloons: NN` where
NN is the number of balloons that should be precreated for inference
containers. Balancing is guaranteed to succeed with precreated
balloons, yet it may succeed when creating balloons when needed, too.

Recommendation 4 can be followed by setting `hideHyperthreads: true`
and allocating double the number of (logical) CPUs compared to the
number of physical cores that each inference engine will get. The
number of CPUs can be chosen with `minCPUs` for precreated balloons,
or by using inference container's `resources.requests.cpu` in
Kubernetes.