Setup Test Environment for CI/CD¶

This document outlines the steps to set up a test environment for OPEA CI/CD from scratch. The environment will be used to run tests and ensure code quality before PR merge and Release.

Install Habana Driver (Gaudi Only)¶

Driver and software installation https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html
Firmware upgrade https://docs.habana.ai/en/latest/Installation_Guide/Firmware_Upgrade.html

Install Docker¶

    sudo apt update
    sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
    sudo systemctl enable docker.service
    sudo systemctl daemon-reload
    sudo systemctl start docker

Troubleshooting Docker Installation¶

Issue: E: Unable to locate package docker-compose-plugin
solution:

    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
    echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    sudo apt-get update
    sudo apt-get install -y docker-compose-plugin

Issue: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get “http://%2Fvar%2Frun%2Fdocker.sock/v1.45/containers/json”: dial unix /var/run/docker.sock: connect: permission denied
solution:

    # option1. 
    sudo usermod -a -G docker xxx  
    # option2. 
    sudo chmod 666 /var/run/docker.sock 

Issue: ulimit -n setting. [optional]
solution:

    cat << EOF | tee /etc/systemd/system/containerd.service.d/override.conf
    [Service]
    LimitNOFILE=infinity
    EOF
    sudo systemctl restart containerd.service

Issue: control the maximum number of memory mapped areas that a process can have. [optional]
solution:

    echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
    sudo sysctl -p
    sysctl vm.max_map_count # check 

Install Conda¶

For e2e test env setup.

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh

Install K8S¶

Use kubeadm to setup k8s cluster. https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubeadm.md
Install Habana plugins (Gaudi Only) https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Kubernetes_Installation/Intel_Gaudi_Kubernetes_Device_Plugin.html

Some Test Code after Installation¶

    kubectl get nodes -o wide
    kubectl get pods -A
    kubectl get cs
    kubectl describe node <node_name>
    kubectl describe pod <pod_name>

Test for Gaudi:

cat <<EOF | tee test.yaml
apiVersion: batch/v1
kind: Job
metadata:
   name: habanalabs-gaudi-demo
spec:
   template:
      spec:
         hostIPC: true
         restartPolicy: OnFailure
         containers:
          - name: habana-ai-base-container
            image: vault.habana.ai/gaudi-docker/1.21.1/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
            workingDir: /root
            command: ["hl-smi"]
            securityContext:
               capabilities:
                  add: ["SYS_NICE"]
            resources:
               limits:
                  habana.ai/gaudi: 1
EOF

kubectl apply -f test.yaml
kubectl delete -f test.yaml

Setup Image Registry for K8S Test¶

Create a docker image registry.

cat << EOF | tee registry.yaml
version: 0.1
log:
  fields:
    service: registry
storage:
  cache:
    blobdescriptor: inmemory
  filesystem:
    rootdirectory: /var/lib/registry
  delete:
    enabled: true
http:
  addr: :5000
  headers:
    X-Content-Type-Options: [nosniff]
health:
  storagedriver:
    enabled: true
    interval: 10s
    threshold: 3
EOF

cd /scratch-1 # place to store the images
mkdir local_image_registry && chmod -R 777 local_image_registry
docker run -d -p 5000:5000 --restart=always --name registry -v /home/sdp/workspace/registry.yaml:/etc/docker/registry/config.yml -v /scratch-1/local_image_registry:/var/lib/registry registry:2

Setup docker registry clean up cron. https://github.com/opea-project/Validation/blob/main/tools/image-registry/cleanup.sh
Setup connection to the local registry.

For docker:

cat /etc/docker/daemon.json
# gaudi: 
{"runtimes": {"habana": {"path": "/usr/bin/habana-container-runtime", "runtimeArgs": []}}, "default-runtime": "habana", "insecure-registries" : [ "100.83.111.232:5000" ]}
# xeon: 
{"insecure-registries": ["100.83.111.232:5000"]}

# restart docker
sudo systemctl restart docker

# for test
docker pull opea/chatqna:latest
docker tag opea/chatqna:latest 100.83.111.232:5000/opea/chatqna:test
docker push 100.83.111.232:5000/opea/chatqna:test

For K8S:

# setup in client side
cat /etc/containerd/config.toml
...
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
      endpoint = ["https://registry-1.docker.io"]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."100.83.111.232:5000"]
      endpoint = ["http://100.83.111.232:5000"]
...
# restart containerd
sudo systemctl restart containerd.service

# setup in server side
cd /etc/containerd
sudo mkdir -p certs.d/100.83.111.232:5000 
cd certs.d/100.83.111.232:5000 
cat << EOF | sudo tee hosts.toml
server = "http://100.83.111.232:5000"
[host."http://100.83.111.232:5000"]
  capabilities = ["pull", "resolve", "push"]
EOF

# restart containerd
sudo systemctl restart containerd.service

# for test
docker pull opea/chatqna:latest
docker tag opea/chatqna:latest 100.83.111.232:5000/opea/chatqna:test
docker push 100.83.111.232:5000/opea/chatqna:test
sudo nerdctl -n k8s.io pull 100.83.111.232:5000/opea/chatqna:test

Setup ENV for CI/CD.

vi .bashrc
export OPEA_IMAGE_REPO=100.83.111.232:5000/

Build and push images to the new local registry.

Setup GHA ENV for CI/CD¶

Setup self-hosed runner for GHA, follow official steps.
Setup ENV for GHA.

vi ~/action_runner/.env
OPEA_IMAGE_REPO=100.83.111.232:5000/

Start runner with svc.

sudo ./svc.sh install # use svc.sh instead of run.sh
sudo ./svc.sh start 
sudo ./svc.sh status 
sudo ./svc.sh stop 

Setup Action Runner Controller (ARC)¶

https://docs.github.com/en/actions/tutorials/quickstart-for-actions-runner-controller
For now, we only support use ARC on Xeon K8S cluster.

Install the ARC Make sure you have installed k8s and helm charts in your test machine.

NAMESPACE="opea-arc-systems"
helm install arc \
    --namespace "${NAMESPACE}" \
    --create-namespace \
    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller

helm uninstall arc -n $NAMESPACE
kubectl delete namespace $NAMESPACE --grace-period=0 --force

Install a runner scale set The runner image that we used in CI/CD build by this dockerfile. And the config setting for the runner scale set can be found in here.

RUNNER_SET_NAME="xeon"
RUNNERS_NAMESPACE="opea-runner-set-c1"
RUNNER_GROUP="opea-runner-set-1" # before use this name, make sure this group has been created in GHA. 
GITHUB_CONFIG_URL="https://github.com/opea-project"
GITHUB_PAT="xxx" # the personal access token for GHA, which has the permission to create runners in the repo.
helm install "${RUNNER_SET_NAME}" \
    --namespace "${RUNNERS_NAMESPACE}" \
    --create-namespace \
    -f xeon_large.yaml \
    --set githubConfigUrl="${GITHUB_CONFIG_URL}" \
    --set githubConfigSecret.github_token="${GITHUB_PAT}" \
    --set runnerGroup="${RUNNER_GROUP}" \
    oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

Nodes:
a. Make sure the nodes in the cluster have enough resources to run the runner pods.
b. Create the special RUNNER_GROUP in GHA, which is used to group the runners.
c. Make sure you have set up label for nodeSelector, with kubectl label nodes opea-cicd-spr-0 runner-node=true and use kubectl get nodes --show-labels to check the labels.
d. Make sure you have /data2 for model cache.

Clean up the ARC (If needed)

# clean up runner set
(
RUNNER_SET_NAME="xeon"
RUNNERS_NAMESPACE="opea-runner-set-c1"
helm uninstall $RUNNER_SET_NAME -n $RUNNERS_NAMESPACE
kubectl delete namespace $RUNNERS_NAMESPACE --grace-period=0 --force
)
# clean up ARC
(
NAMESPACE="opea-arc-systems"
helm uninstall arc -n $NAMESPACE
kubectl delete namespace $NAMESPACE --grace-period=0 --force
)