Setup Test Environment for CI/CD¶
This document outlines the steps to set up a test environment for OPEA CI/CD from scratch. The environment will be used to run tests and ensure code quality before PR merge and Release.
Install Habana Driver (Gaudi Only)¶
Driver and software installation https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html
Firmware upgrade https://docs.habana.ai/en/latest/Installation_Guide/Firmware_Upgrade.html
Install Docker¶
sudo apt update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo systemctl enable docker.service
sudo systemctl daemon-reload
sudo systemctl start docker
Troubleshooting Docker Installation¶
Issue: E: Unable to locate package docker-compose-plugin
solution:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-compose-plugin
Issue: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get “http://%2Fvar%2Frun%2Fdocker.sock/v1.45/containers/json”: dial unix /var/run/docker.sock: connect: permission denied
solution:
# option1.
sudo usermod -a -G docker xxx
# option2.
sudo chmod 666 /var/run/docker.sock
Issue: ulimit -n setting. [optional]
solution:
cat << EOF | tee /etc/systemd/system/containerd.service.d/override.conf
[Service]
LimitNOFILE=infinity
EOF
sudo systemctl restart containerd.service
Issue: control the maximum number of memory mapped areas that a process can have. [optional]
solution:
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
sysctl vm.max_map_count # check
Install Conda¶
For e2e test env setup.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Install K8S¶
Use kubeadm to setup k8s cluster. https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubeadm.md
Install Habana plugins (Gaudi Only) https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Kubernetes_Installation/Intel_Gaudi_Kubernetes_Device_Plugin.html
Some Test Code after Installation¶
kubectl get nodes -o wide
kubectl get pods -A
kubectl get cs
kubectl describe node <node_name>
kubectl describe pod <pod_name>
Test for Gaudi:
cat <<EOF | tee test.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: habanalabs-gaudi-demo
spec:
template:
spec:
hostIPC: true
restartPolicy: OnFailure
containers:
- name: habana-ai-base-container
image: vault.habana.ai/gaudi-docker/1.21.1/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
workingDir: /root
command: ["hl-smi"]
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
limits:
habana.ai/gaudi: 1
EOF
kubectl apply -f test.yaml
kubectl delete -f test.yaml
Setup Image Registry for K8S Test¶
Create a docker image registry.
cat << EOF | tee registry.yaml
version: 0.1
log:
fields:
service: registry
storage:
cache:
blobdescriptor: inmemory
filesystem:
rootdirectory: /var/lib/registry
delete:
enabled: true
http:
addr: :5000
headers:
X-Content-Type-Options: [nosniff]
health:
storagedriver:
enabled: true
interval: 10s
threshold: 3
EOF
cd /scratch-1 # place to store the images
mkdir local_image_registry && chmod -R 777 local_image_registry
docker run -d -p 5000:5000 --restart=always --name registry -v /home/sdp/workspace/registry.yaml:/etc/docker/registry/config.yml -v /scratch-1/local_image_registry:/var/lib/registry registry:2
Setup docker registry clean up cron. https://github.com/opea-project/Validation/blob/main/tools/image-registry/cleanup.sh
Setup connection to the local registry.
For docker:
cat /etc/docker/daemon.json
# gaudi:
{"runtimes": {"habana": {"path": "/usr/bin/habana-container-runtime", "runtimeArgs": []}}, "default-runtime": "habana", "insecure-registries" : [ "100.83.111.232:5000" ]}
# xeon:
{"insecure-registries": ["100.83.111.232:5000"]}
# restart docker
sudo systemctl restart docker
# for test
docker pull opea/chatqna:latest
docker tag opea/chatqna:latest 100.83.111.232:5000/opea/chatqna:test
docker push 100.83.111.232:5000/opea/chatqna:test
For K8S:
# setup in client side
cat /etc/containerd/config.toml
...
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."100.83.111.232:5000"]
endpoint = ["http://100.83.111.232:5000"]
...
# restart containerd
sudo systemctl restart containerd.service
# setup in server side
cd /etc/containerd
sudo mkdir -p certs.d/100.83.111.232:5000
cd certs.d/100.83.111.232:5000
cat << EOF | sudo tee hosts.toml
server = "http://100.83.111.232:5000"
[host."http://100.83.111.232:5000"]
capabilities = ["pull", "resolve", "push"]
EOF
# restart containerd
sudo systemctl restart containerd.service
# for test
docker pull opea/chatqna:latest
docker tag opea/chatqna:latest 100.83.111.232:5000/opea/chatqna:test
docker push 100.83.111.232:5000/opea/chatqna:test
sudo nerdctl -n k8s.io pull 100.83.111.232:5000/opea/chatqna:test
Setup ENV for CI/CD.
vi .bashrc
export OPEA_IMAGE_REPO=100.83.111.232:5000/
Build and push images to the new local registry.
Setup GHA ENV for CI/CD¶
Setup self-hosed runner for GHA, follow official steps.
Setup ENV for GHA.
vi ~/action_runner/.env
OPEA_IMAGE_REPO=100.83.111.232:5000/
Start runner with svc.
sudo ./svc.sh install # use svc.sh instead of run.sh
sudo ./svc.sh start
sudo ./svc.sh status
sudo ./svc.sh stop
Setup Action Runner Controller (ARC)¶
https://docs.github.com/en/actions/tutorials/quickstart-for-actions-runner-controller
For now, we only support use ARC on Xeon K8S cluster.
Install the ARC Make sure you have installed k8s and helm charts in your test machine.
NAMESPACE="opea-arc-systems"
helm install arc \
--namespace "${NAMESPACE}" \
--create-namespace \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
helm uninstall arc -n $NAMESPACE
kubectl delete namespace $NAMESPACE --grace-period=0 --force
Install a runner scale set The runner image that we used in CI/CD build by this dockerfile. And the config setting for the runner scale set can be found in here.
RUNNER_SET_NAME="xeon"
RUNNERS_NAMESPACE="opea-runner-set-c1"
RUNNER_GROUP="opea-runner-set-1" # before use this name, make sure this group has been created in GHA.
GITHUB_CONFIG_URL="https://github.com/opea-project"
GITHUB_PAT="xxx" # the personal access token for GHA, which has the permission to create runners in the repo.
helm install "${RUNNER_SET_NAME}" \
--namespace "${RUNNERS_NAMESPACE}" \
--create-namespace \
-f xeon_large.yaml \
--set githubConfigUrl="${GITHUB_CONFIG_URL}" \
--set githubConfigSecret.github_token="${GITHUB_PAT}" \
--set runnerGroup="${RUNNER_GROUP}" \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
Nodes:
a. Make sure the nodes in the cluster have enough resources to run the runner pods.
b. Create the special RUNNER_GROUP
in GHA, which is used to group the runners.
c. Make sure you have set up label for nodeSelector
, with kubectl label nodes opea-cicd-spr-0 runner-node=true
and use kubectl get nodes --show-labels
to check the labels.
d. Make sure you have /data2
for model cache.
Clean up the ARC (If needed)
# clean up runner set
(
RUNNER_SET_NAME="xeon"
RUNNERS_NAMESPACE="opea-runner-set-c1"
helm uninstall $RUNNER_SET_NAME -n $RUNNERS_NAMESPACE
kubectl delete namespace $RUNNERS_NAMESPACE --grace-period=0 --force
)
# clean up ARC
(
NAMESPACE="opea-arc-systems"
helm uninstall arc -n $NAMESPACE
kubectl delete namespace $NAMESPACE --grace-period=0 --force
)