Metrics / visualization add-ons¶

Table of Contents

Pre-conditions
Device metrics for Gaudi HW
Extra metrics for OPEA applications
CPU metrics from PCM
Importing dashboards to Grafana
More dashboards

Pre-conditions¶

Monitoring for Helm installed OPEA applications is already working, see Helm monitoring option.

Device metrics for Gaudi HW¶

To monitor Gaudi hardware metrics, you can use the following steps:

Step 1: Install daemonset¶

kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-daemonset.yaml

Step 2: Install metric-exporter¶

kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.16.2/metric-exporter-service.yaml

Step 3: Install service-monitor¶

kubectl apply -f ./habana/metric-exporter-serviceMonitor.yaml

Step 4: Verify the metrics¶

# To get the metric endpoints, e.g. to get first endpoint to test
habana_metric_url=`kubectl -n monitoring get ep metric-exporter -o jsonpath="{.subsets[].addresses[0].ip}:{..subsets[].ports[0].port}"`
# Fetch the metrics
curl ${habana_metric_url}/metrics

# you will see the habana metric data  like this:
process_resident_memory_bytes 2.9216768e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71394960963e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.862641152e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 125
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

Step 5: Import the dashboard into Grafana¶

Import the Dashboard-Gaudi-HW.json file into Grafana Gaudi HW dashboard

Extra metrics for OPEA applications¶

Here are few Grafana dashboards for monitoring additional aspects of OPEA applications:

queue_size_embedding_rerank_tgi.json: queue size of TGI-gaudi, TEI-Embedding, TEI-reranking
tgi_grafana.json: tgi-gaudi text generation inferencing service utilization

Which can be imported to Grafana.

NOTE: Services provide metrics only after they have processed at least one query, before that dashboards can be empty!

TGI dashboard

CPU mmetrics from PCM¶

To monitor PCM (Intel® Performance Counter Monitor) metrics, you can use the following steps:

Step 1: Install PCM¶

Please refer to this repo to install Intel® PCM

Step 2: Modify & Install pcm-service¶

modify the pcm/pcm-service.yaml file to set the addresses

kubectl apply -f pcm/pcm-service.yaml

Step 3: Install PCM serviceMonitor¶

kubectl apply -f pcm/pcm-serviceMonitor.yaml

Step 4: Install the PCM dashboard¶

Import the pcm-dashboard.json file into the Grafana PCM dashboard

Importing dashboards to Grafana¶

You can either:

Import them manually to Grafana,
Use update-dashboards.sh script to add them to Kubernetes as (more persistent) Grafana dashboard configMaps
- Script uses $USER-<file name> as dashboard configMap names, and overwrites any pre-existing configMap with the same name
Or create your own dashboards based on them

When dashboard is imported to Grafana, you can directly save changes to it, but such dashboards go away if Grafana is removed / re-installed. When dashboard is in configMap, Grafana saves its changes to a (selected) file, but you need to re-apply those files to Kubernetes with the script, for your changes to be there when that Grafana dashboard page is reloaded in browser.

Gotchas for dashboard configMap script usage:

If you change dashboard file name, you need to change also its ‘uid’ field (at end of the file), otherwise Grafana will see multiple configMaps for the same dashboard ID
If there’s no uid specified for the dashboard, Grafana will generate one on configMap load. Meaning that dashboard ID, and Grafana URL to it, will change on every reload
Script assumes default Prometheus / Grafana install (monitoring namespace, grafana_dasboard=1 label identifying dashboard configMaps)

More dashboards¶

GenAIEval repository includes additional dashboards.