[LongBench](https://github.com/THUDM/LongBench) is the benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models' multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty one different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks and code completion.

In this guideline, we evaluate LongBench dataset with OPEA services on Intel hardwares.

# 🚀 QuickStart

## Installation

```
pip install ../../../requirements.txt
```

## Launch a LLM Service

To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) or [OPEA microservices](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation) to launch a service. 

### Example 1: TGI
For example, the follow command is to setup the [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) model on  Gaudi:

```
model=meta-llama/Llama-2-7b-hf
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN=$hf_token \
-e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host \
ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id $model --max-input-tokens 1024 \
--max-total-tokens 2048
```

### Example 2: OPEA LLM
You can also set up a service with OPEA microservices. 

For example, you can refer to [native LLM](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/native/langchain) for deployment on native Gaudi without any serving framework.

## Predict 
Please set up the environment variables first.
```
export ENDPOINT="http://{host_ip}:8080/generate" # your LLM serving endpoint
export LLM_MODEL="meta-llama/Llama-2-7b-hf"
export BACKEND="tgi" # "tgi" or "llm"
export DATASET="narrativeqa" # can refer to https://github.com/THUDM/LongBench/blob/main/task.md for full list
export MAX_INPUT_LENGTH=2048 # specify the max input length according to llm services
```
Then get the prediction on the dataset.
```
python pred.py \
    --endpoint ${ENDPOINT} \
    --model_name ${LLM_MODEL} \
    --backend ${BACKEND} \ 
    --dataset ${DATASET} \
    --max_input_length  ${MAX_INPUT_LENGTH}
```
The prediction will be saved to "pred/{LLM_MODEL}/{DATASET.jsonl}".

## Evaluate
Evaluate the prediction with LongBench metrics.
```
git clone https://github.com/THUDM/LongBench
cd LongBench
pip install -r requirements.txt
python eval.py --model ${LLM_MODEL}
```
Then evaluated result will be saved to "pred/{LLM_MODEL}/{result.jsonl}".