LongBench is the benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. LongBench includes different languages (Chinese and English) to provide a more comprehensive evaluation of the large models’ multilingual capabilities on long contexts. In addition, LongBench is composed of six major categories and twenty one different tasks, covering key long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks and code completion.
In this guideline, we evaluate LongBench dataset with OPEA services on Intel hardwares.
🚀 QuickStart¶
Installation¶
pip install ../../../requirements.txt
Launch a LLM Service¶
To setup a LLM model, we can use tgi-gaudi or OPEA microservices to launch a service.
Example 1: TGI¶
For example, the follow command is to setup the meta-llama/Llama-2-7b-hf model on Gaudi:
model=meta-llama/Llama-2-7b-hf
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run -p 8080:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN=$hf_token \
-e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host \
ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id $model --max-input-tokens 1024 \
--max-total-tokens 2048
Example 2: OPEA LLM¶
You can also set up a service with OPEA microservices.
For example, you can refer to native LLM for deployment on native Gaudi without any serving framework.
Predict¶
Please set up the environment variables first.
export ENDPOINT="http://{host_ip}:8080/generate" # your LLM serving endpoint
export LLM_MODEL="meta-llama/Llama-2-7b-hf"
export BACKEND="tgi" # "tgi" or "llm"
export DATASET="narrativeqa" # can refer to https://github.com/THUDM/LongBench/blob/main/task.md for full list
export MAX_INPUT_LENGTH=2048 # specify the max input length according to llm services
Then get the prediction on the dataset.
python pred.py \
--endpoint ${ENDPOINT} \
--model_name ${LLM_MODEL} \
--backend ${BACKEND} \
--dataset ${DATASET} \
--max_input_length ${MAX_INPUT_LENGTH}
The prediction will be saved to “pred/{LLM_MODEL}/{DATASET.jsonl}”.
Evaluate¶
Evaluate the prediction with LongBench metrics.
git clone https://github.com/THUDM/LongBench
cd LongBench
pip install -r requirements.txt
python eval.py --model ${LLM_MODEL}
Then evaluated result will be saved to “pred/{LLM_MODEL}/{result.jsonl}”.