AdaCLIP-Finetune¶

This repo is the finetune implementation for the paper “AdaCLIP: Towards Pragmatic Multimodal Video Retrieval”

Incorporating large image-text foundation models such as CLIP has substantially improved the performance of the multimodal video retrieval task. However, how to practically sample the frames from a video and aggregate the frame features into a video representation is still an open research question. In particular, real-world deployment scenarios, such as embodiment within consumer electronics or cloud-based inference pipelines, require two key facets of retrieval (representation building and search) to be computationally light and fast. In this paper, we propose AdaCLIP, a computationand latency-aware system for pragmatic multimodal video retrieval. AdaCLIP consists of a learning-based frame selection module to select informative frames and a query-independent frame aggregation module to obtain strong video representations from the frame features. Specifically, in the frame selection module, we introduce a differentiable Hard-Top-k algorithm to sample a subset of the frames while optimizing the performance of the video retrieval task in an end-to-end manner. Moreover, to be latency-aware, we also propose a query-independent lightweight approach, MLP-Score, to aggregate the frame features into the video representation, which offers up to 142x speedup on GPU and 822x speedup on CPU in similarity search time compared to query-dependent matching methods. Experimental results on several popular video retrieval datasets confirm the effectiveness of AdaCLIP.

Prerequisites¶

Linux (Ubuntu 22.04.1 or later is recommended)
Python 3.10
Packages:
- ffmpeg ($sudo apt-get install ffmpeg)
Datasets: ActivityNet Dense Captions, MSRVTT, DiDeMo

How to Install¶

Install on NVIDIA¶

Create a conda environment and install the appropriate packages:

conda activate adaclip_py310_nv
conda create -n adaclip_py310_nv python=3.10 -y
conda install -y pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 cudatoolkit=11.7 -c pytorch -c conda-forge
pip install -r requirements.txt

Install on Arc A770¶

Install Driver for Arc 770¶

please fololow Install Dependency to install Driver for Arc 770

Install oneapi¶

You can refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/dfc4a434-838c-4450-a6fe-2fa903b75aa7/intel-oneapi-base-toolkit-2025.0.1.46_offline.sh
sudo sh ./intel-oneapi-base-toolkit-2025.0.1.46_offline.sh -a --silent --cli --eula accept

Create a conda environment install IPEX and other lib¶

Create a conda environment¶

conda create -n adaclip_py310 python=3.10 -y
conda activate adaclip_py310

Install ipex¶

You can refer to https://github.com/intel/intel-extension-for-pytorch

python -m pip install torch==2.5.1+cxx11.abi torchvision==0.20.1+cxx11.abi torchaudio==2.5.1+cxx11.abi intel-extension-for-pytorch==2.5.10+xpu oneccl_bind_pt==2.5.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Check xpu:

python
import torch
import intel_extension_for_pytorch
torch.xpu.device_count()

Install requirements¶

pip install -r requirements.txt

Prepare Datasets¶

Datasets¶

We mainly use ActivityNet to do finetune when development , you can also use other datasets.

The following are some datasets used in AdaCLIP:

ActivityNet¶

Download the videos from the official website. The authors have made the videos available on Google and Baidu drives.

MSRVTT¶

The videos are shared by Frozen in Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

DiDeMo¶

The videos can be downloaded from LisaAnne/LocalizingMoments.

Frame Extraction¶

Run utils/frame_extraction.py to extract frames after having downloaded the dataset videos and annotations from the website.

Make sure that all the videos are in the same directory (no sub-directories allowed).

The frames from each video will be saved under: /path/to/frames/video_name

python utils/frame_extraction.py /path/to/videos /path/to/frames --parallel

Dataset JSON prepare¶

Prepare dataset json as the following example:

    "video_name": {
        "sentences": [
        "sentence 1",
        "sentence 2",
        ...
        "sentence n"
        ]
    }

Each data need video name and sentences that describe the video content.

An example json file is provided in : CLIP_LLama_Factory/src/llamafactory/adaclip_finetune/dataset_example/dataset.json

Implemented finetune methods¶

We have implemented the BitFit and IBS fine-tuning methods.

To fine-tune using different methods, you can utilize the corresponding configuration files located under src/llamafactory/adaclip_finetune/cfgs. For a more detailed guide, please refer to How to Finetune section.

BitFit & SSF¶

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning
Revisiting Batch Normalization For Practical Domain Adaptation

Example

    "peft": {
        "method": "bitfit",
        "config": {
            "keep_module_keywords": [
                "ln_post",
                "visual.proj",
                "ln_final",
                "text_projection",
                "logit_scale"
            ]
        }
    }

Config path: src/llamafactory/adaclip_finetune/cfgs/peft/bitfit.json

TODO: check if the naive recursive monkey patch has problems.

Importance Based Selection (IBS)¶

Select partial layers for finetune based on the parameter updates after training a given steps/epochs. The metric for importance can be either the l2 norm of param updates or angle based, which is introduced in the following paper:

Angle-based Search Space Shrinking for Neural Architecture Search

Example

    "peft": {
        "method": "ibs",
        "config": {
            "pre_batch_size": 8,
            "num_pre_epochs": 2,
            "retain_ratio": 0.1,
            "metric": "l2norm",
            "normalization": true,
            "keep_module_keywords": [
                "ln_post",
                "visual.proj",
                "ln_final",
                "text_projection",
                "logit_scale"
            ]
        }
    }

Config path: src/llamafactory/adaclip_finetune/cfgs/peft/ibs.json

Performance of different finetune methods¶

Finetune method	# frames	Top-k	epochs	batch size	LR(Main/CLIP)	% params	# train	# test	T2V: R1/R5	V2T: R1/R5
Full Finetune	32	16	30	16	1e-4/1e-7	100	5000	4917	37.9/68.2	38.5/69.3
IBS-G Finetune	32	16	30	16	1e-4/1e-7	8.314	5000	4917	36.8/67.4	38.4/68.3
BitFit Finetune	32	16	30	16	1e-4/2e-5	0.516	5000	4917	36.3/66.2	37.7/68.4

How to Finetune¶

You can finetune AdaCLIP by using configs under src/llamafactory/adaclip_finetune/cfgs.

You can modify the information in config jsons to meet your requirements, like train_annot,val_annot and test_annot in the configs according to your own dataset.

Finetune on NVIDIA¶

cd src/llamafactory/adaclip_finetune

Finetune AdaCLIP with bitfit

python  train.py --config src/llamafactory/adaclip_finetune/cfgs/bitfit.json --frames_dir  /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pre-train/model --batch_size 8

Finetune AdaCLIP with ibs

python  train.py --config src/llamafactory/adaclip_finetune/cfgs/ibs.json --frames_dir  /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pre-train/model --batch_size 8

Full finetune

python  train.py --config src/llamafactory/adaclip_finetune/cfgs/full-finetune.json --frames_dir  /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pretrain/model --batch_size 8

Finetune on Arc A770¶

Currently only single card finetune is supported, you can specify the XPU with the following command:

export ZE_AFFINITY_MASK=the_card_number

Enter the AdaCLIP folder:

cd src/llamafactory/adaclip_finetune

Finetune AdaCLIP with bitfit

python  train.py --config src/llamafactory/adaclip_finetune/cfgs/bitfit.json --frames_dir  /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pretrain/model --xpu --batch_size 8

Finetune AdaCLIP with ibs

python  train.py --config src/llamafactory/adaclip_finetune/cfgs/ibs.json --frames_dir  /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pretrain/model --xpu --batch_size 8

Full finetune

python  train.py --config src/llamafactory/adaclip_finetune/cfgs/full-finetune.json --frames_dir  /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pretrain/model --xpu --batch_size 8

The finetune output will located in src/llamafactory/adaclip_finetune/output

Use optuna to automatic get the best param¶

You can enable optuna to automatic get the best param by adding optuna_cfg configs to config files like:

    "optuna_cfg": {
        "n_trials": 30,
        "n_warmup_steps":10,
        "sampler": {
            "name": "TPESampler"
        },
        "opt_params": {
            "coef_lr": {
                "range": [0.02,0.5],
                "log": false
            },
            "weight_decay": {
                "range": [0.01,0.5],
                "log": false
            }
        }
    }

The config example is: src/llamafactory/adaclip_finetune/cfgs/bitfit-optuna.json

Config name	Description
n_trials	The max number of trials. Must be set to an integer.
n_warmup_steps	The pruning is disabled until the trial exceeds the given number of step(epochs). Note that this feature assumes that step starts at zero.
sampler	Choose samplers which optuna uses. now support `TPESampler`,`CmaEsSampler` and `GPSampler`.
opt_params	The parameters you want to optimize.

Configs of opt_params	Description
range	The min and max value of the parameter.
log	A flag to sample the value from the log domain or not. If log is true, the value is sampled from the range in the log domain. Otherwise, the value is sampled from the range in the linear domain.

If you want to continue train models with the best parameters after optuna optimization, add --do_training_af_optuna in your command line.

Command example:

cd src/llamafactory/adaclip_finetune/train.py
python train.py --config ./cfgs/bitfit-optuna.json --frames_dir /path/to/frames --top_k 16 --freeze_cnn --frame_agg mlp --resume /path/to/pre-train/model --xpu --batch_size 8

Visualization¶

You can review optuna tuning results by:

sudo ufw allow 8084
optuna-dashboard --host 0.0.0.0 --port 8084 sqlite:///optuna.db

Open in the website:

http://<serverIP>:8084/dashboard

You can see finetune curves for different parameters and other infornations in the website.