Dataset¶

Dataset for CLIP¶

Caltech101¶

Create a folder named caltech-101/ under $DATA.
Download 101_ObjectCategories.tar.gz from http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz and extract the file under $DATA/caltech-101.
Download split_zhou_Caltech101.json from this link and put it under $DATA/caltech-101.

The directory structure should look like

$DATA/
|-- caltech-101/
|   |-- 101_ObjectCategories/
|   | split_zhou_Caltech101.json

mini-imagenet¶

Create a folder named mini-imagenet/ under $DATA.
Download the dataset from the mini-imagnet and extract the training and validation sets to $DATA/mini-imagenet.
Download the classnames.txt to $DATA/mini-imagenet/ from this link. The class names are copied from CLIP.

The directory structure should look like

$DATA/
|–– mini-imagenet/
|   |–– train/ # contains 1,000 folders like n01440764, n01443537, etc.
|   |–– val/
|   |–– test/
|   |-- classnames.txt

MSCOCO2014¶

Create a folder named mscoco2014/ under $DATA.
Download the dataset from the MSCOCO and extract the training and validation sets to $DATA/mscoco2014.
download json file from https://www.kaggle.com/datasets/wangjilong/dataset-json/ data/mscococ2014/*.json to $DATA/mscoco2014

The directory structure should look like

$DATA/
|–– mscoco2014/
|   |–– train2014/ # contains 1,000 folders like n01440764, n01443537, etc.
|   |–– val2014/
|   |-- captions_train.json
|   |-- coco_karpathy_test.json
|   |-- coco_karpathy_val.json

Flickr¶

Create a folder name flickr/ under $DATA.
Download the dataset form the Kaggle
download json file from https://www.kaggle.com/datasets/wangjilong/dataset-json/ data/flickr/*.json to $DATA/flickr

$DATA/
|–– flickr/
|   |–– flickr30k-images/
|   |   |-- *.jpg
|   |-- flickr30k_train.json.json
|   |-- flickr30k_val.json.json
|   |-- flickr30k_test.json.json

Flickr5k¶

Create a folder name flickr5k/ under $DATA.
Download the dataset form the Kaggle
download json file from https://www.kaggle.com/datasets/wangjilong/dataset-json/ data/flickr5k/*.json to $DATA/flickr5k

$DATA/
|–– flickr5k/
|   |–– flickr5k-images/
|   |   |-- *.jpg
|   |-- flickr5k_train.json.json
|   |-- flickr5k_val.json.json
|   |-- flickr5k_test.json.json

Dataset for AdaCLIP¶

MSRVTT¶

The videos are shared by Frozen in Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

DiDeMo¶

The videos can be downloaded from LisaAnne/LocalizingMoments.

ActivityNet¶

Download the videos from the official website. The authors have made the videos available on Google and Baidu drives.

Preprocessing¶

Frame Extraction¶

Run src/adaclip_finetune/utils/frame_extraction.py after having downloaded the dataset videos and annotations from the website. Make sure that all the videos are in the same directory (no sub-directories allowed).

python src/adaclip_finetune/utils/frame_extraction.pyy /path/to/videos /path/to/frames --parallel

Subsequently, update the frames_dir parameter in the config files configs/[dataset].json.

Annotation Preprocessing¶

If the videos downloaded differ from the set used in the paper, run annot_preprocess/{dataset}_preprocess.py to generate train/test splits used by the dataloader. Splits used in the paper can be found in annots/.

To obtain the annotation files used to generate the splits, please download them from the following links:

MSRVTT annotations are from CLIP4Clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

ActivityNet annotations are from the project page of ActivityNet Captions:

wget https://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip

DiDeMo annotations have two components: annotations from the original author and the split used by Collaborative Experts.