Dataset

Dataset for CLIP

Caltech101

  • Create a folder named caltech-101/ under $DATA.

  • Download 101_ObjectCategories.tar.gz from http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz and extract the file under $DATA/caltech-101.

  • Download split_zhou_Caltech101.json from this link and put it under $DATA/caltech-101.

The directory structure should look like

$DATA/
|-- caltech-101/
|   |-- 101_ObjectCategories/
|   | split_zhou_Caltech101.json

mini-imagenet

  • Create a folder named mini-imagenet/ under $DATA.

  • Download the dataset from the mini-imagnet and extract the training and validation sets to $DATA/mini-imagenet.

  • Download the classnames.txt to $DATA/mini-imagenet/ from this link. The class names are copied from CLIP.

The directory structure should look like

$DATA/
|–– mini-imagenet/
|   |–– train/ # contains 1,000 folders like n01440764, n01443537, etc.
|   |–– val/
|   |–– test/
|   |-- classnames.txt

MSCOCO2014

  • Create a folder named mscoco2014/ under $DATA.

  • Download the dataset from the MSCOCO and extract the training and validation sets to $DATA/mscoco2014.

  • download json file from https://www.kaggle.com/datasets/wangjilong/dataset-json/ data/mscococ2014/*.json to $DATA/mscoco2014

The directory structure should look like

$DATA/
|–– mscoco2014/
|   |–– train2014/ # contains 1,000 folders like n01440764, n01443537, etc.
|   |–– val2014/
|   |-- captions_train.json
|   |-- coco_karpathy_test.json
|   |-- coco_karpathy_val.json

Flickr

  • Create a folder name flickr/ under $DATA.

  • Download the dataset form the Kaggle

  • download json file from https://www.kaggle.com/datasets/wangjilong/dataset-json/ data/flickr/*.json to $DATA/flickr

$DATA/
|–– flickr/
|   |–– flickr30k-images/
|   |   |-- *.jpg
|   |-- flickr30k_train.json
|   |-- flickr30k_val.json
|   |-- flickr30k_test.json

FlickrCN

  • Create a folder name flickrcn/ under $DATA.

  • Download the dataset form the Kaggle

  • download json file from https://huggingface.co/datasets/OFA-Sys/chinese-clip-eval/resolve/main/Flickr30k-CN.zip Flickr30k-CN.zip\Flickr30k-CN/*.jsonl to $DATA/flickrcn

$DATA/
|–– flickrcn/
|   |–– flickr30k-images/
|   |   |-- *.jpg
|   |-- train_texts.jsonl
|   |-- val_texts.jsonl
|   |-- test_texts.jsonl
  • Run generate_flickr30k_cn_json.py --base_dir $DATA/flickrcn/ to generate usable json file

$DATA/
|–– flickrcn/
|   |–– flickr30k-images/
|   |   |-- *.jpg
|   |-- flickr30k_cn_train.json
|   |-- flickr30k_cn_val.json
|   |-- flickr30k_cn_test.json

Flickr5k

  • Create a folder name flickr5k/ under $DATA.

  • Download the dataset form the Kaggle

  • download json file from https://www.kaggle.com/datasets/wangjilong/dataset-json/ data/flickr5k/*.json to $DATA/flickr5k

$DATA/
|–– flickr5k/
|   |–– flickr5k-images/
|   |   |-- *.jpg
|   |-- flickr5k_train.json.json
|   |-- flickr5k_val.json.json
|   |-- flickr5k_test.json.json

Dataset for AdaCLIP

MSRVTT

The videos are shared by Frozen in Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

DiDeMo

The videos can be downloaded from LisaAnne/LocalizingMoments.

ActivityNet

Download the videos from the official website. The authors have made the videos available on Google and Baidu drives.

Preprocessing

Frame Extraction

Run src/adaclip_finetune/utils/frame_extraction.py after having downloaded the dataset videos and annotations from the website. Make sure that all the videos are in the same directory (no sub-directories allowed).

python src/adaclip_finetune/utils/frame_extraction.pyy /path/to/videos /path/to/frames --parallel

Subsequently, update the frames_dir parameter in the config files configs/[dataset].json.

Annotation Preprocessing

If the videos downloaded differ from the set used in the paper, run annot_preprocess/{dataset}_preprocess.py to generate train/test splits used by the dataloader. Splits used in the paper can be found in annots/.

To obtain the annotation files used to generate the splits, please download them from the following links:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip
  • ActivityNet annotations are from the project page of ActivityNet Captions:

wget https://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip

Dataset for Qwen2-VL & Qwen2.5-VL Finetune

ActivityNet-QA

  • Please follow https://github.com/MILVLG/activitynet-qa/tree/master to download and seperata train/val dataset

  • Then use below python generate_llama_json_limit_frames.py file to generate our train and test dataset:

    python generate_llama_json_limit_frames.py -name val_q -type val -n 500 -seconds 20
    

    generate_llama_json_limit_frames.py:

    import json
    import os
    import argparse
    import ffmpeg
    
    # Define the path to the directory where the video files are stored
    video_directory = "where to find dataset"
    
    
    def get_video_duration(video_path):
        try:
            probe = ffmpeg.probe(video_path)
            video_stream = next(stream for stream in probe["streams"] if stream["codec_type"] == "video")
            return float(video_stream["duration"])
        except Exception as e:
            print(f"Error getting duration for video {video_path}: {e}")
            return 0
    
    
    if __name__ == "__main__":
        # Parse command line arguments
        parser = argparse.ArgumentParser(description="Generate LLaMA JSON")
        parser.add_argument("-name", type=str, default="train_q_3000", help="Number of questions to process")
        parser.add_argument("-type", type=str, default="train", help="data type")
        parser.add_argument("-fps", type=float, default=0.2, help="data type")
        parser.add_argument("-n", type=int, default=250, help="data type")
        parser.add_argument("-seconds", type=int, default=20, help="minimum video duration in seconds")
        args = parser.parse_args()
        fps = args.fps
        basic_seconds = args.seconds
        question_json = "../activitynet-qa/dataset/{}.json".format(args.name)
        answer_json = "../activitynet-qa/dataset/{}_a.json".format(args.type)
        combine_json = "../data/activitynet_qa_{}_{}_limit_{}s.json".format(args.type, args.n, basic_seconds)
        print("combine_json:", combine_json)
    
        # Supported video file extensions
        video_extensions = (".mp4", ".mkv", "webm")
    
        # Load the questions and answers JSON files
        with open(question_json, "r") as question_file:
            questions = json.load(question_file)
    
        with open(answer_json, "r") as answer_file:
            answers = json.load(answer_file)
    
        # Create a dictionary to map question_id to answer for quick lookup
        answer_lookup = {answer["question_id"]: answer for answer in answers}
    
        combined_data = []
        len_pairs = len(questions)
        # Process each question and look for a corresponding answer
        for question in questions:
            question_id = question["question_id"]
            if question_id in answer_lookup:
                answer = answer_lookup[question_id]
    
                # Extract the video name typically between 'v_' and the second underscore or end
                video_name_without_path = ("_").join(question_id.split("_")[:-1])
                # Search for the video file that matches the extracted name
                video_path = None
                find_flag = False
                # Walk through the directory to find matching video files
                for root, dirs, files in os.walk(video_directory):
                    for file in files:
                        if file.startswith(video_name_without_path) and file.endswith(video_extensions):
                            video_path = os.path.join(root, file)
                            find_flag = True
                            break
                    if video_path:
                        break
                if not find_flag:
                    print("!!not find:", video_name_without_path)
                if video_path:
                    video_duration = get_video_duration(video_path)
                    if video_duration > basic_seconds:
                        combined_entry = {
                            "messages": [
                                {"content": f"<video>{question['question']}?", "role": "user"},
                                {"content": answer["answer"], "role": "assistant"},
                            ],
                            "videos": [video_path],
                        }
                        combined_data.append(combined_entry)
                        if len(combined_data) % 100 == 0:
                            print(f"Processed {len(combined_data)} entries")
                        if len(combined_data) >= args.n:
                            break
                    else:
                        print("video_duration < basic_seconds", video_duration, video_path)
        # Write the combined data to the output JSON file
        with open(combine_json, "w") as combine_file:
            json.dump(combined_data, combine_file, indent=4)
    

Update dataset_info.json

Update dataset_info.json in your data dir:

dataset_info.json

{
  "caltech101": {
    "file_name": "caltech101.json"
  },
  "ActivityNet": {
    "file_name": "ActivityNet.json"
  },
  "activitynet_qa_2000_limit_20s": {
    "file_name": "activitynet_qa_2000_limit_20s.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "videos": "videos"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  }
}

caltech101.json

[]

ActivityNet.json

[]

activitynet_qa_2000_limit_20s.json

Generate by generate_llama_json_limit_frames.py