The core features for clip finetune tool:

Below method can run on Classification task and Image to Text task

Method Detail Description
Full Finetune 1. Default update all parameters
2. Enable Angle-Based Selection(base on the weight angle to determine which layer to update)
3. Enable Angle-Based Selection(base on the weight angle to determine which layer to update)
Partial Finetuning - bias 1. Default update all bias parameters
2. Allow users to customize which layers participate in training and which ones do not
Prompt Tuning adding prompt embedding layer at the head of model or at the inputs of every layer and only train these layers
Adapter Tuning adding adapter network at the end of encoder and only train this network
Training free - Tip Adapter 1. finetune CLIP model without any training or with few epochs learning
2. Added fixed cache size to reduce memory and enable experience sharing across different datasets

How to config features for clip finetune tool:

Basic yaml for vit_b16.yaml

see this in src/llamafactory/clip_finetune/configs/clip_finetune/vit_b16.yaml

DATALOADER:
  TRAIN_X:
    BATCH_SIZE: 32
  TEST:
    BATCH_SIZE: 100
  NUM_WORKERS: 4

INPUT:
  SIZE: (224, 224)
  INTERPOLATION: "bicubic"
  PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
  PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
  TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]

OPTIM:
  NAME: "sgd"
  LR: 0.02
  MAX_EPOCH: 50
  LR_SCHEDULER: "cosine"
  WARMUP_EPOCH: 1
  WARMUP_TYPE: "constant"
  WARMUP_CONS_LR: 1e-5

TRAIN:
  PRINT_FREQ: 1             # print acc after $PRINT_FREQ iteration

MODEL:
  BACKBONE:
    NAME: "ViT-B/16"

Angle-Based Selection for full-finetune

Add below line in src/llamafactory/clip_finetune/configs/clip_finetune/vit_b16.yaml

MODEL:
  ABS: True
  ABS_TOP: True                     # True: select top ABS_KEEP layer  False: select bottom ABS_KEEP layer
  ABS_GROUP: True                   # True: select top ABS_KEEP layer in each group False: select bottom ABS_KEEP layer
  ABS_GROUP_NAME: ["k_proj", "v_proj", "q_proj"]    # How to divide layer into GTOUP, this means divide layers into 4 group. Each layer has k_proj in its name will into group 0, v_proj into group1, q_proj into group 2, other into group 3
  ABS_KEEP: 5                       # keep layer number
  BACKBONE:
    NAME: "ViT-B/16"

customize trained layer for partial-finetune

Add below line in src/llamafactory/clip_finetune/configs/clip_finetune/vit_b16.yaml

MODEL:
  BACKBONE:
    NAME: "ViT-B/16"
BIAS:
  BIAS_TERMS: ["layer_norm", "layernorm"]   # which layer you want to train
  BIAS_TERMS_EXCLUDE: ["layernorm"]         # which layer you don't want to train

fixed cache size for tip-adapter

Add below line in src/llamafactory/clip_finetune/configs/clip_finetune/vit_b16.yaml

TRAINER:
  TIP:
    LOAD_CACHE: True                    # whether to use cache data trained with tip-adapter before
    beta: 1.0                           # hyper param in origin paper
    alpha: 3.0                          # hyper param in origin paper
    AUGMENT_EPOCH: 10                   # train cache epoch
    search_best: True                   # whether to search the best beta and alpha
    NEW: False                          # whether to use fixed cache size. True: all dataset cache will merge into one tensor [100, hidden_size]   False: each dataset will has it's own cache [num_dataset * 100, hidden_size]
    NEW_DATASET: False                  # Whether to train this dataset from scratch