Finetuned SigLIP for Person Visual Descriptions and Reidentification

This model is part of the family of SigLIP models finetuned for person visual description and retrieval.


Model Details

  • Base model: google/siglip2-base-patch16-224
  • Architecture modifications: none

Intended Uses & Limitations

Example Applications

  • Person retrieval based on textual or visual descriptions of the person
  • Person image re-identification
  • Embedding extraction for retrieval systems

Limitations and Bias

  • May inherit biases from the base SigLIP model and training data
  • Not suitable for tasks requiring detailed fine-grained recognition without further training
  • Trained on surveillance data; suitable for tasks where a substantial portion of the person is visible

Training

Loss function

  • Type: Soft contrastive loss with label smoothing; image-to-image contrastive loss for same-identity image pairs.
  • Description: The model is trained to align text and image embeddings using a modified contrastive objective. Instead of relying on hard one-hot targets, label smoothing allocates a small probability mass to all other samples in the batch. All embeddings are normalized prior to similarity computation, and the loss is applied symmetrically in both the image-to-text and text-to-image directions. To encourage re-identification and emphasize clothing features rather than pose or background, an additional image-to-image contrastive loss is incorporated.

Evaluation Metrics

  • Truncated Cumulative Matching Characteristic (CMC) AUC
    • Measures the fraction of queries where the correct match appears within the top K ranks (e.g., top-10). Unlike MRR or strict top-1 accuracy, this metric rewards consistent retrieval of all relevant matches near the top ranks, rather than a few perfect hits with others ranked very low.

Datasets

  • Sources: CUHK-PEDES, ICFG-PEDES, IIITD, ITCPR, PRW-TBPS, PETA, (part of SYNTH and RSTPReid for testing)
  • Processing: Text descriptions were processed with the Mistral LLM to remove ambiguous information about pose or context, leaving only clear visual characteristics in a structured format, where key features are separated with comma. During train/val/test the original splits are respected as much as possible.

The splits are created as follows:

  • CUHK-PEDES: test: 1000 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
  • ICFG-PEDES: test: 946 identities following the original split; validation: 150 identities sampled from the original training split; train: all remaining identities
  • IIITD: test: 2500 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
  • ITCPR: Originally designed for zero-shot evaluation, so a new split is created:test: 1000 randomly selected identities; validation: 150 randomly selected identities; train: all remaining identities
  • PRW: test: 450 identities following the original split; validation: 150 identities sampled from the original training split; train: all remaining identities
  • PETA: The original random image-level split can cause identity leakage. Therefore, the identity-based split from PETA-ZS is adopted: test: 1706 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
  • RSTPReid: Only the 200 test images are used for evaluation due to annotation errors in the rest of the dataset.
  • SYNTH: 1000 random images are used for the test set because the generated captions are noisy and sometimes inaccurate.

Training Setup

  • Frozen model head initially; only heads were trained.
  • Fine-tuning: later trained the whole model.
  • Optimizer: AdamW
  • Learning rate: 1e-4 for warm-up, 5e-6 for the rest
  • Epochs: 4 epochs for model head fine-tuning, and 50 epochs for full model fine-tuning
  • Visual Augmentations: random operations including small rotations, scale, hue variations, horizontal flip, and color jitter
  • Text Augmentations: random subsets of key features are selected and removed from the comma-separated description strings to create augmented training samples
  • Training Codes: codes are available on [github] (https://github.com/MarketaJu/ReducedSiglipForPersonDescription)
  • Package Versions: torch==2.9.1, transformers==4.57.3, pillow==12.0.0, torchvision==0.24.1

Results

The model is evaluated on data mentioned above. The following table summarizes the number of identities, images, and queries for each subset.
The final test set is a merge of all subsets.

Dataset #Identities #Images #Queries
CUHK 1000 3074 6156
ICFG 946 19848 19873
IIITD 2500 2500 5000
ITCPR 1000 1620 1620
PRW 450 2057 4114
PETA 1706 3933 4614
RSTPReid 200 1000 1932
SYNTH 1000 1000 1000
Final (All) 8802 35032 44309

Evaluation of the Model on Text-based Image Retrieval Task

Since the task is focused on retrieving the correct person identity rather than the exact matching image, the evaluation is performed as follows:

  • For each text query, the goal is to retrieve the correct identity, not the exact corresponding image.
  • During evaluation, scores are computed over all images belonging to the same identity, and the maximum score is taken to represent that identity.
  • These identity-level scores are then ranked, and the following retrieval metrics are calculated:
    • Top-k: standard top-1, top-5, top-10 accuracy
    • MRR: Mean Reciprocal Rank
    • CMC AUC: Truncated CMC AUC evaluated up to rank 20
Dataset Top-1 Top-5 Top-10 MRR CMC AUC (20)
CUHK 66.2 87.5 92.7 74.1 91.5
ICFG 52.9 76.6 83.8 61.1 82.8
IIITD 68.4 89.2 93.7 77.8 92.7
ITCPR 42.7 68.6 78.9 52.0 78.0
PRW 60.5 84.9 91.6 69.9 90.5
PETA 44.8 74.3 83.7 56.8 82.2
RSTPReid 50.5 78.5 86.2 60.9 85.3
SYNTH 47.2 72.5 81.9 58.9 80.9
Final (All) 51.5 74.9 82.3 60.4 81.3

Cross-Model Comparison

Model Variant Top-1 Top-5 Top-10 MRR CMC AUC (20)
google/siglip-base-patch16-224 16.4 33.1 41.4 23.8 41.1
google/siglip2-base-patch16-224 12.3 26.3 34.2 18.7 34.2
finetuned_siglip2 53.8 77.0 83.8 62.6 82.8
finetuned_siglip2_reid 51.5 74.9 82.3 60.4 81.3
siglip2-person-description-128 51.0 75.0 82.4 60.4 81.4
siglip2-person-description-64 49.0 73.8 81.7 58.6 80.5
siglip2-person-description-32 43.1 70.0 78.2 53.5 77.3

Evaluation of the Model on Image Reidentification Task

Since some images from Market-1501 are also included in the aforementioned datasets used for training, the reported results on this dataset are not objective. Reported metrics are Top-1/5/10 and mAP.

Model Variant Market 1501 MSMT17 Duke MTMC EntireID
google/siglip-base-patch16-224 20.3 / 36.2 / 43.3 60.3 / 73.1 / 77.5 60.8 / 75.1 / 79.5 30.0 / 45.5 / 51.9
google/siglip2-base-patch16-224 22.9 / 38.0 / 45.9 59.1 / 71.3 / 76.2 60.2 / 74.2 / 78.7 32.3 / 47.1 / 54.0
finetuned_siglip2 86.8 / 94.7 / 96.4 87.8 / 93.0 / 94.5 91.1 / 95.3 / 96.4 73.5 / 86.0 / 89.5
finetuned_siglip2_reid 90.7 / 97.0 / 98.4 92.8 / 96.4 / 97.2 92.7 / 96.2 / 97.0 79.3 / 89.8 / 92.1
siglip-person-description-64 86.0 / 93.9 / 96.2 84.6 / 91.0 / 92.8 89.8 / 94.8 / 95.7 68.3 / 82.4 / 86.3
siglip2-person-description-64 87.1 / 95.1 / 96.5 84.4 / 90.9 / 92.9 89.0 / 94.4 / 95.6 68.0 / 83.0 / 86.9
Model Variant Market 1501 MSMT17 Duke MTMC EntireID
google/siglip-base-patch16-224 6.4 13.2 17.5 16.1
google/siglip2-base-patch16-224 7.5 12.0 17.0 17.2
finetuned_siglip2 73.5 48.7 60.6 54.4
finetuned_siglip2_reid 81.0 66.6 66.8 60.6
siglip-person-description-64 72.3 44.0 57.1 48.9
siglip2-person-description-64 73.4 44.1 57.0 49.8

Usage

The usage is identical to SigLIP.


# Import custom model code from repository
from modeling_resipvd import ReSiPVDModel

# Load the model from Hugging Face Hub
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
model = AutoModel.from_pretrained("MarketaJu/siglip2-person-description-reid")

# Example: get embeddings
from skimage.io import imread
image = imread("test.jpg")
text_inputs = processor(text=["random person description"], return_tensors="pt", padding="max_length", max_length=64, truncation=True)
image_inputs = processor(images=image, return_tensors="pt", padding="max_length", max_length=64, truncation=True)
text_embeds = model.get_text_features(**text_inputs)
image_embeds = model.get_image_features(**image_inputs)


---

## Citation

If you use this model, please cite:

```bibtex
@misc{reduced-siglip-visualdescription,
  title={Reduced SigLIP for Visual Descriptions},
  author={Marketa Jurankova},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/collections/MarketaJu/reduced-siglip-for-person-visual-description}}
}
Downloads last month
8,317
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MarketaJu/siglip2-person-description-reid