Finetuned SigLIP for Person Visual Descriptions and Reidentification
This model is part of the family of SigLIP models finetuned for person visual description and retrieval.
Model Details
- Base model:
google/siglip2-base-patch16-224
- Architecture modifications: none
Intended Uses & Limitations
Example Applications
- Person retrieval based on textual or visual descriptions of the person
- Person image re-identification
- Embedding extraction for retrieval systems
Limitations and Bias
- May inherit biases from the base SigLIP model and training data
- Not suitable for tasks requiring detailed fine-grained recognition without further training
- Trained on surveillance data; suitable for tasks where a substantial portion of the person is visible
Training
Loss function
- Type: Soft contrastive loss with label smoothing; image-to-image contrastive loss for same-identity image pairs.
- Description: The model is trained to align text and image embeddings using a modified contrastive objective. Instead of relying on hard one-hot targets, label smoothing allocates a small probability mass to all other samples in the batch. All embeddings are normalized prior to similarity computation, and the loss is applied symmetrically in both the image-to-text and text-to-image directions.
To encourage re-identification and emphasize clothing features rather than pose or background, an additional image-to-image contrastive loss is incorporated.
Evaluation Metrics
- Truncated Cumulative Matching Characteristic (CMC) AUC
- Measures the fraction of queries where the correct match appears within the top
K ranks (e.g., top-10). Unlike MRR or strict top-1 accuracy, this metric rewards consistent retrieval of all relevant matches near the top ranks, rather than a few perfect hits with others ranked very low.
Datasets
- Sources: CUHK-PEDES, ICFG-PEDES, IIITD, ITCPR, PRW-TBPS, PETA, (part of SYNTH and RSTPReid for testing)
- Processing: Text descriptions were processed with the Mistral LLM to remove ambiguous information about pose or context, leaving only clear visual characteristics in a structured format, where key features are separated with comma. During train/val/test the original splits are respected as much as possible.
The splits are created as follows:
- CUHK-PEDES: test: 1000 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
- ICFG-PEDES: test: 946 identities following the original split; validation: 150 identities sampled from the original training split; train: all remaining identities
- IIITD: test: 2500 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
- ITCPR: Originally designed for zero-shot evaluation, so a new split is created:test: 1000 randomly selected identities; validation: 150 randomly selected identities; train: all remaining identities
- PRW: test: 450 identities following the original split; validation: 150 identities sampled from the original training split; train: all remaining identities
- PETA: The original random image-level split can cause identity leakage. Therefore, the identity-based split from PETA-ZS is adopted: test: 1706 identities following the original split; validation: 150 identities sampled from the original validation split; train: all remaining identities
- RSTPReid: Only the 200 test images are used for evaluation due to annotation errors in the rest of the dataset.
- SYNTH: 1000 random images are used for the test set because the generated captions are noisy and sometimes inaccurate.
Training Setup
- Frozen model head initially; only heads were trained.
- Fine-tuning: later trained the whole model.
- Optimizer: AdamW
- Learning rate: 1e-4 for warm-up, 5e-6 for the rest
- Epochs: 4 epochs for model head fine-tuning, and 50 epochs for full model fine-tuning
- Visual Augmentations: random operations including small rotations, scale, hue variations, horizontal flip, and color jitter
- Text Augmentations: random subsets of key features are selected and removed from the comma-separated description strings to create augmented training samples
- Training Codes: codes are available on [github] (https://github.com/MarketaJu/ReducedSiglipForPersonDescription)
- Package Versions: torch==2.9.1, transformers==4.57.3, pillow==12.0.0, torchvision==0.24.1
Results
The model is evaluated on data mentioned above. The following table summarizes the number of identities, images, and queries for each subset.
The final test set is a merge of all subsets.
| Dataset |
#Identities |
#Images |
#Queries |
| CUHK |
1000 |
3074 |
6156 |
| ICFG |
946 |
19848 |
19873 |
| IIITD |
2500 |
2500 |
5000 |
| ITCPR |
1000 |
1620 |
1620 |
| PRW |
450 |
2057 |
4114 |
| PETA |
1706 |
3933 |
4614 |
| RSTPReid |
200 |
1000 |
1932 |
| SYNTH |
1000 |
1000 |
1000 |
| Final (All) |
8802 |
35032 |
44309 |
Evaluation of the Model on Text-based Image Retrieval Task
Since the task is focused on retrieving the correct person identity rather than the exact matching image, the evaluation is performed as follows:
- For each text query, the goal is to retrieve the correct identity, not the exact corresponding image.
- During evaluation, scores are computed over all images belonging to the same identity, and the maximum score is taken to represent that identity.
- These identity-level scores are then ranked, and the following retrieval metrics are calculated:
- Top-k: standard top-1, top-5, top-10 accuracy
- MRR: Mean Reciprocal Rank
- CMC AUC: Truncated CMC AUC evaluated up to rank 20
| Dataset |
Top-1 |
Top-5 |
Top-10 |
MRR |
CMC AUC (20) |
| CUHK |
66.2 |
87.5 |
92.7 |
74.1 |
91.5 |
| ICFG |
52.9 |
76.6 |
83.8 |
61.1 |
82.8 |
| IIITD |
68.4 |
89.2 |
93.7 |
77.8 |
92.7 |
| ITCPR |
42.7 |
68.6 |
78.9 |
52.0 |
78.0 |
| PRW |
60.5 |
84.9 |
91.6 |
69.9 |
90.5 |
| PETA |
44.8 |
74.3 |
83.7 |
56.8 |
82.2 |
| RSTPReid |
50.5 |
78.5 |
86.2 |
60.9 |
85.3 |
| SYNTH |
47.2 |
72.5 |
81.9 |
58.9 |
80.9 |
| Final (All) |
51.5 |
74.9 |
82.3 |
60.4 |
81.3 |
Cross-Model Comparison
| Model Variant |
Top-1 |
Top-5 |
Top-10 |
MRR |
CMC AUC (20) |
| google/siglip-base-patch16-224 |
16.4 |
33.1 |
41.4 |
23.8 |
41.1 |
| google/siglip2-base-patch16-224 |
12.3 |
26.3 |
34.2 |
18.7 |
34.2 |
| finetuned_siglip2 |
53.8 |
77.0 |
83.8 |
62.6 |
82.8 |
| finetuned_siglip2_reid |
51.5 |
74.9 |
82.3 |
60.4 |
81.3 |
| siglip2-person-description-128 |
51.0 |
75.0 |
82.4 |
60.4 |
81.4 |
| siglip2-person-description-64 |
49.0 |
73.8 |
81.7 |
58.6 |
80.5 |
| siglip2-person-description-32 |
43.1 |
70.0 |
78.2 |
53.5 |
77.3 |
Evaluation of the Model on Image Reidentification Task
Since some images from Market-1501 are also included in the aforementioned datasets used for training, the reported results on this dataset are not objective.
Reported metrics are Top-1/5/10 and mAP.
| Model Variant |
Market 1501 |
MSMT17 |
Duke MTMC |
EntireID |
| google/siglip-base-patch16-224 |
20.3 / 36.2 / 43.3 |
60.3 / 73.1 / 77.5 |
60.8 / 75.1 / 79.5 |
30.0 / 45.5 / 51.9 |
| google/siglip2-base-patch16-224 |
22.9 / 38.0 / 45.9 |
59.1 / 71.3 / 76.2 |
60.2 / 74.2 / 78.7 |
32.3 / 47.1 / 54.0 |
| finetuned_siglip2 |
86.8 / 94.7 / 96.4 |
87.8 / 93.0 / 94.5 |
91.1 / 95.3 / 96.4 |
73.5 / 86.0 / 89.5 |
| finetuned_siglip2_reid |
90.7 / 97.0 / 98.4 |
92.8 / 96.4 / 97.2 |
92.7 / 96.2 / 97.0 |
79.3 / 89.8 / 92.1 |
| siglip-person-description-64 |
86.0 / 93.9 / 96.2 |
84.6 / 91.0 / 92.8 |
89.8 / 94.8 / 95.7 |
68.3 / 82.4 / 86.3 |
| siglip2-person-description-64 |
87.1 / 95.1 / 96.5 |
84.4 / 90.9 / 92.9 |
89.0 / 94.4 / 95.6 |
68.0 / 83.0 / 86.9 |
| Model Variant |
Market 1501 |
MSMT17 |
Duke MTMC |
EntireID |
| google/siglip-base-patch16-224 |
6.4 |
13.2 |
17.5 |
16.1 |
| google/siglip2-base-patch16-224 |
7.5 |
12.0 |
17.0 |
17.2 |
| finetuned_siglip2 |
73.5 |
48.7 |
60.6 |
54.4 |
| finetuned_siglip2_reid |
81.0 |
66.6 |
66.8 |
60.6 |
| siglip-person-description-64 |
72.3 |
44.0 |
57.1 |
48.9 |
| siglip2-person-description-64 |
73.4 |
44.1 |
57.0 |
49.8 |
Usage
The usage is identical to SigLIP.
from modeling_resipvd import ReSiPVDModel
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
model = AutoModel.from_pretrained("MarketaJu/siglip2-person-description-reid")
from skimage.io import imread
image = imread("test.jpg")
text_inputs = processor(text=["random person description"], return_tensors="pt", padding="max_length", max_length=64, truncation=True)
image_inputs = processor(images=image, return_tensors="pt", padding="max_length", max_length=64, truncation=True)
text_embeds = model.get_text_features(**text_inputs)
image_embeds = model.get_image_features(**image_inputs)
---
If you use this model, please cite:
```bibtex
@misc{reduced-siglip-visualdescription,
title={Reduced SigLIP for Visual Descriptions},
author={Marketa Jurankova},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/collections/MarketaJu/reduced-siglip-for-person-visual-description}}
}