Add model card and metadata
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
|
| 7 |
+
|
| 8 |
+
LoMo (Local Modality Substitution) is a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers.
|
| 9 |
+
|
| 10 |
+
- **Paper:** [LoMo: Local Modality Substitution for Deeper Vision-Language Fusion](https://huggingface.co/papers/2605.30265)
|
| 11 |
+
- **Project Page:** [https://maplebb.github.io/LoMo/page/](https://maplebb.github.io/LoMo/page/)
|
| 12 |
+
- **Repository:** [https://github.com/Maplebb/LoMo](https://github.com/Maplebb/LoMo)
|
| 13 |
+
|
| 14 |
+
## Introduction
|
| 15 |
+
|
| 16 |
+
Vision-Language Models (VLMs) often exhibit "carrier sensitivity," where replacing a textual question with its rendered-image counterpart causes significant performance degradation. LoMo addresses this by reformulating single-modality prompts into interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, preserving the same semantics across "text, visual, text" carriers to yield deeper cross-modal fusion.
|
| 17 |
+
|
| 18 |
+
## Evaluation
|
| 19 |
+
|
| 20 |
+
Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.
|
| 21 |
+
|
| 22 |
+
## Citation
|
| 23 |
+
|
| 24 |
+
If you find this work useful, please consider citing:
|
| 25 |
+
|
| 26 |
+
```bibtex
|
| 27 |
+
@article{han2026lomo,
|
| 28 |
+
title={LoMo: Local Modality Substitution for Deeper Vision-Language Fusion},
|
| 29 |
+
author={Feng Han and Zhixiong Zhang and Zheming Liang and Yibin Wang and Jiaqi Wang},
|
| 30 |
+
journal={arXiv preprint arXiv:2605.30265},
|
| 31 |
+
year={2026}
|
| 32 |
+
}
|
| 33 |
+
```
|