Instructions to use ModalityDance/Omni-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModalityDance/Omni-R1 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ModalityDance/Omni-R1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - multimodal | |
| - reasoning | |
| - sft | |
| - rl | |
| datasets: | |
| - multimodal-reasoning-lab/Zebra-CoT | |
| - ModalityDance/Omni-Bench | |
| base_model: | |
| - GAIR/Anole-7b-v0.1 | |
| pipeline_tag: any-to-any | |
| # Omni-R1 | |
| [](https://arxiv.org/abs/2601.09536) | |
| [](https://github.com/ModalityDance/Omni-R1) | |
| [](https://huggingface.co/datasets/ModalityDance/Omni-Bench) | |
| ## Overview | |
| **Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories. | |
| ## Usage | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from transformers import ChameleonProcessor, ChameleonForConditionalGeneration | |
| # 1) Import & load | |
| model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero" | |
| processor = ChameleonProcessor.from_pretrained(model_id) | |
| model = ChameleonForConditionalGeneration.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| model.eval() | |
| # 2) Prepare a single input (prompt contains <image>) | |
| prompt = "What is the smiling man in the image wearing? <image>" | |
| image = Image.open("image.png").convert("RGB") | |
| inputs = processor( | |
| prompt, | |
| images=[image], | |
| padding=False, | |
| return_for_text_completion=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| # --- minimal image token preprocessing: replace <image> placeholder with image tokens --- | |
| input_ids = inputs["input_ids"].long() | |
| pixel_values = inputs["pixel_values"] | |
| placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0] | |
| image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible) | |
| mask = (input_ids == placeholder_id) | |
| input_ids = input_ids.clone() | |
| input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device) | |
| # 3) Call the model | |
| outputs = model.generate( | |
| input_ids=input_ids, | |
| max_length=4096, | |
| do_sample=True, | |
| temperature=0.5, | |
| top_p=0.9, | |
| pad_token_id=1, | |
| multimodal_generation_mode="unrestricted", | |
| ) | |
| # 4) Get results | |
| text = processor.batch_decode(outputs, skip_special_tokens=False)[0] | |
| print(text) | |
| ``` | |
| For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository: | |
| https://github.com/ModalityDance/Omni-R1 | |
| ## License | |
| This project is licensed under the **MIT License**. | |
| It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**. | |
| ## Citation | |
| ```bibtex | |
| @misc{cheng2026omnir1unifiedgenerativeparadigm, | |
| title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, | |
| author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li}, | |
| year={2026}, | |
| eprint={2601.09536}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2601.09536}, | |
| } | |
| ``` | |