Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Paper β’ 2603.12554 β’ Published
This repository contains task-specific checkpoints for EGSPO and EGSPO-SA methods from our paper:
"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages"
π Paper (arXiv:2603.12554) | π» Code (GitHub)
We introduce principled RL methods for diffusion language models that:
Base Model: LLaDA-8B-Instruct (Masked Diffusion Language Model)
We provide checkpoints for two methods across six tasks:
| Task | EGSPO | EGSPO-SA | Best Performance |
|---|---|---|---|
| GSM8K | β
gsm8k/egspo |
β
gsm8k/egspo-sa |
85.7% / 85.03% |
| MATH500 | β
math/egspo |
β
math/egspo-sa |
39.0% / 39.6% |
| Sudoku | β
sudoku/egspo |
β
sudoku/egspo-sa |
93.6% / 94.3% |
| Countdown | β
countdown/egspo |
β
countdown/egspo-sa |
75.8% / 78.5% |
| MBPP | β
mbpp/egspo |
β
mbpp/egspo-sa |
50.6% / 51.1% |
| HumanEval | β
humaneval/egspo |
β
humaneval/egspo-sa |
40.2% / 44.5% |
EGSPO-SA generally achieves better performance, especially on logical reasoning and coding tasks.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"GSAI-ML/LLaDA-8B-Instruct",
torch_dtype="auto",
device_map="auto"
)
# Load EGSPO-SA checkpoint for GSM8K
model = PeftModel.from_pretrained(
base_model,
"fatemehdoudi97/egspo-llada-8b",
subfolder="gsm8k/egspo-sa"
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct")
# Example inference
prompt = "Solve: John has 5 apples and gives 2 to Mary. How many does he have left?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_length=256,
temperature=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# EGSPO for Sudoku
model = PeftModel.from_pretrained(
base_model,
"fatemehdoudi97/egspo-llada-8b",
subfolder="sudoku/egspo"
)
# EGSPO-SA for HumanEval
model = PeftModel.from_pretrained(
base_model,
"fatemehdoudi97/egspo-llada-8b",
subfolder="humaneval/egspo-sa"
)
| Method | Sudoku | Countdown | GSM8K | MATH500 | HumanEval | MBPP |
|---|---|---|---|---|---|---|
| LLaDA-8B-Instruct | 11.7 | 20.7 | 78.2 | 36.2 | 37.8 | 41.2 |
| d1 | 22.1 | 42.2 | 82.1 | 40.2 | 37.8 | 44.7 |
| wd1 | 76.4 | 51.2 | 82.3 | 39.0 | - | - |
| SPG | 94.0β | 71.5 | 86.1 | 41.8 | - | - |
| EGSPO (Ours) | 93.6 | 75.8 | 85.7 | 39.0 | 40.2 | 50.6 |
| EGSPO-SA (Ours) | 94.3 | 78.5 | 85.03 | 39.6 | 44.5 | 51.1 |
β SPG uses 3-shot evaluation; ours is 0-shot.
EGSPO:
EGSPO-SA:
If you use these checkpoints, please cite our paper:
@article{kunde2025egspo,
title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
journal={arXiv preprint arXiv:2603.12554},
year={2025}
}
Apache 2.0
For questions or issues: