EGSPO & EGSPO-SA: Entropy-Guided Stepwise Policy Optimization for Diffusion LLMs

This repository contains task-specific checkpoints for EGSPO and EGSPO-SA methods from our paper:

"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages"

📄 Paper (arXiv:2603.12554) | 💻 Code (GitHub)

Overview

We introduce principled RL methods for diffusion language models that:

Formulate diffusion generation as a finite-horizon MDP
Derive exact policy gradients decomposed over denoising steps
Use entropy-guided step selection for compute-efficient training
Estimate stepwise advantages via one-step denoising completions

Base Model: LLaDA-8B-Instruct (Masked Diffusion Language Model)

Available Checkpoints

We provide checkpoints for two methods across six tasks:

Task	EGSPO	EGSPO-SA	Best Performance
GSM8K	✅ `gsm8k/egspo`	✅ `gsm8k/egspo-sa`	85.7% / 85.03%
MATH500	✅ `math/egspo`	✅ `math/egspo-sa`	39.0% / 39.6%
Sudoku	✅ `sudoku/egspo`	✅ `sudoku/egspo-sa`	93.6% / 94.3%
Countdown	✅ `countdown/egspo`	✅ `countdown/egspo-sa`	75.8% / 78.5%
MBPP	✅ `mbpp/egspo`	✅ `mbpp/egspo-sa`	50.6% / 51.1%
HumanEval	✅ `humaneval/egspo`	✅ `humaneval/egspo-sa`	40.2% / 44.5%

Method Comparison

EGSPO: Entropy-guided step selection with sequence-level advantages
EGSPO-SA: EGSPO + stepwise advantages for improved credit assignment

EGSPO-SA generally achieves better performance, especially on logical reasoning and coding tasks.

Usage

Loading a Checkpoint

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load EGSPO-SA checkpoint for GSM8K
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-llada-8b",
    subfolder="gsm8k/egspo-sa"
)

tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct")

# Example inference
prompt = "Solve: John has 5 apples and gives 2 to Mary. How many does he have left?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_length=256,
    temperature=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading Different Tasks/Methods

# EGSPO for Sudoku
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-llada-8b",
    subfolder="sudoku/egspo"
)

# EGSPO-SA for HumanEval
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-llada-8b",
    subfolder="humaneval/egspo-sa"
)

Performance Results

Comparison with Baselines

Method	Sudoku	Countdown	GSM8K	MATH500	HumanEval	MBPP
LLaDA-8B-Instruct	11.7	20.7	78.2	36.2	37.8	41.2
d1	22.1	42.2	82.1	40.2	37.8	44.7
wd1	76.4	51.2	82.3	39.0	-	-
SPG	94.0†	71.5	86.1	41.8	-	-
EGSPO (Ours)	93.6	75.8	85.7	39.0	40.2	50.6
EGSPO-SA (Ours)	94.3	78.5	85.03	39.6	44.5	51.1

†SPG uses 3-shot evaluation; ours is 0-shot.

Key Findings

⚡ More compute-efficient than baselines (see paper for detailed analysis)
🎯 Stepwise advantages particularly effective for tasks requiring precise credit assignment

Training Details

Hyperparameters

Base Model: LLaDA-8B-Instruct
Fine-tuning: LoRA (rank=128, α=64)
Optimizer: Adam (β₁=0.9, β₂=0.99, weight_decay=0.1)
Learning Rate: 3×10⁻⁵
Batch Size: 96 (effective)
Denoising Steps: 128 (training)
Step Budget (K): 8 high-entropy steps per trajectory
Temperature: 0.9 (0.2 for Countdown)
Generation Length: 256 tokens (training)

Method-Specific Settings

EGSPO:

Entropy-guided step selection
Sequence-level advantages (λ = 0)

EGSPO-SA:

Entropy-guided step selection
Stepwise advantages with one-step denoising baseline
λ schedule: starts at 1.0, halved every 500 steps (for math/coding tasks)
λ = 1.0 (static) for logical reasoning tasks

Citation

If you use these checkpoints, please cite our paper:

@article{kunde2025egspo,
  title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
  author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
  journal={arXiv preprint arXiv:2603.12554},
  year={2025}
}

License

Apache 2.0

Contact

For questions or issues:

Open an issue on GitHub
Contact: Fatemeh Doudi (fatemehdoudi@tamu.edu)

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for fatemehdoudi97/egspo-llada-8b

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Paper • 2603.12554 • Published Mar 13