EGSPO & EGSPO-SA: Entropy-Guided Stepwise Policy Optimization for Diffusion LLMs

This repository contains task-specific checkpoints for EGSPO and EGSPO-SA methods from our paper:

"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages"

πŸ“„ Paper (arXiv:2603.12554) | πŸ’» Code (GitHub)

Overview

We introduce principled RL methods for diffusion language models that:

  • Formulate diffusion generation as a finite-horizon MDP
  • Derive exact policy gradients decomposed over denoising steps
  • Use entropy-guided step selection for compute-efficient training
  • Estimate stepwise advantages via one-step denoising completions

Base Model: LLaDA-8B-Instruct (Masked Diffusion Language Model)

Available Checkpoints

We provide checkpoints for two methods across six tasks:

Task EGSPO EGSPO-SA Best Performance
GSM8K βœ… gsm8k/egspo βœ… gsm8k/egspo-sa 85.7% / 85.03%
MATH500 βœ… math/egspo βœ… math/egspo-sa 39.0% / 39.6%
Sudoku βœ… sudoku/egspo βœ… sudoku/egspo-sa 93.6% / 94.3%
Countdown βœ… countdown/egspo βœ… countdown/egspo-sa 75.8% / 78.5%
MBPP βœ… mbpp/egspo βœ… mbpp/egspo-sa 50.6% / 51.1%
HumanEval βœ… humaneval/egspo βœ… humaneval/egspo-sa 40.2% / 44.5%

Method Comparison

  • EGSPO: Entropy-guided step selection with sequence-level advantages
  • EGSPO-SA: EGSPO + stepwise advantages for improved credit assignment

EGSPO-SA generally achieves better performance, especially on logical reasoning and coding tasks.

Usage

Loading a Checkpoint

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load EGSPO-SA checkpoint for GSM8K
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-llada-8b",
    subfolder="gsm8k/egspo-sa"
)

tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct")

# Example inference
prompt = "Solve: John has 5 apples and gives 2 to Mary. How many does he have left?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_length=256,
    temperature=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading Different Tasks/Methods

# EGSPO for Sudoku
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-llada-8b",
    subfolder="sudoku/egspo"
)

# EGSPO-SA for HumanEval
model = PeftModel.from_pretrained(
    base_model,
    "fatemehdoudi97/egspo-llada-8b",
    subfolder="humaneval/egspo-sa"
)

Performance Results

Comparison with Baselines

Method Sudoku Countdown GSM8K MATH500 HumanEval MBPP
LLaDA-8B-Instruct 11.7 20.7 78.2 36.2 37.8 41.2
d1 22.1 42.2 82.1 40.2 37.8 44.7
wd1 76.4 51.2 82.3 39.0 - -
SPG 94.0† 71.5 86.1 41.8 - -
EGSPO (Ours) 93.6 75.8 85.7 39.0 40.2 50.6
EGSPO-SA (Ours) 94.3 78.5 85.03 39.6 44.5 51.1

†SPG uses 3-shot evaluation; ours is 0-shot.

Key Findings

  • ⚑ More compute-efficient than baselines (see paper for detailed analysis)
  • 🎯 Stepwise advantages particularly effective for tasks requiring precise credit assignment

Training Details

Hyperparameters

  • Base Model: LLaDA-8B-Instruct
  • Fine-tuning: LoRA (rank=128, Ξ±=64)
  • Optimizer: Adam (β₁=0.9, Ξ²β‚‚=0.99, weight_decay=0.1)
  • Learning Rate: 3Γ—10⁻⁡
  • Batch Size: 96 (effective)
  • Denoising Steps: 128 (training)
  • Step Budget (K): 8 high-entropy steps per trajectory
  • Temperature: 0.9 (0.2 for Countdown)
  • Generation Length: 256 tokens (training)

Method-Specific Settings

EGSPO:

  • Entropy-guided step selection
  • Sequence-level advantages (Ξ» = 0)

EGSPO-SA:

  • Entropy-guided step selection
  • Stepwise advantages with one-step denoising baseline
  • Ξ» schedule: starts at 1.0, halved every 500 steps (for math/coding tasks)
  • Ξ» = 1.0 (static) for logical reasoning tasks

Citation

If you use these checkpoints, please cite our paper:

@article{kunde2025egspo,
  title={Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages},
  author={Kunde, Vishnu Teja and Doudi, Fatemeh and Farahbakhsh, Mahdi and Kalathil, Dileep and Narayanan, Krishna and Chamberland, Jean-Francois},
  journal={arXiv preprint arXiv:2603.12554},
  year={2025}
}

License

Apache 2.0

Contact

For questions or issues:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for fatemehdoudi97/egspo-llada-8b