YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

PIPer Stage 2 RL - Final Checkpoint

100% pass@5 on EnvBench evaluation set!

This model is the final checkpoint from a 2-stage training pipeline for Python environment setup tasks.

Model Description

  • Base Model: Qwen3-8B-am
  • Training Pipeline:
    • Stage 1: Supervised Fine-Tuning on 2,250 ShareGPT conversations
    • Stage 2: Reinforcement Learning with PPO on 228 EnvBench samples (40 epochs)
  • Hardware: 8x NVIDIA H200 GPUs
  • Training Time: ~3 hours total

Performance

Metric Value
pass@5 (20-sample eval) 100% (20/20 problems)
Baseline (paper) 19.4%
Baseline (reproduction) 30%
Improvement +70 percentage points

Training Data

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "PIPer-Stage2-RL-Final",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage2-RL-Final")

# Format prompt
messages = [{
    "role": "user",
    "content": "Your task is to generate a bash script that will set up a Python development environment..."
}]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.8, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Stage 1 Configuration

  • Dataset: ShareGPT conversations
  • Batch Size: 256 (8 GPUs ร— 32 samples per GPU)
  • Learning Rate: 2e-5
  • Epochs: 3
  • Sequence Length: 4096
  • Training Steps: 24

Stage 2 Configuration

  • Algorithm: PPO (Proximal Policy Optimization)
  • Dataset: EnvBench environment setup problems
  • Batch Size: 128
  • Reward Function: Strict shellcheck validation
  • Epochs: 40
  • Sequence Length: 8192
  • Training Steps: 40

Evaluation Results

Evaluated on 20 problems from EnvBench test set with pass@5 metric (5 samples per problem):

  • 20/20 problems passed (100% success rate)
  • Most problems achieved 5/5 correct samples
  • Strong consistency across samples

Sample Reward Distributions

  • Problem 1: [1.00, 1.00, 1.00, 1.00, 1.00] โœ“
  • Problem 2: [1.00, 1.00, 1.00, 1.00, 1.00] โœ“
  • Problem 3: [1.00, 1.00, 1.00, 1.00, -1.00] โœ“
  • ...

Architecture

  • Framework: veRL (Versatile Reinforcement Learning)
  • Distribution: FSDP (Fully Sharded Data Parallel)
  • Inference: vLLM 0.8.4 (hybrid FSDP+vLLM mode)
  • Attention: Flash Attention 2

Checkpoints

Citation

Based on the PIPer paper:

@article{piper2025,
  title={PIPer: Automated Python Environment Setup with Reinforcement Learning},
  author={...},
  journal={arXiv preprint},
  year={2025}
}

License

Same as base model (Qwen3-8B-am)

Acknowledgments

  • JetBrains Research for the PIPer codebase and EnvBench dataset
  • Qwen team for the base model
  • veRL team for the training framework
Downloads last month
1
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support