PIPer Stage 2 RL - Final Checkpoint

100% pass@5 on EnvBench evaluation set!

This model is the final checkpoint from a 2-stage training pipeline for Python environment setup tasks.

Model Description

Base Model: Qwen3-8B-am
Training Pipeline:
- Stage 1: Supervised Fine-Tuning on 2,250 ShareGPT conversations
- Stage 2: Reinforcement Learning with PPO on 228 EnvBench samples (40 epochs)
Hardware: 8x NVIDIA H200 GPUs
Training Time: ~3 hours total

Performance

Metric	Value
pass@5 (20-sample eval)	100% (20/20 problems)
Baseline (paper)	19.4%
Baseline (reproduction)	30%
Improvement	+70 percentage points

Training Data

Stage 1: PIPer-SFT-ShareGPT-Data
- 2,250 training conversations
- 250 validation conversations
Stage 2: PIPer-EnvBench-Data
- 228 environment setup problems (training)
- 96 environment setup problems (test)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "PIPer-Stage2-RL-Final",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("PIPer-Stage2-RL-Final")

# Format prompt
messages = [{
    "role": "user",
    "content": "Your task is to generate a bash script that will set up a Python development environment..."
}]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.8, top_p=0.95)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Stage 1 Configuration

Dataset: ShareGPT conversations
Batch Size: 256 (8 GPUs × 32 samples per GPU)
Learning Rate: 2e-5
Epochs: 3
Sequence Length: 4096
Training Steps: 24

Stage 2 Configuration

Algorithm: PPO (Proximal Policy Optimization)
Dataset: EnvBench environment setup problems
Batch Size: 128
Reward Function: Strict shellcheck validation
Epochs: 40
Sequence Length: 8192
Training Steps: 40

Evaluation Results

Evaluated on 20 problems from EnvBench test set with pass@5 metric (5 samples per problem):

20/20 problems passed (100% success rate)
Most problems achieved 5/5 correct samples
Strong consistency across samples

Sample Reward Distributions

Problem 1: [1.00, 1.00, 1.00, 1.00, 1.00] ✓
Problem 2: [1.00, 1.00, 1.00, 1.00, 1.00] ✓
Problem 3: [1.00, 1.00, 1.00, 1.00, -1.00] ✓
...

Architecture

Framework: veRL (Versatile Reinforcement Learning)
Distribution: FSDP (Fully Sharded Data Parallel)
Inference: vLLM 0.8.4 (hybrid FSDP+vLLM mode)
Attention: Flash Attention 2

Checkpoints

Stage 1 SFT: PIPer-Stage1-SFT-ShareGPT
Stage 2 RL (this model): PIPer-Stage2-RL-Final

Citation

Based on the PIPer paper:

@article{piper2025,
  title={PIPer: Automated Python Environment Setup with Reinforcement Learning},
  author={...},
  journal={arXiv preprint},
  year={2025}
}

License

Same as base model (Qwen3-8B-am)

Acknowledgments

JetBrains Research for the PIPer codebase and EnvBench dataset
Qwen team for the base model
veRL team for the training framework

Downloads last month: 1

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support