Qwen3-4B-SFT
Qwen3-4B-SFT is a reasoning-focused model fine-tuned from Qwen3-4B-Base using the verl framework.
In open-source practice, fully reproducible "pre-RL SFT base" releases are rare. This model fills that gap by providing a practical intermediate checkpoint that is Math-forward, Reasoning-focused, and Format-aligned. It is ideal as a clean, warm-start base for Reinforcement Learning (RL) or standalone reasoning tasks.
Configuration Notes
- Model Info: Full-parameter SFT based on
Qwen3-4B-Base, optimized for Chain of Thought (COT) reasoning. - Template: Trained with the Qwen chat template; learns to end responses with
<|im_end|>(151645). - Suggested Configuration:
{ "eos_token_id": 151645 }
You may adjust settings according to your training or deployment needs.
Benchmark Snapshot
The following results compare the Base (4B) model against the Qwen3-4B-SFT (this model), highlighting the performance gains in reasoning and mathematics.
| Dataset | Base (4B) | Qwen3-4B-SFT (this model) | Improvement (Absolute) |
|---|---|---|---|
| AIME 2024 | 11.25% | 20.8% | +9.55% |
| AIME 2025 | 6.46% | 19.4% | +12.94% |
| AMC 2023 | 31.09% | 58.0% | +26.91% |
| GPQA-Diamond | 7.77% | 29.1% | +21.33% |
- Cluster: MeluXina Supercomputer (LuxProvide)
- Node Config: 4 NVIDIA-A100 GPUs per node.
- Final SFT Run: 12 Node-hours (16× A100 for 3 hours)
- Total R&D Investment: ~700 Node-hours (Includes data ablation, hyperparameter sweeps, and extensive benchmark evaluation.)
Project Links
- Model repository (this page): https://huggingface.co/96kevinli29/Qwen3-4B-SFT
- Dataset card used for SFT: https://huggingface.co/datasets/96kevinli29/SFT-Dataset
- Training code repository: https://github.com/96kevinli29/base-model-sft-verl
Limitations
- Not optimized for factual correctness in all domains
- May still produce hallucinations or unsafe outputs
- Performance is sensitive to prompt style and decoding settings
Citation
If you use this model, please cite this checkpoint, bibTeX for this release :
@misc{qwen3-4b-sft-2026,
title = {{Qwen3-4B-SFT}: Supervised Fine-Tuned {Qwen3}-4B for Reasoning},
author = {Hongyang Li and Xiao Li and {Sea-Fill Community}},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/SeaFill2025/Qwen3-4B-SFT}},
note = {Checkpoint trained with verl; warm-start for pre-RL alignment research. Maintained by Sea-Fill Community.}
}
Also cite as appropriate:
- The base model (
Qwen3-4B-Base) — use the official Qwen3 / Alibaba citation from its Hugging Face model card. - The training code repository:
https://github.com/96kevinli29/base-model-sft-verl/ - The original source datasets listed in the dataset recipe.
- Downloads last month
- 2,600
Model tree for SeaFill2025/Qwen3-4B-SFT
Base model
Qwen/Qwen3-4B-BaseEvaluation results
- accuracy on AIME 2024self-reported20.800
- accuracy on AIME 2025self-reported19.400
- accuracy on AMC 2023self-reported58.000
- accuracy on GPQA-Diamondself-reported29.100