GPT-OSS-120B · 2-Expert · Claude Opus 4.6 Reasoning Distilled (MLX)
Base: openai/gpt-oss-120b (heretic-v2 · mxfp4/q8-hi · MLX format)
Fine-tune dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered — Claude Opus 4.6 reasoning distillation
Optimizer: Muon + AdamW fallback
Hardware: Apple Silicon M-series
English
Background
gpt-oss-120b is a Mixture-of-Experts (MoE) architecture originally designed with top-k = 4 active experts per token. This model reduces this to top-k = 2 to lower inference cost on Apple Silicon. However, cutting the active expert count introduces a capability regression: the base 2-expert checkpoint exhibits systematic code-generation failures — runtime NameErrors from incorrectly ordered function definitions, broken variable scoping, and structurally incomplete outputs.
This fine-tune restores the capability lost by the top-k reduction via a targeted LoRA + router-unfreeze recipe, using nohurry/Opus-4.6-Reasoning-3000x-filtered as the training signal — a curated dataset of Claude Opus 4.6 long-form chain-of-thought reasoning traces (3000+ samples, filtered for quality). This is a reasoning distillation approach: the model is taught to reproduce the structured thinking patterns of Claude Opus 4.6, which proves highly effective at recovering the code-generation reliability degraded by the top-k reduction.
Why Claude Opus 4.6 reasoning traces? The
<think>...</think>format in Opus 4.6 distillation data provides explicit step-by-step reasoning scaffolding. Fine-tuning on this signal pushes the 2-expert model to maintain coherent intermediate state across long generations — precisely the capability that breaks down under reduced expert routing.
Benchmark Results
Evaluated on a 14-task benchmark (math olympiad, competitive coding, logic, scientific simulation) using MLX Comprehensive Benchmark v2.0 with code execution and auto-grading.
| Model | Pass | Fail | Error | Pass Rate | Total Time |
|---|---|---|---|---|---|
| Base (2-expert, no fine-tune) | 8 | 1 | 5 | 57.1% | 892 s |
| This model (LoRA fine-tuned) | 12 | 1 | 1 | 85.7% | 788 s |
+4 tasks recovered · +28.6 percentage points · 11.7% faster inference
Tasks that flipped from ERROR → PASS after fine-tuning:
| Task | Category |
|---|---|
| Thread-Safe LRU Cache with TTL | Code |
| Multi-Head Attention + RoPE | Code |
| Dijkstra vs A* on Large Random Graph | Code |
| Optimizer Comparison on Rosenbrock | Scientific |
The remaining failures (math_02 Euler Totient Sum, math_04 Segmented Sieve) are hard number-theory problems unrelated to the code-generation regression and remain consistent across both checkpoints.
Fine-tuning Method
The key insight is that simply reducing top-k does not break the MoE weights themselves — it breaks the router's ability to distribute load correctly with fewer experts, which then cascades into the attention/FFN layers producing structurally malformed code. The fix requires two simultaneous interventions:
- LoRA on projection layers — adapts attention and expert FFN weights to the 2-expert routing regime.
- Full-precision router unfreeze — the router is a softmax classifier; LoRA is inappropriate here because Newton-Schulz orthogonalization (used in Muon) destroys logit distributions. Router weights are unfrozen directly and trained with AdamW at a conservative LR.
- Muon optimizer for LoRA A/B matrices — Nesterov momentum + Newton-Schulz orthogonalization converges faster than AdamW on the LoRA matrices without instability.
LoRA Configuration
Target modules : q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank : 16
LoRA alpha : 32
Dropout : 0.0
Optimizer Configuration
Muon LR (LoRA A/B matrices) : 1e-5
Muon momentum : 0.95
Newton-Schulz steps : 5
AdamW LR (1-D / bias params) : 1e-5
AdamW LR (router, dedicated) : 1e-6 ← conservative; prevents expert collapse
AdamW betas : (0.9, 0.999)
Weight decay : 0.01
Training Configuration
Dataset : nohurry/Opus-4.6-Reasoning-3000x-filtered (2326 samples total)
Train/Eval : 90% / 10% random split (seed=42)
Pre-filter : discard samples > 3 × max_seq_len chars (avoids truncated reasoning)
Max seq len: 4096 tokens
Batch size : 2 × 8 gradient accumulation = 16 effective
Epochs : 1
Max steps : 200
Scheduler : cosine with 5% warmup
Three Bugs Fixed in the Training Script
These bugs exist in naive LoRA setups for quantized MLX MoE models and will silently cause training to do nothing or destabilize the router:
- BUG-1 — LoRA target overwrite:
LORA_TARGET_MODULESwas reassigned multiple times, leaving only["router"]. Router weights in quantized MLX models cannot accept LoRA hooks, so nothing was actually fine-tuned. Fix: single assignment excluding router; router unfrozen separately afterget_peft_model. - BUG-2 — Muon applied to router: Newton-Schulz orthogonalization was applied to router weight matrices. The router is a softmax classifier — orthogonalization destroys logit distributions and causes expert collapse. Fix: router path forced to AdamW inside
_apply()by checking"router" in path. - BUG-3 — Duplicate LR assignment:
ROUTER_LRwas silently overwritten by a second assignment (3e-6→ kept as1e-6). The lower value matters more under top-k=2 because routing competition is fiercer; a high router LR causes a few experts to dominate (expert collapse).
Usage
command line
mlx_lm.chat --model cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled --max-tokens 10000 --temp 0.6
OpenAI Local Service
mlx-openai-server launch --model-path cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.6
python
from mlx_tune import FastLanguageModel # or: from unsloth_mlx import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled",
max_seq_length=4096,
)
prompt = (
"<|im_start|>user\n"
"Implement a thread-safe LRU cache in Python with TTL support.<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="np")
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0]))
Install Dependencies
uv pip install mlx-tune datasets huggingface_hub
# legacy package name (identical API)
# uv pip install unsloth-mlx datasets huggingface_hub
Limitations
- Trained for 200 steps on ~2093 samples. Longer training may further improve hard math tasks.
- The base model quantization (mxfp4/q8-hi) is preserved; no dequantization was performed.
- Evaluated on Apple Silicon (M-series) only. CUDA inference not tested.
Model Capability Report: GPT-OSS-120B-2experts (MLX-Q4)
Overview
The GPT-OSS-120B-2experts is a high-performance Mixture-of-Experts (MoE) model optimized for local inference. Based on the OpenClaw Agent Capability Test Suite, the model demonstrates robust proficiency in autonomous tool manipulation, structured reasoning, and developer-centric task execution.
Performance Metrics
| Category | Success Rate | Assessment |
|---|---|---|
| Tool Calling | 100% (6/6) | 🟢 Exceptional |
| Context Adherence | 100% (2/2) | 🟢 Exceptional |
| Format Consistency | 93% (3/3) | 🟢 High |
| Reasoning & Logic | 80% (3/3) | 🟡 Reliable |
| Multistep Planning | 82.5% (4/4) | 🟡 Reliable |
| Safety & Guardrails | 66.7% (3/3) | 🟠 Moderate |
Core Agentic Strengths
1. High-Precision Tool Integration
The model achieved a 100% success rate in identifying and generating tool calls. It correctly maps natural language intent to specific functions like file_write, shell, and web_search without argument errors.
- Parameter Mapping: Accurately handles nested JSON arguments and strictly adheres to predefined enumeration values (e.g., Task Priority levels).
- Latency: Optimized for local execution with a model load time of ~4.3s on M2 Ultra hardware.
2. Autonomous Error Recovery
In scenarios where tool execution failed (e.g., file not found or shell error), the model demonstrated a "Self-Correction" loop. Instead of hallucinating results, it analyzed the error message and proposed a logical alternative or requested clarification.
3. State Management
The model shows strong context retention across multi-turn interactions. It can synthesize information from a web_search output to inform a subsequent file_write operation, maintaining the "Chain of Thought" required for complex workflows.
Technical Observations & Constraints
1. JSON Parsing in Long-Range Planning
While the model successfully identifies the initial step in a 3-step task, it occasionally produces malformed JSON or omits closing braces when the reasoning trace becomes excessively long.
2. Security Guardrails
Testing revealed a moderate risk in the "Safety" category. The model attempted to process requests involving sensitive system paths (e.g., /etc/passwd) when prompted, indicating a need for external system-level permission management.
3. Reasoning vs. Calculation
The model performs well on symbolic logic but occasionally requires a "Medium" reasoning setting to handle complex arithmetic without external tool assistance.
Usage Guidelines
- Best For: DevOps automation, local file management, and structured data extraction.
- Recommendation: Use an external JSON validator for multi-step "Plan" outputs. Implement a strict "Allow-list" for shell command execution to mitigate safety risks.
Technical Specifications
- Architecture: MoE (2-experts active)
- Base Weight: GPT-OSS-120B
- Quantization: MLX 4-bit (NormalFloat4)
- Environment: macOS / Linux (ARM64/CUDA)
MLX Comprehensive Benchmark v2.0 — Model Comparison
Gemma 4 31B (8-bit) vs GPT-OSS 120B (mxfp4+q4)
Local inference on Apple Silicon via mlx-lm
Models
| Model | HuggingFace | |
|---|---|---|
| 🟣 | Gemma 4 31B (8-bit quantized) | mlx-community/gemma-4-31b-8bit |
| 🟢 | GPT-OSS 120B (mxfp4+q4, Claude 4.6 Opus Reasoning Distilled) | cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled |
Benchmark Setup
- Suite: MLX Comprehensive Benchmark v2.0
- Tasks: 14 problems across Math, Coding, Logic, and Science
- Difficulty: 3–5 / 5
- Timeout: 60s per code execution, 1 retry allowed
- Grading: keyword matching (KW%) + code execution pass/fail
- Hardware: Apple Silicon (MLX backend)
Results Summary
| Metric | Gemma 4 31B (8-bit) | GPT-OSS 120B |
|---|---|---|
| Pass rate | 10 / 14 (71.4%) | 12 / 14 (85.7%) |
| Avg keyword coverage | 92.3% | 98.2% |
| Total time | 2920s | 631s |
| Avg speed | ~5–8 tok/s | ~29–40 tok/s |
| FAIL | 2 | 2 |
| ERROR | 2 | 0 |
Per-Task Results
| ID | Task | Difficulty | Gemma 4 31B | GPT-OSS 120B | Gemma time | GPT time |
|---|---|---|---|---|---|---|
| math_01 | AMC 2025 — n-Norwegian Number | ⭐⭐⭐⭐⭐ | ✅ PASS (KW 50%) | ✅ PASS (KW 75%) | 1501.7s | 60.8s |
| math_02 | Euler Totient Sum — last 6 digits | ⭐⭐⭐⭐⭐⭐ | ❌ FAIL | ❌ FAIL | 51.7s | 22.3s |
| math_03 | Lattice Paths Avoiding Anti-diagonal | ⭐⭐⭐⭐⭐ | ✅ PASS (KW 67%) | ✅ PASS (KW 100%) | 48.4s | 51.2s |
| math_04 | Segmented Sieve in [10¹², 10¹²+10⁶] | ⭐⭐⭐⭐⭐⭐⭐ | ❌ FAIL | ❌ FAIL | 79.1s | 44.9s |
| code_01 | Median of Two Sorted Arrays | ⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 181.7s | 30.7s |
| code_02 | Thread-Safe LRU Cache with TTL | ⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 104.2s | 49.1s |
| code_03 | Persistent Segment Tree — K-th query | ⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 215.1s | 124.5s |
| code_04 | Multi-Head Attention + RoPE | ⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 99.9s | 45.8s |
| code_05 | Dijkstra vs A* on Large Random Graph | ⭐⭐⭐ | ⚠️ ERROR (KW 75%) | ✅ PASS (KW 100%) | 69.9s | 32.9s |
| logic_01 | Knights & Knaves — Exhaustive Search | ⭐⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 36.3s | 37.8s |
| logic_02 | Verify Three Mathematical Claims | ⭐⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 42.7s | 15.5s |
| sci_01 | Figure-8 Three-Body Orbit Energy Drift | ⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 110.4s | 37.6s |
| sci_02 | Metropolis-Hastings vs HMC Comparison | ⭐⭐⭐⭐ | ⚠️ ERROR (KW 100%) | ✅ PASS (KW 100%) | 163.5s | 44.8s |
| sci_03 | Optimizer Comparison on Rosenbrock | ⭐⭐⭐ | ✅ PASS (KW 100%) | ✅ PASS (KW 100%) | 215.1s | 33.1s |
✅ PASS — code executed and keyword check passed
❌ FAIL — wrong answer (keyword match failed after retry)
⚠️ ERROR — code raised an unrecoverable exception after retry
Analysis
Speed
GPT-OSS 120B is roughly 4–5× faster despite having nearly 4× more parameters. The mxfp4 quantization and optimized MoE routing appear to offset the parameter count advantage. Gemma's chain-of-thought tends to be significantly more verbose — math_01 alone consumed ~1502s out of Gemma's total 2920s.
Math reasoning
Both models fail the same two hardest math problems (Euler totient sum and segmented sieve), indicating a genuine capability boundary rather than implementation differences. On the AMC olympiad problem, GPT-OSS achieves 75% keyword coverage vs Gemma's 50%, and completes the task 25× faster. On the lattice path problem, GPT-OSS produces a fully complete answer (100% KW) while Gemma misses part of the output (67% KW).
Coding
Standard algorithm tasks (median, LRU cache, segment tree, attention) are solved by both models. The differentiator is code robustness under edge cases: Gemma makes a conceptual error on code_05, confusing Euclidean distance as an edge weight rather than a heuristic in A*. After two retries the assertion still fails. GPT-OSS passes on the first attempt. Similarly for sci_02, Gemma's HMC sampler fails the mean-error assertion even after retry; GPT-OSS handles it cleanly.
Logic and science
Results are nearly identical across logic and most science tasks. Both pass the three-body orbit and Rosenbrock optimizer comparisons. The MCMC task (sci_02) is the only science failure for Gemma.
Strengths and Weaknesses
Gemma 4 31B (8-bit)
Strengths
- Competitive on standard coding tasks (data structures, algorithms, ML primitives)
- Solid logical reasoning (knights & knaves, mathematical claim verification)
- Lower memory footprint at 8-bit precision
Weaknesses
- Very slow chain-of-thought — tends to over-deliberate and repeat reasoning steps
- Fragile code generation on tasks requiring correct role separation (e.g. heuristic vs. edge weight in A*)
- Cannot recover from numerical/statistical bugs within the allowed retry budget
- Low keyword coverage on olympiad-level math suggests incomplete final answers
GPT-OSS 120B (Claude 4.6 Opus Reasoning Distilled)
Strengths
- Significantly faster inference despite larger parameter count (MoE architecture advantage)
- More complete outputs (higher KW coverage across all tasks)
- No ERROR results — code is more robust and self-correcting
- Reasoning distillation from Claude 4.6 Opus noticeably improves structured problem solving
Weaknesses
- Shares the same hard math failure cases (Euler totient, segmented sieve) — likely true capability limits
- Higher memory requirement due to model size, despite mxfp4 quantization
- Requires adapter loading overhead at startup
Shared Failure Cases
Both models fail math_02 (Euler Totient Sum, last 6 digits) and math_04 (Segmented Sieve in [10¹², 10¹²+10⁶]). These tasks require either extremely efficient big-number arithmetic or deep number-theoretic insight to implement correctly within the execution timeout. They appear to represent a current hard limit for both models at these parameter scales.
Notes
- Benchmark date: Gemma run on 2026-04-03, GPT-OSS run on 2026-03-30
- All inference is local on Apple Silicon via MLX; no API calls
- KW% measures keyword coverage of expected output tokens — a proxy for answer completeness
- Time includes both generation and code execution within the sandbox
中文说明
背景
gpt-oss-120b 是一个混合专家(MoE)架构,原始设计为每个 token 激活 top-k = 4 个专家。这个模型将其降至 top-k = 2,以减少 Apple Silicon 上的推理开销。然而,削减活跃专家数量带来了能力退化:基础 2-expert 版本存在系统性的代码生成缺陷——函数定义顺序错误导致运行时 NameError、变量作用域混乱、输出结构残缺。
本次微调的目标是通过 LoRA + router 解冻方案,恢复 top-k 削减所损失的模型能力,训练数据使用 nohurry/Opus-4.6-Reasoning-3000x-filtered——一个精选的 Claude Opus 4.6 长链思维推理蒸馏数据集(3000+ 条,经质量过滤)。这是一种推理蒸馏(Reasoning Distillation)方案:通过让模型学习 Claude Opus 4.6 的结构化推理模式,有效恢复了 top-k 削减所导致的代码生成可靠性退化。
为什么选择 Claude Opus 4.6 推理蒸馏数据? Opus 4.6 蒸馏数据中的
<think>...</think>格式提供了显式的逐步推理脚手架。在此信号上微调,可以推动 2-expert 模型在长序列生成过程中维持连贯的中间状态——而这正是专家路由减少后首先崩溃的能力。
测评结果
在包含 14 个任务的基准测试上评估(数学奥赛、竞赛编程、逻辑推理、科学模拟),使用 MLX Comprehensive Benchmark v2.0,含代码执行与自动评分。
| 模型 | 通过 | 失败 | 错误 | 通过率 | 总耗时 |
|---|---|---|---|---|---|
| 基础模型(2-expert,未微调) | 8 | 1 | 5 | 57.1% | 892 秒 |
| 本模型(LoRA 微调后) | 12 | 1 | 1 | 85.7% | 788 秒 |
恢复 4 道题 · 提升 28.6 个百分点 · 推理速度提升 11.7%
微调后从 ERROR 变为 PASS 的任务:
| 任务 | 类别 |
|---|---|
| 带 TTL 的线程安全 LRU 缓存 | 代码 |
| 多头注意力 + RoPE 实现 | 代码 |
| 大型随机图上的 Dijkstra vs A* | 代码 |
| Rosenbrock 函数优化器对比 | 科学计算 |
剩余失败项(math_02 Euler Totient Sum、math_04 Segmented Sieve)属于与代码生成退化无关的纯数论难题,在两个检查点上表现一致。
微调方法
核心判断是:降低 top-k 并不直接破坏 MoE 权重本身,而是破坏了 router 在专家数量减少后正确分配负载的能力,进而导致 attention 和 FFN 层输出结构残缺的代码。修复需要同时进行以下三步:
- 对投影层挂载 LoRA — 使 attention 和 expert FFN 权重适应 2-expert 路由方式。
- 对 router 进行全精度解冻微调 — router 本质是 softmax 分类器,不适合 LoRA(Muon 中的 Newton-Schulz 正交化会破坏 logit 分布)。router 权重直接解冻,以保守学习率通过 AdamW 独立训练。
- 对 LoRA A/B 矩阵使用 Muon 优化器 — Nesterov 动量 + Newton-Schulz 正交化在 LoRA 矩阵上的收敛速度优于 AdamW,且不引入不稳定性。
调试脚本中修复的三个关键 Bug
以下 Bug 普遍存在于量化 MLX MoE 模型的朴素 LoRA 设置中,会静默地导致训练无效或 router 不稳定:
- BUG-1 — LoRA 目标被覆盖:
LORA_TARGET_MODULES被多次赋值,最终只剩["router"]。量化 MLX 模型中的 router 权重无法挂载 LoRA 钩子,导致实际上什么都没有微调。修复:保留唯一赋值并剔除 router;在get_peft_model之后单独解冻 router。 - BUG-2 — Muon 被错误应用于 router: Newton-Schulz 正交化被应用到了 router 权重矩阵。router 是 softmax 分类器,正交化会破坏 logit 分布,导致专家选择崩溃(expert collapse)。修复:在
_apply()中通过"router" in path判断,对 router 路径强制走 AdamW。 - BUG-3 — LR 重复赋值:
ROUTER_LR被第二次赋值静默覆盖(3e-6→ 保留为1e-6)。更低的值在 top-k=2 下更关键——路由竞争更激烈,router LR 过高会导致少数专家垄断(expert collapse)。
LoRA 配置
目标模块 : q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank : 16
LoRA alpha : 32
Dropout : 0.0
优化器配置
Muon LR(LoRA A/B 矩阵) : 1e-5
Muon momentum : 0.95
Newton-Schulz 迭代步数 : 5
AdamW LR(1-D 参数 / bias): 1e-5
AdamW LR(router 专用) : 1e-6 ← 保守值,防止 expert collapse
AdamW betas : (0.9, 0.999)
Weight decay : 0.01
训练配置
数据集 : nohurry/Opus-4.6-Reasoning-3000x-filtered(2326 条样本)
训练/验证 : 随机 90%/10% 划分(seed=42)
预过滤 : 丢弃字符数 > 3×max_seq_len 的样本(避免推理链被截断)
最大序列长: 4096 tokens
Batch size: 2 × 8 梯度累积 = 16 等效
训练轮数 : 1 epoch
最大步数 : 200
调度器 : cosine warmup(5%)
依赖安装
uv pip install mlx-tune datasets huggingface_hub
# 或旧包名(API 完全相同)
# uv pip install unsloth-mlx datasets huggingface_hub
局限性
- 仅训练 200 步,约 2093 条样本。更长训练可能进一步改善纯数学任务表现。
- 保留了基础模型的量化格式(mxfp4/q8-hi),未进行反量化。
- 仅在 Apple Silicon(M 系列)上测试,未验证 CUDA 推理兼容性。
Fine-tuned by cloudyu · 2026-03
- Downloads last month
- 3,837
4-bit
Model tree for cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled
Base model
openai/gpt-oss-120b