GPT-OSS-120B · 2-Expert · Claude Opus 4.6 Reasoning Distilled (MLX)

Base: openai/gpt-oss-120b (heretic-v2 · mxfp4/q8-hi · MLX format)
Fine-tune dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered — Claude Opus 4.6 reasoning distillation
Optimizer: Muon + AdamW fallback
Hardware: Apple Silicon M-series

English

Background

gpt-oss-120b is a Mixture-of-Experts (MoE) architecture originally designed with top-k = 4 active experts per token. This model reduces this to top-k = 2 to lower inference cost on Apple Silicon. However, cutting the active expert count introduces a capability regression: the base 2-expert checkpoint exhibits systematic code-generation failures — runtime NameErrors from incorrectly ordered function definitions, broken variable scoping, and structurally incomplete outputs.

This fine-tune restores the capability lost by the top-k reduction via a targeted LoRA + router-unfreeze recipe, using nohurry/Opus-4.6-Reasoning-3000x-filtered as the training signal — a curated dataset of Claude Opus 4.6 long-form chain-of-thought reasoning traces (3000+ samples, filtered for quality). This is a reasoning distillation approach: the model is taught to reproduce the structured thinking patterns of Claude Opus 4.6, which proves highly effective at recovering the code-generation reliability degraded by the top-k reduction.

Why Claude Opus 4.6 reasoning traces? The <think>...</think> format in Opus 4.6 distillation data provides explicit step-by-step reasoning scaffolding. Fine-tuning on this signal pushes the 2-expert model to maintain coherent intermediate state across long generations — precisely the capability that breaks down under reduced expert routing.

Benchmark Results

Evaluated on a 14-task benchmark (math olympiad, competitive coding, logic, scientific simulation) using MLX Comprehensive Benchmark v2.0 with code execution and auto-grading.

Model	Pass	Fail	Error	Pass Rate	Total Time
Base (2-expert, no fine-tune)	8	1	5	57.1%	892 s
This model (LoRA fine-tuned)	12	1	1	85.7%	788 s

+4 tasks recovered · +28.6 percentage points · 11.7% faster inference

Tasks that flipped from ERROR → PASS after fine-tuning:

Task	Category
Thread-Safe LRU Cache with TTL	Code
Multi-Head Attention + RoPE	Code
Dijkstra vs A* on Large Random Graph	Code
Optimizer Comparison on Rosenbrock	Scientific

The remaining failures (math_02 Euler Totient Sum, math_04 Segmented Sieve) are hard number-theory problems unrelated to the code-generation regression and remain consistent across both checkpoints.

Fine-tuning Method

The key insight is that simply reducing top-k does not break the MoE weights themselves — it breaks the router's ability to distribute load correctly with fewer experts, which then cascades into the attention/FFN layers producing structurally malformed code. The fix requires two simultaneous interventions:

LoRA on projection layers — adapts attention and expert FFN weights to the 2-expert routing regime.
Full-precision router unfreeze — the router is a softmax classifier; LoRA is inappropriate here because Newton-Schulz orthogonalization (used in Muon) destroys logit distributions. Router weights are unfrozen directly and trained with AdamW at a conservative LR.
Muon optimizer for LoRA A/B matrices — Nesterov momentum + Newton-Schulz orthogonalization converges faster than AdamW on the LoRA matrices without instability.

LoRA Configuration

Target modules : q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank      : 16
LoRA alpha     : 32
Dropout        : 0.0

Optimizer Configuration

Muon LR  (LoRA A/B matrices)  : 1e-5
Muon momentum                 : 0.95
Newton-Schulz steps           : 5
AdamW LR (1-D / bias params)  : 1e-5
AdamW LR (router, dedicated)  : 1e-6     ← conservative; prevents expert collapse
AdamW betas                   : (0.9, 0.999)
Weight decay                  : 0.01

Training Configuration

Dataset    : nohurry/Opus-4.6-Reasoning-3000x-filtered (2326 samples total)
Train/Eval : 90% / 10% random split (seed=42)
Pre-filter : discard samples > 3 × max_seq_len chars (avoids truncated reasoning)
Max seq len: 4096 tokens
Batch size : 2 × 8 gradient accumulation = 16 effective
Epochs     : 1
Max steps  : 200
Scheduler  : cosine with 5% warmup

Three Bugs Fixed in the Training Script

These bugs exist in naive LoRA setups for quantized MLX MoE models and will silently cause training to do nothing or destabilize the router:

BUG-1 — LoRA target overwrite: LORA_TARGET_MODULES was reassigned multiple times, leaving only ["router"]. Router weights in quantized MLX models cannot accept LoRA hooks, so nothing was actually fine-tuned. Fix: single assignment excluding router; router unfrozen separately after get_peft_model.
BUG-2 — Muon applied to router: Newton-Schulz orthogonalization was applied to router weight matrices. The router is a softmax classifier — orthogonalization destroys logit distributions and causes expert collapse. Fix: router path forced to AdamW inside _apply() by checking "router" in path.
BUG-3 — Duplicate LR assignment: ROUTER_LR was silently overwritten by a second assignment (3e-6 → kept as 1e-6). The lower value matters more under top-k=2 because routing competition is fiercer; a high router LR causes a few experts to dominate (expert collapse).

Usage

command line

mlx_lm.chat --model cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled --max-tokens 10000 --temp 0.6

OpenAI Local Service

mlx-openai-server launch --model-path cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.6

python

from mlx_tune import FastLanguageModel  # or: from unsloth_mlx import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled",
    max_seq_length=4096,
)

prompt = (
    "<|im_start|>user\n"
    "Implement a thread-safe LRU cache in Python with TTL support.<|im_end|>\n"
    "<|im_start|>assistant\n"
)

inputs = tokenizer(prompt, return_tensors="np")
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0]))

Install Dependencies

uv pip install mlx-tune datasets huggingface_hub
# legacy package name (identical API)
# uv pip install unsloth-mlx datasets huggingface_hub

Limitations

Trained for 200 steps on ~2093 samples. Longer training may further improve hard math tasks.
The base model quantization (mxfp4/q8-hi) is preserved; no dequantization was performed.
Evaluated on Apple Silicon (M-series) only. CUDA inference not tested.

Model Capability Report: GPT-OSS-120B-2experts (MLX-Q4)

Overview

The GPT-OSS-120B-2experts is a high-performance Mixture-of-Experts (MoE) model optimized for local inference. Based on the OpenClaw Agent Capability Test Suite, the model demonstrates robust proficiency in autonomous tool manipulation, structured reasoning, and developer-centric task execution.

Performance Metrics

Category	Success Rate	Assessment
Tool Calling	100% (6/6)	🟢 Exceptional
Context Adherence	100% (2/2)	🟢 Exceptional
Format Consistency	93% (3/3)	🟢 High
Reasoning & Logic	80% (3/3)	🟡 Reliable
Multistep Planning	82.5% (4/4)	🟡 Reliable
Safety & Guardrails	66.7% (3/3)	🟠 Moderate

Core Agentic Strengths

1. High-Precision Tool Integration

The model achieved a 100% success rate in identifying and generating tool calls. It correctly maps natural language intent to specific functions like file_write, shell, and web_search without argument errors.

Parameter Mapping: Accurately handles nested JSON arguments and strictly adheres to predefined enumeration values (e.g., Task Priority levels).
Latency: Optimized for local execution with a model load time of ~4.3s on M2 Ultra hardware.

2. Autonomous Error Recovery

In scenarios where tool execution failed (e.g., file not found or shell error), the model demonstrated a "Self-Correction" loop. Instead of hallucinating results, it analyzed the error message and proposed a logical alternative or requested clarification.

3. State Management

The model shows strong context retention across multi-turn interactions. It can synthesize information from a web_search output to inform a subsequent file_write operation, maintaining the "Chain of Thought" required for complex workflows.

Technical Observations & Constraints

1. JSON Parsing in Long-Range Planning

While the model successfully identifies the initial step in a 3-step task, it occasionally produces malformed JSON or omits closing braces when the reasoning trace becomes excessively long.

2. Security Guardrails

Testing revealed a moderate risk in the "Safety" category. The model attempted to process requests involving sensitive system paths (e.g., /etc/passwd) when prompted, indicating a need for external system-level permission management.

3. Reasoning vs. Calculation

The model performs well on symbolic logic but occasionally requires a "Medium" reasoning setting to handle complex arithmetic without external tool assistance.

Usage Guidelines

Best For: DevOps automation, local file management, and structured data extraction.
Recommendation: Use an external JSON validator for multi-step "Plan" outputs. Implement a strict "Allow-list" for shell command execution to mitigate safety risks.

Technical Specifications

Architecture: MoE (2-experts active)
Base Weight: GPT-OSS-120B
Quantization: MLX 4-bit (NormalFloat4)
Environment: macOS / Linux (ARM64/CUDA)

MLX Comprehensive Benchmark v2.0 — Model Comparison

Gemma 4 31B (8-bit) vs GPT-OSS 120B (mxfp4+q4)
Local inference on Apple Silicon via mlx-lm

Models

	Model	HuggingFace
🟣	Gemma 4 31B (8-bit quantized)	mlx-community/gemma-4-31b-8bit
🟢	GPT-OSS 120B (mxfp4+q4, Claude 4.6 Opus Reasoning Distilled)	cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled

Benchmark Setup

Suite: MLX Comprehensive Benchmark v2.0
Tasks: 14 problems across Math, Coding, Logic, and Science
Difficulty: 3–5 / 5
Timeout: 60s per code execution, 1 retry allowed
Grading: keyword matching (KW%) + code execution pass/fail
Hardware: Apple Silicon (MLX backend)

Results Summary

Metric	Gemma 4 31B (8-bit)	GPT-OSS 120B
Pass rate	10 / 14 (71.4%)	12 / 14 (85.7%)
Avg keyword coverage	92.3%	98.2%
Total time	2920s	631s
Avg speed	~5–8 tok/s	~29–40 tok/s
FAIL	2	2
ERROR	2	0

Per-Task Results

ID	Task	Difficulty	Gemma 4 31B	GPT-OSS 120B	Gemma time	GPT time
math_01	AMC 2025 — n-Norwegian Number	⭐⭐⭐⭐⭐	✅ PASS (KW 50%)	✅ PASS (KW 75%)	1501.7s	60.8s
math_02	Euler Totient Sum — last 6 digits	⭐⭐⭐⭐⭐⭐	❌ FAIL	❌ FAIL	51.7s	22.3s
math_03	Lattice Paths Avoiding Anti-diagonal	⭐⭐⭐⭐⭐	✅ PASS (KW 67%)	✅ PASS (KW 100%)	48.4s	51.2s
math_04	Segmented Sieve in [10¹², 10¹²+10⁶]	⭐⭐⭐⭐⭐⭐⭐	❌ FAIL	❌ FAIL	79.1s	44.9s
code_01	Median of Two Sorted Arrays	⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	181.7s	30.7s
code_02	Thread-Safe LRU Cache with TTL	⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	104.2s	49.1s
code_03	Persistent Segment Tree — K-th query	⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	215.1s	124.5s
code_04	Multi-Head Attention + RoPE	⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	99.9s	45.8s
code_05	Dijkstra vs A* on Large Random Graph	⭐⭐⭐	⚠️ ERROR (KW 75%)	✅ PASS (KW 100%)	69.9s	32.9s
logic_01	Knights & Knaves — Exhaustive Search	⭐⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	36.3s	37.8s
logic_02	Verify Three Mathematical Claims	⭐⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	42.7s	15.5s
sci_01	Figure-8 Three-Body Orbit Energy Drift	⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	110.4s	37.6s
sci_02	Metropolis-Hastings vs HMC Comparison	⭐⭐⭐⭐	⚠️ ERROR (KW 100%)	✅ PASS (KW 100%)	163.5s	44.8s
sci_03	Optimizer Comparison on Rosenbrock	⭐⭐⭐	✅ PASS (KW 100%)	✅ PASS (KW 100%)	215.1s	33.1s

✅ PASS — code executed and keyword check passed
❌ FAIL — wrong answer (keyword match failed after retry)
⚠️ ERROR — code raised an unrecoverable exception after retry

Analysis

Speed

GPT-OSS 120B is roughly 4–5× faster despite having nearly 4× more parameters. The mxfp4 quantization and optimized MoE routing appear to offset the parameter count advantage. Gemma's chain-of-thought tends to be significantly more verbose — math_01 alone consumed ~1502s out of Gemma's total 2920s.

Math reasoning

Both models fail the same two hardest math problems (Euler totient sum and segmented sieve), indicating a genuine capability boundary rather than implementation differences. On the AMC olympiad problem, GPT-OSS achieves 75% keyword coverage vs Gemma's 50%, and completes the task 25× faster. On the lattice path problem, GPT-OSS produces a fully complete answer (100% KW) while Gemma misses part of the output (67% KW).

Coding

Standard algorithm tasks (median, LRU cache, segment tree, attention) are solved by both models. The differentiator is code robustness under edge cases: Gemma makes a conceptual error on code_05, confusing Euclidean distance as an edge weight rather than a heuristic in A*. After two retries the assertion still fails. GPT-OSS passes on the first attempt. Similarly for sci_02, Gemma's HMC sampler fails the mean-error assertion even after retry; GPT-OSS handles it cleanly.

Logic and science

Results are nearly identical across logic and most science tasks. Both pass the three-body orbit and Rosenbrock optimizer comparisons. The MCMC task (sci_02) is the only science failure for Gemma.

Strengths and Weaknesses

Gemma 4 31B (8-bit)

Strengths

Competitive on standard coding tasks (data structures, algorithms, ML primitives)
Solid logical reasoning (knights & knaves, mathematical claim verification)
Lower memory footprint at 8-bit precision

Weaknesses

Very slow chain-of-thought — tends to over-deliberate and repeat reasoning steps
Fragile code generation on tasks requiring correct role separation (e.g. heuristic vs. edge weight in A*)
Cannot recover from numerical/statistical bugs within the allowed retry budget
Low keyword coverage on olympiad-level math suggests incomplete final answers

GPT-OSS 120B (Claude 4.6 Opus Reasoning Distilled)

Strengths

Significantly faster inference despite larger parameter count (MoE architecture advantage)
More complete outputs (higher KW coverage across all tasks)
No ERROR results — code is more robust and self-correcting
Reasoning distillation from Claude 4.6 Opus noticeably improves structured problem solving

Weaknesses

Shares the same hard math failure cases (Euler totient, segmented sieve) — likely true capability limits
Higher memory requirement due to model size, despite mxfp4 quantization
Requires adapter loading overhead at startup

Shared Failure Cases

Both models fail math_02 (Euler Totient Sum, last 6 digits) and math_04 (Segmented Sieve in [10¹², 10¹²+10⁶]). These tasks require either extremely efficient big-number arithmetic or deep number-theoretic insight to implement correctly within the execution timeout. They appear to represent a current hard limit for both models at these parameter scales.

Notes

Benchmark date: Gemma run on 2026-04-03, GPT-OSS run on 2026-03-30
All inference is local on Apple Silicon via MLX; no API calls
KW% measures keyword coverage of expected output tokens — a proxy for answer completeness
Time includes both generation and code execution within the sandbox

中文说明

背景

gpt-oss-120b 是一个混合专家（MoE）架构，原始设计为每个 token 激活 top-k = 4 个专家。这个模型将其降至 top-k = 2，以减少 Apple Silicon 上的推理开销。然而，削减活跃专家数量带来了能力退化：基础 2-expert 版本存在系统性的代码生成缺陷——函数定义顺序错误导致运行时 NameError、变量作用域混乱、输出结构残缺。

本次微调的目标是通过 LoRA + router 解冻方案，恢复 top-k 削减所损失的模型能力，训练数据使用 nohurry/Opus-4.6-Reasoning-3000x-filtered——一个精选的 Claude Opus 4.6 长链思维推理蒸馏数据集（3000+ 条，经质量过滤）。这是一种推理蒸馏（Reasoning Distillation）方案：通过让模型学习 Claude Opus 4.6 的结构化推理模式，有效恢复了 top-k 削减所导致的代码生成可靠性退化。

为什么选择 Claude Opus 4.6 推理蒸馏数据？ Opus 4.6 蒸馏数据中的 <think>...</think> 格式提供了显式的逐步推理脚手架。在此信号上微调，可以推动 2-expert 模型在长序列生成过程中维持连贯的中间状态——而这正是专家路由减少后首先崩溃的能力。

测评结果

在包含 14 个任务的基准测试上评估（数学奥赛、竞赛编程、逻辑推理、科学模拟），使用 MLX Comprehensive Benchmark v2.0，含代码执行与自动评分。

模型	通过	失败	错误	通过率	总耗时
基础模型（2-expert，未微调）	8	1	5	57.1%	892 秒
本模型（LoRA 微调后）	12	1	1	85.7%	788 秒

恢复 4 道题 · 提升 28.6 个百分点 · 推理速度提升 11.7%

微调后从 ERROR 变为 PASS 的任务：

任务	类别
带 TTL 的线程安全 LRU 缓存	代码
多头注意力 + RoPE 实现	代码
大型随机图上的 Dijkstra vs A*	代码
Rosenbrock 函数优化器对比	科学计算

剩余失败项（math_02 Euler Totient Sum、math_04 Segmented Sieve）属于与代码生成退化无关的纯数论难题，在两个检查点上表现一致。

微调方法

核心判断是：降低 top-k 并不直接破坏 MoE 权重本身，而是破坏了 router 在专家数量减少后正确分配负载的能力，进而导致 attention 和 FFN 层输出结构残缺的代码。修复需要同时进行以下三步：

对投影层挂载 LoRA — 使 attention 和 expert FFN 权重适应 2-expert 路由方式。
对 router 进行全精度解冻微调 — router 本质是 softmax 分类器，不适合 LoRA（Muon 中的 Newton-Schulz 正交化会破坏 logit 分布）。router 权重直接解冻，以保守学习率通过 AdamW 独立训练。
对 LoRA A/B 矩阵使用 Muon 优化器 — Nesterov 动量 + Newton-Schulz 正交化在 LoRA 矩阵上的收敛速度优于 AdamW，且不引入不稳定性。

调试脚本中修复的三个关键 Bug

以下 Bug 普遍存在于量化 MLX MoE 模型的朴素 LoRA 设置中，会静默地导致训练无效或 router 不稳定：

BUG-1 — LoRA 目标被覆盖： LORA_TARGET_MODULES 被多次赋值，最终只剩 ["router"]。量化 MLX 模型中的 router 权重无法挂载 LoRA 钩子，导致实际上什么都没有微调。修复：保留唯一赋值并剔除 router；在 get_peft_model 之后单独解冻 router。
BUG-2 — Muon 被错误应用于 router： Newton-Schulz 正交化被应用到了 router 权重矩阵。router 是 softmax 分类器，正交化会破坏 logit 分布，导致专家选择崩溃（expert collapse）。修复：在 _apply() 中通过 "router" in path 判断，对 router 路径强制走 AdamW。
BUG-3 — LR 重复赋值： ROUTER_LR 被第二次赋值静默覆盖（3e-6 → 保留为 1e-6）。更低的值在 top-k=2 下更关键——路由竞争更激烈，router LR 过高会导致少数专家垄断（expert collapse）。

LoRA 配置

目标模块   : q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank  : 16
LoRA alpha : 32
Dropout    : 0.0

优化器配置

Muon LR（LoRA A/B 矩阵）   : 1e-5
Muon momentum              : 0.95
Newton-Schulz 迭代步数      : 5
AdamW LR（1-D 参数 / bias）: 1e-5
AdamW LR（router 专用）    : 1e-6   ← 保守值，防止 expert collapse
AdamW betas                : (0.9, 0.999)
Weight decay               : 0.01

训练配置

数据集    : nohurry/Opus-4.6-Reasoning-3000x-filtered（2326 条样本）
训练/验证 : 随机 90%/10% 划分（seed=42）
预过滤    : 丢弃字符数 > 3×max_seq_len 的样本（避免推理链被截断）
最大序列长: 4096 tokens
Batch size: 2 × 8 梯度累积 = 16 等效
训练轮数  : 1 epoch
最大步数  : 200
调度器    : cosine warmup（5%）

依赖安装

uv pip install mlx-tune datasets huggingface_hub
# 或旧包名（API 完全相同）
# uv pip install unsloth-mlx datasets huggingface_hub

局限性

仅训练 200 步，约 2093 条样本。更长训练可能进一步改善纯数学任务表现。
保留了基础模型的量化格式（mxfp4/q8-hi），未进行反量化。
仅在 Apple Silicon（M 系列）上测试，未验证 CUDA 推理兼容性。

Fine-tuned by cloudyu · 2026-03

Downloads last month: 3,837

Safetensors

Model size

117B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for cloudyu/GPT-OSS-120B-2experts-MLX-q4-Claude-4.6-Opus-Reasoning-Distilled

Base model

openai/gpt-oss-120b

Adapter

(27)

this model