Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4
Mixed-precision quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled — a Claude 4.6 Opus reasoning-distilled Qwen3.5-27B model.
~25 GB on disk (~23.7 GiB in VRAM). Recommended GPU: NVIDIA RTX PRO 6000 (96 GB), which works out of the box at full 262K context. The repo also includes preprocessor_config.json, video_preprocessor_config.json, corrected tokenizer metadata, and the Qwen3.5 MTP sidecar (model.mtp.safetensors). On a single RTX 5090 (32 GB), this checkpoint was locally validated on 2026-03-19 with vLLM 0.17.1, transformers 5.3.0, and only the SM120 TMA patch from PR #36325; MTP speculative decoding launched and generated coherently up to the full configured 262,144 context without --enforce-eager.
RTX 5090 Setup Guide (single GPU, 32 GB)
If you have a GPU with >= 48 GB VRAM (RTX PRO 6000, A100, H100, etc.), skip this section.
Exact setup validated locally on 2026-03-19
vllm==0.17.1transformers==5.3.0- only the SM120 TMA patch from PR #36325
- no PR branch install
- no
--enforce-eager - validated on a single RTX 5090 with
CUDA_DEVICE_ORDER=PCI_BUS_IDandCUDA_VISIBLE_DEVICES=1
Step 1: Install the validated runtime
pip install -U vllm==0.17.1
pip install transformers==5.3.0
Step 2: Apply the SM120 TMA patch
The exact patch used in the validated run restricts Triton's TMA path to Hopper and keeps it off on Blackwell:
UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))")
sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"
echo "Patched: $UTILS_FILE"
Step 3: Run
Without MTP:
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8_e4m3 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP:
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8_e4m3 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Notes:
--gpu-memory-utilization 0.85,--kv-cache-dtype fp8_e4m3,--max-num-seqs 1, and--skip-mm-profilingwere all part of the validated single-5090 setup.--enforce-eagerwas not required in this local validation.- MTP was successfully launched and tested at
8192,12288,18432,27648,41472,62208,93312,139968,209952, and262144.- The first failed step in the 1.5x sweep was
314928, which failed because it exceeds the model's configuredmax_position_embeddings=262144, not because of a 5090 VRAM/runtime failure.
Quantization Strategy
Non-uniform mixed-precision quantization using llm-compressor v0.10.1, with layer-level precision assignment based on architectural sensitivity analysis of the Qwen3.5 hybrid DeltaNet architecture. The quantization uses the compressed-tensors format with two quantization groups plus unquantized layers:
| Precision | Layers | Rationale |
|---|---|---|
| FP8 W8A8 (per-channel weights, per-token dynamic activations) | DeltaNet projections (in_proj_qkv, in_proj_z, out_proj), softmax q_proj/k_proj/v_proj, MLP down_proj |
Sensitive layers: QKV projections directly shape attention computation, DeltaNet accumulates error in recurrent state, down_proj is the MLP accumulation point. Uniform QKV precision required for fused QKV weight loading. |
| NVFP4 W4A4 (FP4 E2M1 weights + activations, FP8 E4M3 per-group scales, group_size=16) | Softmax o_proj, MLP gate_proj/up_proj |
Error-tolerant layers: o_proj is post-attention reduction, gate/up are pre-activation (errors dampened by SiLU gating) |
| BF16 (unquantized) | lm_head, embed_tokens, DeltaNet small projections (in_proj_a, in_proj_b), all norms, visual encoder |
Industry consensus: lm_head amplifies errors across 248K vocab; embed_tokens is a lookup table; DeltaNet low-rank projections are numerically sensitive; vision tower retained at full precision |
Weight Breakdown
| Component | Size | Precision |
|---|---|---|
| MLP | 11.4 GB | FP8 + NVFP4 |
| DeltaNet attention | 5.6 GB | FP8 + BF16 |
| lm_head | 2.5 GB | BF16 |
| embed_tokens | 2.5 GB | BF16 |
| Softmax attention | 1.4 GB | FP8 + NVFP4 |
| Visual encoder | 0.9 GB | BF16 (vision tower retained; text-only calibration — vision capability not separately validated post-quantization) |
| Quantization scales | 0.8 GB | Mixed |
| Total | ~25 GB |
Self-Calibration
Calibration data was generated by the model itself (512 reasoning traces across 8 categories: math, code, logic, analysis, creative writing, general knowledge, tool calling, Korean). Self-calibration produces traces that match the model's actual activation distributions during reasoning, yielding better quantization accuracy than generic web text for reasoning-distilled models.
- Calibration samples: 512
- Mean sequence length: 2,180 tokens
- Max sequence length: 4,096 tokens
- Generation: internal offline generation for calibration traces. SGLang is not a supported serving runtime for this checkpoint.
Architecture
Qwen3.5-27B uses a hybrid DeltaNet + softmax attention architecture with full_attention_interval=4:
Layer pattern (64 layers):
[DeltaNet, DeltaNet, DeltaNet, Softmax] × 16
= 48 DeltaNet layers + 16 softmax attention layers
Key architectural parameters:
- Hidden size: 5,120
- Attention heads: 24 (query), 4 (KV, GQA)
- Head dimension: 256
- DeltaNet heads: 16 key, 48 value (dim 128 each)
- MLP intermediate: 17,408
- Vocabulary: 248,320
- Max position embeddings: 262,144
Only 16 of 64 layers require KV cache — the 48 DeltaNet layers use a fixed-size recurrent state that doesn't grow with sequence length. This gives ~4x more context capacity than a standard transformer of the same size.
KV Cache Budget
Per-token KV cache cost (only 16 softmax layers):
- FP16: 4 KV heads x 256 dim x 2 (K+V) x 2 bytes x 16 layers = 64 KB/token
- FP8: 32 KB/token
| GPU | Available for KV | Max Context (FP8 KV) |
|---|---|---|
| RTX 5090 (32 GB) | ~7 GiB | Locally validated with FP8 KV cache up to the full configured 262K context using vLLM 0.17.1 + the PR #36325 SM120 TMA patch |
| RTX PRO 6000 (96 GB) | ~71 GiB | 8 concurrent requests × 262K tokens each |
Usage
Serving with vLLM (recommended)
RTX PRO 6000 / high-VRAM GPUs (>= 48 GB) — works out of the box with recent vLLM releases:
pip install vllm>=0.17.0
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 262144 \
--reasoning-parser qwen3
RTX 5090 / consumer Blackwell (32 GB) — validated locally with vllm==0.17.1, transformers==5.3.0, and only the PR #36325 SM120 TMA patch:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8_e4m3 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
Note: The exact validated local environment used
vllm==0.17.1,transformers==5.3.0, and the one-line SM120 TMA patch above. In that setup, the model launched and generated coherently on a single RTX 5090 without--enforce-eager.
MTP / Speculative Decoding
This repo now publishes the Qwen3.5 MTP sidecar as model.mtp.safetensors.
Without MTP:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--reasoning-parser qwen3
With MTP:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Transformers (direct loading)
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4",
trust_remote_code=True,
)
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | compressed-tensors NVFP4 with Blackwell FP4 acceleration. A single RTX 5090 was locally validated with vllm==0.17.1, transformers==5.3.0, and only the PR #36325 SM120 TMA patch. |
| transformers >= 5.3.0 | Yes | Direct loading with device_map="auto" |
| SGLang | No | compressed-tensors NVFP4 not supported (only ModelOpt format) |
| llama.cpp / GGUF | No | NVFP4 format not supported |
Hardware Requirements
| Configuration | VRAM | Notes |
|---|---|---|
| Minimum | 28 GB | Weights only, minimal context |
| RTX 5090 | 32 GB | Locally validated at full 262144 context on a single card with vllm==0.17.1, transformers==5.3.0, FP8 KV cache, max_num_seqs=1, --skip-mm-profiling, and the PR #36325 SM120 TMA patch |
| RTX PRO 6000 (recommended) | 96 GB | 8 concurrent × 262K context with FP8 KV cache. Works out of the box with vLLM 0.17.0 |
| 2x RTX 5090 | 64 GB | Tensor parallel, full context |
Known Issues
Older discussion posts about missing preprocessor_config.json, video_preprocessor_config.json, or incompatible tokenizer metadata are stale. Those files are now present in the repo.
The single-5090 validation reported above used the latest release available at the time of testing, vllm==0.17.1, plus only the SM120 TMA patch from PR #36325. That was enough to launch and generate on a single RTX 5090 at the full configured 262144 context, including with Qwen3.5 MTP enabled.
Current status (validated on 2026-03-19):
- RTX PRO 6000 works out of the box with recent vLLM releases.
- A single RTX 5090 was locally validated with
vllm==0.17.1,transformers==5.3.0, FP8 KV cache,--max-num-seqs 1,--skip-mm-profiling, and the PR #36325 SM120 TMA patch. - In that setup, MTP speculative decoding launched and generated coherently at
8192,12288,18432,27648,41472,62208,93312,139968,209952, and262144. - The next 1.5x step,
314928, failed before launch because it exceeds the model's configuredmax_position_embeddings=262144.
Notes:
- The exact local environment also used
transformers==5.3.0; this is recorded here because the test machine hit a local import mismatch withtransformers==4.57.6. - The 5090 validation described above did not use
--enforce-eager.
Source Model
This is a quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which is an SFT fine-tune of Qwen/Qwen3.5-27B using Claude 4.6 Opus reasoning distillation data.
Training datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- TeichAI/claude-4.5-opus-high-reasoning-250x
- Jackrong/Qwen3.5-reasoning-700x
Quantization Details
- Tool: llm-compressor v0.10.1
- Format: compressed-tensors (mixed FP8 + NVFP4)
- Scheme: Two quantization groups — FP8 W8A8 (per-channel weights, per-token activations) for sensitive layers; NVFP4 W4A4 (FP4 E2M1, group_size=16) for error-tolerant layers; BF16 for embeddings, head, norms, and small projections
- Calibration: Self-generated reasoning traces (512 samples, 8 categories)
- Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB)
- Time: ~62 minutes (calibration pass)
License
Apache 2.0, following the base model license.
- Downloads last month
- 18,655
Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4
Base model
Qwen/Qwen3.5-27B