Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4

Mixed-precision quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled — a Claude 4.6 Opus reasoning-distilled Qwen3.5-27B model.

~25 GB on disk (~23.7 GiB in VRAM). Recommended GPU: NVIDIA RTX PRO 6000 (96 GB), which works out of the box at full 262K context. The repo also includes preprocessor_config.json, video_preprocessor_config.json, corrected tokenizer metadata, and the Qwen3.5 MTP sidecar (model.mtp.safetensors). On a single RTX 5090 (32 GB), this checkpoint was locally validated on 2026-03-19 with vLLM 0.17.1, transformers 5.3.0, and only the SM120 TMA patch from PR #36325; MTP speculative decoding launched and generated coherently up to the full configured 262,144 context without --enforce-eager.

RTX 5090 Setup Guide (single GPU, 32 GB)

If you have a GPU with >= 48 GB VRAM (RTX PRO 6000, A100, H100, etc.), skip this section.

Exact setup validated locally on 2026-03-19

vllm==0.17.1
transformers==5.3.0
only the SM120 TMA patch from PR #36325
no PR branch install
no --enforce-eager
validated on a single RTX 5090 with CUDA_DEVICE_ORDER=PCI_BUS_ID and CUDA_VISIBLE_DEVICES=1

Step 1: Install the validated runtime

pip install -U vllm==0.17.1
pip install transformers==5.3.0

Step 2: Apply the SM120 TMA patch

The exact patch used in the validated run restricts Triton's TMA path to Hopper and keeps it off on Blackwell:

UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))")

sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"

echo "Patched: $UTILS_FILE"

Step 3: Run

Without MTP:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

With MTP:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Notes:

--gpu-memory-utilization 0.85, --kv-cache-dtype fp8_e4m3, --max-num-seqs 1, and --skip-mm-profiling were all part of the validated single-5090 setup.

--enforce-eager was not required in this local validation.

MTP was successfully launched and tested at 8192, 12288, 18432, 27648, 41472, 62208, 93312, 139968, 209952, and 262144.

The first failed step in the 1.5x sweep was 314928, which failed because it exceeds the model's configured max_position_embeddings=262144, not because of a 5090 VRAM/runtime failure.

Quantization Strategy

Non-uniform mixed-precision quantization using llm-compressor v0.10.1, with layer-level precision assignment based on architectural sensitivity analysis of the Qwen3.5 hybrid DeltaNet architecture. The quantization uses the compressed-tensors format with two quantization groups plus unquantized layers:

Precision	Layers	Rationale
FP8 W8A8 (per-channel weights, per-token dynamic activations)	DeltaNet projections (`in_proj_qkv`, `in_proj_z`, `out_proj`), softmax `q_proj`/`k_proj`/`v_proj`, MLP `down_proj`	Sensitive layers: QKV projections directly shape attention computation, DeltaNet accumulates error in recurrent state, down_proj is the MLP accumulation point. Uniform QKV precision required for fused QKV weight loading.
NVFP4 W4A4 (FP4 E2M1 weights + activations, FP8 E4M3 per-group scales, group_size=16)	Softmax `o_proj`, MLP `gate_proj`/`up_proj`	Error-tolerant layers: o_proj is post-attention reduction, gate/up are pre-activation (errors dampened by SiLU gating)
BF16 (unquantized)	`lm_head`, `embed_tokens`, DeltaNet small projections (`in_proj_a`, `in_proj_b`), all norms, visual encoder	Industry consensus: lm_head amplifies errors across 248K vocab; embed_tokens is a lookup table; DeltaNet low-rank projections are numerically sensitive; vision tower retained at full precision

Weight Breakdown

Component	Size	Precision
MLP	11.4 GB	FP8 + NVFP4
DeltaNet attention	5.6 GB	FP8 + BF16
lm_head	2.5 GB	BF16
embed_tokens	2.5 GB	BF16
Softmax attention	1.4 GB	FP8 + NVFP4
Visual encoder	0.9 GB	BF16 (vision tower retained; text-only calibration — vision capability not separately validated post-quantization)
Quantization scales	0.8 GB	Mixed
Total	~25 GB

Self-Calibration

Calibration data was generated by the model itself (512 reasoning traces across 8 categories: math, code, logic, analysis, creative writing, general knowledge, tool calling, Korean). Self-calibration produces traces that match the model's actual activation distributions during reasoning, yielding better quantization accuracy than generic web text for reasoning-distilled models.

Calibration samples: 512
Mean sequence length: 2,180 tokens
Max sequence length: 4,096 tokens
Generation: internal offline generation for calibration traces. SGLang is not a supported serving runtime for this checkpoint.

Architecture

Qwen3.5-27B uses a hybrid DeltaNet + softmax attention architecture with full_attention_interval=4:

Layer pattern (64 layers):
  [DeltaNet, DeltaNet, DeltaNet, Softmax] × 16
  = 48 DeltaNet layers + 16 softmax attention layers

Key architectural parameters:

Hidden size: 5,120
Attention heads: 24 (query), 4 (KV, GQA)
Head dimension: 256
DeltaNet heads: 16 key, 48 value (dim 128 each)
MLP intermediate: 17,408
Vocabulary: 248,320
Max position embeddings: 262,144

Only 16 of 64 layers require KV cache — the 48 DeltaNet layers use a fixed-size recurrent state that doesn't grow with sequence length. This gives ~4x more context capacity than a standard transformer of the same size.

KV Cache Budget

Per-token KV cache cost (only 16 softmax layers):

FP16: 4 KV heads x 256 dim x 2 (K+V) x 2 bytes x 16 layers = 64 KB/token
FP8: 32 KB/token

GPU	Available for KV	Max Context (FP8 KV)
RTX 5090 (32 GB)	~7 GiB	Locally validated with FP8 KV cache up to the full configured 262K context using vLLM 0.17.1 + the PR #36325 SM120 TMA patch
RTX PRO 6000 (96 GB)	~71 GiB	8 concurrent requests × 262K tokens each

Usage

Serving with vLLM (recommended)

RTX PRO 6000 / high-VRAM GPUs (>= 48 GB) — works out of the box with recent vLLM releases:

pip install vllm>=0.17.0

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

RTX 5090 / consumer Blackwell (32 GB) — validated locally with vllm==0.17.1, transformers==5.3.0, and only the PR #36325 SM120 TMA patch:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

Note: The exact validated local environment used vllm==0.17.1, transformers==5.3.0, and the one-line SM120 TMA patch above. In that setup, the model launched and generated coherently on a single RTX 5090 without --enforce-eager.

MTP / Speculative Decoding

This repo now publishes the Qwen3.5 MTP sidecar as model.mtp.safetensors.

Without MTP:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
  --reasoning-parser qwen3

With MTP:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Transformers (direct loading)

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	compressed-tensors NVFP4 with Blackwell FP4 acceleration. A single RTX 5090 was locally validated with `vllm==0.17.1`, `transformers==5.3.0`, and only the PR #36325 SM120 TMA patch.
transformers >= 5.3.0	Yes	Direct loading with `device_map="auto"`
SGLang	No	compressed-tensors NVFP4 not supported (only ModelOpt format)
llama.cpp / GGUF	No	NVFP4 format not supported

Hardware Requirements

Configuration	VRAM	Notes
Minimum	28 GB	Weights only, minimal context
RTX 5090	32 GB	Locally validated at full `262144` context on a single card with `vllm==0.17.1`, `transformers==5.3.0`, FP8 KV cache, `max_num_seqs=1`, `--skip-mm-profiling`, and the PR #36325 SM120 TMA patch
RTX PRO 6000 (recommended)	96 GB	8 concurrent × 262K context with FP8 KV cache. Works out of the box with vLLM 0.17.0
2x RTX 5090	64 GB	Tensor parallel, full context

Known Issues

Older discussion posts about missing preprocessor_config.json, video_preprocessor_config.json, or incompatible tokenizer metadata are stale. Those files are now present in the repo.

The single-5090 validation reported above used the latest release available at the time of testing, vllm==0.17.1, plus only the SM120 TMA patch from PR #36325. That was enough to launch and generate on a single RTX 5090 at the full configured 262144 context, including with Qwen3.5 MTP enabled.

Current status (validated on 2026-03-19):

RTX PRO 6000 works out of the box with recent vLLM releases.
A single RTX 5090 was locally validated with vllm==0.17.1, transformers==5.3.0, FP8 KV cache, --max-num-seqs 1, --skip-mm-profiling, and the PR #36325 SM120 TMA patch.
In that setup, MTP speculative decoding launched and generated coherently at 8192, 12288, 18432, 27648, 41472, 62208, 93312, 139968, 209952, and 262144.
The next 1.5x step, 314928, failed before launch because it exceeds the model's configured max_position_embeddings=262144.

Notes:

The exact local environment also used transformers==5.3.0; this is recorded here because the test machine hit a local import mismatch with transformers==4.57.6.
The 5090 validation described above did not use --enforce-eager.

Source Model

This is a quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which is an SFT fine-tune of Qwen/Qwen3.5-27B using Claude 4.6 Opus reasoning distillation data.

Training datasets:

Quantization Details

Tool: llm-compressor v0.10.1
Format: compressed-tensors (mixed FP8 + NVFP4)
Scheme: Two quantization groups — FP8 W8A8 (per-channel weights, per-token activations) for sensitive layers; NVFP4 W4A4 (FP4 E2M1, group_size=16) for error-tolerant layers; BF16 for embeddings, head, norms, and small projections
Calibration: Self-generated reasoning traces (512 samples, 8 categories)
Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB)
Time: ~62 minutes (calibration pass)

License

Apache 2.0, following the base model license.

Downloads last month: 18,655

Safetensors

Model size

22B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4

Base model

Qwen/Qwen3.5-27B

Finetuned

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Quantized

(26)

this model

mconcat
/

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4