Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4

Mixed-precision quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but applies the same NVFP4/FP8/BF16 mixed recipe that worked well on the earlier v1 release.

The published folder includes:

model.safetensors
model.safetensors.index.json
model.mtp.safetensors
processor_config.json
preprocessor_config.json
video_preprocessor_config.json

Verified Inference

Local inference was verified on 2026-03-26 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

vllm==0.17.1
transformers==5.3.0

Patch note:

The old v1 one-line vllm patch for the Blackwell/TMA issue may still be required if you encounter the same problem.
If your local vllm build does not already include that fix, apply the one-line patch from the v1 README.

Concrete patch command:

UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"

The exact validated command for MTP-enabled serving was:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

The same model also serves without MTP:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

What was verified in that run:

the server started cleanly
GET /health returned 200
GET /v1/models returned the model
POST /v1/chat/completions returned 200
MTP/speculative decoding was active and reported acceptance metrics in the server logs

Quantization Strategy

Non-uniform mixed-precision quantization using llm-compressor with the same recipe that worked on the earlier v1 model:

Precision	Layers
FP8 W8A8	DeltaNet `in_proj_qkv`, `in_proj_z`, `out_proj`; softmax `q_proj`/`k_proj`/`v_proj`; MLP `down_proj`
NVFP4 W4A4	softmax `o_proj`; MLP `gate_proj`/`up_proj`
BF16	`lm_head`, `embed_tokens`, DeltaNet `in_proj_a`/`in_proj_b`, norms, visual encoder, MTP sidecar

Architecture match with the BF16 source:

model_type=qwen3_5
64 text layers
full_attention_interval=4
mtp_num_hidden_layers=1
max_position_embeddings=262144

Local Benchmark Slice

Local quality comparison against the BF16 v2 source model, run on 2026-03-26 with the same offline vLLM evaluation flow on a single RTX PRO 6000.

These are reduced verification slices, not full leaderboard runs.

Benchmark	BF16 v2	NVFP4 v2
GSM8K Platinum (50)	50/50 = 100.0%	50/50 = 100.0%
AIME 2025 (30)	8/30 = 26.7%	7/30 = 23.3%

Average generated tokens in the same slice:

Benchmark	BF16 v2	NVFP4 v2
GSM8K Platinum (50)	581	572
AIME 2025 (30)	2596	2990

Usage

vLLM

pip install -U vllm==0.17.1 transformers==5.3.0

With MTP:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Without MTP:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4",
    trust_remote_code=True,
)

Compatibility

Framework	Supported	Notes
vLLM >= 0.17.0	Yes	Verified locally with `vllm==0.17.1`; MTP works when `model.mtp.safetensors` and processor files are present
transformers >= 5.3.0	Yes	Direct loading works with `device_map="auto"`
SGLang	No	compressed-tensors NVFP4 is not the supported path here

Notes

This v2 export required the Qwen3.5 MTP sidecar and processor metadata to be materialized in the output folder for vLLM compatibility.
The old v1 one-line vllm patch may still be required if you encounter the same Blackwell/TMA issue on your own setup.
The benchmark numbers above are intended as a sanity-check comparison against the BF16 source, not as exhaustive leaderboard submissions.

Downloads last month: 3,686

Safetensors

Model size

22B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4

Base model

Qwen/Qwen3.5-27B

Adapter

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

Quantized

(10)

this model