Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4

Mixed-precision quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but applies the same NVFP4/FP8/BF16 mixed recipe that worked well on the earlier v1 release.

The published folder includes:

  • model.safetensors
  • model.safetensors.index.json
  • model.mtp.safetensors
  • processor_config.json
  • preprocessor_config.json
  • video_preprocessor_config.json

Verified Inference

Local inference was verified on 2026-03-26 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

  • vllm==0.17.1
  • transformers==5.3.0

Patch note:

  • The old v1 one-line vllm patch for the Blackwell/TMA issue may still be required if you encounter the same problem.
  • If your local vllm build does not already include that fix, apply the one-line patch from the v1 README.

Concrete patch command:

UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"

The exact validated command for MTP-enabled serving was:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

The same model also serves without MTP:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

What was verified in that run:

  • the server started cleanly
  • GET /health returned 200
  • GET /v1/models returned the model
  • POST /v1/chat/completions returned 200
  • MTP/speculative decoding was active and reported acceptance metrics in the server logs

Quantization Strategy

Non-uniform mixed-precision quantization using llm-compressor with the same recipe that worked on the earlier v1 model:

Precision Layers
FP8 W8A8 DeltaNet in_proj_qkv, in_proj_z, out_proj; softmax q_proj/k_proj/v_proj; MLP down_proj
NVFP4 W4A4 softmax o_proj; MLP gate_proj/up_proj
BF16 lm_head, embed_tokens, DeltaNet in_proj_a/in_proj_b, norms, visual encoder, MTP sidecar

Architecture match with the BF16 source:

  • model_type=qwen3_5
  • 64 text layers
  • full_attention_interval=4
  • mtp_num_hidden_layers=1
  • max_position_embeddings=262144

Local Benchmark Slice

Local quality comparison against the BF16 v2 source model, run on 2026-03-26 with the same offline vLLM evaluation flow on a single RTX PRO 6000.

These are reduced verification slices, not full leaderboard runs.

Benchmark BF16 v2 NVFP4 v2
GSM8K Platinum (50) 50/50 = 100.0% 50/50 = 100.0%
AIME 2025 (30) 8/30 = 26.7% 7/30 = 23.3%

Average generated tokens in the same slice:

Benchmark BF16 v2 NVFP4 v2
GSM8K Platinum (50) 581 572
AIME 2025 (30) 2596 2990

Usage

vLLM

pip install -U vllm==0.17.1 transformers==5.3.0

With MTP:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Without MTP:

vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

Transformers

from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4",
    trust_remote_code=True,
)

Compatibility

Framework Supported Notes
vLLM >= 0.17.0 Yes Verified locally with vllm==0.17.1; MTP works when model.mtp.safetensors and processor files are present
transformers >= 5.3.0 Yes Direct loading works with device_map="auto"
SGLang No compressed-tensors NVFP4 is not the supported path here

Notes

  • This v2 export required the Qwen3.5 MTP sidecar and processor metadata to be materialized in the output folder for vLLM compatibility.
  • The old v1 one-line vllm patch may still be required if you encounter the same Blackwell/TMA issue on your own setup.
  • The benchmark numbers above are intended as a sanity-check comparison against the BF16 source, not as exhaustive leaderboard submissions.
Downloads last month
3,686
Safetensors
Model size
22B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-NVFP4