Qwen3.5-9B-Sculpt-Throughput

12% FFN compression with live teacher distillation. Drop-in replacement — no custom kernels, no runtime changes.

Dystrio Sculpt structurally compresses transformer FFN layers, producing dense models that load with standard transformers.

This is the Throughput tier of Qwen3.5-9B.

Use case: Local/throughput — speed sweet spot (1.25x prefill)

Benchmark Results (lm_eval)

Model MMLU HellaSwag ARC-C TruthfulQA Winogrande GSM8K
Qwen3.5-9B (baseline) 78.7 78.1 55.6 53.7 73.0 87.3
Sculpt Default (kf=0.95) 76.2 (↓2.5) 75.8 (↓2.3) 56.4 (↑0.8) 52.6 (↓1.1) 68.7 (↓4.3) 81.5 (↓5.8)
Sculpt Production (kf=0.9) 73.9 (↓4.8) 75.1 (↓3.0) 56.8 (↑1.2) 47.3 (↓6.4) 69.8 (↓3.2) 74.5 (↓12.8)
Sculpt Throughput (kf=0.88) 70.8 (↓7.9) 74.0 (↓4.1) 57.2 (↑1.6) 52.0 (↓1.7) 70.7 (↓2.3) 69.6 (↓17.7)
Sculpt Experimental (kf=0.82) 70.2 (↓8.5) 70.7 (↓7.4) 53.6 (↓2.0) 47.6 (↓6.1) 66.6 (↓6.4) 54.7 (↓32.6)

This Model vs Baseline

Benchmark Throughput Baseline Delta
arc_challenge 57.2 55.6 +1.6
gsm8k 69.6 87.3 -17.7
hellaswag 74.0 78.1 -4.1
mmlu 70.8 78.7 -7.9
mmlu_abstract_algebra 47.0 66.0 -19.0
mmlu_anatomy 68.9 77.8 -8.9
mmlu_astronomy 82.9 92.8 -9.9
mmlu_business_ethics 73.0 82.0 -9.0
mmlu_clinical_knowledge 77.7 86.8 -9.1
mmlu_college_biology 83.3 93.1 -9.8
mmlu_college_chemistry 52.0 59.0 -7.0
mmlu_college_computer_science 64.0 82.0 -18.0
mmlu_college_mathematics 51.0 64.0 -13.0
mmlu_college_medicine 68.2 81.5 -13.3
mmlu_college_physics 59.8 64.7 -4.9
mmlu_computer_security 76.0 83.0 -7.0
mmlu_conceptual_physics 77.9 90.2 -12.3
mmlu_econometrics 52.6 73.7 -21.1
mmlu_electrical_engineering 64.8 82.1 -17.3
mmlu_elementary_mathematics 61.9 80.7 -18.8
mmlu_formal_logic 64.3 65.9 -1.6
mmlu_global_facts 37.0 50.0 -13.0
mmlu_high_school_biology 87.1 93.5 -6.4
mmlu_high_school_chemistry 67.0 77.8 -10.8
mmlu_high_school_computer_science 75.0 88.0 -13.0
mmlu_high_school_european_history 84.8 87.3 -2.5
mmlu_high_school_geography 82.3 92.4 -10.1
mmlu_high_school_government_and_politics 89.1 96.9 -7.8
mmlu_high_school_macroeconomics 75.4 85.9 -10.5
mmlu_high_school_mathematics 44.1 53.3 -9.2
mmlu_high_school_microeconomics 85.7 93.3 -7.6
mmlu_high_school_physics 63.6 72.8 -9.2
mmlu_high_school_psychology 89.7 93.2 -3.5
mmlu_high_school_statistics 69.4 78.7 -9.3
mmlu_high_school_us_history 81.9 90.2 -8.3
mmlu_high_school_world_history 84.0 89.9 -5.9
mmlu_human_aging 71.3 78.9 -7.6
mmlu_human_sexuality 78.6 86.3 -7.7
mmlu_humanities 65.5 70.5 -5.0
mmlu_international_law 81.0 90.1 -9.1
mmlu_jurisprudence 80.6 84.3 -3.7
mmlu_logical_fallacies 74.2 84.7 -10.5
mmlu_machine_learning 55.4 66.1 -10.7
mmlu_management 86.4 86.4 +0.0
mmlu_marketing 86.8 95.7 -8.9
mmlu_medical_genetics 82.0 91.0 -9.0
mmlu_miscellaneous 82.6 90.3 -7.7
mmlu_moral_disputes 71.1 81.2 -10.1
mmlu_moral_scenarios 57.4 53.3 +4.1
mmlu_nutrition 76.1 86.3 -10.2
mmlu_other 74.2 83.1 -8.9
mmlu_philosophy 73.3 80.4 -7.1
mmlu_prehistory 72.5 84.3 -11.8
mmlu_professional_accounting 54.6 65.6 -11.0
mmlu_professional_law 53.9 60.3 -6.4
mmlu_professional_medicine 79.8 91.5 -11.7
mmlu_professional_psychology 72.1 82.8 -10.7
mmlu_public_relations 67.3 73.6 -6.3
mmlu_security_studies 75.1 76.7 -1.6
mmlu_social_sciences 79.5 87.0 -7.5
mmlu_sociology 87.6 89.1 -1.5
mmlu_stem 66.9 78.3 -11.4
mmlu_us_foreign_policy 86.0 90.0 -4.0
mmlu_virology 53.0 56.6 -3.6
mmlu_world_religions 81.9 86.5 -4.6
truthfulqa_mc2 52.0 53.7 -1.7
winogrande 70.7 73.0 -2.3

Performance

Metric Sculpt Baseline Change
Model size 15.6 GB 16.7 GB -6.2%
Parameters 8,400,155,136
Prefill throughput 5,726 tok/s 4,566 tok/s +25%
Decode throughput 36 tok/s 37 tok/s -4%

KV-cache footprint is unchanged — Sculpt only compresses FFN layers, not attention.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "dystrio/Qwen3.5-9B-Sculpt-Throughput",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen3.5-9B-Sculpt-Throughput")

inputs = tokenizer("The future of AI inference is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

All Sculpt Tiers

Tier HuggingFace Config Use Case
Default dystrio/Qwen3.5-9B-Sculpt-Default kf=0.95 Enterprise — maximum quality preservation
Production dystrio/Qwen3.5-9B-Sculpt-Production kf=0.9 Enterprise — balanced quality and efficiency
Throughput dystrio/Qwen3.5-9B-Sculpt-Throughput kf=0.88 Local/throughput — speed sweet spot (1.25x prefill)
Experimental dystrio/Qwen3.5-9B-Sculpt-Experimental kf=0.82 Local — maximum compression (1.27x prefill)

Technical Details

  • Method: Structural FFN pruning with importance-aware block selection + live teacher distillation (alpha=0.5)
  • Keep fraction: 0.88 (12% of FFN neurons removed)
  • Repair: 8-stage cosine-LR fine-tuning with best-checkpoint restore
  • Training data: general_v2 mixture (WikiText, OpenHermes 2.5, MMLU, HellaSwag, GSM8K, OpenOrca)
  • Hardware: 1x NVIDIA H200 141GB
  • Output: Standard dense transformer — loads with any HuggingFace-compatible framework

Compatibility

  • HuggingFace Transformers
  • vLLM
  • TGI (Text Generation Inference)
  • llama.cpp / GGUF conversion
  • AWQ / GPTQ quantization
  • Any framework that loads standard safetensors

Citation

@misc{dystrio_sculpt_2026,
  title={Dystrio Sculpt: Structural Compilation for Transformer LLMs},
  author={Dystrio},
  year={2026},
  url={https://huggingface.co/dystrio}
}
Downloads last month
308
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dystrio/Qwen3.5-9B-Sculpt-Throughput

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(142)
this model

Datasets used to train dystrio/Qwen3.5-9B-Sculpt-Throughput