Qwen3.5-27B-TQ3_1S

Qwen3.5-27B-TQ3_1S is a GGUF quantization of Qwen/Qwen3.5-27B using TQ3_1S, a 3.5-bit weight format based on:

  • Walsh-Hadamard rotation
  • 8 centroid quantization
  • dual half-block scales

This release is aimed at one practical outcome:

  • near-Q4_0 quality
  • about 10% smaller than Q4_0
  • small enough to fit fully on a single 16 GB RTX 5060 Ti in the tested llama.cpp setup

Headline Result

Gold-standard wiki.test.raw pass, c=512, full 580 chunks:

Format PPL Size
Q4_0 7.2431 +/- 0.0482 14.4 GB
TQ3_1S 7.2570 +/- 0.0480 12.9 GB

Measured gap:

  • +0.0139 PPL
  • about 0.19%

Safe interpretation:

  • TQ3_1S is near-Q4_0 quality
  • TQ3_1S is about 1.5 GB smaller
  • on this 27B model, that size reduction is enough to change deployment on a 16 GB GPU

Important Caveat

This model card does not claim that TQ3_1S is universally faster than native Q4_0 under the same conditions.

The practical speed win in the tested setup comes mainly from:

  • TQ3_1S fitting fully on GPU
  • while Q4_0 does not fit fully on GPU on the same 16 GB card

So this is primarily a deployment / fit advantage story, not a blanket kernel-speed claim.

Files

  • Qwen3.5-27B-TQ3_1S.gguf

Base Model

Recommended Runtime

This model is intended for the public TQ3 runtime fork:

  • GitHub: https://github.com/turbo-tan/llama.cpp-tq3

It requires TQ3_1S runtime support and will not run on a stock llama.cpp build unless that support is present.

Example

./build/bin/llama-server \
  -m Qwen3.5-27B-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -c 4096

Quantization Notes

TQ3_1S uses a 32-element block layout:

[d0: fp16][d1: fp16][qs: 12 bytes]

That is:

  • 16 bytes per 32 weights
  • 4.0 bits per weight at the block level

Credit

This work is inspired by the broader line of transform-based quantization methods, especially RaBitQ-style Walsh-Hadamard rotation ideas, adapted here for LLM weight quantization in GGUF / llama.cpp.

Limitations

  • This 27B result does not imply that plain TQ3_1S is equally strong on smaller dense models.
  • In internal testing, 9B models were much less forgiving at this bitrate.
  • This release is a practical 27B deployment artifact, not a universal claim about all model scales.

License

Same model license terms as the base model apply.

Downloads last month
177
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YTan2000/Qwen3.5-27B-TQ3_1S

Base model

Qwen/Qwen3.5-27B
Quantized
(135)
this model