Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Paper β’ 2411.19146 β’ Published β’ 20
EXPERIMENTAL - REQUIRES CUSTOM BRANCH
These GGUF files will NOT work with mainline llama.cpp. You must use the branch linked below.
GGUF quantisation of nvidia/gpt-oss-puzzle-88B, an 88B parameter MoE model derived from gpt-oss-120B using NVIDIA's Puzzle NAS framework.
This model requires a custom llama.cpp branch with gpt-oss-puzzle architecture support:
https://github.com/smpurkis/llama.cpp/tree/gpt-oss-puzzle-support
Tracking issue: ggml-org/llama.cpp#21028 PR: ggml-org/llama.cpp#21032
This will not work on mainline llama.cpp until the architecture is merged upstream.
# Clone the required branch
git clone --branch gpt-oss-puzzle-support https://github.com/smpurkis/llama.cpp.git
cd llama.cpp
# Build (example with Vulkan)
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j$(nproc)
# Run
./build/bin/llama-cli -m gpt-oss-puzzle-88B.MXFP4_MOE.gguf -ngl 99 -fa 1 -p "Hello"
| File | Quant | Size | Description |
|---|---|---|---|
gpt-oss-puzzle-88B.f16.gguf |
F16 | 47.0 GiB | Full precision (for requantisation) |
gpt-oss-puzzle-88B.MXFP4_MOE.gguf |
MXFP4_MOE | 44.8 GiB | Native MXFP4 expert weights (matches original model precision) |
The puzzle model differs from the standard gpt-oss architecture in ways that require dedicated support:
| Property | gpt-oss-120B | gpt-oss-puzzle-88B |
|---|---|---|
| Expert count | 128 per layer (uniform) | 128 or 64 per layer (heterogeneous) |
| Attention pattern | Interleaved global/SWA (single window) | Global + multiple SWA window sizes (128, 8192) |
| Total parameters | ~117B | ~88B |
4-bit
16-bit
Base model
nvidia/gpt-oss-puzzle-88B