gpt-oss-puzzle-88B-GGUF

EXPERIMENTAL - REQUIRES CUSTOM BRANCH

These GGUF files will NOT work with mainline llama.cpp. You must use the branch linked below.

GGUF quantisation of nvidia/gpt-oss-puzzle-88B, an 88B parameter MoE model derived from gpt-oss-120B using NVIDIA's Puzzle NAS framework.

Required Branch

This model requires a custom llama.cpp branch with gpt-oss-puzzle architecture support:

https://github.com/smpurkis/llama.cpp/tree/gpt-oss-puzzle-support

Tracking issue: ggml-org/llama.cpp#21028 PR: ggml-org/llama.cpp#21032

This will not work on mainline llama.cpp until the architecture is merged upstream.

How to Use

# Clone the required branch
git clone --branch gpt-oss-puzzle-support https://github.com/smpurkis/llama.cpp.git
cd llama.cpp

# Build (example with Vulkan)
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j$(nproc)

# Run
./build/bin/llama-cli -m gpt-oss-puzzle-88B.MXFP4_MOE.gguf -ngl 99 -fa 1 -p "Hello"

Available Quantisations

File Quant Size Description
gpt-oss-puzzle-88B.f16.gguf F16 47.0 GiB Full precision (for requantisation)
gpt-oss-puzzle-88B.MXFP4_MOE.gguf MXFP4_MOE 44.8 GiB Native MXFP4 expert weights (matches original model precision)

Architecture Differences from gpt-oss-120B

The puzzle model differs from the standard gpt-oss architecture in ways that require dedicated support:

Property gpt-oss-120B gpt-oss-puzzle-88B
Expert count 128 per layer (uniform) 128 or 64 per layer (heterogeneous)
Attention pattern Interleaved global/SWA (single window) Global + multiple SWA window sizes (128, 8192)
Total parameters ~117B ~88B

Credits

Downloads last month
2,181
GGUF
Model size
88B params
Architecture
gpt-oss-puzzle
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SamPurkis/gpt-oss-puzzle-88B-GGUF

Quantized
(1)
this model

Paper for SamPurkis/gpt-oss-puzzle-88B-GGUF