Beta Release - This is a beta release. A v2 is expected with more training data and improved training methodology. As of now, this model is fine-tuned exclusively on the nohurry/Opus-4.6-Reasoning-3000x-filtered dataset (2,326 reasoning traces from Claude Opus 4.6).

Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled

GGUF quantizations of the fine-tuned NVIDIA Nemotron-3-Super-120B-A12B, distilled from Claude Opus 4.6 reasoning traces.

Available Quantizations

Quantization	File	Size	Description
Q4_K_M	`nemotron-120b-q4-k-m.gguf`	~50 GB	Best balance of quality and size. Medium quality, recommended for most users.
Q8_0	`nemotron-120b-q8-0.gguf`	~120 GB	Near-lossless quantization. Best quality, requires more RAM.

Model Details

Property	Value
Base Model	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Architecture	Nemotron-H (Mamba-2 SSM + MoE + Attention hybrid)
Parameters	120B total / 12B active (MoE)
Fine-tuning Method	LoRA (r=32, alpha=64) merged into base weights
Training Data	nohurry/Opus-4.6-Reasoning-3000x-filtered
Epochs	3
Final Training Loss	0.42

What's Different

This model has been fine-tuned on 2,326 high-quality reasoning traces from Claude Opus 4.6. The model produces structured reasoning with <think> tags before answering, similar to o1/reasoning-style models.

Usage

With llama.cpp

# Download the Q4_K_M quantization
huggingface-cli download blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled \
  nemotron-120b-q4-k-m.gguf --local-dir ./models

# Run inference
./llama-cli -m ./models/nemotron-120b-q4-k-m.gguf \
  -p "<|im_start|>system\nYou are a helpful reasoning assistant. Think step by step before answering.<|im_end|>\n<|im_start|>user\nWhat is 7 * 13?<|im_end|>\n<|im_start|>assistant\n" \
  --temp 1.0 --top-p 0.95 -n 512

With Ollama

Create a Modelfile:

FROM ./nemotron-120b-q4-k-m.gguf
PARAMETER temperature 1.0
PARAMETER top_p 0.95
SYSTEM You are a helpful reasoning assistant. Think step by step before answering.

Then:

ollama create nemotron-reasoning -f Modelfile
ollama run nemotron-reasoning "What is the sum of all prime numbers less than 20?"

Recommended Sampling Parameters

Per NVIDIA's recommendation, use temperature=1.0 and top_p=0.95 across all tasks — reasoning, tool calling, and general chat alike.

Other Formats

BF16 (safetensors): blobbybob/Nemotron-3-Super-120B-A12B-BF16-Claude-4.6-Opus-Reasoning-Distilled
FP8 (safetensors): blobbybob/Nemotron-3-Super-120B-A12B-FP8-Claude-4.6-Opus-Reasoning-Distilled

Limitations

Fine-tuned on only 2,326 examples — may not generalize to all domains
Reasoning traces are from Claude Opus 4.6; model behavior reflects that style
Beta release — expect improvements in v2

License

This model inherits the NVIDIA Nemotron Open Model License from the base model.

Downloads last month: 1,066

GGUF

Model size

121B params

Architecture

nemotron_h_moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for blobbybob/Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Quantized

(30)

this model

blobbybob
/

Nemotron-3-Super-120B-A12B-GGUF-Claude-4.6-Opus-Reasoning-Distilled