gemma-4-19b-a4b-it-REAP GGUF

GGUF quantized versions of 0xSero/gemma-4-19b-a4b-it-REAP.

Model Description

This is the 30% expert-pruned version of Google's Gemma-4 27B model using Cerebras' REAP (Routing-Efficient Expert Pruning) method.

  • Original model: Google Gemma-4 27B
  • Pruning: 30% of experts removed via REAP
  • Total parameters: ~19B
  • Active parameters per token: ~4B (MoE architecture)
  • Experts per layer: 90 (down from 128)
  • Selected experts per token: 8

Available Quantizations

Quantization Size Description
Q8_0 ~19GB 8-bit, best quality
Q6_K ~15GB 6-bit K-quant
Q5_K_M ~13GB 5-bit K-quant medium
Q4_K_M ~11GB 4-bit K-quant medium
Q3_K_M ~9GB 3-bit K-quant medium

Usage

With llama.cpp

./llama-cli -m gemma-4-19b-a4b-it-REAP-Q4_K_M.gguf -p "Hello, how are you?" -n 256

With Ollama

# Create a Modelfile
echo 'FROM ./gemma-4-19b-a4b-it-REAP-Q4_K_M.gguf' > Modelfile
ollama create gemma4-19b-reap -f Modelfile
ollama run gemma4-19b-reap

Chat Format

This model uses Gemma-4 chat format:

<bos><start_of_turn>user
Your message here<end_of_turn>
<start_of_turn>model

License

This model inherits the Gemma license from Google.

Downloads last month
909
GGUF
Model size
18B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ayodele01/gemma-4-19b-a4b-it-REAP-GGUF

Quantized
(7)
this model