gemma-4-19b-a4b-it-REAP GGUF
GGUF quantized versions of 0xSero/gemma-4-19b-a4b-it-REAP.
Model Description
This is the 30% expert-pruned version of Google's Gemma-4 27B model using Cerebras' REAP (Routing-Efficient Expert Pruning) method.
- Original model: Google Gemma-4 27B
- Pruning: 30% of experts removed via REAP
- Total parameters: ~19B
- Active parameters per token: ~4B (MoE architecture)
- Experts per layer: 90 (down from 128)
- Selected experts per token: 8
Available Quantizations
| Quantization | Size | Description |
|---|---|---|
| Q8_0 | ~19GB | 8-bit, best quality |
| Q6_K | ~15GB | 6-bit K-quant |
| Q5_K_M | ~13GB | 5-bit K-quant medium |
| Q4_K_M | ~11GB | 4-bit K-quant medium |
| Q3_K_M | ~9GB | 3-bit K-quant medium |
Usage
With llama.cpp
./llama-cli -m gemma-4-19b-a4b-it-REAP-Q4_K_M.gguf -p "Hello, how are you?" -n 256
With Ollama
# Create a Modelfile
echo 'FROM ./gemma-4-19b-a4b-it-REAP-Q4_K_M.gguf' > Modelfile
ollama create gemma4-19b-reap -f Modelfile
ollama run gemma4-19b-reap
Chat Format
This model uses Gemma-4 chat format:
<bos><start_of_turn>user
Your message here<end_of_turn>
<start_of_turn>model
License
This model inherits the Gemma license from Google.
- Downloads last month
- 909
Hardware compatibility
Log In to add your hardware
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Ayodele01/gemma-4-19b-a4b-it-REAP-GGUF
Base model
0xSero/gemma-4-19b-a4b-it-REAP