Chroma Context-1 (MXFP4)

MXFP4-quantized version of chromadb/context-1, a 20B parameter agentic search model fine-tuned from openai/gpt-oss-20b.

This checkpoint reduces the model size from 39 GB (BF16) to ~14 GB (MXFP4) with minimal quality degradation, enabling inference on a single GPU with more headroom for KV cache.

Model Details

  • Base model: chromadb/context-1 (BF16)
  • Architecture: GptOssForCausalLM (Mixture of Experts, 24 layers, 32 experts, top-4 routing)
  • Parameters: 20B total
  • Quantization: MXFP4 (E2M1 weights + E8M0 scales, group size 32)
  • Quantized layers: MoE expert weights (gate_up_proj, down_proj)
  • Non-quantized layers: attention, router, embeddings, LM head (remain in BF16)
  • File size: ~14 GB (model.safetensors)

Quantization Details

MXFP4 (Microscaling FP4) is a 4-bit floating-point format using the E2M1 representation (2-bit exponent, 1-bit mantissa) with shared E8M0 per-group scaling factors. Each group of 32 weights shares a single 8-bit scale, and each individual weight is stored as a 4-bit FP4 code. Two FP4 codes are packed into one uint8 byte (low nibble = even index, high nibble = odd index).

This matches the quantization format used by the official openai/gpt-oss-20b MXFP4 checkpoint and is natively supported by vLLM's Marlin MXFP4 kernels.

What is quantized

Component Format Notes
mlp.experts.gate_up_proj MXFP4 Stored as _blocks (U8) + _scales (U8)
mlp.experts.down_proj MXFP4 Stored as _blocks (U8) + _scales (U8)
self_attn.* BF16 Kept at full precision
mlp.router.* BF16 Kept at full precision
embed_tokens, lm_head BF16 Kept at full precision

Tensor layout

Expert weights are transposed from the BF16 checkpoint layout [num_experts, in_features, out_features] to the vLLM-expected [num_experts, out_features, in_features] before quantization. This is required because vLLM's MXFP4 loader performs a direct copy without transposition (unlike the BF16 loader which transposes on load).

Usage with vLLM

Requires vLLM >= 0.18.0.

vllm serve evilfreelancer/context-1-mxfp4 \
    --served-model-name evilfreelancer/context-1-mxfp4 \
    --trust-remote-code \
    --dtype auto \
    --gpu-memory-utilization 0.9 \
    --max-model-len 64000 \
    --max-num-batched-tokens 64000 \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice \
    --tool-call-parser openai

Docker Compose example:

services:
  gpt-oss-20b:
    image: vllm/vllm-openai:v0.18.0
    restart: always
    entrypoint: vllm
    command: >
      serve evilfreelancer/context-1-mxfp4
      --served-model-name evilfreelancer/context-1-mxfp4
      --trust-remote-code
      --dtype auto
      --gpu-memory-utilization 0.9
      --max-model-len 64000
      --max-num-batched-tokens 64000
      --kv-cache-dtype fp8
      --enable-auto-tool-choice
      --tool-call-parser openai
    ports:
      - 8080:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

Note on generation_config.json

The eos_token_id includes token 200012 (<|call|>) in addition to the standard <|return|> (200002). This is required for tool calling to work correctly - without it, the model does not stop generation after emitting a tool call, and the Harmony parser fails.

Conversion Script

The included convert_mxfp4.py converts the original BF16 chromadb/context-1 weights to MXFP4 format. It requires only numpy.

pip install numpy
python convert_mxfp4.py

The script expects the BF16 model in a sibling context-1/ directory and writes the quantized model to context-1-mxfp4/.

What the script does

  1. Reads each tensor from the BF16 model.safetensors
  2. For MoE expert weights (gate_up_proj and down_proj):
    • Transposes from [E, in, out] to [E, out, in]
    • Computes per-group E8M0 scales (group size 32)
    • Quantizes to E2M1 FP4 codes using nearest rounding
    • Packs two FP4 codes per byte
    • Saves as *_blocks (packed weights) and *_scales (shared exponents)
  3. Copies all other tensors (attention, router, embeddings) as-is in BF16
  4. Copies tokenizer and chat template files
  5. Writes config.json with quantization_config for vLLM

Key Capabilities

(Inherited from chromadb/context-1)

  • Query decomposition: Breaks complex multi-constraint questions into targeted subqueries.
  • Parallel tool calling: Averages 2.56 tool calls per turn, reducing total turns and end-to-end latency.
  • Self-editing context: Selectively prunes irrelevant documents mid-search to sustain retrieval quality over long horizons within a bounded context window (0.94 prune accuracy).
  • Cross-domain generalization: Trained on web, legal, and finance tasks; generalizes to held-out domains and public benchmarks (BrowseComp-Plus, SealQA, FRAMES, HLE).

Important: Agent Harness Required

Context-1 is trained to operate within a specific agent harness that manages tool execution, token budgets, context pruning, and deduplication. The harness is not yet public. Running the model without it will not reproduce the results reported in the technical report.

See the technical report for details on the harness design.

Citation

@techreport{bashir2026context1,
  title = {Chroma Context-1: Training a Self-Editing Search Agent},
  author = {Bashir, Hammad and Hong, Kelly and Jiang, Patrick and Shi, Zhiyi},
  year = {2026},
  month = {March},
  institution = {Chroma},
  url = {https://trychroma.com/research/context-1},
}

License

Apache 2.0

Downloads last month
283
Safetensors
Model size
22B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for evilfreelancer/context-1-mxfp4

Quantized
(5)
this model