Rare Archive β€” Qwen 4B SFT v1

A fine-tuned language model for rare genetic disease clinical decision support, trained on 63,212 examples from the RareArena dataset.

This model is NOT FDA-cleared, NOT CE-marked, and NOT a medical device. It is intended for research and clinical decision support only. It must not be used for autonomous diagnosis or treatment decisions. See Ethical Considerations below.

Model Description

Property Value
Base model Qwen/Qwen3.5-4B
Training method QLoRA SFT via Unsloth
LoRA rank 64 (5.04% trainable parameters)
Training data 63,212 examples (49,760 RDS + 6,891 RDC + 6,561 synthetic)
Epochs 3 (11,853 steps)
Quantization Q8_0 GGUF (4.2 GB)
License Apache-2.0

Agentic System Context

This model is one component of a complete agentic diagnostic system. In the full system, the model doesn't just answer clinical questions β€” it reasons through cases by invoking real clinical tools (ClinVar, Orphanet, HPO, PanelApp, gnomAD, PubMed, DiffDx) and interpreting their results to build a differential diagnosis.

The diagnostic trace pattern follows the workflow of expert rare disease specialists: Reason (analyze clinical presentation) β†’ Lookup (query clinical databases) β†’ Match (cross-reference findings against phenotype-gene mappings) β†’ Search (broaden when initial hypotheses don't explain all findings) β†’ Diagnose (synthesize evidence into a ranked differential).

This v1 model is Stage 1 (SFT) β€” it has learned clinical diagnostic reasoning from 63K+ cases but does not yet invoke tools autonomously. Stage 2 (Tool-Use SFT) trains on gold-standard agentic traces where experts navigate real diagnostic cases with real tool API responses. See the training pipeline for the full 4-stage roadmap.

Training Data

The model was fine-tuned on the RareArena Combined dataset (63,212 examples from the rare-archive monorepo), which includes:

  • RareArena Disease Summaries (RDS): 49,760 structured rare disease summaries with clinical presentations, genetic bases, diagnostic approaches, and management guidelines
  • RareArena Disease Cases (RDC): 6,891 clinical case vignettes with differential diagnoses
  • Synthetic Patient Profiles: 6,561 AI-generated patient profiles enriched with Orphadata disease metadata (zero PHI)

All data is de-identified with zero patient health information (PHI). See Data Provenance under Ethical Considerations for sourcing details.

Evaluation Results

Evaluated on a held-out test set of 200 rare disease diagnostic questions:

Metric Base (Qwen3.5-4B) SFT v1 Improvement
Top-1 Accuracy 1.0% 21.5% 21.5x
Top-5 Accuracy 2.5% 28.0% 11.2x
Mean Score 0.041 0.620 15.1x

Evaluation methodology: Each question presents a clinical vignette describing a patient with a rare genetic condition. The model generates a differential diagnosis. Top-1 measures whether the correct diagnosis is the model's first choice; Top-5 measures whether it appears in the top 5 candidates. Mean Score is a weighted composite reflecting diagnostic reasoning quality.

Important context: 21.5% Top-1 accuracy represents a 21.5x improvement over the unspecialized base model, but is not sufficient for clinical deployment. This is Stage 1 (supervised fine-tuning) of a planned 4-stage training pipeline. See Bias, Risks, and Limitations.

Arena Validation

After deployment, models enter the ELO Arena β€” a clinician-operated proving ground where rare disease specialists evaluate model outputs in blind A/B comparisons. Each evaluation produces ratings across 5 quality dimensions:

Dimension What it measures
Diagnostic Accuracy Did the system identify the correct diagnosis?
Reasoning Quality Is the clinical reasoning sound and well-structured?
Tool Usage Were clinical tools invoked appropriately and effectively?
Safety Did it avoid harmful or misleading recommendations?
Overall Which system would you trust in clinical practice?

When a clinician identifies an error, their correction enters the correction-to-retrain loop: stored in the feedback database, exported as training data, and incorporated into the next model version. The Arena is where models earn clinical trust β€” and the clinicians who use the system are the ones who decide when it's ready. See ARCHITECTURE.md for full details.

Intended Use

This model is designed for rare disease clinical decision support β€” helping clinicians identify potential diagnoses, understand genetic bases, and plan diagnostic workups for patients with suspected rare genetic conditions.

Use Cases

  • Rare disease differential diagnosis from clinical presentations
  • Genetic test interpretation and next-step recommendations
  • Clinical literature synthesis for rare conditions
  • Medical education and training scenarios
  • Research into AI-assisted rare disease diagnostics

Out-of-Scope Uses

This model should not be used for:

  • Autonomous medical diagnosis β€” all outputs require review by a qualified clinician
  • Treatment recommendations without clinician oversight and independent verification
  • Emergency or time-critical clinical decisions β€” the model is not designed for acute care settings
  • Pediatric or neonatal assessment without specialist review β€” training data may underrepresent age-specific presentations
  • Legal or insurance determination of rare disease status or disability
  • Replacing specialist consultation β€” the model augments, never replaces, clinical expertise

Bias, Risks, and Limitations

Known Limitations

  • Not a diagnostic tool β€” outputs require clinical validation by a qualified specialist
  • English only β€” trained exclusively on English medical text; non-English clinical presentations are unsupported
  • Knowledge cutoff β€” based on training data available as of March 2026; newly described diseases or updated guidelines are not reflected
  • Rare disease focus β€” may underperform on common conditions compared to general medical models
  • Stage 1 SFT only β€” no RLHF, DPO, tool-use fine-tuning, or progressive RL applied yet; this is the first of four planned training stages

Bias Analysis

  • Disease coverage: The model covers approximately 9,100 rare diseases, but training depth varies significantly. Diseases with rich published literature (e.g., Gaucher disease, cystic fibrosis) are better represented than ultra-rare conditions with prevalence below 1:1,000,000.
  • Population representation: Training data derives primarily from published case literature in English-language medical journals, which skews toward Western clinical settings. Genetic variant frequencies referenced during training (from gnomAD) may not fully represent all ethnic and geographic populations.
  • Synthetic data artifacts: 6,561 of 63,212 training examples are synthetic patient profiles generated from Orphadata disease descriptions. These may reflect template-based patterns rather than the heterogeneous clinical presentations seen in real patients.
  • Confidence calibration: No formal calibration analysis has been performed. The model may express high confidence on incorrect diagnoses. Do not interpret output certainty as clinical probability.
  • Evaluation sample size: The evaluation test set contains 200 questions β€” sufficient to demonstrate training effectiveness but too small for robust statistical confidence intervals on accuracy estimates.

Risk Summary

Risk Severity Mitigation
Misdiagnosis extending diagnostic odyssey High Always use with specialist oversight; present as differential, not definitive
Overconfidence on rare conditions Medium No calibration performed; treat all outputs as hypotheses requiring verification
Underrepresentation of non-Western populations Medium Cross-reference with population-appropriate genetic databases
Synthetic data patterns in output Low Synthetic data is ~10% of training set; impact is diluted but present

Ethical Considerations

Regulatory Status

This model is NOT FDA-cleared, NOT CE-marked, and has NOT been evaluated by any regulatory body. It does not meet the definition of a medical device under FDA 21 CFR Part 820 or EU MDR 2017/745. It should not be marketed, distributed, or used as a Software as a Medical Device (SaMD).

Rare Disease Population Equity

Rare disease patients are a vulnerable population. The average diagnostic odyssey takes 5-7 years, during which patients may receive multiple incorrect diagnoses and unnecessary treatments. An AI system that provides incorrect guidance can extend this odyssey rather than shorten it. This model should augment, never replace specialist clinical judgment.

Feedback Loop Accountability

The Rare AI Archive includes a correction-to-retrain feedback loop where clinician corrections flow back into the training data. This creates accountability for:

  • Feedback quality: Corrections from specialists with rare disease expertise should be weighted appropriately
  • Feedback representativeness: If corrections come predominantly from one clinical setting, the model may drift toward that setting's diagnostic patterns
  • Feedback transparency: All corrections are logged and auditable

Data Provenance

  • RareArena RDS/RDC: Derived from published clinical literature and structured disease databases. No individual patient data is used.
  • Synthetic patients: Generated from publicly available Orphanet / Orphadata disease profiles using template-based generation with randomized clinical parameters. Zero patient health information (PHI).
  • All training data is released under Apache 2.0 and available for inspection on HuggingFace.

Why Open Source

This model is released under Apache 2.0 because rare disease diagnostics should not be locked behind proprietary walls. Open weights enable:

  • Scrutiny: Any researcher can inspect the model's behavior and identify failure modes
  • Reproducibility: Training can be verified and replicated independently
  • Equitable access: Clinics in resource-limited settings can deploy locally without API costs
  • Community improvement: Specialists worldwide can contribute corrections and improvements
  • Ecosystem participation: Part of a decentralized post-training ecosystem where patient communities, rare disease centers, and ML engineers collaborate to make the system continuously better

Environmental Impact

Factor Value
Hardware 1x NVIDIA A100 80GB
Training time ~23 hours
Estimated energy ~3.5 kWh
Estimated CO2 ~1.4 kg CO2e (US grid average)
Quantization Minimal additional compute
Inference Runs on consumer hardware (Apple M-series, single GPU)

The model's small size (4B parameters) and GGUF quantization enable efficient inference without dedicated GPU infrastructure, reducing the ongoing environmental cost of deployment.

Related Work

The rare disease AI landscape is emerging rapidly. We acknowledge the following contributions and position this work in context:

  • DeepRare (2025): Achieves 57.18% Recall@1 with 40+ clinical tools and 95.4% expert agreement. The current leading system for rare disease diagnostics β€” but closed-source and not independently deployable. Our 21.5% Top-1 does not match this performance; we are Stage 1 of 4 in our training pipeline.
  • RareSeek R1 (2025): Demonstrates physician-parity on EHR narratives through staged instruction tuning. An important academic contribution, but not available as a deployable system or open model.
  • RareBench (2024) / RareScale (2025): Provide evaluation frameworks and benchmarks for rare disease AI. These serve the research community but do not produce deployed clinical tools.
  • Zebra-Llama (2024): Shows that focused single-disease fine-tuning (Ehlers-Danlos syndrome) can achieve strong results. However, single-disease models face scalability challenges across the 7,000+ rare disease landscape.

Our position: The Rare AI Archive is the first open-source, federated, composable, deployed rare disease diagnostic AI. We trade peak single-metric performance for transparency, deployability, and extensibility. We welcome comparison and collaboration with all of the above efforts.

How to Use

With llama.cpp (recommended)

# Download the Q8_0 GGUF
huggingface-cli download Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1 \
  rare-archive-qwen-4b-sft-v1-Q8_0.gguf --local-dir ./models

# Serve with llama-server (Metal for Mac, CUDA for Linux)
llama-server -m ./models/rare-archive-qwen-4b-sft-v1-Q8_0.gguf \
  -ngl 99 --port 8082

# Query via OpenAI-compatible API
curl http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"rare-archive","messages":[{"role":"user","content":"A 3-year-old presents with hepatosplenomegaly, bone pain, and Erlenmeyer flask deformity on X-ray. What is the most likely diagnosis?"}]}'

With OpenWebUI

pip install open-webui
OPENAI_API_BASE_URLS="http://localhost:8082/v1" open-webui serve --port 3100

Training & Evaluation Data

Dataset Records Purpose Link
Synthetic Patients 12,984 SFT training data β€” frequency-weighted HPO vignettes across ~4,500 diseases rare-archive-synthetic-patients
RareArena RDS 8,562 Evaluation benchmark β€” clinical vignettes from published case reports rare-archive-eval-rarearena-rds
RareArena RDC 4,376 Evaluation benchmark β€” vignettes with laboratory test results rare-archive-eval-rarearena-rdc

All datasets are part of the Rare AI Archive β€” Complete Toolkit collection.

Training Details

  • Framework: Unsloth + Hugging Face Transformers
  • Hardware: NVIDIA A100 80GB (single GPU)
  • Training time: ~23 hours (3 epochs, 11,853 steps)
  • LoRA config: rank=64, alpha=128, dropout=0.05, target_modules=all linear layers
  • Optimizer: AdamW 8-bit
  • Learning rate: 1e-4 with cosine schedule
  • Batch size: 1 per device x 16 gradient accumulation = 16 effective

Condition-Specific Models

Beyond this foundation model (trained on the full 9,100-disease corpus), the Archive trains condition-specific adapters β€” LoRA layers specialized for disease clusters:

Disease Cluster Key Diseases Status
IEM / Lysosomal Storage Gaucher, Fabry, Pompe Adapter complete
Neuromuscular Duchenne, SMA, Myasthenia Gravis Adapter complete
Connective Tissue Ehlers-Danlos, Marfan Planned
Autoimmune Sjogren's, Lupus Planned
Mitochondrial MELAS, Leigh Syndrome Planned

The ontology package assigns each disease to a cluster, and the Arena tracks per-cluster ELO ratings β€” so model quality is measured where it matters, not just in aggregate. See packages/models/ for adapter training configs.

Citation

@misc{rare-archive-qwen-4b-sft-v1,
  title={Rare Archive: Fine-tuned Qwen 4B for Rare Disease Clinical Decision Support},
  author={Wilhelm Foundation and Lattice Protocol},
  year={2026},
  url={https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1}
}

Acknowledgments

Downloads last month
45
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1

Finetuned
Qwen/Qwen3.5-4B
Quantized
(142)
this model

Spaces using Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1 2

Collection including Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1

Evaluation results

  • Top-1 Accuracy on RareArena Combined (63K examples)
    self-reported
    0.215
  • Top-5 Accuracy on RareArena Combined (63K examples)
    self-reported
    0.280
  • Mean Score on RareArena Combined (63K examples)
    self-reported
    0.620