Rare Archive — Qwen 4B SFT v1

A fine-tuned language model for rare genetic disease clinical decision support, trained on 63,212 examples from the RareArena dataset.

This model is NOT FDA-cleared, NOT CE-marked, and NOT a medical device. It is intended for research and clinical decision support only. It must not be used for autonomous diagnosis or treatment decisions. See Ethical Considerations below.

Model Description

Property	Value
Base model	Qwen/Qwen3.5-4B
Training method	QLoRA SFT via Unsloth
LoRA rank	64 (5.04% trainable parameters)
Training data	63,212 examples (49,760 RDS + 6,891 RDC + 6,561 synthetic)
Epochs	3 (11,853 steps)
Quantization	Q8_0 GGUF (4.2 GB)
License	Apache-2.0

Agentic System Context

This model is one component of a complete agentic diagnostic system. In the full system, the model doesn't just answer clinical questions — it reasons through cases by invoking real clinical tools (ClinVar, Orphanet, HPO, PanelApp, gnomAD, PubMed, DiffDx) and interpreting their results to build a differential diagnosis.

The diagnostic trace pattern follows the workflow of expert rare disease specialists: Reason (analyze clinical presentation) → Lookup (query clinical databases) → Match (cross-reference findings against phenotype-gene mappings) → Search (broaden when initial hypotheses don't explain all findings) → Diagnose (synthesize evidence into a ranked differential).

This v1 model is Stage 1 (SFT) — it has learned clinical diagnostic reasoning from 63K+ cases but does not yet invoke tools autonomously. Stage 2 (Tool-Use SFT) trains on gold-standard agentic traces where experts navigate real diagnostic cases with real tool API responses. See the training pipeline for the full 4-stage roadmap.

Training Data

The model was fine-tuned on the RareArena Combined dataset (63,212 examples from the rare-archive monorepo), which includes:

RareArena Disease Summaries (RDS): 49,760 structured rare disease summaries with clinical presentations, genetic bases, diagnostic approaches, and management guidelines
RareArena Disease Cases (RDC): 6,891 clinical case vignettes with differential diagnoses
Synthetic Patient Profiles: 6,561 AI-generated patient profiles enriched with Orphadata disease metadata (zero PHI)

All data is de-identified with zero patient health information (PHI). See Data Provenance under Ethical Considerations for sourcing details.

Evaluation Results

Evaluated on a held-out test set of 200 rare disease diagnostic questions:

Metric	Base (Qwen3.5-4B)	SFT v1	Improvement
Top-1 Accuracy	1.0%	21.5%	21.5x
Top-5 Accuracy	2.5%	28.0%	11.2x
Mean Score	0.041	0.620	15.1x

Evaluation methodology: Each question presents a clinical vignette describing a patient with a rare genetic condition. The model generates a differential diagnosis. Top-1 measures whether the correct diagnosis is the model's first choice; Top-5 measures whether it appears in the top 5 candidates. Mean Score is a weighted composite reflecting diagnostic reasoning quality.

Important context: 21.5% Top-1 accuracy represents a 21.5x improvement over the unspecialized base model, but is not sufficient for clinical deployment. This is Stage 1 (supervised fine-tuning) of a planned 4-stage training pipeline. See Bias, Risks, and Limitations.

Arena Validation

After deployment, models enter the ELO Arena — a clinician-operated proving ground where rare disease specialists evaluate model outputs in blind A/B comparisons. Each evaluation produces ratings across 5 quality dimensions:

Dimension	What it measures
Diagnostic Accuracy	Did the system identify the correct diagnosis?
Reasoning Quality	Is the clinical reasoning sound and well-structured?
Tool Usage	Were clinical tools invoked appropriately and effectively?
Safety	Did it avoid harmful or misleading recommendations?
Overall	Which system would you trust in clinical practice?

When a clinician identifies an error, their correction enters the correction-to-retrain loop: stored in the feedback database, exported as training data, and incorporated into the next model version. The Arena is where models earn clinical trust — and the clinicians who use the system are the ones who decide when it's ready. See ARCHITECTURE.md for full details.

Intended Use

This model is designed for rare disease clinical decision support — helping clinicians identify potential diagnoses, understand genetic bases, and plan diagnostic workups for patients with suspected rare genetic conditions.

Use Cases

Rare disease differential diagnosis from clinical presentations
Genetic test interpretation and next-step recommendations
Clinical literature synthesis for rare conditions
Medical education and training scenarios
Research into AI-assisted rare disease diagnostics

Out-of-Scope Uses

This model should not be used for:

Autonomous medical diagnosis — all outputs require review by a qualified clinician
Treatment recommendations without clinician oversight and independent verification
Emergency or time-critical clinical decisions — the model is not designed for acute care settings
Pediatric or neonatal assessment without specialist review — training data may underrepresent age-specific presentations
Legal or insurance determination of rare disease status or disability
Replacing specialist consultation — the model augments, never replaces, clinical expertise

Bias, Risks, and Limitations

Known Limitations

Not a diagnostic tool — outputs require clinical validation by a qualified specialist
English only — trained exclusively on English medical text; non-English clinical presentations are unsupported
Knowledge cutoff — based on training data available as of March 2026; newly described diseases or updated guidelines are not reflected
Rare disease focus — may underperform on common conditions compared to general medical models
Stage 1 SFT only — no RLHF, DPO, tool-use fine-tuning, or progressive RL applied yet; this is the first of four planned training stages

Bias Analysis

Disease coverage: The model covers approximately 9,100 rare diseases, but training depth varies significantly. Diseases with rich published literature (e.g., Gaucher disease, cystic fibrosis) are better represented than ultra-rare conditions with prevalence below 1:1,000,000.
Population representation: Training data derives primarily from published case literature in English-language medical journals, which skews toward Western clinical settings. Genetic variant frequencies referenced during training (from gnomAD) may not fully represent all ethnic and geographic populations.
Synthetic data artifacts: 6,561 of 63,212 training examples are synthetic patient profiles generated from Orphadata disease descriptions. These may reflect template-based patterns rather than the heterogeneous clinical presentations seen in real patients.
Confidence calibration: No formal calibration analysis has been performed. The model may express high confidence on incorrect diagnoses. Do not interpret output certainty as clinical probability.
Evaluation sample size: The evaluation test set contains 200 questions — sufficient to demonstrate training effectiveness but too small for robust statistical confidence intervals on accuracy estimates.

Risk Summary

Risk	Severity	Mitigation
Misdiagnosis extending diagnostic odyssey	High	Always use with specialist oversight; present as differential, not definitive
Overconfidence on rare conditions	Medium	No calibration performed; treat all outputs as hypotheses requiring verification
Underrepresentation of non-Western populations	Medium	Cross-reference with population-appropriate genetic databases
Synthetic data patterns in output	Low	Synthetic data is ~10% of training set; impact is diluted but present

Ethical Considerations

Regulatory Status

This model is NOT FDA-cleared, NOT CE-marked, and has NOT been evaluated by any regulatory body. It does not meet the definition of a medical device under FDA 21 CFR Part 820 or EU MDR 2017/745. It should not be marketed, distributed, or used as a Software as a Medical Device (SaMD).

Rare Disease Population Equity

Rare disease patients are a vulnerable population. The average diagnostic odyssey takes 5-7 years, during which patients may receive multiple incorrect diagnoses and unnecessary treatments. An AI system that provides incorrect guidance can extend this odyssey rather than shorten it. This model should augment, never replace specialist clinical judgment.

Feedback Loop Accountability

The Rare AI Archive includes a correction-to-retrain feedback loop where clinician corrections flow back into the training data. This creates accountability for:

Feedback quality: Corrections from specialists with rare disease expertise should be weighted appropriately
Feedback representativeness: If corrections come predominantly from one clinical setting, the model may drift toward that setting's diagnostic patterns
Feedback transparency: All corrections are logged and auditable

Data Provenance

RareArena RDS/RDC: Derived from published clinical literature and structured disease databases. No individual patient data is used.
Synthetic patients: Generated from publicly available Orphanet / Orphadata disease profiles using template-based generation with randomized clinical parameters. Zero patient health information (PHI).
All training data is released under Apache 2.0 and available for inspection on HuggingFace.

Why Open Source

This model is released under Apache 2.0 because rare disease diagnostics should not be locked behind proprietary walls. Open weights enable:

Scrutiny: Any researcher can inspect the model's behavior and identify failure modes
Reproducibility: Training can be verified and replicated independently
Equitable access: Clinics in resource-limited settings can deploy locally without API costs
Community improvement: Specialists worldwide can contribute corrections and improvements
Ecosystem participation: Part of a decentralized post-training ecosystem where patient communities, rare disease centers, and ML engineers collaborate to make the system continuously better

Environmental Impact

Factor	Value
Hardware	1x NVIDIA A100 80GB
Training time	~23 hours
Estimated energy	~3.5 kWh
Estimated CO2	~1.4 kg CO2e (US grid average)
Quantization	Minimal additional compute
Inference	Runs on consumer hardware (Apple M-series, single GPU)

The model's small size (4B parameters) and GGUF quantization enable efficient inference without dedicated GPU infrastructure, reducing the ongoing environmental cost of deployment.

Related Work

The rare disease AI landscape is emerging rapidly. We acknowledge the following contributions and position this work in context:

DeepRare (2025): Achieves 57.18% Recall@1 with 40+ clinical tools and 95.4% expert agreement. The current leading system for rare disease diagnostics — but closed-source and not independently deployable. Our 21.5% Top-1 does not match this performance; we are Stage 1 of 4 in our training pipeline.
RareSeek R1 (2025): Demonstrates physician-parity on EHR narratives through staged instruction tuning. An important academic contribution, but not available as a deployable system or open model.
RareBench (2024) / RareScale (2025): Provide evaluation frameworks and benchmarks for rare disease AI. These serve the research community but do not produce deployed clinical tools.
Zebra-Llama (2024): Shows that focused single-disease fine-tuning (Ehlers-Danlos syndrome) can achieve strong results. However, single-disease models face scalability challenges across the 7,000+ rare disease landscape.

Our position: The Rare AI Archive is the first open-source, federated, composable, deployed rare disease diagnostic AI. We trade peak single-metric performance for transparency, deployability, and extensibility. We welcome comparison and collaboration with all of the above efforts.

How to Use

With llama.cpp (recommended)

# Download the Q8_0 GGUF
huggingface-cli download Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1 \
  rare-archive-qwen-4b-sft-v1-Q8_0.gguf --local-dir ./models

# Serve with llama-server (Metal for Mac, CUDA for Linux)
llama-server -m ./models/rare-archive-qwen-4b-sft-v1-Q8_0.gguf \
  -ngl 99 --port 8082

# Query via OpenAI-compatible API
curl http://localhost:8082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"rare-archive","messages":[{"role":"user","content":"A 3-year-old presents with hepatosplenomegaly, bone pain, and Erlenmeyer flask deformity on X-ray. What is the most likely diagnosis?"}]}'

With OpenWebUI

pip install open-webui
OPENAI_API_BASE_URLS="http://localhost:8082/v1" open-webui serve --port 3100

Training & Evaluation Data

Dataset	Records	Purpose	Link
Synthetic Patients	12,984	SFT training data — frequency-weighted HPO vignettes across ~4,500 diseases	rare-archive-synthetic-patients
RareArena RDS	8,562	Evaluation benchmark — clinical vignettes from published case reports	rare-archive-eval-rarearena-rds
RareArena RDC	4,376	Evaluation benchmark — vignettes with laboratory test results	rare-archive-eval-rarearena-rdc

All datasets are part of the Rare AI Archive — Complete Toolkit collection.

Training Details

Framework: Unsloth + Hugging Face Transformers
Hardware: NVIDIA A100 80GB (single GPU)
Training time: ~23 hours (3 epochs, 11,853 steps)
LoRA config: rank=64, alpha=128, dropout=0.05, target_modules=all linear layers
Optimizer: AdamW 8-bit
Learning rate: 1e-4 with cosine schedule
Batch size: 1 per device x 16 gradient accumulation = 16 effective

Condition-Specific Models

Beyond this foundation model (trained on the full 9,100-disease corpus), the Archive trains condition-specific adapters — LoRA layers specialized for disease clusters:

Disease Cluster	Key Diseases	Status
IEM / Lysosomal Storage	Gaucher, Fabry, Pompe	Adapter complete
Neuromuscular	Duchenne, SMA, Myasthenia Gravis	Adapter complete
Connective Tissue	Ehlers-Danlos, Marfan	Planned
Autoimmune	Sjogren's, Lupus	Planned
Mitochondrial	MELAS, Leigh Syndrome	Planned

The ontology package assigns each disease to a cluster, and the Arena tracks per-cluster ELO ratings — so model quality is measured where it matters, not just in aggregate. See packages/models/ for adapter training configs.

Citation

@misc{rare-archive-qwen-4b-sft-v1,
  title={Rare Archive: Fine-tuned Qwen 4B for Rare Disease Clinical Decision Support},
  author={Wilhelm Foundation and Lattice Protocol},
  year={2026},
  url={https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1}
}

Acknowledgments

Wilhelm Foundation — project sponsor and data steward
Lattice Protocol — infrastructure and training pipeline
Orphanet — rare disease reference data
RareArena — training data source
Unsloth — efficient fine-tuning framework

Downloads last month: 45

GGUF

Model size

4B params

Architecture

qwen35

Hardware compatibility

8-bit

Model tree for Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Quantized

(142)

this model

Spaces using Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1 2

Collection including Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1

Rare AI Archive — Complete Toolkit

Collection

Open-source rare disease diagnostic AI: models, benchmarks, training data, and clinical demo. No disease is too rare to matter. • 5 items • Updated 25 days ago • 1

Evaluation results

Top-1 Accuracy on RareArena Combined (63K examples)
self-reported

0.215
Top-5 Accuracy on RareArena Combined (63K examples)
self-reported

0.280
Mean Score on RareArena Combined (63K examples)
self-reported

0.620