LLaDA-8B dLLM β Core AI (on-device diffusion LLM)
The model zoo's first diffusion language model (dLLM) for Apple Core AI. Most on-device LLMs
are autoregressive β they write one token at a time, left to right. This one is a masked
diffusion model: it starts from a canvas of [MASK] tokens and fills them in in parallel,
committing the most-confident positions each step until the answer resolves. A different decoding
paradigm, running on-device on Apple Silicon.
- Base:
GSAI-ML/LLaDA-8B-Instruct(LLaMA-dense 8B, bidirectional, no causal mask) - Distillation:
d3LLM/d3LLM_LLaDA(hao-ai-lab / NVIDIA) β pseudo-trajectory distillation that cuts the number of denoising steps hard (β8 tokens committed per forward here) - Quantization: int4 weight-only (per-block-32 body) + int8 head, β4.9 GB
- Runtime: Apple Core AI (
coreai-core), GPU
What it looks like
The answer doesn't stream left-to-right β the canvas denoises. Mid-generation you see tokens pop in out of order, e.g.:
48 + β4 =β7β clipsβββ β 48 + 24 = 72 clips altogether.
That parallel, out-of-order fill is the signature of a masked-diffusion LM.
How to run it
Open it in CoreAIChatMac (the zoo's macOS chat app) β Download Modelsβ¦ β LLaDA-8B
(diffusion), or point the app at a folder containing the macos/ bundle. The host drives the
denoising loop over a single static, bidirectional forward (main(input_ids[1,S]) β logits[1,S,vocab],
no KV cache).
Bundle
macos/β Core AI.aimodel(GPU) +tokenizer/+metadata.jsonmetadata.jsonexposes the diffusion knobs (no recompile to retune):seqβ canvas length (256β answers up to ~210 tokens). No KV cache β the whole prompt+answer lives inS; per-step cost is ~linear inS(a larger canvas = fuller answers but slower steps).block_size32,threshold1.0β the entropy threshold trades steps for speed; lower = more gradual, higher = fewer forwards (faster), too high degrades quality
Performance (M4 Max, GPU)
| metric | value |
|---|---|
| throughput | β40 tok/s (threshold 1.0, S=256) |
| TTFT | ~0.3 s |
| forwards (NFE) | ~22 for a full 210-token answer |
| size | 4.9 GB (int4 + int8 head) |
The per-step cost is a full bidirectional forward over the whole canvas (no KV cache) β the distillation keeps the step count low. A delayed-KV-cache decode (process only the active region) is the next speed lever.
Roadmap
- Longer context (larger canvas) build
- Delayed-KV-cache decode (the d3LLM
generate_multi_block_kv_cachelever) for higher throughput - iPhone (h18p) build
Credits & license
A community port. All credit to the upstream authors β LLaDA (GSAI-ML) and d3LLM (hao-ai-lab / NVIDIA). This bundle inherits and is bound by their licenses β please review them before use. Core AI conversion only; no retraining.
Part of the Core AI model zoo.
Model tree for mlboydaisuke/LLaDA-8B-dLLM-CoreAI
Base model
GSAI-ML/LLaDA-8B-Instruct