LLaDA-8B dLLM — Core AI (on-device diffusion LLM)

The model zoo's first diffusion language model (dLLM) for Apple Core AI. Most on-device LLMs are autoregressive — they write one token at a time, left to right. This one is a masked diffusion model: it starts from a canvas of [MASK] tokens and fills them in in parallel, committing the most-confident positions each step until the answer resolves. A different decoding paradigm, running on-device on Apple Silicon.

Base: GSAI-ML/LLaDA-8B-Instruct (LLaMA-dense 8B, bidirectional, no causal mask)
Distillation: d3LLM/d3LLM_LLaDA (hao-ai-lab / NVIDIA) — pseudo-trajectory distillation that cuts the number of denoising steps hard (≈8 tokens committed per forward here)
Quantization: int4 weight-only (per-block-32 body) + int8 head, ≈4.9 GB
Runtime: Apple Core AI (coreai-core), GPU

What it looks like

The answer doesn't stream left-to-right — the canvas denoises. Mid-generation you see tokens pop in out of order, e.g.:

48 + ░4 =░7░ clips░░░   →   48 + 24 = 72 clips altogether.

That parallel, out-of-order fill is the signature of a masked-diffusion LM.

How to run it

Open it in CoreAIChatMac (the zoo's macOS chat app) — Download Models… → LLaDA-8B (diffusion), or point the app at a folder containing the macos/ bundle. The host drives the denoising loop over a single static, bidirectional forward (main(input_ids[1,S]) → logits[1,S,vocab], no KV cache).

Bundle

macos/ — Core AI .aimodel (GPU) + tokenizer/ + metadata.json
metadata.json exposes the diffusion knobs (no recompile to retune):
- seq — canvas length (256 ⇒ answers up to ~210 tokens). No KV cache ⇒ the whole prompt+answer lives in S; per-step cost is ~linear in S (a larger canvas = fuller answers but slower steps).
- block_size 32, threshold 1.0 — the entropy threshold trades steps for speed; lower = more gradual, higher = fewer forwards (faster), too high degrades quality

Performance (M4 Max, GPU)

metric	value
throughput	≈40 tok/s (threshold 1.0, S=256)
TTFT	~0.3 s
forwards (NFE)	~22 for a full 210-token answer
size	4.9 GB (int4 + int8 head)

The per-step cost is a full bidirectional forward over the whole canvas (no KV cache) — the distillation keeps the step count low. A delayed-KV-cache decode (process only the active region) is the next speed lever.

Roadmap

Longer context (larger canvas) build
Delayed-KV-cache decode (the d3LLM generate_multi_block_kv_cache lever) for higher throughput
iPhone (h18p) build

Credits & license

A community port. All credit to the upstream authors — LLaDA (GSAI-ML) and d3LLM (hao-ai-lab / NVIDIA). This bundle inherits and is bound by their licenses — please review them before use. Core AI conversion only; no retraining.

Part of the Core AI model zoo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/LLaDA-8B-dLLM-CoreAI

Base model

GSAI-ML/LLaDA-8B-Instruct

Finetuned

(36)

this model