MiniCPM5-1B β Core AI (int8, runs on iPhone)
Apple Core AI (.aimodel) conversion of openbmb/MiniCPM5-1B β
OpenBMB's 1.08B on-device LLM with hybrid Think / No-Think reasoning and 128K context, reaching
1B-class open-source SOTA. Runs fully on-device on iPhone and Apple Silicon Macs (GPU, pipelined engine).
Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo
Use it
βΆοΈ Run it (source) β the ChatDemo runner (GUI + CLI, one app for every chat model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# β Run, then pick "MiniCPM5 1B" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model minicpm5-1b --prompt "What can you do, offline?"
π» Build with it β complete; the glue is kit API, copy-paste runs:
import CoreAIKit
let chat = try await ChatSession(catalog: "minicpm5-1b")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device
The take-home is Examples/ChatDemo/Sources/QuickStart.swift
β this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same ChatSession across turns for its transcript.
Multi-turn? Hold the ChatSession and call respond(to:) per turn β it keeps the
conversation history; streamResponse(to:) yields tokens as they decode.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kitβ product CoreAIKit - Info.plist: none needed
- Entitlements: none needed
- First run downloads the model β 2.0 GB (Mac) / 2.0 GB (iPhone) β then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release β Debug is ~3Γ slower on per-token host work
On-device numbers (iPhone 17 Pro, A19 Pro)
Measured with the zoo's PipelinedBench (random 128-token prompt, greedy):
| decode | prefill | quality | size | engine-ready | |
|---|---|---|---|---|---|
int8/ (ship) |
66.8 tok/s | 68.0 tok/s | lossless (24/24 token-exact vs HF fp32) | 1.0 GB | 2.0 s |
int8 is ~2.2Γ faster than fp16 on iPhone (decode is memory-bandwidth-bound, so halving the
weight read β doubles throughput) at no quality cost β the device greedy output is token-for-token
identical to the fp32 reference on the benchmark prompts. So int8 strictly dominates fp16 here.
Quantization
Weight-only symmetric per-channel int8 (absmax, no clipping β clipping craters the 130k-vocab LM
head; absmax keeps it lossless), applied as a torch pre-export pass via coreai-opt; SDPA / RoPE /
RMSNorm stay full precision. Same recipe family as the zoo's proven sym8.
uv run coreai.llm.export openbmb/MiniCPM5-1B --experimental --compute-precision float16 \
--compression-config minicpm5_int8sym.yaml
# minicpm5_int8sym.yaml: quantization_config β op_state_spec.weight = {dtype: int8,
# qscheme: symmetric, granularity: {type: per_channel, axis: 0}}
Conversion notes
llama β mistralremap. MiniCPM5-1B'smodel_typeisllama; the stock exporter has nollamagraph family, but Mistral's builder is architecturally identical for this config (GQA, no qkv bias, no qk-norm, explicithead_dimhonored). One-line remap in the model registry.- Chat EOS. Base
eos_tokenis</s>, but the chat template ends turns with<|im_end|>(id 130073). The bundle's tokenizereos_tokenis set to<|im_end|>(as Qwen ships) so generation halts cleanly. - Dynamic-shape bundle β the Core AI pipelined engine (the iPhone path); a static iOS export routes to the static-shape engine instead, which this FM-format bundle doesn't target.
Run
// iOS / macOS, via Foundation Models
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: modelURL) // int8/ bundle
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "Explain on-device AI in one sentence."))
License
Apache-2.0 (upstream MiniCPM5 license). Model Β© OpenBMB β see https://huggingface.co/openbmb/MiniCPM5-1B. Conversion: community.
Model tree for mlboydaisuke/MiniCPM5-1B-CoreAI
Base model
openbmb/MiniCPM5-1B