GLM-4.7-Flash — Core AI (`gather_qmm` kernel, 2.6× faster)

Apple Core AI (.aimodel) conversion of zai-org/GLM-4.7-Flash (text decoder): MLA attention + a 64-expert top-4 sparse MoE (+ non-gated shared expert). 30B total / **3B active per token** — a strong local coder.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/glm-4.7-flash.md).

Use it

▶️ Run it (source) — the ChatDemo runner (GUI + CLI, one app for every chat model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ChatDemo/ChatDemo.xcodeproj
# → Run, then pick "GLM-4.7-Flash (MoE+MLA)" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ChatDemo
swift run chat-cli --model glm-4.7-flash --prompt "What can you do, offline?"

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let chat = try await ChatSession(catalog: "glm-4.7-flash")
let reply = try await chat.respond(to: prompt)
// reply: the answer, generated fully on-device

The take-home is Examples/ChatDemo/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same ChatSession across turns for its transcript. Multi-turn? Hold the ChatSession and call respond(to:) per turn — it keeps the conversation history; streamResponse(to:) yields tokens as they decode.

Integration checklist

SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKit
Info.plist: none needed
Entitlements: none needed (macOS)
First run downloads the model — 30.0 GB (Mac) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
Measure in Release — Debug is ~3× slower on per-token host work

The `gather_qmm` kernel — 20.3 → 52.4 tok/s (2.6×)

Apple's GatherMM reads all 64 experts' weights every token; a custom coreai_torch.TorchMetalKernel reads only the 4 routed experts (4/64) → decode runs at active-param bandwidth: 52.4 tok/s, 2.6× (the biggest relative gain of the zoo's three MoE gather ports — a 16× over-read removed).

Quality is clean and unchanged. The kernel reads the sym8 scheme = the same symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a bit-exact gather: 0 introduced flips / 18 vs fp16. Pure speed win at the same quality.

bundle	size	decode tok/s	quality
`gpu-pipelined/glm_4_7_flash_decode_sym8_gather/`	30 GB	52.4	clean (0 flips/18 vs fp16) ✅

Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers).

Run

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3

Convert your own with conversion/export_glm47_moe_metal_decode_pipelined.py.

License

MIT (upstream GLM license). Conversion + gather_qmm kernel: community.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/GLM-4.7-Flash-CoreAI

Base model

zai-org/GLM-4.7-Flash

Finetuned

(65)

this model

GLM-4.7-Flash — Core AI (gather_qmm kernel, 2.6× faster)

Use it

The gather_qmm kernel — 20.3 → 52.4 tok/s (2.6×)

Run

License

Model tree for mlboydaisuke/GLM-4.7-Flash-CoreAI

GLM-4.7-Flash — Core AI (`gather_qmm` kernel, 2.6× faster)

The `gather_qmm` kernel — 20.3 → 52.4 tok/s (2.6×)