Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache 🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them ⚡ Active params isn't the same as memory footprint, especially for sparse architectures 📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident 📚 KV cache can still dominate depending on context length, batch size, and concurrency 🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate 🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving