Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
1
1
4
Yonghua Lin
Yonghua
Follow
Tkaegypt's profile picture
21world's profile picture
ldwang's profile picture
3 followers
·
1 following
AI & ML interests
None yet
Recent Activity
new
activity
5 days ago
deepseek-ai/DeepSeek-V4-Flash:
Run DeepSeek-V4-Flash on more hardware: FP8/BF16 adapted versions for 8 AI chips (ready to download)
posted
an
update
5 days ago
🚀 Run DeepSeek V4 on more AI GPUs with FlagOS DeepSeek V4 just dropped with huge specs: 1.6T params, 1M context, MIT license. But there’s a catch: the official weights use FP4+FP8 mixed precision, which mainly targets NVIDIA Blackwell / B200-class GPUs. So we built DeepSeek-V4-FlagOS. On Day 0, the FlagOS community completed multi-chip adaptation across 8 AI hardware platforms: ✅ NVIDIA H100/H20 — FP8/BF16 ✅ Huawei Ascend — BF16 ✅ Hygon DCU — BF16 ✅ MetaX GPU — BF16 ✅ Moore Threads MTT S5000 — FP8 ✅ Kunlunxin XPU — BF16 ✅ T-Head/Alibaba Zhenwu — BF16 ✅ Iluvatar GPU — BF16 🔧 What makes it work? 1️⃣ FlagGems operator replacement DeepSeek V4 operators — MoE routing, Attention, RMSNorm and more — are reimplemented with Triton, reducing dependency on CUDA-specific libraries. New V4 operators include: Act Quant, hc_split_sinkhorn, FP8 MatMul, Sparse Attention, Hadamard Transform. 2️⃣ Flexible tensor parallelism DeepSeek V4 uses o_groups=8, which can limit TP. We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards. 3️⃣ FP4 → BF16 conversion For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases. 📦 Pre-converted models are available on Hugging Face: V4-Pro: FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS FlagRelease/DeepSeek-V4-Pro-metax-FlagOS FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS V4-Flash: FlagRelease/DeepSeek-V4-Flash-nvidia-FlagOS FlagRelease/DeepSeek-V4-Flash-zhenwu-FlagOS FlagRelease/DeepSeek-V4-Flash-kunlunxin-FlagOS FlagRelease/DeepSeek-V4-Flash-iluvatar-FlagOS ⚡ Performance on NVIDIA H20, V4-Flash FP8: FlagGems C++ Wrapper + Triton: 70.7 tok/s DeepSeek TileLang: 62.99 tok/s That’s 12.24% faster. 👉 Try it here: https://github.com/flagos-ai/DeepSeek-V4-FlagOS Open models should run on open infrastructure
authored
a paper
7 months ago
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
View all activity
Organizations
Yonghua
's activity
All
Models
Datasets
Spaces
Buckets
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
upvoted
an
article
10 months ago
view article
Article
基于FlagRelease平台适配的多芯版大模型做AI应用开发
Jul 7, 2025
•
3