Spaces:
Running on A10G
technical details, how is it achieved ?
is it a finetuned model, or some deployment optimizations ...etc
Hi! Yes, that's the project. Thinking about putting out a blog, but mainly it's using transformer's static cache and cuda graphs. Static cache is needed for cuda graphs, but also already accelerates everything. Then this model is particularly slow because it has to do a lot of handshakes between GPU and CPU, so using cuda graphs makes it flies. The benchmarks in the github repo show a bit how different systems react differently. The Jetson wht the slow CPU see a huge speedup, whereas the DGX spark with the integrated chip sees less, but still clear speed up.
Streaming was a whole thing on it's own. Had to read the paper and think about how to get it to work without artifacts, and do lots and lots of experiments to make sure things were coming up nicely.
TLDR: Same exact weights as the official implementation. Optimized inference.
I can confirm that the 0.6B model indeed fits in 8GB VRAM -- and probably 6 GB if you're not using any for your desktop environment.
As for speed & quality -- seems much faster but I have minor doubts about there being perhaps some degradation, might have been a fluke, but I noticed some weird stuff a couple of times.
RTX 3050 GA107 8GB results with low-end supporting hardware:
{
"0.6B": {
"model": "0.6B",
"gpu": "NVIDIA_GeForce_RTX_3050",
"ttfa_ms": 415.1116179986275,
"ttfa_std_ms": 13.525802715881103,
"streaming_baseline_ttfa_ms": 1577.4137812011759,
"streaming_baseline_ttfa_std_ms": 53.895404091137486,
"streaming_baseline_rtf": 0.40037805688079825,
"streaming_baseline_rtf_std": 0.00800231149396026,
"streaming_fast_ttfa_ms": 406.40410680061905,
"streaming_fast_ttfa_std_ms": 2.9588236981470484,
"streaming_fast_rtf": 1.4980374219856765,
"streaming_fast_rtf_std": 0.011108189280494557,
"ttfa_by_chunk_size": {
"4": {
"mean": 250.57873179903254,
"std": 18.795786669598712
},
"8": {
"mean": 415.1116179986275,
"std": 13.525802715881103
},
"12": {
"mean": 578.312341797573,
"std": 10.528144713370644
}
}
}
}
Hi! Yes, that's the project. Thinking about putting out a blog, but mainly it's using transformer's static cache and cuda graphs. Static cache is needed for cuda graphs, but also already accelerates everything. Then this model is particularly slow because it has to do a lot of handshakes between GPU and CPU, so using cuda graphs makes it flies. The benchmarks in the github repo show a bit how different systems react differently. The Jetson wht the slow CPU see a huge speedup, whereas the DGX spark with the integrated chip sees less, but still clear speed up.
Streaming was a whole thing on it's own. Had to read the paper and think about how to get it to work without artifacts, and do lots and lots of experiments to make sure things were coming up nicely.
YEAH!! Write a blog post dude! I would love to know how this works! First time I've heard anyone talk about CUDA Graphs.
Question: If I used the GGUF version of the 0.6B Qwen model would it break anything? I imagine with CUDA Graphs you can't quantise it much more for CPU inference? What about WebGPU?
Thanks for the cool demo and idea ๐๐ผ
I can confirm that the 0.6B model indeed fits in 8GB VRAM -- and probably 6 GB if you're not using any for your desktop environment.
totals at some where around ~6.667gb
i tested it and the 0.6B model fits under 4gb vRAM, then there is parakeet which adds upto 6-7gb , so if we remove the it and enter the transcription manually it should easily fit even in 4gb GPUs
oh yeah, I added parakeet for the demo but you definitely don't need it
This is... UH-MAZING. Thank you for al the hard work!
Thank you for this!