Spaces:

HuggingFaceM4
/

faster-qwen3-tts-demo

Running on A10G

App Files Files Community

technical details, how is it achieved ?

by D3vShoaib - opened Mar 2

Discussion

D3vShoaib

Mar 2

is it a finetuned model, or some deployment optimizations ...etc

D3vShoaib

Mar 2

it's this project ?
https://github.com/andimarafioti/faster-qwen3-tts

andito

HuggingFaceM4 org Mar 3

Hi! Yes, that's the project. Thinking about putting out a blog, but mainly it's using transformer's static cache and cuda graphs. Static cache is needed for cuda graphs, but also already accelerates everything. Then this model is particularly slow because it has to do a lot of handshakes between GPU and CPU, so using cuda graphs makes it flies. The benchmarks in the github repo show a bit how different systems react differently. The Jetson wht the slow CPU see a huge speedup, whereas the DGX spark with the integrated chip sees less, but still clear speed up.
Streaming was a whole thing on it's own. Had to read the paper and think about how to get it to work without artifacts, and do lots and lots of experiments to make sure things were coming up nicely.

andito

HuggingFaceM4 org Mar 3

TLDR: Same exact weights as the official implementation. Optimized inference.

throsturx

Mar 3

•

edited Mar 3

I can confirm that the 0.6B model indeed fits in 8GB VRAM -- and probably 6 GB if you're not using any for your desktop environment.
As for speed & quality -- seems much faster but I have minor doubts about there being perhaps some degradation, might have been a fluke, but I noticed some weird stuff a couple of times.
RTX 3050 GA107 8GB results with low-end supporting hardware:

{
  "0.6B": {
    "model": "0.6B",
    "gpu": "NVIDIA_GeForce_RTX_3050",
    "ttfa_ms": 415.1116179986275,
    "ttfa_std_ms": 13.525802715881103,
    "streaming_baseline_ttfa_ms": 1577.4137812011759,
    "streaming_baseline_ttfa_std_ms": 53.895404091137486,
    "streaming_baseline_rtf": 0.40037805688079825,
    "streaming_baseline_rtf_std": 0.00800231149396026,
    "streaming_fast_ttfa_ms": 406.40410680061905,
    "streaming_fast_ttfa_std_ms": 2.9588236981470484,
    "streaming_fast_rtf": 1.4980374219856765,
    "streaming_fast_rtf_std": 0.011108189280494557,
    "ttfa_by_chunk_size": {
      "4": {
        "mean": 250.57873179903254,
        "std": 18.795786669598712
      },
      "8": {
        "mean": 415.1116179986275,
        "std": 13.525802715881103
      },
      "12": {
        "mean": 578.312341797573,
        "std": 10.528144713370644
      }
    }
  }
}

SuperPauly

Mar 4

Hi! Yes, that's the project. Thinking about putting out a blog, but mainly it's using transformer's static cache and cuda graphs. Static cache is needed for cuda graphs, but also already accelerates everything. Then this model is particularly slow because it has to do a lot of handshakes between GPU and CPU, so using cuda graphs makes it flies. The benchmarks in the github repo show a bit how different systems react differently. The Jetson wht the slow CPU see a huge speedup, whereas the DGX spark with the integrated chip sees less, but still clear speed up.
Streaming was a whole thing on it's own. Had to read the paper and think about how to get it to work without artifacts, and do lots and lots of experiments to make sure things were coming up nicely.

YEAH!! Write a blog post dude! I would love to know how this works! First time I've heard anyone talk about CUDA Graphs.

Question: If I used the GGUF version of the 0.6B Qwen model would it break anything? I imagine with CUDA Graphs you can't quantise it much more for CPU inference? What about WebGPU?

Thanks for the cool demo and idea 👍🏼

D3vShoaib

Mar 5

I can confirm that the 0.6B model indeed fits in 8GB VRAM -- and probably 6 GB if you're not using any for your desktop environment.

totals at some where around ~6.667gb

i tested it and the 0.6B model fits under 4gb vRAM, then there is parakeet which adds upto 6-7gb , so if we remove the it and enter the transcription manually it should easily fit even in 4gb GPUs

andito

HuggingFaceM4 org Mar 5

oh yeah, I added parakeet for the demo but you definitely don't need it

mirwin2023ok

Mar 16

This is... UH-MAZING. Thank you for al the hard work!

SirBadfish

Mar 16

Thank you for this!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment