nvidia/nemotron-speech-streaming-en-0.6b · Mel-spectrogram boundary artifacts degrade quality in real-time mic streaming

Mel-spectrogram boundary artifacts degrade quality in real-time mic streaming — fix inside

#15

by Twizz - opened 22 days ago

We've been building a real-time mic transcription pipeline with this model and found a significant quality gap between file-mode transcription and live streaming. Posting the root cause and fix in case it helps others.

Environment

NeMo Toolkit: 2.7.2
Model: nvidia/nemotron-speech-streaming-en-0.6b (March 2026 checkpoint)
GPU: NVIDIA RTX (CUDA 12.6, PyTorch 2.11.0+cu126)
OS: Windows 11 via WSL2

Problem

Using the approach from the Online_ASR_Microphone_Demo_Cache_Aware_Streaming notebook, where each mic chunk is preprocessed independently before being fed to conformer_stream_step, the output is noticeably worse than running the same audio through model.transcribe() or the file-based streaming script.

Words get fragmented, phrases get mangled, and punctuation breaks down. Running the same audio through file-mode transcription produces clean, accurate output.

Root cause

The mel-spectrogram (STFT) computation uses overlapping windows. When you preprocess each small audio chunk independently, the frames at chunk boundaries lack context from adjacent audio — they get zero-padded instead. This produces different mel values than computing the spectrogram over the full contiguous audio.

The file-based CacheAwareStreamingAudioBuffer.append_audio_file() doesn't have this problem because it preprocesses the entire audio in one pass, then chunks the already-computed mel frames for the encoder.

Fix

Recompute the mel-spectrogram on all accumulated audio each time, but only feed new mel frames to the encoder:

all_audio = []
mel_frames_processed = 0

while True:
    # get new audio from mic
    all_audio.append(new_chunk)

    # recompute mel on ALL accumulated audio
    all_np = np.concatenate(all_audio)
    full_mel, full_mel_len = preprocessor(
        input_signal=torch.from_numpy(all_np).unsqueeze(0).to(device),
        length=torch.Tensor([all_np.shape[0]]).to(device),
    )

    # only encode NEW mel frames
    while mel_frames_processed + native_chunk_frames <= full_mel.size(-1):
        chunk = full_mel[:, :, mel_frames_processed:mel_frames_processed + native_chunk_frames]
        # ... prepend pre_encode cache from full_mel, call conformer_stream_step ...
        mel_frames_processed += shift_frames

The pre-encode cache is also pulled from the full mel buffer (looking back pre_encode_cache_size frames) rather than maintained separately, so it's always consistent.

Performance

The mel recompute sounds expensive but it's negligible in practice. Tested at 1120ms chunk size:

Audio accumulated	Mel recompute	Encoder step	Total
2s	2.2ms	49ms	51ms
23s	3.2ms	74ms	78ms
6m 45s (724 steps)	8.8ms	42ms	53ms

Mel stays under 10ms even approaching 7 minutes of continuous audio. The encoder only runs on new chunks regardless. For very long sessions you could reset every few minutes with a short overlap to keep it bounded.

Output quality with this approach matches file-mode transcription exactly.

steveheh

NVIDIA org 6 days ago

That tutorial is a bit out of date, please try https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_streaming_inference/asr_streaming_infer.py instead.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment