Mel-spectrogram boundary artifacts degrade quality in real-time mic streaming β€” fix inside

#15
by Twizz - opened

We've been building a real-time mic transcription pipeline with this model and found a significant quality gap between file-mode transcription and live streaming. Posting the root cause and fix in case it helps others.

Environment

  • NeMo Toolkit: 2.7.2
  • Model: nvidia/nemotron-speech-streaming-en-0.6b (March 2026 checkpoint)
  • GPU: NVIDIA RTX (CUDA 12.6, PyTorch 2.11.0+cu126)
  • OS: Windows 11 via WSL2

Problem

Using the approach from the Online_ASR_Microphone_Demo_Cache_Aware_Streaming notebook, where each mic chunk is preprocessed independently before being fed to conformer_stream_step, the output is noticeably worse than running the same audio through model.transcribe() or the file-based streaming script.

Words get fragmented, phrases get mangled, and punctuation breaks down. Running the same audio through file-mode transcription produces clean, accurate output.

Root cause

The mel-spectrogram (STFT) computation uses overlapping windows. When you preprocess each small audio chunk independently, the frames at chunk boundaries lack context from adjacent audio β€” they get zero-padded instead. This produces different mel values than computing the spectrogram over the full contiguous audio.

The file-based CacheAwareStreamingAudioBuffer.append_audio_file() doesn't have this problem because it preprocesses the entire audio in one pass, then chunks the already-computed mel frames for the encoder.

Fix

Recompute the mel-spectrogram on all accumulated audio each time, but only feed new mel frames to the encoder:

all_audio = []
mel_frames_processed = 0

while True:
    # get new audio from mic
    all_audio.append(new_chunk)

    # recompute mel on ALL accumulated audio
    all_np = np.concatenate(all_audio)
    full_mel, full_mel_len = preprocessor(
        input_signal=torch.from_numpy(all_np).unsqueeze(0).to(device),
        length=torch.Tensor([all_np.shape[0]]).to(device),
    )

    # only encode NEW mel frames
    while mel_frames_processed + native_chunk_frames <= full_mel.size(-1):
        chunk = full_mel[:, :, mel_frames_processed:mel_frames_processed + native_chunk_frames]
        # ... prepend pre_encode cache from full_mel, call conformer_stream_step ...
        mel_frames_processed += shift_frames

The pre-encode cache is also pulled from the full mel buffer (looking back pre_encode_cache_size frames) rather than maintained separately, so it's always consistent.

Performance

The mel recompute sounds expensive but it's negligible in practice. Tested at 1120ms chunk size:

Audio accumulated Mel recompute Encoder step Total
2s 2.2ms 49ms 51ms
23s 3.2ms 74ms 78ms
6m 45s (724 steps) 8.8ms 42ms 53ms

Mel stays under 10ms even approaching 7 minutes of continuous audio. The encoder only runs on new chunks regardless. For very long sessions you could reset every few minutes with a short overlap to keep it bounded.

Output quality with this approach matches file-mode transcription exactly.

NVIDIA org

That tutorial is a bit out of date, please try https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_streaming_inference/asr_streaming_infer.py instead.

Sign up or log in to comment