Mel-spectrogram boundary artifacts degrade quality in real-time mic streaming β fix inside
We've been building a real-time mic transcription pipeline with this model and found a significant quality gap between file-mode transcription and live streaming. Posting the root cause and fix in case it helps others.
Environment
- NeMo Toolkit: 2.7.2
- Model: nvidia/nemotron-speech-streaming-en-0.6b (March 2026 checkpoint)
- GPU: NVIDIA RTX (CUDA 12.6, PyTorch 2.11.0+cu126)
- OS: Windows 11 via WSL2
Problem
Using the approach from the Online_ASR_Microphone_Demo_Cache_Aware_Streaming notebook, where each mic chunk is preprocessed independently before being fed to conformer_stream_step, the output is noticeably worse than running the same audio through model.transcribe() or the file-based streaming script.
Words get fragmented, phrases get mangled, and punctuation breaks down. Running the same audio through file-mode transcription produces clean, accurate output.
Root cause
The mel-spectrogram (STFT) computation uses overlapping windows. When you preprocess each small audio chunk independently, the frames at chunk boundaries lack context from adjacent audio β they get zero-padded instead. This produces different mel values than computing the spectrogram over the full contiguous audio.
The file-based CacheAwareStreamingAudioBuffer.append_audio_file() doesn't have this problem because it preprocesses the entire audio in one pass, then chunks the already-computed mel frames for the encoder.
Fix
Recompute the mel-spectrogram on all accumulated audio each time, but only feed new mel frames to the encoder:
all_audio = []
mel_frames_processed = 0
while True:
# get new audio from mic
all_audio.append(new_chunk)
# recompute mel on ALL accumulated audio
all_np = np.concatenate(all_audio)
full_mel, full_mel_len = preprocessor(
input_signal=torch.from_numpy(all_np).unsqueeze(0).to(device),
length=torch.Tensor([all_np.shape[0]]).to(device),
)
# only encode NEW mel frames
while mel_frames_processed + native_chunk_frames <= full_mel.size(-1):
chunk = full_mel[:, :, mel_frames_processed:mel_frames_processed + native_chunk_frames]
# ... prepend pre_encode cache from full_mel, call conformer_stream_step ...
mel_frames_processed += shift_frames
The pre-encode cache is also pulled from the full mel buffer (looking back pre_encode_cache_size frames) rather than maintained separately, so it's always consistent.
Performance
The mel recompute sounds expensive but it's negligible in practice. Tested at 1120ms chunk size:
| Audio accumulated | Mel recompute | Encoder step | Total |
|---|---|---|---|
| 2s | 2.2ms | 49ms | 51ms |
| 23s | 3.2ms | 74ms | 78ms |
| 6m 45s (724 steps) | 8.8ms | 42ms | 53ms |
Mel stays under 10ms even approaching 7 minutes of continuous audio. The encoder only runs on new chunks regardless. For very long sessions you could reset every few minutes with a short overlap to keep it bounded.
Output quality with this approach matches file-mode transcription exactly.
That tutorial is a bit out of date, please try https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_streaming_inference/asr_streaming_infer.py instead.