Jump to content

Denoising with Spectral Subtraction

Michael Lee


If you're doing direct voice ITC, you'll probably be wanting denoise the signals your capturing. The goal of denoising is to remove noise from a voice signal, or equivalently enhance the non-noise, or speech that may be embedded in a hardware noise source.

Of all of the methods for denoising a signal, spectral subtraction is the oldest, and most well-known. As the term, spectral, would imply, it involves converting a time-based audio stream into a frequency-based (spectral) vector using the Fourier transform. 

First we assume, that the desired signal that we are trying to restore, X, is corrupted by an additive noise source, N, such that the resultant, observed signal is Y,


Since N is random and unknowable ahead of time, we can't subtract it from the observed signal, Y. However, in frequency space, we can approximately subtract the noise, given knowledge of the noise's average frequency/power spectrum,

X(f) =|Y(f)|-|N(f)|.

Simply put, compute the frequency spectrum of the observed signal and subtract it by a constant amount in each frequency (equal to the estimated noise at that frequency), then return this result to time-space. When a value of X(f) ends up below zero, it is simply set to zero.

One challenge is knowing the noise's frequency spectrum. This can be estimated by taking the frequency spectrum of a part of the observed signal that is known to only contain noise, and no voice. This of course is not trivial, given the hypothesis, that spirit speech permeates almost continuously in the noise.

One simplification is to use a hardware noise source that is white, aka, all of the frequencies, on average are the same. This allows us to avoid computing the difficult noise frequency term.

There are many papers on spectral subtraction in the scientific literature, to help you understand the method better. The one caveat, is that the method is usually not applied to cases where the noise overwhelms the weak signal. When that happens, the resultant denoised signal usually sounds like discordant musical tones. Discordance is partially due to the tones are linearly spaced (like a Fourier transform) and not exponentially spaced (like the notes on a musical scale).

Figure 1 have spectrograms demonstrating spectral subtraction, where I added the same magnitude of white noise as the speech signal (a real physical voice). The left picture is the original clean speech. The middle picture has added white noise. The right picture is the attempted denoising.

Notice, only the lower harmonics are still visible in the reconstruction. The higher frequency formants are missing - this is a perennial problem with direct voice. 




Recommended Comments

There are no comments to display.

Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.