Formant Detection and Synthesis

Entry posted by Michael Lee in Methods December 6, 2020

1,547 views

Voice compression algorithms utilize the common patterns of human speech to detect (at one end) and synthesize (on the other end) voice communication. Among the common structures of speech is the glottal pulse, which is the buzzing "ah" sound that forms the structure for all vowels and certain voiced consonants (like z, v, and r). White noise is the other base sound for forming phonemes like "s", "sh", and "t". Shaping these two foundational sounds are formants, which are the various resonances of the human vocal cavity. Formants can be modelled more or less as a small sum of narrow bandpass filters, either Gaussian or Lorentzian (1/[1+x^2]) functions. Although, I don't use it, formants can also be modelled as a 10-15th order all-pole filter. As expected, the poles of this filter look like Gaussians.

If we are trying to obtain the vestiges of speech from weak interdimensional signals, the same concepts used in voice compression can be used to deduce subtle voice patterns. The challenge is, of course, making the correct deductions of various speech components given the the fact that the noise dominates over the weak spirit signal.

I hypothesize that spirit signals in our devices are often extremely low-bit information, not unlike voice compression, with the caveat that our compression algorithms are able to selectively encode the most salient aspects of the transmitter's voice patterns. Meanwhile, the signal of a spirit's voice may be the 1-bit on-off "ditter" or random back-and-forth shot noise of a semiconducting element.

I'm guessing that high quality human voice requires about 4 to 6 formants. I experimented trial and error to deduce a formant function (Gaussian) and width (standard deviation) of 120 Hz.

For many input sources, higher frequency formants tend to be missing or clouded by the artifacts of 1-bit quantization.

Pitch detection: We can assume that the fundamental frequency of the glottal pulse ranges anywhere from 75 Hz (deep male) to 500 Hz (child's voice).

Voiced (vowel) vs. Unvoiced (consonant) sound detection: One method I use is to count the number of zero crossings of the clip. If the number is above a threshold, it is assumed to be unvoiced.

Our glottal pulse has equal amplitude harmonics, since the formants can govern the amplitudes of the individual harmonics. The shape of the glottal pulse and resultant harmonics were obtained by more or less trial and error. The glottal pulse sounds like a digital "ah" sound.

The realness of the synthesized speech can be improved by convolving the signal with a short (48ms) random all-pass filter, which is much like a reverberation function.

Performance on clear speech demonstrates that our algorithm works correctly, in principle.

Let's listen to some audio samples using my voice.

First, clean voice, spoken three different ways: normal, whisper, and raspy: voice_variations_clean.mp3

Second, processed with the formant detector algorithm set at normal voicing: voice_variations_fd80.mp3

Third, processed with the FD at enhanced voicing: voice_variations_fd145.mp3

As you can hear, enhanced voicing may be able to make raspy ITC audio more "life-like."

1 Comment

Recommended Comments

Add a comment...

× Pasted as rich text. Paste as plain text instead

Only 75 emoji are allowed.

× Your link has been automatically embedded. Display as a link instead

× Your previous content has been restored. Clear editor

× You cannot paste images directly. Upload or insert images from URL.

Sign In

Michael's Portal Station

Formant Detection and Synthesis

1 Comment

Recommended Comments

Domitilla Pif 1

Link to comment

Home

Forums

Content

Support

Important Information