Jump to content

Machine Learning / AI for ITC: A Quick Explanation of Stream 8

Michael Lee


My First Forays into Direct Continuous Voice

As mentioned previously in my blog, I evolved to direct voice after I noticed that the phonetic samples were getting slightly modified by spirit voices. I reasoned that it should be possible to extract voices directly from a stream of electronically created noise (e.g., radio static).

I don’t know the full history of getting continuous (not just occasional) voices from noise, but it turned out that around the time I started this venture a few years back, I met Keith Clark, who has been running a direct voice from noise stream since the late 2000’s on YouTube. He takes noise generated either mathematically or from a software-defined radio (SDR) and applies a series of denoising filters (software plugins) to extract a continuous voice.

From my work experience, I knew about two denoising methods: spectral subtraction and machine learning.  At first, I experimented a lot with spectral subtraction using the ReaFir Noise Gate plugin run in FL Studio. This plugin allows detailed setting of a frequency-dependent “gate.” When a particular window of samples in time (e.g., N = 2048) has a frequency amplitude over the defined gate/threshold, that “note” is played. Any frequency amplitudes below are made silent. For low noise situations, spectral subtraction is a very solid method. However, as the noise volume gets larger vs. the voice, the algorithm can produce a lot of musical tone artifacts. The waveform editor, Audacity uses a similar spectral subtraction method to denoise signals.

Spirit voice, especially continuous, is exceptionally low volume compared to electrical noise. One spirit once suggested it was, on average, 1/500th the volume of random noise. Applying a strong spectral gate will yield something that sounds more like a bunch of tones than a coherent voice.

I also tried using a gated vocoder, specifically, a versatile plugin called FL Vocodex. This yields similar results as the spectral subtraction (SS), but can also be applied after the SS plugin. The benefit of a vocoder is that the tones are banked exponentially producing more pleasing tones than the linear-spaced frequencies in standard SS.

Eventually, I started writing my own Python scripts to do the same functions as the ReaFir and Vocodex plugins, so that I could exquisitely control all the possible parameters / knobs.

With my attuned ear, I could hear a lot of what was being said, but I still desired better quality voices.

Machine Learning

By happy coincidence, my real-life work had been leading me into learning and using machine learning / artificial intelligence. Around this time, I thought it might be interesting to build an artificial neural network to remove noise from speech and images. My first paper can be found here. Message me for reprint.

In my second paper, which will be published shortly, I added a second model, called a critic, which helps the first model create more realistic looking audio spectrograms, hence improving the quality of the speech.

It turns out there are already commercial products currently out there that claim to use AI to remove noise from speech. For example, there’s the site, krisp.ai. In fact, a YouTuber named Grant Reed uses KRISP to clarify voices from noise sources to hear spirit speech.

However, the story doesn’t end here, because despite getting voices from denoising, the voices end up often sounding scratchy and barely legible – not unlike regular EVPs.

Beyond Denoising

I have spent a big part of the last 1 1/2 years trying to understand better how spirit speech actually manifests in different types of noise - what the corruption actually looks like - and then developing machine learning models to reverse this corruption. I have discovered the following sources of corruption that all seem to compound together:

1)      Additive noise / interference – we already know this one!

2)      Sparsity: Only a small percent (< 5%) of the time samples actually contain speech. Imagine digitizing a one second clip of electrical noise at 16 kHz. You would get 16000 samples from this. Of those 16,000, I postulate less than 800 of them have spirit speech content in them.

3)      Quantization: High-quality audio is often sampled 16-bits. 8-bits with some clever mapping of the signal can provide adequate voice (look up, e.g., mu-law encoding). 1-bit voice is barely legible and sounds like ducks talking. I estimate between 1- to 4-bit samples comes from spirit voice.

4)      Depolarization: Normal audio signals go up above and down below the zero line. Spirit voices may be polarized in a single direction, i.e., there is no dual polarity.

If you try to train a machine learning to reverse these 4 issues in speech, it becomes simply too much to train properly. Thus, I train #1, #2, and #3 together as a single model, and #4 as a separate model. For #4, especially, I have to “cheat” a little, and smooth the randomization of the polarization over a 64 sample window.   If you try to randomize the polarity of every sample, the model isn’t able to train.

Listen For Yourself

Without getting into any more technicalities, go ahead and check out Stream 8, to hear the model in action, in real-time, applied to radio static being generated from a KiwiSDR. If you want messages directed to yourself, make sure you are the only one in the chat room and set your intention. Expect about a 30 second delay, as the signal is bouncing around the Internet from Keith’s desktop in Florida to a streaming server (heaven knows where) and then to Varanormal’s web site audio player.

Let me know in the comments what you think. I feel like we are, at best, only half-way to the finish line. But Keith insisted we start sharing what we have been doing to get the party started, so to speak.


Recommended Comments

  • iDigitalMedium Research Team

I agree very much with what you said. Especially the depolarization is something I noticed a lot. One question from my side to quantization. If I understand you correctly the problem with quantization is that e.g. in a noise signal quantized with 16 bit and a spirit modulation that is very low the spirit signals will go through a quantization of only 1..4 bits. This accounts for every kind of signal hiding in the noise because it's level is much smaller compared to the noise level. Is this correct?


Link to comment
  • iDigitalMedium Research Team

The relative volume could be causing the quantization, as you said. Or, they really only can activate roughly equal sized pulses (1-bit) like shot noise.

Link to comment
  • iDigitalMedium Research Team

Reversal is possible. Its been a while since Ive looked into that. It could be that the ML is jumbling together phonemes to match the source noise. So it might sound like "reversed" speech, but its just gibberish. 😄

Link to comment
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.