emg2speech demo | Harsha Gowda

EMG-to-Speech pipeline overview — (ALS subject, small language-corpora):

A subject with amyotrophic lateral sclerosis (ALS) silently articulated a set of sentences while orofacial EMG was recorded. We convert these EMG signals into synthetic speech using the following pipeline:

Silent EMG Recording

The subject articulates sentences without producing audible sound. Only orofacial EMG is captured.
Reference Audio Generation

For each transcript, we generate clean reference audio using Google Text-to-Speech (TTS).
Discrete Speech Unit Extraction (HuBERT)

We extract discrete HuBERT units from the reference audio.

A neural network is trained to predict these HuBERT units directly from EMG.
Neural Vocoder Synthesis

A pretrained vocoder converts the predicted HuBERT unit sequences into intelligible audio.

The model is trained on 40 minutes of EMG data. The language corpora consists of around 300 unique words and 600 sentences. Sentences in the test and validation sets are not present in the train set.

Note: The audio and video in the examples are not synchronized. This is expected because EMG-to-speech generation operates on discrete HuBERT units trained with CTC loss, which do not preserve sample-accurate timing alignment.

Samples

“The cushion feels tight again.”

“The pillow feels tight today.”

“My shoulder feels softer tonight.”

“My wrist is calming now.”

“My chest feels sore again.”