emg2speech demo | Harsha Gowda

EMG-to-Speech pipeline overview — (healthy subject, general language-corpora):

A healthy subject articulated a set of sentences while orofacial EMG was recorded. We convert these EMG signals into synthetic speech using the following pipeline:

EMG Recording

The subject articulates sentences naturally. Only orofacial EMG is captured.
Reference Audio Generation

For each transcript, we generate clean reference audio using Google Text-to-Speech (TTS).
Discrete Speech Unit Extraction (HuBERT)

We extract discrete HuBERT units from the reference audio.

A neural network is trained to predict these HuBERT units directly from EMG.
Neural Vocoder Synthesis

A pretrained vocoder converts the predicted HuBERT unit sequences into intelligible audio.

The model is trained on nearly 8 hours of EMG data. The language corpora consists of around 6500 unique words and 10000 sentences. Sentences in the test and validation sets are not present in the train set.

Note: The audio and video in the examples are not synchronized. This is expected because EMG-to-speech generation operates on discrete HuBERT units trained with CTC loss, which do not preserve sample-accurate timing alignment. We never make use of any audio or visual cues from the subject. We only use EMG, and directly convert EMG-to-audio.

Samples

“People find ways around.”

“You knock people down.”

“And then you sprinkle the cheese on top of that.”

“I make that once in a while.”

“I don't have a lot of time.”