emg2speech: synthesizing speech from electromyography using self-supervised speech models

Harshavardhana T. Gowda1
Daniel C. Comstock1
Lee M. Miller1
1University of California, Davis

Abstract

We present a neuromuscular speech interface that translates electromyographic (EMG) signals recorded from orofacial muscles during speech articulation directly into audio. We find that self-supervised speech representations (SS) are strongly linearly related to the electrical power of muscle activity: a simple linear mapping predicts EMG power from SS with a correlation of r = 0.85. In addition, EMG power vectors associated with distinct articulatory gestures form structured, separable clusters. Together, these observations suggest that SS implicitly encode articulatory mechanisms, as reflected in EMG activity. Leveraging this structure, we map EMG signals into the SS space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory modeling or vocoder training. We demonstrate this system with a participant with amyotrophic lateral sclerosis (ALS), converting orofacial EMG recorded while she silently articulated speech into audio.

emg2speech demos with an ALS participant

Note: The audio and video in these examples are not synchronized. EMG-to-speech generation operates on discrete HuBERT units trained with CTC loss, which do not preserve sample-accurate timing alignment or the original sample duration, even though the model is causal.

emg2speech demos with a healthy participant

Note: The audio and video in these examples are not synchronized. EMG-to-speech generation operates on discrete HuBERT units trained with CTC loss, which do not preserve sample-accurate timing alignment or the original sample duration, even though the model is causal.

All human transcriptions

Preview in browser  ยท  Download XLSX