emg2speech

Abstract

We present a neuromuscular speech interface that translates electromyographic (EMG) signals recorded from orofacial muscles during speech articulation directly into audio. We find that self-supervised speech representations (SS) are strongly linearly related to the electrical power of muscle activity: a simple linear mapping predicts EMG power from SS with a correlation of r = 0.85. In addition, EMG power vectors associated with distinct articulatory gestures form structured, separable clusters. Together, these observations suggest that SS implicitly encode articulatory mechanisms, as reflected in EMG activity. Leveraging this structure, we map EMG signals into the SS space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory modeling or vocoder training. We demonstrate this system with a participant with amyotrophic lateral sclerosis (ALS), converting orofacial EMG recorded while she silently articulated speech into audio.

emg2speech demos with an ALS participant

The cushion feels tight again.

The pillow feels tight today.

My shoulder feels softer tonight.

My wrist is calming now.

My chest feels sore again.

Note: The audio and video in these examples are not synchronized. EMG-to-speech generation operates on discrete HuBERT units trained with CTC loss, which do not preserve sample-accurate timing alignment or the original sample duration, even though the model is causal.

emg2speech demos with a healthy participant

People find ways around.

You knock people down.

And then you sprinkle the cheese on top of that.

I make that once in a while.

I don't have a lot of time.