mitchell vitez blog music art media dark mode

Mouth Controls

After my recent foray into generating character voice sounds from text automatically, I was interested in learning how to add automatic lip syncing as well. There are many different ways to do this, with wildly varying levels of complexity, all the way up to manually modeling each separate viseme.

The simplest approach I found that still gives satisfying results relies on a few basic “sliders”. Once we have these controls, we can morph between them to create all kinds of mouth shapes. In my case, that meant creating the “extremes” as shape keys in Blender.

In total, there are five controls:

If you have a character with a mouth at all, you have some kind of default rest pose. A basic rest pose could be something like this: lips ever-so-slightly open, but fairly neutral all around.

One of the most basic sliders is “open” vs. “closed”. The open pose provides room for big vowel sounds (often shaped by other controls), while the closed one gives us “M”, “B”, and “P”.

Even given just this control, you can get a decently alive-seeming character (instead of the completely dead one you get with no mouth movement at all). This might be enough for cartoony functions like simple games, characters very far away, stylized usages, etc.

To add another dimension, we can have a control that varies between wide and narrow:

When narrow, the lips naturally open up a little bit. To me, this looks close to an “ooh” sound.

For more expression, we can have the corners of the mouth move separately from the rest. This adds some emotional affect—basically these read as “happy” and “sad” respectively.

Independent lip control, combined with corner control, gives us a lot of possibilities. Here, we see raising the upper lip and raising the lower lip. The lower-lip-raised pose looks like “F” or “V”, which are hard to convincingly create with only the previous controls.

The last pose needed to fill in the gaps is the tongue-behind-the-teeth of “L”. The “R” viseme could be created separately, but it’s close enough to a partial “L” with a wider/narrower mouth that I felt it was possible to get away with not doing a separate control.

Mixing these controls is pretty fun, and begins to open up a lot more possibilities. A mix of an open mouth with corners pulled up gives us “ah”.

A narrow/open pose (mixed with touches of a few others) gives us a narrow “oh”.

Mixing in corners up/down or making the mouth shape exaggeratedly large can often inject some emotion into otherwise-straightforward poses.

Sometimes, a phoneme corresponds to multiple visemes. We can simplify most English phonemes into these ten viseme groups:

Note that this simplification doesn’t cover the alphabet (letters like “C” might map onto either “K” or “S” depending on context), or all possible vowel sounds (e.g. “uh”). However, it’s enough to provide pretty good coverage for automated lip syncing, without too much redundant animation work for one person to handle.

Given the five shape controls, I was able to approximate the ten visemes above. This was done by only moving vertices in the mouth area. Mixing in a little bit more facial animation (especially around the eyes) seems like it really goes a long way in bringing a character to life.