Automated lip-sync with mouth & face expressions

This was probably discussed before, but I’m still unable to find an answer. I already know how to do simple automated lip-syncing, by applying the intensity of an audio file to bones. I’m hoping that in the future, I can make animations with realistic lip-sync instead, which include correct mouth shapes and face expressions based on each vocal spoken (eg: round closed mouth for “O”, tongue out for “L”, etc). I already saw Blender tutorials on how to do so manually, but that doesn’t help when you have a long voice file and no time / patience to animate every single vocal at each frame. So I’m hoping there might be a way to map mouth and face expressions automatically.

I know that such an addon would be hard to make although possible. The modeler would have to create each expression as an armature / shape action, so there’s an animation for each type of vocal. Then an addon can analyze the sound file and determine what type of vocal is located at each frame, blending the animations accordingly. In other words, binding armature actions to pitch ranges, or matching segments of audio files to letter patterns. And the louder that part of the sound, the more the mouth is opened.

Alternatively, this could be done using text. You would write the spoken sentence and specify a length in frames where it takes place. The addon would blend mouth expressions based on the letters at each frame. This would be less accurate than interpreting the voice file, and also wouldn’t have a loudness which influences how much the mouth is open, but should be easier.

Any such addon at all, or maybe something that is / will be built into Blender? It doesn’t need to be accurate and perfect, but I’d like to see anything that works in the slightest.

Hmm… I found a page that might be relevant: It seems to be pretty old so I assume it might have gotten done by now, but I’m not sure if it’s exactly I’m looking for.

I did find a plugin for something called Papagayo which is a lip-sync creation program. I would however prefer doing this entirely in Blender than relying on a bunch of other programs, so I’m still hoping there might be a dependency-free addon out there.

Spoke about this on IRC and got a bit closer to a solution. I was suggested to use “bake sound to f-curve” for this purpose. That’s something I already know and have done before, but only for simple lip-syncing (opening the mouth based on sound intensity at each frame).

What I was told is that it should be possible to bake sounds to f-curve based on a pitch range. If so, perhaps I can do multiple bakes from an audio file, and extract various frequencies to various mouth expressions. Is this possible, and does “bake sound to f-curve” have parameters that can filter the A / E / I / O / U / etc vocals?

Ok, I played around with this today. I found two solutions (both the same at base) which aren’t 100% accurate but are realistic enough. I shall make a video tutorial at some point to help others with this dilemma. Anyway, here’s how to do it:

1 - Have a face mesh and a speech file ready.

2 - Create a few speech expressions using Shape Keys. They are to be blended at any value, so make sure they look good together.

3 - At frame 0 insert a keyframe for each shape key (at value 0). Go into the graph and "bake sound to f-curve" for each of those shape keys.

4, Way 1 (deferred) - On each of the curves, add Limits and Envelope modifiers. The limit modifier must specify a volume range (different per key) and the envelope must scale and position the resulting curve.

4, Way 2 (preferred) - In the properties panel of "bake sound to f-curve", choose a different frequency range for each key. Play around with the settings and see which expression matches which pitch best.

What I found works for me is having a shape key that stretches the mouth horizontally (also puffing the cheeks up), one that stretches the mouth vertically (also raising eyebrows), and one that controls tongue. The one that controls vertical mouth opening should be tied to the lower-half frequency range of the audio file, the one that controls horizontal mouth opening should be tied to the upper-half frequency range, and tongue seems ok at any range you wish.

Here are the two example blends which show this in action. The first one bakes the sound to f-curve with default settings but keys act upon different volume ranges (bad), the second one bakes the sound to f-curve using different frequency ranges (good and animates the face much more fully). Blend 1, Blend 2.

Let me know what you think and if I did this right or you know how to improve it. I’d also like to see someone else using any of the two methods as seen in the example blends, with their own meshes and voices, and posting the result to see exactly how it turns out (Suzane isn’t that good for this).

First, let me just say I didn’t look at your files, I’m not here to offer advice on your work, sorry… :frowning:

When I do lip-sync, I first create a pose library of phonemes. Then scrub thru the timeline, inserting the phonemes poses as needed. Then I do another pass thru the animation, tweaking the poses as needed to reflect the tone/mood of the voice. If the character is shouting, then I open the mouth more, if whispering, close the mouth, etc… Once the pose library is created, it takes me about 2 hours to do 10 seconds of dialogue, an hour for inserting the phonemes, an hour of tweaking.

Anyhow, why I’m really posting is to point you at this:
Link to the wiki page & download of the script in the first post. Last post in that thread is by the author on June 30th, 2013, so the author is still active with developing it.

I’m surprised in your searching/questioning/probing, that no one directed you to that addon. It’s not included with blender, but since they host the development of it, they deem it a worth while addon.


@ @revolt_randy: The blends I posted aren’t really what can be called a work. I just added a Suzane head, generated a sentence in espeak, and made a simple test case to demonstrate how I baked the sound. But no need to look at them, the method is easy to replicate.

The way you described is doing it manually I think. That’s easy to understand, but a lot more work and sadly I wouldn’t have the patience to do it for long sentences when there’s already a lot to animate. Can’t even imagine how it’s like to figure out the mouth expression at each frame based on the vocals there… prolly requires a ton of patience but it’s awesome if anyone has that :slight_smile:

I did find an import script for Papagayo, but I don’t have that program or wish to learn it now. The one you linked seems to be an import script as well… not sure again what other programs I’d need. I prefer something that does it entirely in Blender. The method I found is like that, even if the expression isn’t very accurate.