That effect can be summarized in 2 words: Texture offset.
I suppose the heads were unwrapped using cylinder projection so the texture is rectangular with the face in the middle and plain yellow on both sides.
Imagine you stack several faces the one above the other along the Y axis in a single image. One face with the mouth closed and (most probably) 3 faces for the vowels. Of course, you will setup your UV mapping so that it shows only one face or it will look strange.
Now you have a big texture of which only a 1/4 is visible. If you animate the Y offset to show one face after the other, you have a talking head. Just remember to switch the faces, not to let the texture slide from one face to the other.
Most probably, with several stereotyped characters like Lego men, the mouth area has a different material from the rest of the face and each character has its mouth material. So you can have one texture for all the mouths and animated each one separately… or else all the heads talk together.
If you want to get fancy, you analyze the speech frame by frame and match the different mouth shapes with the speech. Closed mouth for some of the consonants and the silences, open mouth for the rest.
(I ate my own dog food.) Everything I said before holds, except that I replaced “all the consonants” with “some of them” i.e. the bilabial ones: m, b, and p (and more for other languages than English).
As for animating the texture offset, this is an activity to reserve to the monomaniacs. If not there yet, it will drive you into insanity. IMHO, there is a bug somewhere. I used Cycles and it shows the effect of the texture offset when you set it in the Node Editor (and have the 3D view set to Materials) and when you render… but nowhere else. It really doesn’t help.
The cherry on the top is the “lip-syncing”, to match the mouth shapes with the speech. I learned a lot about human speech in the process… but it’s still overly tedious. You need an audio software capable of showing the time index in frames or (at worst) in seconds and 100th of second and you do the conversion into frames. Blender can load some audio but its job stops here.
Anyway… The result is there: http://youtu.be/B3zzufAAWHs
The blend file is here: TalkingCube.blend (1.11 MB)
Blender assured me that everything is packed in the file. My (crappy) texture is CC0. Just contact me if you want to re-use my voice. We’ll discuss contracts and royalties…