As some know, the work on Siberia Complex has for a while been focused on voices and lipsyncs. We have been working on a simple voice/speech generator for minor characters, part of an ongoing project (AMPS, for those who know it).
We have now hit a bit of a snag… we have grown immune to low voice quality! Most speech engines on the market are poor quality, and after listening to dozens of them, including our own, we have started to understand even the worst voice samples.
That means we also understand our own speech engine, which makes it impossible for us to determine whether it is getting better, or if we are just getting numb to the distortion in the voice. Therefore, I need your help!
This is a sample of the current quality. It is not good quality; it sounds like a late-80s computer voice, to be honest. What I need to know is whether it can at least be understood, even if it is jarring on the ears. The locals here in Denmark do not speak English as a native language, and thus the responses are too varied to make a verdict by. I need input from both native English speakers and English-as-a-second-language speakers from other language cultures (German, Spanish, etc.).
Please do not respond on the quality; it is still horrible. The test is:
Can you understand what the artificial voice is saying at all?
Please write your best guess to what is being said. It means a lot to the project!
Also, speech recognition by the brain is alot of visual/historical context. just a random clip is low hit rate. But if a thug-looking guy comes up to me with a pistol on a dark street corner and sez “Gimmedatsheeyet, mon”, I’m gonna reach for my wallet.
To me it sounds like “Hey, are you hearing?” or “Hey, how’re you feeling?”.
I’m guessing it’s the second one;) I had to listen to it two or three times before I worked that out though, at first it sounded like “K. Howry Earring” or something;)
The main problem is it’s a bit too quiet and there’s a “kh” sound between each word, which make it a bit hard to understand.
Thinking about it, I would like to make an amendment to the “quality comment” remark: While it is clear to everyone that this is still a very bad speech engine, detailed descriptions on exactly what makes it hard to hear are very welcome!
I am considering a small prize for the first X correct guesses
I’m not trying to interpret it here. I’m just getting what it sounds like to me first shot, which–if you’re actually planning to use it as voices in a movie–is all most people will get per line.
EDIT: For what makes it hard to here is that it’s low, and fuzzy, with weird distortions in the background and weird, ckicking ‘sharp k’ sounds at the beginning. Besides the ‘k’ sounds, there’s no definition to the consonants that I can hear, it’s like a string of vowels.
I have decided that the first correct answer will receive one pick from the Blender shop, up to 35 Euro’s worth which seems to be the most expensive items available (if there is a link to a E200 bottle of wine, you cheated ).
I am very happy with the comments on the flaws in the sounds; they help a lot.
@Fatfinger, could you elaborate on the triangle/sawtooth waves you feel are missing? The reference files (the sentence was also recorded straight from speech for reference) hold no particular triangle or sawtooth waves, but it might be a way to emulate what is missing…
Yeah, I was a bit vague. I think what I was trying to say is, that sounds like D,B,P, are quite percussive and have a fairly sharp attack. I’m not too sure how speech synthesis works, but I’m imagining you have a couple or more oscillators corresponding to phonemes/letters, that are played in sequence corresponding to the text. Perhaps trying to alter the oscillator waveforms might do it. Or I could be a million miles off the mark. Hope that helps. Anyway, what do I win?
The ‘Hey’ sounds much more like just ‘A’/ In old text-to-speech processors, phonetic typing is usually better to reproduce speech. Was this phrase based on a sentence, or on phonemes?
Have you played around with the Microsoft Agent?
I guess that means animatinator and MikeCuffe got it right (Animatinator’s “how’re you feeling” gets points for high proximity). Originally, I stated the first one to guess it would get a pressie, but since I at that point knew that Animatinator had a correct answer, and I was just poking to see if his bullseye was a fluke, I repent by awarding both with a pick at the Blender eshop (still at the max of 35 Euro, so if I overlooked that silver-grey Bugatti road racer, sorry, you can’t have that one :p).
Mike, Ani, let me know what item you want from the shop, and where it should be delivered!
As for the rest, and those who should see this thread after this post of mine, do not fret! I like the idea of awarding gifts for effort, and will be lining up a follow-up contest (in which I’ll stay true to first-guess-gets-the-prize!). As some progress has (I believe) been made, next contest will be harder, and will be a full dialog between two people! I hope the speech engine is up to the task by then…
The Sentence II (pun intended) should be held within a week or two. Stay tuned
PS: I would like to know if people in the meantime would be willing to comment on the progress that I believe has been made on the speech engine, by evaluating smaller samples up until The Sentence II is held. No prizes, but completely external critique is very valuable to this work!
EDIT: Two examples of current state of this sentence are here (“hey, how you feeling?”) and here (“hey how are you feeling?”). Progress is not tremendous, but the percentage of peopple actually guessing right on first try here seems to be increasing…
Oh, and Roger gets honorable mention for " Hey, dam your spewing!" :D. But the Ghost of Elvis will punish you for revealing the hidden reverse message :RocknRoll:
It’s not quite the method we use, but yes, all observations help! We are currently testing what we call “sharp sounds”, like b, t, d, hard g, and the like (“soft sounds” are traditional vowels, “hard sounds” are consonants that can be ‘dragged’, like l or s). I just witnessed the first fairly comprehensible ‘b’ (it said “bar fly”, though the “fly” part was incomprehensible due to continued problems with f), and it was quite thrilling, in a seriously geeky way!
Anyone with any skilled or unskilled observations on voice reproduction is highly welcome to share