Speak= music?

Greetings to you!
I am planning on adding voice to the chat in my game! The idea was to have a sound engine receive commands to execute speech. I found the following potential problems:

  1. Platform compatibility,
  2. Dependencies, I don’t want people have to install extra software to run the game;
  3. speech engine limitations, the number of voices is reduced and can’t be fiddle with too much.

The idea I’m contemplating now is to use a setup similar to a music’s one.
The words would work like music notes. They will be phonetics, so that a library can be used to use different languages.
There are fundamental differences between music and voice, but since we can sing, talking could go more like Raping. Music is continuous in most cases, and voice is fragmented. The sound of certain words depend on what comes before or after them. Unless everyone is forced to use phonetics, it will be hard to make the words sound correct.
To structure the words like in music, maybe a set of codes can be used and a few limitations imposed:

  1. Numbers should be used in front of a word to determine it’s intensity or speed; so the system would not read numbers unless they have a 0 in front.
  2. Other symbols could be used to show emotions like a ! mark in front to make the following word(s) come on a surprised tone or a # to make the words come on an angered tone; + and - could add or reduce to the pitch, etc.

To have different voices one could configure or somehow ad filters to the base voices.

Now I’m no expert or even know much about music and sound in general, so this idea is most likely not new or even original, so, while keeping that in mind, please tell me what you think the possibilities are.

Speech is very complicated, much more so than english makes it out to be. There are millions of sounds the human mouth and voice box can make. There are only 500 or so in english, but bear in mind that A doesn’t sound like “A” all the time. I’ve attached a chart that shows this.
So let’s assume you have a complete sound-base (ie sounds for all of the possibilities) you then have to take a word and figure out, from the combination of letters what it should sound like.
This is evident when you take something like “espeak” and get it to say “disciples” it comes out as diss-a-ples, with the “i” turning into an “a” sound due to it coming after a “c” No matter how good your coding it, there are exceptions to every rule.
Pitches change to indicate questions and such too.

So lets say you have all that sorted out, you still have to find a way to refer to what sound you want, it’s pitch and it’s duration
You could use a string like:
The sound, raised .4 of a semitone, for 20 milliseconds.
Maybe you could go even further back, and have the first 3 numbers represent the sound, the next two the pitch in tenths of semitones and another three indicating length.

I think you may want to find an existing library for speech.
There is one, called pyspeach that handles both speech input and output, but is is for python 2.6 and windows only.
It will be a challenge, but you could…

Thank you for your input sdfgeoff!
You are right! You see in music, there are such variations, that go around the notes. So as for the sounds, I mentioned the use of phonetics which is more an universal sounds library.
pyspeech is unfortunately limited to windows and old python. I couldn’t find a cross platform speech library. In any case, if it turns out too be too much of a hassle, specially that the game itself is just a few bare bones, I’ll just drop it!