What is Speech Synthesis?

Amruta Bhaskar
Dec 9, 2019
0 comment(s)
2632 Views

Computers do their jobs in three distinct stages called input (where you feed information in, often with a keyboard or mouse), processing (where the computer responds to your input, say, by adding up some numbers you typed in or enhancing the colours on a photo you scanned), and output (where you get to see how the computer has processed your input, typically on a screen or printed out on paper). Speech synthesis is simply a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS).

Talking machines are nothing new—somewhat surprisingly, they date back to the 18th century—but computers that routinely speak to their operators are still extremely uncommon. True, we drive our cars with the help of computerized navigators, engage with computerized switchboards when we phone utility companies and listen to computerized apologies on railroad stations when our trains are running late. But hardly any of us talk to our computers (with voice recognition) or sit around waiting for them to reply. Professor Stephen Hawking was a truly unique individual—in more ways than one: can you think of any other person famous for talking with a computerized voice? All that may change in future as computer-generated speech becomes less robotic and more human.

How does speech synthesis work?

Let's say you have a paragraph of written text that you want your computer to speak aloud. How does it turn the written words into ones you can actually hear? There are essentially three stages involved, which I'll refer to as text to words, words to phonemes, and phonemes to sound.

1. Text to words

Reading words sounds easy, but if you've ever listened to a young child reading a book that was just too hard for them, you'll know it's not as trivial as it seems. The main problem is that written text is ambiguous: the same written information can often mean more than one thing and usually, you have to understand the meaning or make an educated guess to read it correctly. So the initial stage in speech synthesis, which is generally called pre-processing or normalization, is all about reducing ambiguity: it's about narrowing down the many different ways you could read a piece of text into the one that's the most appropriate.

Preprocessing involves going through the text and cleaning it up so the computer makes fewer mistakes when it actually reads the words aloud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned into words—and that's harder than it sounds. The number 1843 might refer to a quantity of items ("one thousand eight hundred and forty-three"), a year or a time ("eighteen forty-three"), or a padlock combination ("one eight four three"), each of which is read out slightly differently. While humans follow the sense of what's written and figure out the pronunciation that way, computers generally don't have the power to do that, so they have to use statistical probability techniques (typically Hidden Markov Models) or neural networks (computer programs structured like arrays of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead. So if the word "year" occurs in the same sentence as "1843," it might be reasonable to guess this is a date and pronounce it "eighteen forty-three." If there were a decimal point before the numbers (".843"), they would need to be read differently as "eight four three."

A speech synthesizer needs some understanding of what it's reading.

Preprocessing also has to tackle homographs, words pronounced in different ways according to what they mean. The word "read" can be pronounced either "red" or "reed," so a sentence such as "I read the book" is immediately problematic for a speech synthesizer. But if it can figure out that the preceding text is entirely in the past tense, by recognizing past-tense verbs ("I got up... I took a shower... I had breakfast... I read a book..."), it can make a reasonable guess that "I read [red] a book" is probably correct. Likewise, if the preceding text is "I get up... I take a shower... I have breakfast..." the smart money should be on "I read [reed] a book."

Author: Ravinder Joshi

SkillRary