Amazon bought text-to-speech company IVONA systems on Wednesday, the online book-floggers have announced. The acquisition fuelled rumours that Amazon, the quietest member of The Gang of Four, is planning a rival to Apple's talking assistant Siri. Amazon announced the deal with Ivona Software yesterday on its website, but didn't …
This is a fun article, about how Siri grew out of a DARPA project, was intended to do much more than she does now, and after Apple bought her (snatching her away from Verizon's Android handsets) they curbed her abilities and potty mouth. She seems to have plenty of cousins, though.
I must admit, their text to speech is pretty damn good, and I've tried it out on my Android handset also.
Ivona the best I've heard.
Over a short span, one of the U.S. female voices is quite indistinguishable from real (although one of the male voices does not quite convince) in that if I did not know beforehand, I'd assume it was real.
However, natural reading of various TYPES of texts (news, non-fiction, novels) requires adjustment. The voice for a Jane Eyre or Mickey Spillane reading would really have to change from that for a science book. Even a dull Librivox speaker will put some life into the characters. You just can't do that with TTS, probably never will. One has to understand each person in the story, and the inflection and modulation also changes from sentence to sentence. Affect is something inferred from the text, it cannot be calculated by an algorithm.
Re: Ivona the best I've heard.
Even a dull Librivox speaker will put some life into the characters. You just can't do that with TTS, probably never will.
"Never will" is rather too strong. While we're still very far away from a decent understanding of affect, much less a formal model of it that would let us implement it algorithmically, a lot could be done in this area with predictive models. NLP researchers have made significant progress in recent years in systems that can model rudiments of narrative; going from there to building decent predictive models of the purported emotional states of characters is not a great leap. They wouldn't be nuanced, and no one would claim this represents "understanding" emotion, but if the model predicts a valid emotional overtone for the subject text most of the time, that's good enough for TTS purposes.
Then it's a matter of varying the TTS prosodic parameters to convey that emotional overtone. That's tricky in itself, because prosody is not terribly well understood itself; for example, linguists who study prosody still can't agree on what in English pronunciation actually conveys "stress". But again you can build a pretty effective predictive model without fully understanding the domain being modeled.
Affect is something inferred from the text, it cannot be calculated by an algorithm.
That's a fallacious argument, unless you can prove that human readers don't "calculate" affect "by an algorithm".
The real question is...
Does she lick the microphone?
Re: The real question is...
With what? A motherboard?
Re: The real question is...
In this context it is spelled motherbeard.
"a 12-year-old startup" ?
They must have fixed it.
Christ on a bike..
"American English, Ivy" is the stuff of nightmares, beware..
Go on, who else went to the website and got the voices to swear?
And German sounds so natural...
But it could be due to the fact that they are all androids from dark side of the moon.
I can't believe you didn't do any IVONA puns :(
Re: IVONA Pun!
'Ivona bevy of babes' not good enough?
Don't you dare ruin my Friday!
Re: IVONA Pun!
I thought it was 'bevy of blabber babes'. That makes my day.
I tried a couple of voices on: "I can say fairly complex, as opposed to complicated, sentences without stumbling, or indeed any other kind of awkwardness." and I'm impressed with how natural that comes out.
Now the Dutch voices on "Ik kan nogal complexe, maar niet gecompliceerde, zinnen zeggen zonder te struikelen, of wat voor foutjes dan ook." (not quite a translation, but the same structure) are clearly robots as soon a you hit the first inter-word pause. Also they make pronunciation errors (wrong vowel in "te"). But, again, the overall inflexion of the sentence sounds right.
These guys are onto something.
I did a load of work on this in 1994 and, clearly, they have some better inflection and frequency models than we did then. Also the pacing for English (and probably Polish) is much better than then. However, it's clear that the phoneme splitting and reconstruction is not always being done correctly. Which probably reflects on the language skills of the people doing this tedious and exacting work. The corpus of sentences being split may also vary quite a bit in size for each language. That will make quite a difference when doing contextual reconstruction.
I checked their website, and unless I missed something, they've made a strange oversight: the languages are all European. Amazon will miss a huge market in India, Japan, Korea, and of course, China.
The real problem is not realistic phono-syntactic synthesis, but the pragmatics of style. It will practically take full-blown AI, or tagged text, to properly render the ever-changing affect implied by characters' content in a novel.
Amazon books are heavily dependent on fiction readers. After several chapters, only the sight-impaired are going to stick with the voices. People are just going to continue sight-reading.
I checked their website, and unless I missed something, they've made a strange oversight: the languages are all European.
Call me crazy, but I suspect this "strange oversight" is what we sometimes call "developing what you have the expertise and resources to develop".
Amazon will miss a huge market in India, Japan, Korea, and of course, China.
Perhaps - one can only hope - Amazon will be able to supply additional resources and expertise?
"It even has two Welsh-speaking voices - available in either gender."
Either sex, you mean.
There are more Welsh speaker in the US than there are in Wales, plus those living in Patagonia... so what accent does it have?
It was very impressive, but still not a patch on Rutger Hauer...
"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched c-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die."
Also, she says Tannhäuser like a Canadian: Tannhoooooooser.
Standard Canadian would be Tann-how-ser, with slight lip-rounding on the ending diphthong. Of course extreme East-coast Canadian might sound a bit like 'hooo', or Oirish.
Hello and again welcome to the Aperture Science Computer-Aided Enrichment Centre. We hope your brief detention in the relaxation vault has been a pleasant one. Your specimen has been processed and we are now ready to begin the test proper. Before we start, however, keep in mind that although fun and learning are the primary goals of the enrichment centre activities, serious injuries may occur. For your own safety, and the safety of others, please refrain from GFRXTHGGGHHHNAAAK
I wonder if all the big tech companies are going to buy TTS companies now? Maybe iSpeech?
- Product round-up Ten excellent FREE PC apps to brighten your Windows
- Review Tough Banana Pi: a Raspberry Pi for colour-blind diehards
- Product round-up Ten Mac freeware apps for your new Apple baby
- Analysis Pity the poor Windows developer: The tools for desktop development are in disarray
- Chromecast video on UK, Euro TVs hertz so badly it makes us judder – but Google 'won't fix'