"3,000 agents"
3,000 agents probably means "on average, at any given time". Suppose a typical shift pattern (and I do realise they're unlikely to be "typical") of 8 hours gives three shifts a day and 9,000 agents total.
Weasel words, no doubt!
In a statement issued on Sunday, SpinVox admits it needs call centres staffed by human agents to transcribe voice messages and has begun to back away from its earlier claims that most of the translation is performed by AI-based machine translation software, without human intervention. But thanks to company insiders and company …
I have no more information than anyone else so can't judget specifically what SpinVox is doing, but you seem skeptical on the whole concept of machine recognition and don't think SpinVox could possibly be telling the truth. Why? Do you think that PhoneTag (formerly SimulScribe) is lying as well? What about Nuance's extensive products for the OEM market? It seems that if SpinVox were using humans world-wide they could save a bundle by just licensing software from Nuance...
3000 agents typing 1 message every 3 minutes is 60000 messages an hour. Do they really get that many?
Lets make it 10 minutes per message for an 8 hour shift.
6 messages per hour per worker
48 Messages per working shift
There are three 8 hour shifts in a 24 hour day
So 1000 agents working per shift.
48000 messages per shift
144000 messages per day
Really??
I'd often been a bit inquisitive when I heard someone with SpinVox enabled voicemail. Was always blown away that voice recognition had come on so far (fair enough when trained to a particular user/accent, but not like this). Apparently it hasn't.
May be forwarding this to acquaintances I know who use this service for their company mobiles (banks, government...)
Interesting article.
re"SpinVox declined to give a figure. "It is our confidential business formula. It is literally the ratio that any competitor or company wanting to start a business in the potential multi-billion dollar marketplace that SpinVox actually created would love to know so that they could come after us. ."
Is that not like bringing a new computer to market but not telling anyone the speed, memory capacity etc , in case the competition found out?
THANK YOU DREW for honoring my request for more of this company. The Pakistan SOS was great. Seems like they took more time composing that message then they did yours from the previous story.
Be F'n classic if they had a call center in Nigeria....
This is your long lost couisn Bob and I need you help. Please txet me back with your personal details so I may get strted on retreiving funds being held by (blah blah blah*)
*Ran out of energy typing that
Keep em coming man keep em coming.
"In the patent description's own words:
"... the operator intelligently transcribes the actual message from the original voice message by entering the corresponding text message"
Hang on - they're trying to get a patent on the concept of listening to a phone message and writing it down? Why didn't I think of that?
> "Hang on - they're trying to get a patent on the concept of listening to a phone message and writing it down? Why didn't I think of that?"
"The scary thing is that they'll probably get the patent."
Even in the US they're not stupid enough (or are they?).
There has to be some kind of prior art. I mean pencils or something at least!
I have been using spinvox for about a year, and regardless of how they do it, it is wonderful as it saves loads of time especially if like me you hate listening to VMs!
I am actually sceptical about how much is done by humans, simply due to the amount of mistakes it does make (generally though you get the jist), also the number of messages that simply come through has 'x has left you a message, we can transcribe it, dial xyz to hear the message', you listen to the message and its perfectly understandable.
Ultimately the world would be a worse place without spinvox.
In about 1984, I was shown a hardware speech recognition system attached to a BBC micro that could be trained for about 200 words reliably.
In 1990, I was shown a software system running on an Intel 386 system that would achieve about 80% accuracy when untrained, rising to over 90% when given some training.
In about 1999, I played around with Dragon Naturally Speaking and ViaVoice, both of which were able to do a competent job of turning speech into text, even if they only did basic syntactic analysis.
Each time, I was told that 'context sensitive, natural language recognition' was only a matter of 5 years away.
In the 25 years during which I have seen voice recognition working, commodity computing power has risen by something like 4 orders of magnitude, and DSP hardware that can do the majority of the work has become even faster, and significantly cheaper.
Why is it, then, that it is so impossible for this technology to work? And why do we not have home media centres, fridges and cookers that we can talk to? After all, an iPhone can listen to a song and name it with a high degree of accuracy. It's really just a matter of application.
I guess that it is just one of those unfulfilled technological dreams. Or possibly, the computer and device manufacturers don't want it, because it would start to make the GUI irrelevant, and slow down the pace they could re-sell us ever more pretty and more compute and graphics intensive operating systems and hardware.
God, I've been in this business too long.
I know it's naive, but what the f*** has the fact that SpinVox is run by a woman got to do with anything?
If they've got the tech to do what they say, why aren't they putting up so others shut up? This all sounds faintly suspect to be honest. In which case that makes Julie Meyer one of the worst kind of apologists there is out on the market. A silly girl funding dodgy science because it's fronted by a woman so that the two of them can "stick it to" the blokes. Congratulations Ms Meyer, it looks rather like you help hand 200 mill to a less than honest firm
I used to work for a firm who developed a small PBX linked to PCs and that had Voice to Text, Text to Voice and that worked pretty damn well 9 years ago, so I fail to see what the big deal is about SpinVox's supposed tech.
Come on SpinVox, either mount a sensible defence and tell the truth, and provide evidence so we all know you not still just being extremely economical with truth, or forever be lumped in with the worst excesses of Web2.0 and (with any luck) die a dismal coporate death.
It's all becoming a bit surreal really isn't it?
Irrespective of how the transcription takes place and how many NDAs and whatnot have been signed, they are still charging for a service that's sold as a new system, using tech (that doesn't exist) instead of people (that do exist) .
Surely misrepresentation?
From what I can see from the case families linked to in the article, there are actually only 30 published patent applications. However, that doesn't mean that there aren't 70 patent applications or cases. It's really a question of exactly what they mean and exactly what is and is not counted, and that can be a long and complicated discussion.
For those that were saying that Spinvox have had patent applications refused, they have actually got 8 granted UK patents (first granted a few years ago), with only 1 UK application refused - you need to do an online status check on every individual publication number at the relevant patent office rather than just looking at the list of publications which isn't always complete.
More interestingly, the UK IPO record on GB2435147 shows that TISBURY EUROPEAN MASTER FUND LIMITED (registered office in the Cayman Islands) have acquired rights in a number of SpinVox patent cases. European Patent Office records show that a debenture exists - full copy available through online file inspection here (click on "All documents" link on the left): https://register.epoline.org/espacenet/regviewer?AP=04728841&CY=EP&LG=en&DB=REG
The "predictive text" mentioned in the article is quite key to speech recognition. The usual approach is a Noisy Channel model, and uses the relation:
P(what was heard | what was said).P(what was said) = P(what was said | what was heard).P(what was heard)
For a given candidate sentence, the value of P(what was said) comes from a language model - "the the the the the the" is a v.low probability sentence, "the cat sat on the mat" somewhat higher. With an acoustic model - P(what was heard | what was said) - you can then use Bayes' rule to invert it, and normalise over the possible options, getting a distribution for P(what was said | what was heard). The highest-scoring candidate sentence in that distribution is the most probable option for what was gabbled down the phone, or you can offer the top N choices to an operator and let them pick.
The problem with all of this is that humans are *extremely* good at the language model part of this. We have to be, as there's too much uncertainty in the actual sound. We can "hear" sentences in audio that is (practically) indistinguishable from noise - just try playing a Beatles album backwards! We also, to some extent, hear what we expect to hear - being told that a backwards track contains the words "I buried Paul" makes you more likely to hear that than if you'd not been told.
Computers can't hope to match human accuracy until they have, basically, the ability to understand an article on El Reg. Which is why AI has been a complete failure for decades and will remain so for quite a while yet.
Techie Guy 'cuz this stuff is Master's Degree level...