back to article Microsoft: Our AI speech recognition mangles your words the least

Microsoft researchers working on AI computer speech recognition have reached a word error rate of 6.3 per cent, claiming to be the lowest in the industry. Hot on the heels of Google DeepMind announcing a “breakthrough” in AI speech recognition, Microsoft was quick to respond by saying it, too, has reached a “milestone” while …

  1. stuartnz
    Thumb Up

    Hic Sunt Dracones

    "Computer speech recognition has come a long way"

    It really, really has. I've been using Dragon for 10+ years now, and each new release gets better and better. As the effects of my CP become more noticeable, it's great that I can finally rely on my computer to understand 99% of what my thick Kiwi accent says. However, that's not my experience with the built in speech recognition in Windows 10. It's not awful, but it's not a patch on Dragon 15 for sure. If MS's new AI speech recognition i as good as claimed, I hope it does filter down to Win 10 sooner rather than later

  2. Anonymous Coward
    Anonymous Coward

    So effing what.

    Be it Cortana, Siri or Jumping Jack flash, I won't be using any of them in the forseeable future.

    Add to that the snooping that the all seem to do why would I volunteer all that sort of information to the likes of Microsoft, Goole and the rest.

    1. hplasm Silver badge
      Coat

      Re: So effing what.

      "all that sort of information to the likes of Microsoft, Goole and the rest."

      It's not Goole you need to worry about, it's Scunthorpe.

  3. Dwarf Silver badge
    Joke

    I'm sorry Dave,

    I can't do that.

    1. hplasm Silver badge
      Happy

      "I'm sorry Dave," said Cortana, sadly.

      "I haven't a clue what you're on about. And your nose wobbles when you speak, did you know that? It's very distracting. Now what were you saying again? Eh? What? "

  4. inmypjs Silver badge

    6.3% eh?

    More than 40 years of development and that is where we are? A comparison with human performance with the same material would be useful.

    I keep comparing computer speech recognition with computer autonomous car driving and consider the former to be relatively a piece of piss.

    Would you want to be a passenger in an autonomous car which incorrectly interprets what it 'sees' 6 out of 100 times?

    1. MatthewSt

      Re: 6.3% eh?

      Depends on your identification needs. With speech recognition you have a measurable goal which if you miss is very obvious, whereas with the 'seeing' for driving you only have a limited set of results. I don't care whether it's a person, another car or a tree. All I care about (from the car's perspective) is whether I'm going to hit it based on current trajectory. That would appear to be a lot simpler to get right than "Where can I find Sweden sour chicken?"

      1. inmypjs Silver badge

        Re: 6.3% eh?

        "I don't care whether it's a person, another car or a tree."

        I care about not jamming on the brakes and having someone tail end me for the sake of a plastic bag blowing across the road or a tin can that looks much bigger on radar.

        I care about it not even trying to interpret subtle clues about the future actions of pedestrians and other road users. Will it see a football as a threat? What about a kid that is about to run into the road after it?

        Will it see horse shit on the road and take it as an indication there could be a couple of people on horses around the next corner? (an observation and interpretation I have correctly made on more than one occasion). Will it be even surer because it observed the horse shit was still steaming?

        The amount of information a competent driver can collect and what can be interpreted from it is simply huge compared to recognising speech. Just look at the data rates 8kb/s for speech, how much all round stereo vision do you get in that?

        1. quxinot

          Re: 6.3% eh?

          "The amount of information a competent driver can collect and what can be interpreted from it is simply huge compared to recognising speech. Just look at the data rates 8kb/s for speech, how much all round stereo vision do you get in that?"

          Absolutely true and I agree with the lot.

          On the other hand, the larger problem is the typical driver will be able to say with certainty what the last tweet\facebookpost\sms that just came up was....

    2. Sandtitz Silver badge
      Happy

      Re: 6.3% eh?

      "More than 40 years of development and that is where we are? A comparison with human performance with the same material would be useful."

      Agreed. I'd also like to know what sort of material was used. I could perhaps get correctly 6.3% of the words from a Glaswegian chav. (no offence, Scots!)

      There's also the problem with words that sound alike but have different meaning. It's not just converting the sounds to words, the AI also needs to decipher the sentence - many people don't adhere to the correct syntax (and some don't always even make sense).

      1. inmypjs Silver badge

        Re: 6.3% eh?

        "I'd also like to know what sort of material was used"

        If you try you can find examples. They are recorded telephone conversations. The one I found appeared to be a woman talking to her mother about an upcoming party or reunion. Fast two way conversation and pretty clear apart from some person names which I would be guessing at.

      2. Pompous Git Silver badge
        Trollface

        Re: 6.3% eh?

        many people don't adhere to the correct syntax (and some don't always even make sense)

        Which makes it a bit of a mystery why the latter bother commenting on El Reg...

    3. Herby Silver badge

      Re: 6.3% eh?

      Just a comment. If you were to hire a secretary with a 6.3% error rate, she would be out the door the first day!

      I have memories of some work done in the 60's that could (after training) understand the words "one, "two", "three" and "four". The training took a while, and the program used speaker specific samples. Pretty crafty for around 50 years ago.

      So, yes we have come a long way, but more needs to be done. In my case, Siri works pretty well though.

      1. Mage Silver badge

        Re: 6.3% eh?

        Also at the end of that you have a string of text. Then a completely different AI problem to parse that and decide what is wanted.

        To have really good speech recognition you need context and meaning unless it's very simple list of commands for a car radio or dialling a phone.

        Since most people have difficulty using search well with a keyboard, I don't think adding speech on top is going to result in successful queries or searches except for expert users.

    4. Eric Olson

      Re: 6.3% eh?

      I've had 34 years of development, and I still have a double-digit error rate when dealing with accents outside of my home region, especially those from Deep South of the US, Newfoundland, and much of the non-London portion of England.

      And I have better luck with people from India, Bangladesh, and Pakistan, talking to me through a cell phone in those locations, than whatever the Florida Man has to say.

    5. Anonymous Coward
      Anonymous Coward

      Re: 6.3% eh?

      "More than 40 years of development and that is where we are?"

      Yep. In terms of accuracy per man-hour consumed we would have been better off stuffing little people with keyboards in barrels and installing them in each home.

  5. Crazy Operations Guy Silver badge

    I wish someone would build a close-captioning type thing for phone calls

    I work with a lot of international teams, and some of them are just plain unintelligible. Having some sort of AI, even at 5% error rate, to translate their speech into text would help immensely when doing phone calls with them.

    1. cambsukguy

      Re: I wish someone would build a close-captioning type thing for phone calls

      Well, Skype has a translate button, it presumably works for English to English.

      I doubt it would manage the reliability seen here but Skype is usually much higher fidelity than an old-fashioned phone call, in fact it is often better than a modern phone call since they are all digital but Skype seems to throw a lot more bits at the problem.

  6. Pompous Git Silver badge

    Quoting BillG from a few years back...

    "Can it wreck a nice beach yet?"

  7. Baldy50

    Don't....

    Install Windows ten work? Or does the 'Don't' not get recognised?

  8. David Woodhead

    Speech recognition? That's 60s stuff.

    Nobody talks to their computer.

    Really, if you do anything meaningful on your PC, you don't talk to it. That's because if you do you'll seriously piss off the people you're working with or those sitting around you. And it's slow.

    The technology was pretty much sorted in the 1960s. Yes, that's around 50 years ago. So why hasn't it taken off? Because 1) it's inefficient; 2) people don't want to do it; and 3; it pisses off anyone who can hear you.

    That's all.

  9. un

    We don't have Glaswegian chav's. Only Neds, which is a classic example of why one of the most important parts of speech recognition is still context.

  10. Andronnicus Block

    The BBC got there first...

    Well at least the alternative BBC did - this being the broadcaster that can be seen on the UK fly-on-the wall documentary series W1A that was broadcast on Monday this week.

    The latest version of their in-house developed Syncopatico automatic live sub-titling software was achieving 93% accuracy - with just the slight problem that the 7% errors were all related to names.

    So, our beloved PM became Tweezer May, the Russian president Vladimir Puking and the actress Maggie Smith was transformed into Dame Baggy Smith.

    Go Microsoft!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019