back to article Audio tweaked just 0.1% to fool speech recognition engines

The development of AI adversaries continues apace: a paper by Nicholas Carlini and David Wagner of the University of California Berkeley has explained off a technique to trick speech recognition by changing the source waveform by 0.1 per cent. The pair wrote at arXiv that their attack achieved a first: not merely an attack …

  1. Anonymous Coward
    Anonymous Coward

    How can get a 100 per cent success rate when speech recognition itself isn't 100 per cent successful?

    1. Anonymous Coward
      Anonymous Coward

      In about ten minutes guv'.

  2. bazza Silver badge

    "Think “Alexa, stream smut to the TV” when your friend only hears you say “What's the weather, Alexa?”"

    Judging by some of the clips on YouTube, Alexa is perfectly capable of doing that already...

    This kind of thing should act as a real warning to anyone planning an automated call centre. It means that fraud is a real risk. If "Tell me my balance" can be tweaked into being interpretted as "transfer the funds", followed by "No" being tweaked into a "Yes", a bank could get into deep trouble. Any playback in a court case would show that the punter had said one thing and that the bank interpretted it all wrongly...

    Generally speaking, at least here in the UK / Europe, it'd be interesting to see if a recording of someone's voice (as made by a voice recognition system) counted as a personal data record. If so then a failure to process it accurately (and to the detriment of the customer) would be a Data Protection Act problem. £5000 fine.

    1. A Non e-mouse Silver badge
      Terminator

      Siri too

      Siri on my iPhone suffers from the same problem. I've heard its bong registering it's heard the magic "Hey Siri" command when nothing like "Hey Siri" was said by anyone/thing in the audible vicinity. Yet it doesn't always acknowledge when I say "Hey Siri".

      1. Anonymous Coward
        Anonymous Coward

        Re: Siri too

        My portable Satnav in the car sometimes reacts to non-vocal sounds from whatever music I'm listening to.

    2. Anonymous Coward
      Anonymous Coward

      "Generally speaking, at least here in the UK / Europe, it'd be interesting to see if a recording of someone's voice (as made by a voice recognition system) counted as a personal data record."

      HMRC are using voice recognition as a tax payer's ID to get through to their help desks. It appears to use one set phrase.

      1. Simon Harris

        "HMRC ... appears to use one set phrase."

        "That money was just resting in my account"?

  3. Anonymous Coward
    Anonymous Coward

    "My Voice is my..."

    Oh, perhaps not.

    1. llaryllama

      Re: "My Voice is my..."

      Shall I phone you or nudge you?

  4. Justin Case

    So, it's crap?

    If a system can be "fooled" in this way, it's probably not doing it right.

    Yet another example of AI proving itself to be Artificial Inanity.

    1. RW

      Re: So, it's crap?

      Some decades ago I briefly crossed paths with an AI researcher. As a result, in my books, the correct term is "artificial stupidity."

    2. Alan Brown Silver badge

      Re: So, it's crap?

      'If a system can be "fooled" in this way, it's probably not doing it right.'

      Humans can be trivially fooled into mishearing things

      (Excuse me while I kiss this guy, We built this city on sausage rolls, We're calling a trout, Le freak c'est sheep, I'll never leave your pizza burning)

      They can also be relatively easily fooled into misidentifying the speaker.

  5. Nick Kew

    Just like human senses

    Remind me: was that dress blue and black, or gold and white, or ... ? I'm sure someone can remember the story of the dress that hit the headlines when it fooled the human eye.

    And when I worked in speech recognition research, we could easily confuse a speech recogniser for a digger, because the latter could wreck a nice beach (say it a couple of times if you don't get it).

    1. Anonymous Coward
      Anonymous Coward

      Re: Just like human senses

      It was both. I managed to see either dress, depending on the conditions when I looked at the picture. On a few cases I fit the sweet spot where it was just on the point of changing.

      I had a copy pinned up for a while because the ongoing optical illusion delighted me that much.

      1. handleoclast

        Re: Just like human senses

        @Anonymous Coward

        because the ongoing optical illusion delighted me that much.

        If you liked that, you'll love this. If you're looking at it one a phone, it's not using the front-facing camera to track your eyes; it works on a desktop setup with no cam. Doesn't work in a printout, though. :)

    2. Nick Kew

      Re: Just like human senses

      Following up to myself (sorry).

      Just heard Rutherford & Fry ont'wireless discussing human vs machine perception. Specifically, facial recognition.

      They made a crucial distinction. Humans (and sheep) are very good at recognising faces we know, but very bad at recognising strangers. The latter has led to criminal convictions on eye-witness evidence that have subsequently been proven entirely wrong. Machines can of course be fooled too, as studies like this article demonstrate.

      I reckon that means the real human/machine distinctions come from secondary influences. Like suggestibility and prejudices in humans, or tampering in machines.

      1. Charles 9

        Re: Just like human senses

        "They made a crucial distinction. Humans (and sheep) are very good at recognising faces we know, but very bad at recognising strangers."

        We also lose the ability to recognize even faces we know if enough cues disappear. A famous case around the early 90's pretty much shot eyewitness testimony all to hell by showing a suffciently-covered (ex. beard and glasses) celebrity face was mistaken by nearly everyone for the defendant.

        1. Alan Brown Silver badge

          Re: Just like human senses

          "We also lose the ability to recognize even faces we know if enough cues disappear."

          All it takes for most people is a different haircut.

  6. John Smith 19 Gold badge
    Unhappy

    Good work, and a reminder (like it was needed) this AI BS is not all it's cracked up to be.

    Keep in mind that this once again reminds people AI speech recog ¬= human speech recog.

    After all humans recognize "silence" as silence (or in this case I suspect random low level noise).

  7. foxyshadis

    El Reg is showing a pattern here

    While this is a major step up from the last two "machine learning fail" studies The Register has breathlessly reported on -- at least this time it's not just testing some crap created from scratch by the researchers themselves -- they chose DeepSpeech, of all the speech-to-text algorithms, widely considered so bad that this might be the first study to actually bother testing it. It's no surprise that it fails so badly. Even if they have to confine themselves to open source (which makes no sense in this case, since they neither analyze the algorithms nor modify the code), CMU Sphinx and Kaldi are the gold standards.

    No one cares how DeepSpeech fails, it's widely regarded as a failure. Waste of time testing that. Wait until it has another year or two to mature before it's worth testing.

  8. Anonymous Coward
    Anonymous Coward

    Potential for the future

    Although there are obvious flaws and issues with the case, it shows that there is potential for people who want to communicate in different languages. I think there are already some earpieces that do a similar thing but with added AI then it could only improve on the accuracy and range of things you can say.

  9. Joerg

    This happens because AI algorithms are a joke...

    This happens because AI algorithms are a joke... yep that is the truth. All currently in use AI algorithms aren't really AI at all. It is just a very complex (aka messed up) combination of various conditions with tweaks and hacks to make it look like an AI taking its own decisions. They really are no different than having thousands of very simple nested if-then-else conditions .. All the neural network stuff it looks shiny and cool theory wise but it is not that advanced as the marketing wants people to believe.

  10. Matthew Taylor

    He who laughs last...

    To everyone who is chuckling that "it just goes to show, AI is useless after all", consider this. One of the problems with training these systems is a lack of good training data, so this "attack" is a boon to AI researchers. They just need to add lots of adversarially hobbled speech samples to their network's training set, and it will learn to classify speech much more robustly.

    1. Alistair
      Windows

      Re: He who laughs last...

      Umm:

      They just need to add lots of adversarially hobbled speech samples to their network's training set,

      I'd suggest that they start collecting recordings from drive through restaurant communications systems.

      If they get that working, well, then they can *definitely* claim a victory.

  11. mark l 2 Silver badge

    I wondered if this technique will work with other audio other than voice? If so it could also be used to get around Youtube annoying song recognition where you videos have to have parts of the audio replaced or muted or you have to allow moneytised by the song copyright owner because there happened to be a bit of music playing in the background when you made your video.

    I can understand them not wanting people to make money off others work, but sometimes you can't always turn down music if what your filming is occurring in real time and is not a repeatable event

    1. Anonymous Coward
      Anonymous Coward

      "I can understand them not wanting people to make money off others work, but sometimes you can't always turn down music if what your filming is occurring in real time and is not a repeatable event"

      As Hollywood would say, "We can always edit in post."

  12. Anonymous Coward
    Anonymous Coward

    Reminds me of the time...

    I took a recording of my friend saying "We have a direct hit!" and changed it to say "We have an erect tit!" I'm slightly less juvenile now. Slightly. ;-)

  13. Packet

    I am not surprised by this.

    What disturbs me is how many banks have shifted to voice print identification for their telephone systems.

    And unlike say a consumer version like Siri, the telephone banking system can't be trained to recognize your voice (or maybe it is?) I use Siri as an example of something that learns your voice the more you use it.

    I get the motivation though - every CTO, etc is looking for the next great cure-all for all the security shite of IT, and so they jump onto the newest flavour of the month.

    Telephone banking is definitely insecure - to just rely upon a PIN of 3-5 digits, so along comes voice recognition - and it sounds very Star Trek like too.

    But in the meantime, they ignore their web/app banking security... (I know of a bank that for the longest time, did not differentiate between upper-case and lower-case passwords...)

  14. Ron Luther

    More Fun

    Why punk a friend? Surely it would be a bigger laff to punk the world?

    That politician giving the speech to the international audience? Let's have the auto-translators pick up "Good Morning!" as "Today we will be bombing West Ham!"

  15. Alan Brown Silver badge

    Speech recognition

    Try "Mmm, yes. Special we are" played backwards.

    As far as the stegenography in audio is concerned:

    This kind of thing is due to having far too high a dynamic range on the listener as well as too wide a bandwidth. Injecting synthetic masking noise would nobble the hidden speech detection (and also kills off false google/siri/cortana/alexa hits) and filtering to 300-3000Hz (actually only 300-1500Hz is all you'd need) would probably improve accuracy.

    You only need 12dB of dynamic range to handle intelligible speech (That's why 12dB SINAD used to be the squelch point in landmobile systems). Old style Telephony LD circuits used to only have about 40dB

  16. Herby

    Pay no attention...

    ...to the man behind the curtain.

    Speech is noise if you don't understand it. And Vice-versa.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like