back to article Google's video recognition AI is trivially trollable

Early in March, Google let loose a beta previewing an AI to classify videos – and it only took a month for University of Washington boffins to defeat it. The academics' approach is trivial: all the researchers (led by PhD student Hossein Hosseini) did was inserted a still photo “periodically and at a very low rate into videos …

  1. Lee D Silver badge

    Because it's not AI.

    It's not "seeing" anything.

    It's just taking a statistic generated from the data and finding the nearest data point to that statistic in its database and then returning it.

    None of this stuff is "AI". This is why web filters are still just huge lists of human-checked websites, rather than something scanning each page to tell the difference between a pair of stockings modelled on the M&S website and something much more nefarious.

    Stop calling it "AI".

    Hell, stop calling it "recognition".

    At best, it's an "algorithm".

    1. Anonymous Coward
      Anonymous Coward

      I couldn't agree more, we've literally been discussing this in the office. As far as I'm concerned IA doesn't exist yet, not by a long shot. Come back and tell me it exists when a system is flexible, in that it doesn't perform a single repetitive task (an algorithm). I'll believe it exists when I can ask a system to do something it wasn't specifically designed to do, like this video recognition malarkey ordering me a pizza.

    2. LionelB Silver badge

      It's just taking a statistic generated from the data and finding the nearest data point to that statistic in its database and then returning it.

      No, it's not doing anything like that. Please find out how deep-learning systems actually work before posting fatuous comments.

      At best, it's an "algorithm".

      Yes, of course it is. Computers run algorithms - that's what they do. Whether you call it an "AI algorithm" (I wouldn't) or a "pattern recognition algorithm" (I might) is a matter of what you think those terms ought to mean.

    3. Charlie Clark Silver badge

      It's just taking a statistic generated from the data and finding the nearest data point to that statistic in its database and then returning it.

      This is patently not the case because it's classifying by an outlier. This tells us a lot about the configuration of the algorithm but not much more. As for testing intelligence: I'm sure it wouldn't be too hard to come up with something similar that could "fool" humans. Indeed there was a Horizon programme devoted to just that several years ago. But in terms of video there is also the classic colour changing card trick.

      Heuristics themselves rely on an underlying statistical model or models.

    4. Anonymous Coward
      Anonymous Coward

      "It's just taking a statistic generated from the data and finding the nearest data point to that statistic in its database and then returning it."

      No, it most certainly is not.

    5. John Smith 19 Gold badge
      Unhappy

      "taking a statistic generated from the data...nearest data point to that statistic in its database"

      Except the spoofing suggests that it's not.

      By a very wide margin.

    6. The Man Who Fell To Earth Silver badge
      FAIL

      legendary Gorilla researcher Jane Goodall.

      Jane Goodall worked on Chimpanzees, not Gorillas.

      1. Anonymous Coward
        Anonymous Coward

        Re: legendary Gorilla researcher Jane Goodall.

        > Jane Goodall worked on Chimpanzees, not Gorillas.

        Oh, you must be talking about the actual Jane Goodall. The legendary one worked on gorillas. :-)

  2. Syd

    Isn't this more of a tiny "bug" than a fundamental problem? Like the old trick of putting misleading stuff in the HTML meta-tags, the defeat is easily defeated - just ignore frames that are out of place?

    Of course there will potentially be an arms-race, just as there still is in trying to... er... "optimise" Google search results; but that doesn't mean Google search is fundamentally broken, any more than this is?

    1. Mage Silver badge

      Is it a bug?

      No, it's a fundamental flaw in how all so called AI is hyped, presented, marketed. And how it actually works.

      If it was a tiny bug it could be fixed.

      I see this on AI junk mail filters.

      1. Lee D Silver badge

        Re: Is it a bug?

        If you have to insert an explicit rule, it's not AI. It's a human-written heuristic.

        If the machine can't learn on its own, it's not AI either. It's a human-controlled heuristic.

        If you have to spend your life telling it "Oh, and look out for this explicit thing that you get wrong", then you may as well just write a list of rules.

        And the exact problem with these "deep learning" machine algorithm things is that you can't just say "Oh, take this into account", because they aren't written that way, they've learn from the data.

        No, you have to go back, create test cases for every imaginable scenario, spend years training it on all of them and hope it picks up on what it was doing wrong. And then someone comes up with, say, picture-in-picture which confuses it again. Back to square one retraining on that too.

        So you can't use it for, say... crowd-based facial recognition (as is often advertised as a use case for such things), or self-driving vehicle cameras, because it could flag ANYTHING at any point just by being sufficiently distracted - even with NO knowledge of its underlying training or algorithms. And you can't train it on every possible scenario well enough that someone trying to catch it out can't just make it flip.

        Imagine telling even a 2-year-old that they're going to need to win the toy by getting the video right. And you show them a video with a thousand frames of tigers, and one frame of an Audi. Would you really ever expect them to press "Car" instead of "Animal"? This system is no better trained than a 2-year-old human, in that case, who can do a lot more besides.

        1. LionelB Silver badge

          Re: Is it a bug?

          If you have to insert an explicit rule, it's not AI. It's a human-written heuristic.

          You might well argue, though, that natural (human) intelligence is a massive mish-mash of heuristics, learning algorithms and expedient hacks assembled and honed over evolutionary time scales.

          That's what general (i.e., non-domain-specific) AI is up against - and yes, it's hard, and we're nowhere near yet. And of course it's hyped - what isn't? Get over it, and maybe even appreciate the incremental advances. Or better still, get involved and make it better. Sneering never got anything done.

          1. DropBear
            WTF?

            Re: Is it a bug?

            So your reply to "homoeopathy is not medicine" is "write a new treatise on it and make it better!" Yup, got it.

            1. LionelB Silver badge
              FAIL

              Re: Is it a bug?

              So your reply to "homoeopathy is not medicine" is "write a new treatise on it and make it better!" Yup, got it.

              Sorry, but that's a fantastically lame "analogy".

              1. DropBear
                Stop

                Re: Is it a bug?

                "Sorry, but that's a fantastically lame "analogy"."

                So is the notion that current "AI"-anything has anything to do with intelligence.

                1. LionelB Silver badge

                  Re: Is it a bug?

                  @DropBear

                  LionelB wrote earlier:

                  That's what general (i.e., non-domain-specific) AI is up against - and yes, it's hard, and we're nowhere near yet.

                  IOW, I don't entirely* disagree with you. I just thought your analogy was crap.

                  *OTOH, I don't think "real" AI (whatever that means) is unattainable - always a duff idea to second-guess the future (cf. previous unattainables, like heavier-than-air flight, or putting humans on the moon). Basically, I don't believe that there are magic pixies involved in natural (human) intelligence.

              2. John Brown (no body) Silver badge
                Coat

                Re: Is it a bug?

                "Sorry, but that's a fantastically lame "analogy"."

                Yeah. What this thread needs is a car analogy. Preferably an Audi one.

        2. LionelB Silver badge

          Re: Is it a bug?

          No, you have to go back, create test cases for every imaginable scenario, ...

          Sorry, no. You seem to have a total misconception as to how machine-learning in general, and "deep-learning" (a.k.a. multi-layer, usually feed-forward) networks in particular, function. You seem to have latched onto the bogus idea that a machine learning algorithm needs to have "seen" every conceivable input it might encounter in order to be able to classify it.

          In reality, the entire point of machine-learning algorithms is to be able to generalise to inputs it hasn't encountered in training. The art (and it's not quite a science, although some aspects of the process are well-understood) of good machine-learning design is to tread the line between poor generalisation (a.k.a. "overfitting" the training data) and poor classification ability (a.k.a. "underfitting" the training data).

          It's a hard problem - and while the more (and more varied) the training data and time/computing resources available, the better performance you can expect, I'd be the last person to claim that deep-learning is going to crack general AI. Far from it. But it can be rather good at domain-specific problems, and as such I suspect will become a useful building-block of more sophisticated and multi-faceted systems of the future.

          After all, a rather striking (if comparatively minor) and highly domain-specific aspect of human intelligence is our astounding facial recognition abilities. But then we have the benefit of millions of years of evolutionary "algorithm design" behind those abilities.

      2. LionelB Silver badge

        Re: Is it a bug?

        If it was a tiny bug it could be fixed.

        What makes you so sure it can't be fixed? FWIW, I suspect it is probably not a "tiny" bug, but may not actually be that hard to fix (off the top of my head I can imagine, for example, a training regime which omits random frames, or perhaps a sub-system which recognises highly uncharacteristic frames, which might mitigate the problem).

        This research may well turn out to be rather useful to Google (although I'd also be slightly surprised if they weren't aware of something similar already).

        1. iTheHuman

          Re: Is it a bug?

          I assume you mean "emits"? As long as backpropagation applies the right weight, it will.

          It does remind me of the ai version of a honeypot.

          A paper was making the rounds recently that described a universal method to fool, iirc, cnn, by applying a distortion field over an image. The images remained recognizable by humans. What seemed to be forgotten was that the algorithm is now a part of new training sets which will make the systems more robust.

    2. Anonymous Coward
      Anonymous Coward

      "Isn't this more of a tiny "bug" than a fundamental problem?"

      Probably. It'll likely be some artefact of the use of frames rather than the concept of continuous video, coupled with artificial objects being *much* easier to learn and later classify than natural ones. Particularly natural ones that have spent millions of years evolving to resist being seen.

      It's an interesting attack vector, going after the feature extraction rather than the learning or classification phases. Not a new one, but probably the single most vulnerable one, given that ANNs are black boxes.

  3. Ralph B

    Googlebomb v2

    So, if you replace one frame in 50 of a Trump video with goatse ... ?

    1. Swarthy

      Re: Googlebomb v2

      Why would you do that? I mean, the original source is a flagrant arse-hole, what would adding goatse accomplish?

  4. Ben1892

    Isn't that working as intended?

    Isn't the idea that a human doesn't have to watch hours of footage to find possibly nefarious images inserted into seemingly innocuous videos ? the machine learning algorithm just needs to be taught that videos can be about more than one topic, because humans are sneaky like that

  5. cb7

    The algorithm is clearly flawed

    The real problem is why the algorithm places such a heavy weighting on what effectively amounts to only 2% of the footage.

    1. VinceH

      Re: The algorithm is clearly flawed

      And/or on what it recognises more easily and more quickly. Possibly.

      i,e, if it was flipped - so using the example described in the article, if it was a video about a car, with the odd frame of a tiger inserted, I suspect it would correctly identify it as a video about a car. It needs to be programmed (I won't say taught at this stage) to use the predominant subject matter.

    2. John Smith 19 Gold badge
      Unhappy

      "why the algorithm..such a heavy weighting on..only 2% of the footage."

      Got it in a nutshell.

      These "deep learning"* algorithms are meant to be "robust" which implies not deflected by minor disturbances. IE "I am 98% confident this video is about cats, but there is a 2% chance it's got something to do with cars." So 95% certain it's about a cars is an Epic Fail.

      *which makes the work sound so much more insightful than multi-layer neural network, which actually explains what (AFAIK) all of these algorithms actually are, and therefor suggests people could actually work out how they work.

      1. LionelB Silver badge

        Re: "why the algorithm..such a heavy weighting on..only 2% of the footage."

        @John Smith 19

        Yes, deep-learning networks (usually) are just multi-layer networks - but that doesn't imply that "people could actually work out how they work". It's notoriously hard to figure out the logic (in a form comprehensible to human intuition) of how multi-layer networks arrive at an output. I believe the so-called "deep-dreaming" networks were originally devised as an aid to understanding how multi-layer convolutional networks classify images, roughly by "running them in reverse" (yes, I know it's not quite as straightforward as that).

        1. Adrian 4

          Re: "why the algorithm..such a heavy weighting on..only 2% of the footage."

          it's possible that the algorithm weights frames according to its confidence in those frames. So if the tiger frames were muddied by surrounding distractions but the car shots were clear and contrasty, it would have more confidence in those images than the tiger.

          This isn't really a bug, it's a complication in the tuning. Especially when given input deliberately intended to deceive. Human recognition is also prone to deception - so called 'optical illusions' - and is far more difficult to correct.

    3. Anonymous Coward
      Anonymous Coward

      Re: The algorithm is clearly flawed

      > The real problem is why the algorithm places such a heavy weighting on what effectively amounts to only 2% of the footage.

      Same way the rest of us watch a soporific ten minute YouTube video because there was a scantily clad female on the thumbnail preview?

  6. Anonymous Coward
    Anonymous Coward

    Car adverts

    Sounds like a typical car advert to me - big cat, prowling around on rocks, with about 1 in 50 frames showing an actual vehicle.

  7. Anonymous Coward
    Anonymous Coward

    Reminds me of a story I read of a "learning AI" system designed for the army to recognize tanks in the battlefield - trained with lots of photos of Salisbury plain on a day when there were tank exercises against a set of photos of idectical locations on a day with no tanks present. AI system seemed to learn to detect tanks with high level of accuracy - however when they tried it on a live demo it was hopeless. After investigation they realized the day they took photos with tanks was sunny and the day without was overcast so the system had actually learnt to detect presence of clouds

    1. Charlie Clark Silver badge

      I remember watching an old Horizon on neural networks about this. However, since then the technology has, and our understanding of it, has involved significantly. Not that it's infallible, as this article demonstrates only too well, but whether it can be used in situations and perform at least as well as humans for the same task. This is already the case with still image recognition: video is considerably harder but correctly identifying the brand of the car in the "blipvert" is pretty impressive and, incidentally, tells us a lot about the training data that has been used: Google is going to make more money from brands than tigers…

  8. anthonyhegedus Silver badge

    Artificial stupidity is much harder to fake than artificial intelligence.

    1. Blank Reg

      There's already more than enough real stupidity in the world, no need to go creating synthetic versions.

  9. Anonymous Coward
    Go

    So and advertising agency

    being fooled by subliminal advertising.

    How ironic.

  10. GingerOne

    It's a beta!

    The clue is in the title. Jeez people, chill. Google released a beta and some other independent people found a flaw. This is literally the whole fucking point of a beta!!!

    1. Swarthy

      Re: It's a beta!

      Except that this is Google. "Beta" is their word for "Release".

  11. Robert Carnegie Silver badge

    For advertisements

    There are a lot of advertisement videos where the product that the ad actually is for only appears for a few frames of the video, if at all. Not necessarily with cars, ads tend to show the car driving around or sometimes unrealistically leaping like an acrobatic gazelle, but Google has probably decided that it is a car advert.

  12. John Smith 19 Gold badge
    Unhappy

    One day

    Law enforcement types will realize Person of Interest is not a documentary.

  13. Anonymous Coward Silver badge
    Boffin

    Alternative viewpoint

    It has actually correctly identified some data that had been concealed in an otherwise innocuous video.

    Imagine if terrists put bomb making tutorials into a film by interspersing 1 frame in 50... 49 frames of, say Debbie Does Dallas to 1 frame of How To Make A Bomb. A 2 minute video embedded in a 2 hour video.

    This machine learning / AI / pattern recognition / whatever would pick that up, which is surely a good thing.

  14. Draco
    Pint

    Let's think about how Google might have implemented this.

    (1) Google already automatically classifies images, so it is reasonable to assume they would try to leverage / reuse their image classifiers.

    (2) Since video is simply a bunch of still images, it is reasonable to assumed Google simply takes stills from the video and passes it to their existing (and trained) image classifiers.

    (3) It is pointless to process every-single-frame in the video because that would be prohibitively expensive and there really isn't much change from frame to frame.

    (4) Google, probably, selects only the key frames (I-frames) for classification. (Depending on computation cost, Google may drop key frames if they are similar to other key frames - why classify the same, or very similar image, over and over again. Of course, this depends on whether image classification is more expensive than comparing two key frames).

    (5) It should be obvious that every inserted image is an I-frame, so it WILL be classified.

    (6) Google has some algorithm (or neural net) that tries to boil down the contents of a video (several thousands or millions of images) into a single classification. Clearly, if you have film walking about in a city, you will have cars, buildings, people, trees, etc. Google's classifier has to come back with a single answer. This is, probably, weighted by the confidence of the original classifications. Cars, buildings, laptops, and food plates probably have a higher level of classification confidence.

    I imagine that Google will, over time, tweak the final classification to give more weight to duration of a single classification rather than confidence of classification (or perhaps some admixture of the two).

    On the other hand, I could be blowing smoke since I have absolutely no idea how Google is doing this, but is the way I would approach it.

    (A pint, because we could spend hours arguing over how this is or should be implemented - or even if it should be implemented at all.)

    1. Anonymous Coward
      Anonymous Coward

      Re: Let's think about how Google might have implemented this.

      I would also be willing to lay money on this being an I-frame thing. It's a sensible, obvious optimisation to the ML pipeline, leveraging the work done by the video compression to extract the "key" information. Easily accounts for the surprisingly high levels of confidence in the predictions - the NN is going to "see" almost nothing but the inserted image.

    2. I am the liquor

      Re: Let's think about how Google might have implemented this.

      Good point. Although they've only changed 2% of the frames in the video, they'll have changed a lot more than 2% of the information.

  15. Andrew Jones 2

    So.... the system is designed to accurately spot computers more than anything else is the takeaway I get from this. The "AI" is clearly designed to seek out other possible AI friends...... Not chilling at all...... honest.....

  16. Blank Reg

    Troll Detector

    "AI to spot trolls is easily defeated by typos, deliberate misspellings of offensive words, or bad punctuation"

    So how about marking the posts with typos and bad grammar as trolling. Maybe then people will be forced to learn the difference between their, there and they're.

  17. FelixReg

    Did you assume my gender?

    We can't assume the purpose of the ML these guys used was to find the tiger. What if the purpose was to find the anomalous Audi? This is a general problem philosophers have been working on since forever.

    Think: optical illusions. Is it possible to build a machine that's without such illusions?

  18. JeffyPoooh
    Pint

    "A.I. is hard."

    To be fair, human eyes can be fooled by zillions of different optical illusions.

    Difference is, if we (humans) walked over towards "the Audi", we might realize something was wrong when it growled at us.

  19. JeffyPoooh
    Pint

    The amazing thing is this...

    It's a Audi.

    So it would have been four inches away from whatever it was following. Thus the Audi grill emblem would have been obscured by the vehicle in front, the one suffering 'The Full Audi'. Thus the AI determined it was an Audi just by the tiny gap to the vehicle in front. Therefore the Googly AI has been watching Top Gear reruns.

    Next conclusion is that Google self-driving cars can't see tigers.

  20. ReflectOnLight

    I think it worked correctly.

    If there was a picture of something different every x number of frames then I would argue that there were two videos here. One of the lion and one of the car both being played back to the AI. That it classified the lower frequency one correctly seems to indicate it worked somewhat correctly given that it should give one answer. A human may not see the car because of the persistence of vision aspect of the human eye and its low bandwidth but the AI would see it. And with the picture of the car being identical in all cases, see it better than it saw something that continually moved around and changed shape like the lion. Effectively, the lion was variable noise, the car a constant if infrequent feature.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like