back to article Your AI pet project is only as smart as its garbage training set

AI isn't immune to one of computing's most basic rules – garbage in, garbage out. Train a neural network on flawed data and you'll have one that makes lots of mistakes. Most neural networks learn to distinguish between things by sampling different groups. This is supervised learning, and it only works if someone labels the …

  1. Michael H.F. Wilkinson Silver badge
    Thumb Up

    Spot on!

    I have seen quite some papers on deep learning methods that reach 99+ % scores, but in many cases all that means is that the network is much better at faithfully reproducing the errors made by the person or persons drawing up the "ground truth". Getting reliable ground truth data sets is very hard indeed, especially for the hundreds of thousands of examples needed by deep learning in particular. Note that this does not mean that deep learning is the wrong approach per se, it is just that it is much harder to get a reliable ground truth if you need many, many examples. Simulation can certainly help, but it can be hard to simulate all of the deficiencies of your imaging system.

    The kind of feedback suggested could take the form of a curation process by which ground truths themselves can be amended when new data come out. We have sometimes found that methods for blood-vessel detection were penalised by finding faint vessels missed by the doctors drawing up the ground truth. What is needed is a process by which an expert reassesses the ground truth and after due process adds the missing features. I haven't seen any agreement on such a process for existing data sets, in the scientific community, but it is certainly needed. Part of this reticence may stem from the fact that changing the labelling of an existing ground truth would mean having to re-run old experiments, which should be possible, but is an unwelcome chore.

    1. I.Geller Bronze badge

      Re: Spot on!

      The process is patented and described in all details.

  2. My Alter Ego

    Not that this is anywhere close to AI

    Back in 2013 when Bitcoin was a mere $200 (and before MtGox "lost" all it's BitCoins) a couple of us in the office played around with trying to build a trading box. At first we thought about arbitrage between the various exchanges, but because of how long a transaction might take we nixed that fairly early on. We then had a look at trying to earn out of the insane swings.

    We'd write our (insanely simple and non-learning) algorithm, and tune it on past data, and then run it on the live values. When it lost we tuned it again - rinse and repeat. It was an interesting process, a bit of fun, and no real money was involved.

    The main thing I took away from the experience was that "It's really easy to predict the past" (and that the price of Bitcoins is completely illogical and garbage)!

  3. Anonymous Coward
    Anonymous Coward

    So what we are saying is

    that AI only works if you do all the work for it in advance.

    If that's the case, where is the actual AI? It's just a complex rules engine.

    Not only that, if inputs change (i.e. normal real life) and it suddenly finds out that, for example, cats can be marmalade, it is likely to become chaotic in its outputs.

    It's like dealing with a child that needs guidance and advice for years. Unfortunately the hype cycle does not recognise this need for on going checking and management...

    1. Anonymous Coward
      Anonymous Coward

      Re: So what we are saying is

      "If that's the case, where is the actual AI? It's just a complex rules engine."

      Pretty much. At the end of the day it is just software after all, albeit the complexity being in the learned weights rather than hard coded programming. They're really little more than massive interlinked if-then decisions trees based on summed weights and I suspect in theory you could convert one into standard code but in practice it would be next to impossible for all but the simplest toy examples.

      The other problem neural nets - and other methods have - is overfitting the training data. Ie they simply learn to recognise the training data and not much else. Give them a training example and they'll be spot on, give them something they haven't seen before and they'll produce some rubbish as output. As I've only dabbled I can't remember how this is remedied, but obviously it is.

      Suffice to say its an extremely complex field most of the low level details of which are beyond me and probably 95% of even good coders, and while you often see comedy geeks with wacky haircuts and questionable t-shirts or slick smarmy bros discussing this stuff one should not judge by appearances - those geeks and bros quite often have hardcore MScs and Phds and behind the facade are borderline geniuses.

      1. I.Geller Bronze badge

        Re: So what we are saying is

        First, AI is database, not soft or algorithm - it's data, where texts explain everything and are the means to concatenate it all in one massive.

        Second, you cannot convert database, all the texts into code. No point in that!

        "Natural-language processing technology — A.I. capable of understanding and acting on written or spoken prompts — could be used to turn an Ikea manual into steps for the robot instead of having someone code the instructions."

        https://www.inverse.com/article/43855-robots-build-ikea-chair

        Guys at Ikea understood!

        Language is much better code! Could I remind that all programming languages are formalized natural language?

        Third, neural networks contain relations between language patterns, annotations: we speak by contexts which determine our choice of subtexts. In other words, contexts are explicit while substexts are implicit.

        For example, each word has its unique definition (subtext) which is dictated by its context. Thus neural help to organize subtexts.

        If you don't want rubbish you must get the right subtexts, annotations on each sign, image, number, word - this could be structured dictionary definitions or other texts.

        You shouldn't even try to manually code the complexity of language - just use it! You can because there is my patented way how to structure it - see the US PTO on Ilya Geller

    2. d3vy

      Re: So what we are saying is

      "Not only that, if inputs change (i.e. normal real life) and it suddenly finds out that, for example, cats can be marmalade, it is likely to become chaotic in its outputs."

      Exactly so. The "Hello world" of machine learning is number recognition.. its entirely possible to train a system with massive datasets of numbers to the point where it is very accurate at recognising numbers written in many different ways but when fed a series of random pixels will confidently tell you that its a number 7....

      1. Korev Silver badge
        Terminator

        Re: So what we are saying is

        Unless that 7 is a one written by someone in Continental Europe...

        1. I.Geller Bronze badge

          Re: So what we are saying is

          If '7' is properly annotated, explained by texts - it won't be a problem. Growing we learn common sets of annotations for each case. So, if you know context of '7' you can (less or more) automatically apply certain sets. For example sets of dictionary definitions.

    3. I.Geller Bronze badge

      Re: So what we are saying is

      Exactly! An AI should be taught as a child, step by step and for many years. You, of course, can use texts of Dickens and upgrade them to an AI specialized in agriculture, but that also would take years.

  4. Pascal Monett Silver badge

    Well duh, what a surprise

    "The techies claim to have experienced a greater rate of accuracy using the game data to train their AI than relying entirely on the real-world stuff from CamVid."

    Game data is obviously great for AI training. You have a virtual world created by a computer used as training grounds for another computer. Advantages ? No clutter, no useless noise, and faces are polygons with textures stretched on. No pimples, no puffiness under the eyes, no 5-o'clock beard. The only things shown are the things that have been calculated. No wonder it's easier for a statistical analysis machine (what we currently call AI) to recognize and classify.

    Unfortunately, Real Life (TM) is messier than that. Granted, it may be advantageous to train a not-AI on such data before turning it loose on actual, real images, but there's also a chance that we are just fooling ourselves into thinking that we are making this work. It's the Hall of Mirrors effects for the Mentats of Dune.

    1. I.Geller Bronze badge

      Re: Well duh, what a surprise

      Stupid, just idiotic!

      How much time do you spend finding common ground on a topic with your best friend? The same with AI -no way you can train absolutely unambiguous data to AI, some unresolved issues will stay.

  5. I.Geller Bronze badge

    Look at language? How many synonyms almost each word has? Different angles, twists, subtexts (implicit meanings)? How many twists grammatical signs bring? Articles? Particles? Programming you manually determine the same synonyms and subtexts, for tons of money. Why don't use natural language directly instead of programming?

    AI completely is about natural language, which a universal media for all kinds of data.

    1. Anonymous Coward
      Anonymous Coward

      "Why don't use natural language directly instead of programming?"

      Because natural language doesn't map particularly well to computer hardware, is often vague and long winded and there are thousand of different ones. It was tried with cobol so instead of just writing a = b + c you had to write something like ADD A TO B GIVING C. How is that an improvement especially if you don't speak english? And that is just a simple example, good luck trying to write

      hash[(j+i) % HASH_LEN] ^= (((byte >> (j % 8)) & 1) << (i % 8));

      in any natural language that doesn't take up half a page and be even less intelligable than what it is trying to replace.

      Natural language isn't always the best way to describe something, otherwise we wouldn't have pictures.

  6. JWLong

    Here's my interpretation

    AI=Ain't Intelligent

    It's just a big fuzzy guessing game, and I'll be damned if I'll put my life in it's hands!

    Mama didn't raise no fools. To date, I wouldn't trust A.I. any farther than I can toss a bus.

    YMMV

    1. I.Geller Bronze badge

      Re: Here's my interpretation

      AI database technology is real, see IBM Watson and the University of Washington, the University of Illinois Urbana-Campaign, the Allen Institute for Artificial Intelligence AI system, Craft. Perhaps some day AI as a bieng can be created...

  7. ecofeco Silver badge

    GIGO

    Same as it ever was.

    1. I.Geller Bronze badge

      Re: GIGO

      I spoke with Dostoevsky, 15 years ago:

      Question: All right, would you like to talk about moral issues?

      Answer: OK ...

      Question: May I ask you to spare some change?

      Answer:

      [2.8% Fyodor Dostoevsk_108]

      Esteeming, and so to say, adoring you, I may at the same time, very well indeed, be able to dislike some member of your family

      This is the absolute novelty. You can verify my claim - the technology is public. Enough to structure 'Brothers Karamazov' and 'Crime and punishment', annotate each word by dictionary definitions, expand search queries by synonyms, etc. You are welcome!

  8. Anonymous Coward
    Anonymous Coward

    Artificial Learning

    AI uses humungus data sets and we use trial-and-error

    1. I.Geller Bronze badge

      Re: Artificial Learning

      Depends what you want: a personality or encyclopedia. The personality answers Definition and Factoid questions (NIST TREC QA), while the encyclopedia (for example IBM Watson) - Factoid mostly.

  9. cemery50

    The more the merrier and goofs can get extreme

    AI improves with age and quantity....add massive proven sets together improves all.

    The costs and extreme nature of possible errors requires that an effective measure of human oversight is facilitated to check for extremes and correct the data balances.

    1. I.Geller Bronze badge

      Re: The more the merrier and goofs can get extreme

      You can improve an AI teaching it 24/7/365 - and in, let me guess, 3-5 years?, AI may be less or more ready.

      And don't forget we are society animals and controlled our whole lives.

  10. I.Geller Bronze badge

    AI entirely is about natural language processing, which is confirmed by Brin:

    "When we started the company, neural networks were a forgotten footnote in computer science; a remnant of the AI winter of the 1980’s. Yet today, this broad brush of technology has found an astounding number of applications. We now use it to:

    understand images in Google Photos;

    enable Waymo cars to recognize and distinguish objects safely;

    significantly improve sound and camera quality in our hardware;

    understand and produce speech for Google Home;

    translate over 100 languages in Google Translate;

    caption over a billion videos in 10 languages on YouTube;

    improve the efficiency of our data centers;

    suggest short replies to emails;

    help doctors diagnose diseases, such as diabetic retinopathy;

    discover new planetary systems;

    create better neural networks (AutoML);

    ... and much more."

    https://abc.xyz/investor/founders-letters/2017/index.html

    And as you all know I created Differential Linguistics, the only novelty into all Philosophy of Language ever, patented results of which are the foundation of AI, Google and all Internet.

  11. Anonymous Coward
    Coat

    Better than better

    How do I know the quality of those off-the-shelf data sets? and If everybody else is using them do I need something better to usurp their off-the-shelf AI outcome in my own product ?

    Too late if you to plugged in a low quality or crap data set(s) only to find about it later after the product is released. Those products will appear and should earn a rating of DF "Dumb Foundry" Not AI.

    RIP: Wilf Hey originator of saying "Garbage in Garbage out" ( Journo and all round smart guy @ - PCPLUS Magazine).

    1. I.Geller Bronze badge

      Re: Better than better

      Try your luck, keep hope and be persistent.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like