back to article Inside the 1TB ImageNet data set used to train the world's AI: Naked kids, drunken frat parties, porno stars, and more

ImageNet – a data set used to train AI systems around the world – contains photos of naked children, families on the beach, college parties, porn actresses, and more, scraped from the web to train computers without those individuals' explicit consent. The library consists of 14 million images, each placed into categories that …

  1. lglethal Silver badge
    Go

    Trust of black box systems is overrated

    By erasing them completely, not only is a significant part of the history of AI lost, but researchers are unable to see how the assumptions, labels, and classificatory approaches have been replicated in new systems, or trace the provenance of skews and biases exhibited in working systems."

    If you cant understand your black box, then maybe you shouldnt be trusting the results quite so completely...

    1. Khaptain Silver badge

      Re: Trust of black box systems is overrated

      "By erasing them completely, not only is a significant part of the history of AI lost"

      False : New images can be made/found which have full prior consent.

      These bastards want to make money but they don't want to take the risk of having to pay anything up front... Greed, the underlying factor behind most crime/deceit.....

      1. juice Silver badge

        Re: Trust of black box systems is overrated

        > False : New images can be made/found which have full prior consent

        That's not the point. The issue being flagged up here is that without the original data sets, it'll be impossible to recreate the AI systems which were trained on them. Which in turn means it's impossible to assess how much of a part the excised images played when it comes to their processing capabilities and/or potential biases thereof.

        Admittedly, there is a rebuttal to this, in that you could just retrain the AI with the new data sets and then compare/contrast. But this costs time and money, and by it's very black-box/random-weighting nature (combined with the fact that the new system will probably be trained on different/newer hardware), AI training isn't guaranteed to be reproducible.

        Equally, it's debatable how useful the original material is when it comes to dissecting AI logic paths, given how many iterations and weighting actions occur, and how many layers are in the AI black box.

        But still, there's definitely some merit to the concerns being raised!

        1. jmch Silver badge

          Re: Trust of black box systems is overrated

          "The issue being flagged up here is that without the original data sets, it'll be impossible to recreate the AI systems which were trained on them"

          I don't think that follows at all, in fact quite the opposite. If instead of a million 'bikini' pictures gathered without the individuals' consent, I use a different data set of another million 'bikini' pictures gathered with the individuals' consent, all other things being equal* I would expect AI trained on the second data set to behave substantially the same as one trained on the first data set.

          If the results of the AI are so strongly linked to the exactness of some or all images** in the data set, that AI is frankly not fit for purpose. As you mention, the real issue here is time (which ultimately boils down to money) and money. Big AI-slingers will happily charge top dollar for the final product but are unwilling to pay for proper (and properly-sourced) data sets

          *ie range of differences in skin tones, bikini colours, angle and distance of shot, lighting is broadly the same across the 2 data sets

          **could be images, videos, data records etc depending on the AI, principle doesn't change

          1. Nick Ryan Silver badge

            Re: Trust of black box systems is overrated

            It's a good point that the end result of the training should be the same, however one also needs to take into account selection bias. If the only images are those where the subject human has explicitly consented to their image being used in neural net training processes, this will almost certainly skew the selection of available images. For example, they are likely to be higher quality images of individuals who are more confortable having their photo taken when wearing a bikini.

        2. FozzyBear Silver badge
          Alien

          Re: Trust of black box systems is overrated

          it'll be impossible to recreate the AI systems which were trained on them

          If the system created is so highly dependant on the original data set then the model is useless. In fact, any model worth a dime should be able to process a substituted data set without any statistically significant deltas.

          If your results are statistically skewed , then you need to revise and recheck everything, your assumptions, premise, model and datasets, as something is completely wrong.

    2. Muscleguy Silver badge

      Re: Trust of black box systems is overrated

      In science black boxes where you can measure input and output but have little or no knowledge of what is going on inside have yielded important insights but only because we don't stop there. The classic example is the kidneys which filter blood, process the filtrate recovering some things under some circumstances and returning them to the bloodstream and excreting others.

      It was first probed as a black box system, you measured blood and urine under different circumstances (over hydration, under hydration, salt etc) then inferred function from the results.

      Then along came the New Zealand White rabbit strain which keeps it's proximal convoluted kidney tubules just under the outside of the surface. This enabled them to be carefully removed and hooked up to tiny tubes so both sides could be anything you wanted.

      The first guy who managed this called his PhD supervisor over to look at the first perfused tubule and his supervisor promptly racked the objective lens on the scope right through it. But do it once you can do it again. From that start we have learned a lot.

      The problem with modern AI systems is almost nobody probes the Black Box to ask how it is working. Bad Science.

      1. WallMeerkat

        Re: Trust of black box systems is overrated

        Same in testing, you're given a system and requirements and poke it, but no implementation detail.

        Whereas if you are given access to internal APIs it becomes more grey box testing.

        And unit testing which is white box ie. implies knowledge of the code.

      2. Michael Wojcik Silver badge

        Re: Trust of black box systems is overrated

        The problem with modern AI systems is almost nobody probes the Black Box to ask how it is working.

        "Almost nobody?" This is a very active area of research.

        Here's one post from the morning paper describing half a dozen papers on the topic.

  2. John G Imrie Silver badge
    Trollface

    AI Learning

    Can't they train an AI to filter out appropriate images?

    1. Flocke Kroes Silver badge

      Re: AI Learning

      What will you use for training data?

      1. Wellyboot Silver badge

        Re: AI Learning

        They said MTurk has an >>>automatic quality control system in place to filter out spammers and problematic images<<<

        Can we see that specific training set now and ask how was it created?

        1. Flocke Kroes Silver badge

          Re: AI Learning

          I thought they used MTurk. Give a few images to several people and if one of them consistently gives different answers then fire that turk and delete the data he supplied.

          1. katrinab Silver badge

            Re: AI Learning

            That would be a problem if spammers are the majority of the turks, which they probably are, or if someone else programmed their own bot to make lots of money from it.

        2. John Smith 19 Gold badge
          FAIL

          Oh look. AI rediscovers Garbage In --> Garbage Out

          Let me guess

          "We had to get the training set from somewhere"

          And boy they weren't picky where.

          I sense something like the issues in the music industry when sampling first started and people said it wasn't violating copyright to take a chunk of some other record to make theirs.

          It's not your copyright. It's someone elses. They should decide (as they do in music now) if you can or cannot take that copy.

      2. Sir Runcible Spoon Silver badge
        Coat

        Re: AI Learning

        inappropriate images of course :)

      3. Filippo

        Re: AI Learning

        That's a good point. An AI that can detect illegal porn would be useful, but I have no idea how one could possibly train it and test it.

        1. Ken 16 Silver badge
          Trollface

          Re: AI Learning

          If there's a research grant associated, I'm willing to put in time to answer that question.

    2. Anonymous Coward
      Anonymous Coward

      Re: AI Learning

      Quis comitatu ipsos lanistis?

  3. cb7

    Pictures of bikini clad women

    “At first I found it amusing, and I decided to look through the data set”

    What he really meant:

    At first I found it titillating, and I decided to look through the data set

    1. This post has been deleted by its author

    2. Muppet Boss

      Re: Pictures of bikini clad women

      “I was trying to generate pictures of bicycles using BigGAN,” ... "Instead, however, his code conjured strange flesh-colored blobs that resembled blurry, disfigured female bodies."

      They call it acid.

      https://www.theguardian.com/artanddesign/gallery/2019/aug/06/graphic-history-first-lsd-trip-brain-blomerth-bicycle-day-in-pictures

      1. Teiwaz Silver badge

        Re: Pictures of bikini clad women

        And he didn't call it a Bicycikini and got straight to marketing?

  4. LDS Silver badge

    The AI fee of clay....

    So they actually train their systems on a flawed dataset they created in the cheapest way? The "old" adage "garbage in - garbage out" is still valid - and this way to create software is highly unethical and dangerous. If one of the biggest problems to create a viable and reliable AI is to start from a reliable dataset available in the centuries to come, that's what you have to address first - if you believe you won't ever be able to build it, well, your whole project is flawed and doomed.

    1. phuzz Silver badge

      Re: The AI fee of clay....

      No, ImageNet supply this dataset for other people/groups/companies to train their AI(s) on.

      You're right that it's flawed and cheap, but they sold it anyway.

    2. Anonymous Coward
      Anonymous Coward

      Re: The AI fee of clay....

      *if you believe you won't ever be able to build it, well, your whole project is flawed and doomed."

      Unfortunately, while something is flawed and doomed, I don't believe it will be these projects.

      Once the AI data is being used for making life altering decisions (ie health-based). And in 20+ years we will look back and realize the effects that it has had were frightening. Just hope you fit in the non-flawed parts of the data...

      That's the path most other advances in human knowledge have taken so why should AI be different?

  5. Flocke Kroes Silver badge

    Just put everything in the terms and conditions

    People will literally agree to sacrifice their first born son without reading a terms of service agreement. I would like to believe that dreadful consequences would get a few people to read and consider not agreeing but so far the evidence is against me.

    1. Thoguht Silver badge

      Re: Just put everything in the terms and conditions

      I don't think that's the quote you really want because that doesn't mention sacrifice at all, it's more like a dedication after which you bought them back again as specified in Exodus 13:13. Maybe Genesis 22 for Abraham and Isaac? Except he didn't actually sacrifice his son, but Jephtha is the real deal in Judges 11 (if daughters are OK instead of sons).

      1. Flocke Kroes Silver badge

        Re: Just put everything in the terms and conditions

        Actually I was looking for EULAs with ridiculous conditions and that is what came up top of the list. I was so surprised that I checked other bible sources to be certain I had not come across a hoax. Thanks for the Exodus 13 reference - you get to buy back a first born donkey with a lamb, but if you are short of lambs you have to break the donkey's neck. Buying back a son is a requirement but if you are short of cash, sons get the same treatment as oxen and sheep (anyone know what happens to them?). The idea of buying back God's stuff from a priest is interesting. Presumably the priest takes a commission and passes the rest on to God. How does that work?

        IIRC Abraham/Isaac was a one off demand. Jephtha made a promise that was not part of the standard terms and conditions (and there is no evidence that God agreed to the deal), so does not really fit here. This is the link I was looking for when I got distracted.

        1. Dave314159ggggdffsdds

          Re: Just put everything in the terms and conditions

          Like most apparently rather contentious bits of the Bible, the translation is rubbish. It's actually talking about 'dedicating' rather than 'sacrificing'.

          https://en.m.wikipedia.org/wiki/Pidyon_haben

          The Abraham/Isaac sacrifice story is interesting because there's another interpretation which is unpopular with religions. Rather than it being a parable about Abraham being so dedicated to god that he'd sacrifice his own child, and god rewarding him for it by cancelling the request, the story can be read as a parable about not doing things you know are wrong just because you think god told you to.

    2. daflibble

      Re: Just put everything in the terms and conditions

      Well looks like some people are trying this with the new brexit deal ; )

    3. Benson's Cycle

      Re: Just put everything in the terms and conditions

      It was a pretty unequal contract - milk,honey, and a very fought over part of the Middle East in exchange for having your backsides kicked on a regular basis. God is not dead, but is working in the Oracle contracts department.

    4. Allan George Dyer Silver badge

      Re: Just put everything in the terms and conditions

      Were you looking for F-Secure's experiment?

  6. chivo243 Silver badge
    Big Brother

    Before we start mucking with AI

    The human race needs to get\grow\ acquire some intelligence. If we keep training AI the way we are, AI will come to it's own conclusions about the human race. Skynet anyone?

    1. Anonymous Coward
      Anonymous Coward

      Re: Before we start mucking with AI

      And if Skynets actually right and has the data to back it up?

      1. 404 Silver badge

        Re: Before we start mucking with AI

        /run hide.exe

    2. Nick Ryan Silver badge

      Re: Before we start mucking with AI

      Not a remotest chance of any of that. All this data is being used for is the training of quite primitive (but still very "clever", just not in an intelligent way) neural networks.

      The media fixation and sales-droids lies about everything being AI are just that. At most, they use a small measure of carefully curated machine learning, most just have improved algorithms. However in weasel-speak these are "AI".

      The most extensive neural network setup created so far, as in the closest we are to AI would lose a battle of wits with a fruit fly. While operating somewhat slower. And not being as flexible. And requiring a quite terrifying amount of specialist hardware to perform. A fruit fly is somewhat more efficient on the energy front too.

      This doesn't mean that we shouldn't think about ethics now of course.

      1. Michael Wojcik Silver badge

        Re: Before we start mucking with AI

        It would beat that fruit fly, and anyone else, at Go.

        It's also worth noting that AlphaGo Zero was trained entirely with unsupervised learning and no data from games involving humans, just the rules and games it played against itself. And its style of play has been characterized as "alien" - it doesn't play like any known human master.

        And AGZ is not terribly large in terms of its hardware use, network depth, etc. The AGZ machine itself was expensive (though it'd be cheaper today), but Google's generalized AlphaZero - which also plays chess and shogi at superhuman levels - runs on commodity hardware, as does the open-source implementation Leela Zero.

        Frankly, the fruit-fly bar is too low, except for the efficiency metrics. A fruit fly's reactions can likely be simulated in real time by a modest DNN.

        Of course we're still far away from simulating human capabilities in general, and whether there's a qualitative difference between human cognition and conventional computing machines remains an open question. (John Searle said no, minds are just the effects of biological machines; Roger Penrose says yes, minds are formally more powerful than Turing Machines. Neither had particularly convincing arguments to make, though.) The connectivity of the human brain is enormous and organized into different types of specialized structures at different levels, and other organs and the environment also strongly contribute to cognition. We don't have a mechanical system that comes anywhere near that.

        But a fruit fly? That we can do. Hell, it's been ten years since IBM claimed the simulation of a cat brain (Modha et al). That was strongly criticized by some, such as Markram; but even Markram described the IBM simulation as being closer to par with an ant brain. In the decade since ANN architectures have only gotten larger and more capable.

  7. Anonymous Coward
    Anonymous Coward

    But Officer, about these pictures on my PC...

    I was only training my AI system...

    1. Korev Silver badge
      Windows

      Re: But Officer, about these pictures on my PC...

      "That money was just resting in my account"

      1. Zippy´s Sausage Factory

        Re: But Officer, about these pictures on my PC...

        Didn't stop you winning a Golden Cleric though, did it?

  8. tiggity Silver badge

    Proper consent

    Proper consent should be obtained for images (I know some of mine were on there (assume still are) & happy with that as set appropriate licences on Flickr for images I was happy to make available to anyone (which is definitely NOT all my Flickr photos - anything with people on it is private, just landscapes, plants, animals etc that are available ))

    Though I only found out by accident (as used some of the data set ages ago).

    Some entities have asked permission for my images (even though implicit on the licence) which is always appreciated purely as you know its useful to someone.

    There are enough image storage sites that allow people to set usage terms on their images that there should be no need for image rights abusing scraping.

    No issue with (assuming informed consent given) "offensive" images, as these abound and a training set should include them. More concerned about poor labelling of objects in the images given use as a training data set.

    1. Mark 85 Silver badge

      Re: Proper consent

      If "proper consent" were legalized and required to have specifics, sites like FB, etc. wouldn't be able to exist. Maybe that's not a bad idea?

  9. Anonymous Coward
    Anonymous Coward

    Race labels

    There is a benefit to including labels that identify a person's race, since without it an AI will default to working best for the majority and minorities will get poorer service. Once you have demographic-based labels it's easy to calculate an error rate that includes reference to how bad the system performs for each demographic individually.

    I'm less convinced on letting people use their choice of labels, but I'm also wary of allowing a small privileged group to dictate which are 'real' races. *CoughUiyghursCough*

    1. Mark 85 Silver badge

      Re: Race labels

      Have we really become so politically correct that race can't be mentioned? Or sex? It would appear that we have.

      1. Crazy Operations Guy Silver badge

        Re: Race labels

        The problem wasn't mentioning race or sex, but using slurs to refer a person's race or sex. There is a massive difference between the two.

      2. Anonymous Coward
        Anonymous Coward

        Re: Race labels

        Political correctness is not the problem, power is, the power to define what is correct.

        You can mention XyZ, but only in exactly the correct way as 'we' dictate today. Tomorrow when the correct way is changed, 'we' will use your past utterances against you. As 'we' did before yesterday. And will again repeatedly. Sounds like some book I read once.

  10. Anonymous Coward
    Anonymous Coward

    445 – bikinis – rather than the bicycles in 444

    possibilities are endless e.g. 1501 is nudes, while 1502 is nukes. What could possibly go wrong...

    1. Blane Bramble

      Re: 445 – bikinis – rather than the bicycles in 444

      Prepare for MAD

      Mutually Assured De-clothing

    2. Scott 53

      Re: 445 – bikinis – rather than the bicycles in 444

      Hey Vladimir - send nukes.

    3. Korev Silver badge
      Mushroom

      Re: 445 – bikinis – rather than the bicycles in 444

      Or you could even send nukes to Bikini Atoll

    4. Fruit and Nutcase Silver badge
      Coat

      Re: 445 – bikinis – rather than the bicycles in 444

      An easy mistake to make -

      The classic bicycle frame is effectively 3 triangular sections (2 either side of the rear wheel joined to main triangular section).

      A bikini is made of triangular pieces of cloth

      In the interest of further research, you are invited to study this picture carefully...

      https://regmedia.co.uk/2008/05/21/eee_girl_1.jpg

      looks like a bikini "bottom"

    5. Anonymous Coward
      Anonymous Coward

      Re: 445 – bikinis – rather than the bicycles in 444

      Are we believing that he meant to type 444, and somehow his finger came off and hit 5?

      1. Charles 9 Silver badge

        Re: 445 – bikinis – rather than the bicycles in 444

        Quite easily, especially if you have big fingers so are probe to dual-striking (I speak from experience).

  11. Baldrickk Silver badge

    That, in itself, is a lesson for us all: our data released or shared today may well be used for wildly unexpected purposes tomorrow.

    There is an episode of Dave Gorman's Modern Life is Goodish that covers this really well.

    Starts by getting a volunteer to agree to have a photo taken for any purpose, for a small fee. Then asks how much people would want for a specific purpose - in this case, I think, advertising for sexual disease medication or something like that. People were more concerned about that and wanted a higher amount, despite it being something covered by the original terms for the first photo...

    Enlightening.

    1. Anonymous Coward
      Anonymous Coward

      And then there's home DNA testing

      For a chance to find out you're 80% Celt, you've given full rights to your genome.

      1. Antron Argaiv Silver badge

        Re: And then there's home DNA testing

        In my case, in a few years, I'll be beyond caring who sees my genome.

  12. Muppet Boss

    Looks like this restricted access scientific research database was not politically correct enough for our politically correct times.

    Hope the PC kids do not learn about medical image datasets.

    Now they are addressing diversity and equality of what was scrapped from the public Internet. Good luck!

    http://image-net.org/update-sep-17-2019

  13. RuffianXion

    This won't end well

    So by using an image set that was categorised via MTurk, researchers effectively got a bunch of randos from the internet to train their AIs? That was never going to end well was it?

    1. eldakka Silver badge

      Re: This won't end well

      got a bunch of randos from the internet

      I resemble that remark!

  14. phuzz Silver badge
    WTF?

    "The ImageNet team refused to give The Register access to the data set when we asked. Instead, a spokesperson told us on behalf of the team that the library was unavailable due to “maintenance reasons.”

    Another source within the industry, however, allowed us to pore over the full ImageNet library"

    Hilarious, as if they some how had a 'maintenance' issue just when the press came calling.

    1. Pascal Monett Silver badge

      Well, they couldn't actually say they were deleting the images, now could they ?

  15. alain williams Silver badge

    How did I learn to recognise things ?

    The way that all humans do (to keep it simple I'm only talking about recognising humans):

    * I saw my parents, relations, friends, ... I remembered these images (in whole/part/...)

    * I went outside and saw many peoples' faces & body parts; some of those I remember. None of them gave me permission to remember them

    * I saw people and 'classified' them as examples of: men, women, children, ... people who are: white, black, Indian, Chinese, ...

    * I saw plenty of inappropriate images, some in person, some in publications. Some I still remember (here: details I redact)

    The expectation is that AIs should have the understanding of a mature human. How can they gain that unless they are exposed to what I was and learned from ?

    Questions:

    * I wish to train my AI to understand recent films and modern books. Most of these have some copyright notice that forbids storing these in a 'mechanical or electronic system' (or similar words). Does this mean that I cannot show them to my AI ? If so can I view them myself? After all my neurons work on electricity and the difference between my brain and an AI is getting less & less.

    * If an AI is only exposed to images/... where explicit permission has been granted by the subject: will that result in a biassed data set ?

    * Should the 'experience' that goes into training an AI be regarded as private to the AI and thus beyond the reach of copyright, appropriateness considerations, etc ? Surely what is important is not having the AI reproduce these images in a way that identifies/embarrasses the original subject.

    * Even the last point is arguable: political cartoonists frequently draw embarrassing but identifiable pictures of the likes of Boris. No one says that they should not, so should not an AI be allowed to do similar ?

    In any discussion of this can we please ban the word 'obvious' (& synonyms), any assumptions should be made explicit & clear.

    1. Anonymous Coward
      Anonymous Coward

      Re: How did I learn to recognise things ?

      It's all about context.

      When you see ppl as you grow up, walking around town etc, it's generally got some benign context.

      Your parents are with you, you're wearing a school uniform, you're sat in a cafe ppl watching.

      But you could be sat in front of a fashion catwalk, or at a nudist beach, or outside a children's school playground.

      The latter sounds potentially ominous especially if you're not a parent of a child at the school.

      However, if we switch context again, you could be a policeman. Very welcome now, but outside some cafes in London, a policeman hovering around would severely dent the no of customers.

      the issue with the AI is that we don't have context.

      It's all very well to be scanned for some fashion shop's "machine learning/AI" for retailing purposes like categorising your clothing taste, but not when it's going into a terrorist watch database used to secure Parliament or protect the borders of the USofA.

  16. Pascal Monett Silver badge

    "some of the labels used to describe them are biased and racist"

    Now that is a real problem. You can argue that images posted on the Internet are public, but you absolutely cannot argue that the guy labeling images made a slip of the keyboard. If image labels are racist it's only because some asshole in charge of classification was a racist.

    No wonder facial recog systems are acknowledged as being biased against non-white people. If any random AI training project has racists labeling pics then it would seem quite difficult to have any AI project that only has non-racist people handling the data.

    To think we're in the 3rd millennium. Seems like we'll need a few more millennia before Humanity actually becomes intelligent.

    1. Anonymous Coward
      Anonymous Coward

      Re: "some of the labels used to describe them are biased and racist"

      If image labels are racist it's only because some asshole in charge of classification was a racist.

      Might want to take thirty seconds to identify the woman of color you just slandered before saying things like that.

      No wonder facial recog systems are acknowledged as being biased against non-white people. If any random AI training project has racists labeling pics then it would seem quite difficult to have any AI project that only has non-racist people handling the data.

      AI is, by default, biased against minorities. That's maths. If you don't single out groups for special attention then it will try to get the best results possible for the highest number of people. Thus it will prioritise larger groups. Avoiding this is the big challenge of AI research.

  17. Crazy Operations Guy Silver badge

    If only there were companies that had already solved those problems...

    There are dozens upon dozens of companies that are sitting on top of literally billions of photos that have already been cataloged, tagged, and the people pictured have given full consent for their image to be reused in diverse contexts.

    It would have been fairly cheap for Image-net to just get in touch with the likes of Getty Images or ShutterStock and negotiate an "Academic redistribution" license. Then feed those images to Mechanical Turk to produce the boundary boxes and normalize the tags.

    As for the CSA images, those should be split into another dataset that is strictly controlled. I figure such a library could be controlled by a law enforcement agency like the FBI that acquired the images by having victims / parents of victims sign a contract to allow the images to be used in such research like they do for getting permission to reuse confiscated material in sting operations. Maybe pair it with a library of equivalent images reproduced entirely with adults so as to remove as many possible variables when learning to differentiate between legal and illegal images Like avoiding that 'skin cancer' AI that was making decisions based on whether there was a ruler in the image and not the appearance of the skin blemish).

    1. Snowy Silver badge
      Facepalm

      Re: If only there were companies that had already solved those problems...

      Except both Getty Images or ShutterStock have been alleged to be selling images they have no right to be selling.

      1. Crazy Operations Guy Silver badge

        Re: If only there were companies that had already solved those problems...

        Yeah, but that pushes the legal problems to them and away from ImageNet. If Getty provides an illegal or unauthorized photo, its on Getty, from a legal perspective. At least it would provide ImageNet with ability to sue Getty for providing a faulty product if ImageNet uses a bad image.

  18. Anonymous Coward
    Anonymous Coward

    I have over a Terrabyte of photos of my cat if anyone is interested.

    They mostly consist of him asleep on the keyboard of my laptop to keep warm.

    He once even managed to create a folder in ~/ using the terminal emulator.

    (I must have interrupted him before he could encrypt it)

  19. 2Nick3

    ...little incentive...

    “The data set creators allow some of these 'contaminants' to remain simply because there is little incentive to spend the resources eradicating them all and they have minimal overall effect on the training of machine learning models.”

    In other words they didn't think they would have to do it and now that the data set is created it's really really hard to fix so they don't want to. Nice.

    And the "effect on the training of machine learning models" is irrelevant to the privacy concerns.

  20. Fruit and Nutcase Silver badge
  21. BinkyTheMagicPaperclip Silver badge

    Why? 'We agree that inappropriate images should not be in the data set'

    There's nothing wrong with 'inappropriate images' being in a data set. It's not too much of a stretch to say that pr0n could be needed for some AI purposes (yes, yes, might have an interesting time getting funding for that).

    What *is* wrong is stealing the copyrighted material. pr0n, just like flikr and the other sites, is there for you to appreciate and usually push you towards buying a product. It is not placed there for use elsewhere.

  22. Mike 137 Bronze badge

    Think positive

    <sarc>Using this data set will at least ensure we won't find the outputs of AI incomprehensible as some fear. Training on it will ensure AI "thinks" like the majority of the human population.</sarc>

  23. Anonymous Coward
    Anonymous Coward

    Stupid is as Stupid Does

    Two of the most smug, snobby institutions are responsible for this sophomoric blunder.

  24. TeeCee Gold badge
    Facepalm

    In the near future....

    I see an article about an AI programme that's gone comprehensively titsup.

    Why? Well the researchers thought they were training it to recognise ${thing} but it took the easy option, recognising the permission watermarks in the test pics of ${thing}.

    NB: This (or something very like it) has happened before.

  25. hatti

    Pron

    No m'lud, I wasn't watching pron, I was training my AI system.

    It's a dirty job, but someone's got to do it.

  26. Dave Bell

    Shouldn't we be learning too?

    Obvious problem: some of those images sound to be illegal to possess, in the UK and, very likely, in the USA. The details of the law do differ.

    But some of the legal-but-questionable images might be more useful for testing, rather than training. From the description of the system, I can see how some images could easily fit multiple labels. Is it so hard to imagine a picture of a pretty girl, wearing a bikini and riding a bicycle.

    Maybe this set of images should be retired (not lost) and a new, better organised, set made for training

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019