back to article Bots half all web traffic

Just under half of all web traffic is bots and crawlers, according to a new report. DeviceAtlas analyzed all the data received from its customers (the company provides analytics on website visitors) and concluded that 48 per cent of the "people" landing on websites were in fact search engine crawlers, content scrapers and …

  1. Captain DaFt

    OK, so if bots are half of all internet traffic, and torrents are half of all internet traffic, and streaming is half of all internet traffic... either someone is bad at percentages, or estimating the size of the internet... or maybe, just maybe, "Half" is the big scare word when you're against something?

    1. Dadmin
      Go

      Exactly, those bots are making our searches faster and more relevant! If the site owners don't want all that extra traffic, then "txt the robots" and say; "do not include us in the future of searches. yes, we are dufuses. Thanks!"

      Seriously though, the site owners love the extra traffic, it boosts their stats somewhat, at least until a competent traffic reports on unique visitors. :P

      And the spider-people are NOT going to share their crawler data, let alone methods to quickly index and online it. This is the way of things. Hope your Friday is going well! All my co-workers left for the day and I have some 2 more hours to do stuff.

      1. Wiltshire

        re - "Seriously though, the site owners love the extra traffic, it boosts their stats somewhat, at least until a competent traffic reports on unique visitors. "

        That reminds me, at a recent all-company meeting, a product manager proudly displayed the record-breaking stats on our new shiny toy website. In the chosen period, 15,000 users! Much applause, happy smiley faces, extra bonuses all round.

        Except they were announcing the raw Google Analytics sessions figures, with huge peaks from known bot IP addresses. The actual logged-in users was only 308. Whoops. That went straight into our special file of "not safe for work" stats. Also known as the "no-mates stats" or the "career limiters". The kind of numbers that do not win friends or influence people. Or enhance one's CV : "I was responsible for bursting people's balloons and reducing them to tears".

    2. Ole Juul

      half?

      "either someone is bad at percentages, or estimating the size of the internet."

      It's the latter. I've spent a lot of time trying to analyse traffic to a couple of my sites and make some kind of meaningful sense of it. My best guess is that less than 10% of hits are from legitimate visitors. Half is not a scare word to me. It just makes it sound like somebody has no idea.

    3. Kevin McMurtrie Silver badge
      Paris Hilton

      Bots streaming videos from torrents to look for pirated content.

    4. Barry Rueger

      I'm sure that I read that porn is the other two-thirds.

    5. Electron Shepherd

      The web is not the internet

      OK, so if bots are half of all internet traffic

      That's not what the article, or the original Device Atlas post, say.

      They say that that half of all web requests are from bots. That's not the same as saying that half the traffic sent in response to those requests is due to bots, and it's certainly not the same as saying that half of all internet traffic is due to bots.

    6. Keith Glass
      Trollface

      As they say over here in .us: "Common Core Math" (grin)

    7. Anonymous Coward
      Anonymous Coward

      And in TV fiction, they like to say that tor is ~90% of the internet.

  2. Anonymous Coward
    Anonymous Coward

    I monitor the web pages for over 200 music groups - notifying the updates and page links to a niche set of music lovers. It used to be that most of the groups had no general advance publicity when they went on tour. Any videos they put up were often seen by only a handful of people.

    There is no way I could cover the 600+ pages manually. The automaton visits each page once a week, scrapes any new information, and classifies whether it might be of interest.

    Even the manual final appraisal relies on semi-automatic features to handle Google searches, map look-ups, and retrieving video descriptions.

    For pages that have no relevant new information then it actually reduces their traffic load. People know when it is worthwhile visiting a site - and what to look for.

    This is what computers do very well - turning a mountain of data into bite-sized nuggets in a reasonable time.

    1. Justin Clift

      Maybe we should start using the sci-fi terminology, calling them "Autonomous Agents" (or their precursor), to convey these "bots" are useful?

  3. Aslan

    Is this true EL Reg? If so, I think this topic deserves a more in depth article.

    1. Anonymous Coward
      Anonymous Coward

      It's clearly not.

  4. Brian Miller

    Um, so? And?

    Of course there's a lot of bot traffic. How does anyone think search engines operate? "Hi, but could you just please push your content to us? Thanks!" Um, no.

    I've written bots, from scratch. It's not that difficult to write a bot that "mimics" a person. What is difficult is keeping the bot on the site, instead of zipping off into the rest of the web. My bots could go through thousands of sites in half an hour. For what purpose? Because there's a lot of scam sites on the web, and it's easiest to let a bot analyze and score them.

    It sounds like DeviceAtlas is selling something that analyzes traffic like fail2ban blocks repeated bad logins. You don't want your site spidered? Well, that takes a bit of analysis. That's all there is to it. Could you write up something yourself? Sure.

    I've heard that more than one major shopping site has been broken by a bot that "clicked" on the "add to shopping cart" link, and wound up with thousands of items in the cart. Ooops! Bad bot! Bad site! So I can see a market for a product like this, despite the hype.

    1. Suricou Raven

      Re: Um, so? And?

      Been there. I made a database with a web interface - I didn't bother to bot-proof it as only I had the address and never published it. All was well until I upgraded apache. Somewhere in the process my 'Options -Indexes' on the folder above was lost, exposing the database frontend address to the bots. I got hit by two of them, which between them managed to really mess up the data. I eventually had to get the logs and write a script to identify every URL accessed and undo the operation therein.

      Then I put some http basic auth on it. Enough to keep the bots out.

  5. Suricou Raven

    Solved it.

    1. (optional) Exclude /bait/ using robots.txt if you only want to block the dishonest ones.

    2. Create a link on your index to /bait/banme.cgi, but give it the appropriate css to be entirely hidden from view. Now only bots can see it.

    3. Create a /bait/banme.cgi that adds an iptables rule to drop anything from the originating IP.

    Well, that was easy.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon