Scroogle has once again returned from the dead, continuing to serve up its privacy-friendly Google search results after another programming tweak from founder Daniel Brandt. Brandt and the not-for-profit Scroogle have been scraping Google search results since 2002, allowing netizens to use Mountain View's search engine without …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Friday 9th July 2010 00:04 GMT Anomalous Cowturd

Yay! Scroogle's back

Have another drink on me Daniel.

This is getting expensive...

11 0
Friday 9th July 2010 01:24 GMT Lou Gosselin

I can sympathize with them.

I've written web crawlers for many sites.

The problem is that the parser is so fragile, visual changes to the site means a potentially broken crawler, especially for structured data.

Once the actual data is parsed, one isn't restricted by the interface hard coded into the html. Data can be stored in records, indexed, aggregated, presented in a new form, etc.

As a developer it makes me wish more sites used self documenting data exchange (such as XML or JSON as examples) instead of html.

Today, XML and JSON are mostly an afterthought to supplement shortcomings inherent to the HTTP/HTML interfaces. Ideally, we would eliminate dynamic HTML entirely from the server and do all rendering on the client using the data feeds.

XSLT is one solution, however it's relative complexity killed it's chances before it ever got off the ground.

3 0
1. Friday 9th July 2010 10:38 GMT Anonymous Coward
  
  This is a title
  
  Yes scraping pages can be very testing at times, but I think you've missed an obvious point in your lust for SDDX;
  
  Some people don't want their pages to be scraped!
  
  Do you really think Google wants you to scrape their search results, thus depriving them of advertising revenue and user profiles? I don't, so why would anyone in this position even contemplate changing the way they do things.
  
  Let me put it another way. You run a blog/site (I assume you do), which generates a RSS Feed with full article content (as opposed to just a summary). How happy are you going to be when some T*sser uses _your_ feed to serve that content on their own page?? What if they serve it with ads? Suddenly, they are making money off your work, and giving nothing back.
  
  That's why a lot of people will never switch away from something similar to the current technologies.
  
  <rant>
  
  Rendering on the client,
  
  This is something of a bugbear of mine. You want me to view your content (otherwise you wouldn't have published), why should all the processing be done at my end? If it can be run serverside, you should take responsibility for the code and run it on your hardware.
  
  People who insist on Javascript for a webform submission that could be processed by a serverside really annoy me.
  
  Granted this is less of an issue with simple rendering, but it's still a risk.
  
  How many businesses would ask their customers to take a risk that could be handled within the business. If Argos asked you to do it, would you not be a bit p*ssed off at their level of customer service? So why do it to those visiting your website(s)???
  
  </rant>
  
  2 2
  1. Friday 9th July 2010 13:02 GMT CD001
    
    @rant
    
    Actually - I agree with that rant - and not just using JavaScript on web forms. On ANY public facing website I'd argue that client-side scripting should be an optional extra and that the site should function perfectly with JavaScript disabled - especially if you're selling something.
    
    Using it in a closed environment, such as a CMS, where you can mandate the users' browser settings is a different matter of course.
    
    There's nothing wrong with using XML/XSLT apart from the fact that, for a web page, it's pointless - that's what HTML is FOR. If you want to parse out your results to WML or whatever, you can do it server-side since you're probably pulling the data from a database anyway.
    
    0 1
    1. Saturday 10th July 2010 18:03 GMT Lou Gosselin
      
      @CD001
      
      "Actually - I agree with that rant - and not just using JavaScript on web forms."
      
      Although I can appreciate the issues with javascript, particularly it's use in making annoying web interfaces, I don't agree that client side rendering in general implies annoying interfaces, that's merely a choice by the designer.
      
      Honestly I find it a little ironic that users would object against data centric models on the grounds of poor interfaces. A data centric model makes it far easier for a user to bring up almost any web site in tabular form, possibly applying their own style transformation.
      
      "There's nothing wrong with using XML/XSLT apart from the fact that, for a web page, it's pointless - that's what HTML is FOR."
      
      This is incorrect. XSLT is an example of one technology which enables the separation between contents and presentation. This is the exact opposite of HTML.
      
      1 0
  2. Friday 9th July 2010 13:33 GMT Anonymous Coward
    
    Missing the point about RSS
    
    Though I mostly agree with your first point, you've missed something fundamental about RSS. It stands for Really Simple Syndication, it's a format specifically designed and intended to be used in the way you describe. If you offer an RSS feed then you are explicitly inviting others to serve your content on their pages, aka syndication.
    
    Now you can argue all day long about what your understanding of RSS is or what the intention was when you made an RSS feed available but that won't change the facts. RSS readers came about long after the standard was devised, they were a side-effect not the reason for RSS in the first place.
    
    2 0
  3. Saturday 10th July 2010 18:03 GMT Lou Gosselin
    
    Re: This is a title
    
    "Some people don't want their pages to be scraped!"
    
    Yes, I understand that, and it's a legitimate concern.
    
    "This is something of a bugbear of mine. You want me to view your content (otherwise you wouldn't have published), why should all the processing be done at my end? If it can be run serverside, you should take responsibility for the code and run it on your hardware."
    
    This point however is completely off base.
    
    The postback model (having data go back and forth between client and server over and over again) is very expensive for both the client and the server.
    
    If the interface code could be transfered once, and merely repopulated with data on demand, then there would be much less work for both the server and the client.
    
    "People who insist on Javascript for a webform submission that could be processed by a serverside really annoy me....How many businesses would ask their customers to take a risk that could be handled within the business"
    
    This assumption was unwarranted, obviously the server still has to validate any data submitted by the user. There is no additional risk using data centric web services over HTML. All the security protections applying to HTML also apply to web services.
    
    0 0
2. Friday 9th July 2010 12:45 GMT Charles 9
  
  There IS a reason.
  
  Google DOESN'T WANT people to get easy access to their data. Their business model relies on exploiting this data to deliver revenue-generating material--call it a "loss leader", if you will. So, of course, Google will take pains to make sure that what the customers want, the search results, come with as many strings attached as they can...and make sure there are no ways to cut those strings.
  
  0 0
Friday 9th July 2010 01:24 GMT Anonymous Coward

I'm getting bored over listening to this guy's dramatics every time

He provides a solution that depends on scraping a commercial website and removing everything that makes them money, and far from actively blocking him they just leave him be but change their product every so often to suit themselves, and yet whenever their changes don't suit him he throws a hissy fit that makes it into the press. Not his fault - more yours for covering it

14 25
1. Friday 9th July 2010 10:38 GMT Fred Flintstone
  
  If only Google offered this itself
  
  I would actually not have a problem with generic ads if Google would forego scraping everything I do. Hell, I may even read some instead of actively blocking them.
  
  Yet that precise option is what Google is actively trying to avoid. The pain of NSA sponsorship, I guess..
  
  4 0
2. Friday 9th July 2010 10:38 GMT Anonymous Coward
  
  Agreed
  
  I love the fact you've got 6 downvotes for simply stating facts. There's obviously some scrooglers here!
  
  Fact is, Google makes money from search queries (whether as advertising, or building a profile for later). Why this guy thinks he can expect them to actively work to allow him to use their service without providing the revenue/profile is beyond me.
  
  Yes Scroogle is a useful service, but does it have an absolute right to exist? No.
  
  It's reliant on someone else's service, and if they want to change how their service works then that's tough shit.
  
  Kudos to him for coding around it though. Pity about all the crying that went with it
  
  1 1
3. Friday 9th July 2010 11:26 GMT Anonymous Coward
  
  Actually his rants are what makes him human
  
  Actually thats kind of what makes scroogle fun. After all, it adds a more human feel to it that some mornings I wake up and have, instead of search results, a rant about big business from someone who is prepared to stick it to the man.
  
  I get that its just a search tool and if you only want a robot to deliver you search results then thats cool also, but I like the more chaotic approach to my day. It kind of reflects how I feel about my life in general.
  
  At any rate its about choice, just do what ever you want to and worry about the consequences afterwards.
  
  2 0
4. Friday 9th July 2010 11:38 GMT Absolute Cynic
  
  Getting bored?
  
  Google provide a service and deserve to be recompensed. So I don't mind the ads, and I tolerate the ordering of results. But I don't like Google tracking my searches. That should be illegal and since it is not, Scroogle will continue to get my thumbs up.
  
  4 0
5. Friday 9th July 2010 12:00 GMT Anonymous Coward
  
  Well done you.
  
  You're cleary a confident gent who is happy with google knowing more about you than you know yourself, and making money selling it to other people for their own nefarious ends.
  
  5 0
6. Friday 9th July 2010 12:45 GMT Ian McNee
  
  Try reading the actual article rather than the one in your head
  
  Scroogle uses feeds that are necessarily *not* laden with ads as they have to pay for their bandwidth. And far from throwing a "hissy fit" Daniel Brandt just got on with the job of finding a way of continuing to provide this service to those who want to use it. WTF have you done for the rest of us??
  
  If Google believed Scroogle was hurting it's revenues in anything other than an infinitessimal way you can bet they would have made sure they were closed down forever, one way or another.
  
  2 0
Friday 9th July 2010 08:42 GMT Big-nosed Pengie

Title

Definitely beer time. Here's to Scroogle - long may they scrape!

4 0
1. Friday 9th July 2010 12:00 GMT Anonymous Coward
  
  Porn site?
  
  Read this and thought - google search anonymously? Great! Went to www.scroogle.com and it's a porn site! Am I missing something here - or is this definately NSFW!
  
  0 0
  1. Friday 9th July 2010 13:40 GMT Geoffrey W
    
    Try...
    
    http://scroogle.org
    
    0 0
Friday 9th July 2010 10:33 GMT Anonymous Coward

Misleading sub heading...

"Private Google scrapper in SCO-like refusal to die"

One BIG difference. SCO is pretty much unerversally hated, Scroogle is not.

5 0
Friday 9th July 2010 11:31 GMT James Pickett

No comparison

You're a bit hard comparing him to SCO. I don't think Darl McBride would understand the phrase 'not for profit'...

2 0
Friday 9th July 2010 11:31 GMT PerfectBlue

Stupid question

This might sound stupid, but why don't people just use a different search engine rather than a third party service that is reliant on Google?

Scroogle might give you anonymous access to Google, but it doesn't give you access to the search result that Google won't give you (Regardless of what it claims, Google DOES filter its results so as to exclude certain websites. It's well known for complying with DMCA notices, and for removing Inquisition 21st Century from it's search results in the UK), and it will give you the result in an order based on Google's search algorithms.

If you don't like Google, then support their competitors. Using Scroogle still gives Google market share and keeps them at the top.

Unless people move away from Google they aren't sending them a very powerful message.

Of course I intent to go on using them as they're a good service, but if you don't like them then vote with your feet.

2 0
Friday 9th July 2010 11:43 GMT Gareth 28

A word to the wise

Just in case anybody's at work and wanted to use the scroogle service, the search engine is at http://www.scroogle.org a quick jump to www.scroogle.com brought me to a search engine providing results of a completely different kind of erm... "scroo"...

0 0
Friday 9th July 2010 12:00 GMT Anonymous Coward

Just get on with it man

If you're scraping a site and re-packaging its contents you have to expect things to break occasionally. You just have to fix them. It's part of the game. And three days to re-write a parser doesn't sound like the end of the world to me.

If Google really didn't want you doing it, I'm sure there are lots of much nastier things they could change. But then they'd be evil, wouldn't they?

0 0
Friday 9th July 2010 12:45 GMT Anonymous Coward

Porn site?

Thought this sounded like something I could use - went to scroogle.com and up pops a porn site....not good since I'm at work ! What is the scroogle address?

0 0
1. Friday 9th July 2010 13:40 GMT a reader
  
  not a porn site
  
  try scroogle.org, not .com
  
  0 0
Friday 9th July 2010 12:51 GMT Anonymous Coward

Daniel Brandt doesn't like Google

"This is for your copy-and-paste convenience:

Dear Gmail user: Due to privacy considerations, we cannot respond unless you resend your email from a different account. For more information, please visit www.gmail-is-too-creepy.com" Administrative Contact: Brandt, Daniel namebase BIGATSIGN earthlink.net

creepymail ATSIGN yahoo.co.uk

www.namebase.org, www.microsoft-watch.org, www.yahoo-watch.org, www.amazon-watch.org and www.scroogle.org.

0 0
Friday 9th July 2010 13:30 GMT JDX

re: Actually thats kind of what makes scroogle fun

Yeah, I'd love to use a search engine that doesn't always work. Far as I'm concerned, Google are entitled to show ads if I want to use their (brilliant) product. And good on Google for not trying to kill Scroogle, which they probably legitimately could for excessive usage.

1 0
Saturday 10th July 2010 05:11 GMT Anonymous Coward

Scroogle Crawler

I'm not sure what Scroogle.org is crawling at the moment, but www.google.com/m is a lighter version of Google which could be used.

0 0
Saturday 10th July 2010 23:44 GMT Anonymous Coward

Thanx so much

Thanks again Mr. Daniel! I really appreciate what you've done for Scroogle so far...

Jahangir.

0 0
This post has been deleted by its author
Wednesday 21st July 2010 09:02 GMT Anonymous Coward

parsing google is easy

to parse google results one could use hand-coded C.

one could also use lex and yacc to auto generate even better filter.

or one could use sed.

or one could use m4.

or one could use snobol.

not difficult at all. just different, and not hyped.

regexp is only so powerful.

that's why we have BNF.

0 0