Yay! Scroogle's back
Have another drink on me Daniel.
This is getting expensive...
Scroogle has once again returned from the dead, continuing to serve up its privacy-friendly Google search results after another programming tweak from founder Daniel Brandt. Brandt and the not-for-profit Scroogle have been scraping Google search results since 2002, allowing netizens to use Mountain View's search engine without …
I've written web crawlers for many sites.
The problem is that the parser is so fragile, visual changes to the site means a potentially broken crawler, especially for structured data.
Once the actual data is parsed, one isn't restricted by the interface hard coded into the html. Data can be stored in records, indexed, aggregated, presented in a new form, etc.
As a developer it makes me wish more sites used self documenting data exchange (such as XML or JSON as examples) instead of html.
Today, XML and JSON are mostly an afterthought to supplement shortcomings inherent to the HTTP/HTML interfaces. Ideally, we would eliminate dynamic HTML entirely from the server and do all rendering on the client using the data feeds.
XSLT is one solution, however it's relative complexity killed it's chances before it ever got off the ground.
Yes scraping pages can be very testing at times, but I think you've missed an obvious point in your lust for SDDX;
Some people don't want their pages to be scraped!
Do you really think Google wants you to scrape their search results, thus depriving them of advertising revenue and user profiles? I don't, so why would anyone in this position even contemplate changing the way they do things.
Let me put it another way. You run a blog/site (I assume you do), which generates a RSS Feed with full article content (as opposed to just a summary). How happy are you going to be when some T*sser uses _your_ feed to serve that content on their own page?? What if they serve it with ads? Suddenly, they are making money off your work, and giving nothing back.
That's why a lot of people will never switch away from something similar to the current technologies.
<rant>
Rendering on the client,
This is something of a bugbear of mine. You want me to view your content (otherwise you wouldn't have published), why should all the processing be done at my end? If it can be run serverside, you should take responsibility for the code and run it on your hardware.
People who insist on Javascript for a webform submission that could be processed by a serverside really annoy me.
Granted this is less of an issue with simple rendering, but it's still a risk.
How many businesses would ask their customers to take a risk that could be handled within the business. If Argos asked you to do it, would you not be a bit p*ssed off at their level of customer service? So why do it to those visiting your website(s)???
</rant>
Actually - I agree with that rant - and not just using JavaScript on web forms. On ANY public facing website I'd argue that client-side scripting should be an optional extra and that the site should function perfectly with JavaScript disabled - especially if you're selling something.
Using it in a closed environment, such as a CMS, where you can mandate the users' browser settings is a different matter of course.
There's nothing wrong with using XML/XSLT apart from the fact that, for a web page, it's pointless - that's what HTML is FOR. If you want to parse out your results to WML or whatever, you can do it server-side since you're probably pulling the data from a database anyway.
"Actually - I agree with that rant - and not just using JavaScript on web forms."
Although I can appreciate the issues with javascript, particularly it's use in making annoying web interfaces, I don't agree that client side rendering in general implies annoying interfaces, that's merely a choice by the designer.
Honestly I find it a little ironic that users would object against data centric models on the grounds of poor interfaces. A data centric model makes it far easier for a user to bring up almost any web site in tabular form, possibly applying their own style transformation.
"There's nothing wrong with using XML/XSLT apart from the fact that, for a web page, it's pointless - that's what HTML is FOR."
This is incorrect. XSLT is an example of one technology which enables the separation between contents and presentation. This is the exact opposite of HTML.
Though I mostly agree with your first point, you've missed something fundamental about RSS. It stands for Really Simple Syndication, it's a format specifically designed and intended to be used in the way you describe. If you offer an RSS feed then you are explicitly inviting others to serve your content on their pages, aka syndication.
Now you can argue all day long about what your understanding of RSS is or what the intention was when you made an RSS feed available but that won't change the facts. RSS readers came about long after the standard was devised, they were a side-effect not the reason for RSS in the first place.
"Some people don't want their pages to be scraped!"
Yes, I understand that, and it's a legitimate concern.
"This is something of a bugbear of mine. You want me to view your content (otherwise you wouldn't have published), why should all the processing be done at my end? If it can be run serverside, you should take responsibility for the code and run it on your hardware."
This point however is completely off base.
The postback model (having data go back and forth between client and server over and over again) is very expensive for both the client and the server.
If the interface code could be transfered once, and merely repopulated with data on demand, then there would be much less work for both the server and the client.
"People who insist on Javascript for a webform submission that could be processed by a serverside really annoy me....How many businesses would ask their customers to take a risk that could be handled within the business"
This assumption was unwarranted, obviously the server still has to validate any data submitted by the user. There is no additional risk using data centric web services over HTML. All the security protections applying to HTML also apply to web services.
Google DOESN'T WANT people to get easy access to their data. Their business model relies on exploiting this data to deliver revenue-generating material--call it a "loss leader", if you will. So, of course, Google will take pains to make sure that what the customers want, the search results, come with as many strings attached as they can...and make sure there are no ways to cut those strings.
He provides a solution that depends on scraping a commercial website and removing everything that makes them money, and far from actively blocking him they just leave him be but change their product every so often to suit themselves, and yet whenever their changes don't suit him he throws a hissy fit that makes it into the press. Not his fault - more yours for covering it
I would actually not have a problem with generic ads if Google would forego scraping everything I do. Hell, I may even read some instead of actively blocking them.
Yet that precise option is what Google is actively trying to avoid. The pain of NSA sponsorship, I guess..
I love the fact you've got 6 downvotes for simply stating facts. There's obviously some scrooglers here!
Fact is, Google makes money from search queries (whether as advertising, or building a profile for later). Why this guy thinks he can expect them to actively work to allow him to use their service without providing the revenue/profile is beyond me.
Yes Scroogle is a useful service, but does it have an absolute right to exist? No.
It's reliant on someone else's service, and if they want to change how their service works then that's tough shit.
Kudos to him for coding around it though. Pity about all the crying that went with it
Actually thats kind of what makes scroogle fun. After all, it adds a more human feel to it that some mornings I wake up and have, instead of search results, a rant about big business from someone who is prepared to stick it to the man.
I get that its just a search tool and if you only want a robot to deliver you search results then thats cool also, but I like the more chaotic approach to my day. It kind of reflects how I feel about my life in general.
At any rate its about choice, just do what ever you want to and worry about the consequences afterwards.
Scroogle uses feeds that are necessarily *not* laden with ads as they have to pay for their bandwidth. And far from throwing a "hissy fit" Daniel Brandt just got on with the job of finding a way of continuing to provide this service to those who want to use it. WTF have you done for the rest of us??
If Google believed Scroogle was hurting it's revenues in anything other than an infinitessimal way you can bet they would have made sure they were closed down forever, one way or another.
This might sound stupid, but why don't people just use a different search engine rather than a third party service that is reliant on Google?
Scroogle might give you anonymous access to Google, but it doesn't give you access to the search result that Google won't give you (Regardless of what it claims, Google DOES filter its results so as to exclude certain websites. It's well known for complying with DMCA notices, and for removing Inquisition 21st Century from it's search results in the UK), and it will give you the result in an order based on Google's search algorithms.
If you don't like Google, then support their competitors. Using Scroogle still gives Google market share and keeps them at the top.
Unless people move away from Google they aren't sending them a very powerful message.
Of course I intent to go on using them as they're a good service, but if you don't like them then vote with your feet.
If you're scraping a site and re-packaging its contents you have to expect things to break occasionally. You just have to fix them. It's part of the game. And three days to re-write a parser doesn't sound like the end of the world to me.
If Google really didn't want you doing it, I'm sure there are lots of much nastier things they could change. But then they'd be evil, wouldn't they?
"This is for your copy-and-paste convenience:
Dear Gmail user: Due to privacy considerations, we cannot respond unless you resend your email from a different account. For more information, please visit www.gmail-is-too-creepy.com" Administrative Contact: Brandt, Daniel namebase BIGATSIGN earthlink.net
creepymail ATSIGN yahoo.co.uk
www.namebase.org, www.microsoft-watch.org, www.yahoo-watch.org, www.amazon-watch.org and www.scroogle.org.
Yeah, I'd love to use a search engine that doesn't always work. Far as I'm concerned, Google are entitled to show ads if I want to use their (brilliant) product. And good on Google for not trying to kill Scroogle, which they probably legitimately could for excessive usage.
This post has been deleted by its author
to parse google results one could use hand-coded C.
one could also use lex and yacc to auto generate even better filter.
or one could use sed.
or one could use m4.
or one could use snobol.
not difficult at all. just different, and not hyped.
regexp is only so powerful.
that's why we have BNF.