The long-term future of proxy server Scroogle is seriously in doubt, according to operator Daniel Brandt. Scroogle has scraped Google pages since 2002 by piping results through an anonymizing server. By using the service surfers could remain anonymous, but more importantly use Google without the compulsory 40-year cookie. The …
but more importantly use Google without the compulsory 40-year cookie.
Really? I just set it to be deleted on exit. The cookie is no big deal. But the logs Google keeps I can't do anything about from my end. Scroogle makes the logs mostly useless when it comes to connecting search history to a user, far more important then a cookie that the user can delete.
Damn, not again :-( I don't see what is so hard on proxyfying google page just like any page, there are at least jmarshall's cgi proxy or glype on php, both are capable of doing the job on every page. Slight modification must work for scroogle just fine, no matter how they will change the interface, html will still be html.
not regular a proxy
a regular proxy will only change the ip address seen by the server. The client side code will be the same, with coockie management and user search logging. Scroogle works diferent: it "reads" google results page and creates a new page without client side google code. You only get the results, no the whole page.
"..;.but more importantly use Google without the compulsory 40-year cookie."
There is not a single Google cookie on any of my machines, even though I have used Google on quite a few occasions in the last five years. That's not to say Google do not know what I search for when I do use them, I have a static IP.
I will be sad to see Scroogle scroowed by Google.
Please expand on "compulsory 40-year cookie"?
Does google use a different type of cookie than the usual browser cookie, or is Firefox lying to me about not having any cookies stored that relate to google (or any other site)?
Alternative google source
That is all.
40 years, in perspective.
40 years ago in July, the Intel 4004 "first CPU on a chip" was about a year from announcement, and about a year and a half from real production. I was still using relays, tubes and wire-wrap to learn about computer circuitry at home (thanks, Pop! :-).
In and around the Summer of 1970 ... Nixon was PotUS. Edward Heath was the newly minted PM. Kent State. Vietnam. Apollo 13 and Soyuz 9. Ford Pinto. Hendrix dies. Joplin dies. The Who's "Tommy" at The Met. Bob Dylan goes electric at The Isle Of Wight. Aswan High Dam ... That's off the top of my head, in no particular order.
40 years is a VERY long time. Especially looking back on it. Somehow I doubt google, as an entity, will last that long ... at least not as anything more than a verb meaning "to use computer resources to search" ...
And what, other than legislation, is going to disappear Google? Remember they've gone from being nobody ten years ago to one of the biggest damned companies in the world. And what do they like to do most? Buy out startups that look like they might make a lot of money, usually before they are actually worth anything. Pretty much the same business model as MS. And how long have MS been around? I think they'll still be around in ten years time.
"what, other than legislation, is going to disappear Google?"
Public realization of the futility of it all.
Remember, google doesn't actually do anything. Think about it.
Re : Oh really
In the 70's and 80's IBM almost ruled the computing world. It was said that no one ever got fired for buying Big Blue. IBM was _the one_ company that you could depend on to still be in existence at the millennium.
Now IBM is big in the (remaining) mainframe market and only a small player elsewhere.
Even the mighty can fall, they just make more noise when they do.
And use what instead?
Name a viable alternative and I'm sure large numbers of people will switch...
Six words and a paragraph.
If you say "Bing" then you're just swapping one bunch of control-freaks for another -- which may be OK to you but isn't really an answer.
I'll admit Wolfram Alfa sounds good, but it's hardly useful for finding hotels or flights, for example.
It's all well and good saying "don't use google" but when it's the best search engine for a lot of things then deciding not to is just cutting off your nose to spite your face.
The point is...
...that many people manage to get by without google just by putting in a little effort. Different search engines for different things. The idea of a single search engine for the whole internet might have worked ten years ago, but it's a nonsense now. Search engines need to specialize and Google is crap at that.
The thing that puzzles me is that people rely on Google to find information. Google's good if you want to spend money, find torrents or look at portn, but if that's not what you're after the results will be skewed against you. Enter a phrase and you'll get a couple of pages of things that contain parts of the phrase, three pages down you find a page with your exact search phrase. What's useful about that? Some applications (Office!) have a tendency to try to be helpful by helping you do something you didn't want to do, but you can turn it off. Tried turning this sort of behaviour off on Google? Google is so heavilly skewed by advertising that it's next to useless as a tool for finding hard information. Then there's the idea of "Googleing" a person. Crap. Much better to use something like Pipl if you want to find somebody.
If you object to your searches making money for monolithic multinationals like Google or MS the you could try throwing your search pennies at charity. There's always everyclick.com if you're the kind of pinko scum that obects to Schmidt and Ballmer making even more money that they don't need. Now that, Mr Schmidt, is not being evil.
Where's the Schmidt icon?
Both with the advantage of keeping your IP out of _every_ server log
google scrapers - just don't say who you are ...
You can send a plain old HTTP GET to www.google.co.uk and get about 36KB of HTML back - *IF* you don't bother sending a user agent string (that's why you get a 403 Forbidden if you try to use wget).
Lose the ~15KB before the <ol> tag that starts the list and the ~12KB that follows the </ol> which finishes it.
That leaves you with 10KB containing your 10 line items from the original results page. Easy enough to parse, but I'm not sure you even need to - you can just pass it all back as a piece of fairly clean html.
I used a bit of Smalltalk (what else?) to test it. Sorry I don't have time to set up a webserver to demo it, I'm a bit pressed for time this w/e.
query := 'el+reg'.
rStream := (HttpClient get: 'http://www.google.co.uk/search?hl=en&q=',query) value byteSource contents readStream.
rStream upToAll: '<ol>' asByteArray. "discard"
result := rStream upToAll: '</ol>' asByteArray.
Thanks for the suggestions, but...
I appreciate suggestions from Scroogle users, but there are issues that many of them have missed.
As far as I can tell, the www.google.com/pda, www.google.com/xhtml, and www.google.com/m (mobile) searches only serve up a maximum of 10 results per page. The www.google.com/search can do 100 results with the num=100 parameter. So can the news search and the blog search. This parameter has been a Google staple for ten years.
The so-called "simple interfaces" everyone is recommending to me are pathetic in this respect. I like 100 results per page, and there is no way I'd do 10 successive fetches just so I can put them all in one package.
Also, I compared the standard Google output page with the output=ie page that I was using, for the same search term and using 20 links per page. My search was for "obama". The output=ie page came into my server 7,201 bytes and the regular page came in at 63,070 bytes. The regular page is more than eight times the bloat! Scroogle was doing over 300,000 scrapes a day with six or fewer servers, and had performance issues to consider.
Everything at Scroogle was written in compiled 'C' for speed and efficiency. There is no way I would use Perl, or PHP, or Python.
The other issue worth mentioning is that the output=ie interface skipped not only the ads, but all that "Universal Search" stuff that Google added three years ago. I'm talking about news links, book search links, image links, Youtube links, and whatever else Google uses to make you click harder for clean "organic" (I call them "generic") links from non-Google sites. The output=ie dates back to the day when Google wasn't making 97 percent of its revenue from ads. In fact, they were just starting to get interested in ads when it began. It hasn't changed since then, which is probably why it had to come down.
Scroogle will not return unless Google brings back the output=ie interface.
... there doesn't seem to be a way to make these more lightweight interfaces return a whole block of 100 results at once. The size of 100 'obama' results is nearly 130KB and the links are about 94KB - so about 25% froth (see icon).
So you're looking at 40GB/day in and probably at least 30GB/day out at those sorts of volumes. That's a lot of bandwidth to be runing pro-bono.
Great to hear back from you here though, I think I speak for quite a few people in thanking you for making Scroogle available (again, see icon) - it was great while it lasted.
PS: I can't speak for perl, python or php - but I would have thought any decent language could cope with the proxy work - Smalltalk certainly could :-) - it's always going to be the bandwidth that's a problem - the HTML isn't as easy to parse as it should be - but it certainly isn't hard.
Kudos to Scoogle.org
With much respect for Mr. Daniel Brandt and the philosophy behind Scroogle, the whole problem of not being able to parse Google's results page because it's too complicated, is quite lame.
I'm saying that because my profession is programming, and writing a 'generic' scraper is as easy as a piece of cake.
Nevertheless, I really would like to see Google providing the old result scheme to Scroogle, so at least it shows a little bit respect for its users' privacy.
And in the mean time, I'd definitely stick around with some other search engine, since in Google I do NOT trust.
Nobody with any common sense trusts Google, but it's funny how many people say they don't like Google and then use them anyway. I know plenty of people who rant about Google and then slip back to using it. Not because they think it's great, but because it's the default search engine in their browser. Laziness is a pretty lame excuse.
Another great example of this is Streetview. So many people rant on about it being an invasion of privacy, even to the extent of having their own home removed from it and then use Streetview to look at other places. One of my neighbours did just that, they had their home removed (and because of the way Streetview works the house opposite went as well) and then one day I saw them using Streetview. Their justification was that it was a convenient way to find a particular house they would be visisting later. Hardly fits in with their view of it being an invasion of their privacy does it.
If the defualt search engine is such an issue for some people wouldn't it be a good idea if the browser vendors had a search engine election screen, just like MS did for browsers. I know Mozilla wouldn't like it since they receive a lot of funding from Google, but I suspect that plenty of users don't even know there are alternatives to Google.
i have even more respect for brandt knowing he doesn't use perl, python, etc. i can see why he'd be reluctant to make changes to his parser.
but still, i can get pretty close to scroogle's clean design just using sed(1). what makes scroogle great is not merely the clean design. (best parser i've seen is the one written by students at charles uni in prague from scratch using flex or bison, i forget which, for what became the links(1) browser.)
rather, it's the SSL connection, the POST method, the lack of persistent cookies and most importantly the effective "pooling" of client source IP's that make scroogle special.
it's becoming more and more difficult to get generic, worldwide search results with google. and that's really sad.
GLB is fundamentally flawed- DNS servers can be located anywhere. trying to "localise" the internet is counter to the underlying theory of a global packet-switched network. the whole point of it is that there are no "centres" nor fixed paths. but Google, CDN's and the like have other ideas. "localisation" may benefit marketers and advertisers. shortest path is not the only path, nor is it always the best path. go back and read paul baran's original research.
i hope van jacobsen's "content-centric" idea sticks. because the current "location-centric" paradigm is really annoying.
all that said, i still like google and all they've done. they are not evil, they're simply running a business. and i appreciate that they allow scroogle to exist. scroogle, despite it's ill-chosen name, is a great service. cheers to daniel brandt.