Google Caffeine — the revamped search infrastructure recently rolled out across Google's worldwide network of data centers — is based on a distributed data-processing system known as Percolator. Designed by Google and, until now, jealously guarded by Google, Percolator is a platform for "incremental processing" — a means of …
I knew it!
New York-based engineers. How many engineers can't just cope with life in the smoked-filled room that is the Silicon Valley?
Sun (or should I say Oracle) should go all Steve Jobs on their asses and sue them for using coffee-related words.
The rub is the resources!
The rub is the resources. I'd like to see the global performance hit of this new crawl method. As the article says "The rub is that Caffeine uses roughly twice the resources to keep up with the same crawl rate."
More instant, more distributed and redundant crawling...that all relies more on the web sites themselves to serve up the same data over and over again to the distributed multi-headed Caffeine monster.
You're not getting it...
the new method uses the same data input at the same speed (i.e., crawlers), but what is different is what happens to the data once it is on Google's servers. The fact that it is "more distributed" is only on Google's hardware, the fact that it uses double the number of servers to process updates is, again, only on Google''s servers.
Think of it this way: if the crawlers can return 10,000 page updates* per hour under the old system, it would have taken 2 to 3 days for those updates to show in Google's index. With Caffeine/Percolator, the crawlers STILL take in 10,000 page updates per hour, but now (using double the number of internal Google servers to power it) Caffeine/Percolator shows those same 10,000 updates in a few minutes or hours.
The speed of the crawlers is governed totally independently of the post-processing, and is probably driven by a strong desire not to piss-off web admins by overloading their servers - or risk having robots.txt suddenly become a lot less welcoming to Google's crawlers...
* - a number simply for illustration
Distributed databases are easy: the difficult bit is doing record locking on a distributed database. So I'm curious what the system is for transactions and locking: how have they avoided that problem?
Waste of Resources
What waste of bandwidth, resources and effort have resulted from Tim Berner's Lee's decision not to make document indexing a function of the original HTTP architecture.
Google's mega phd fuelled engineering effort to meet the stated executive megalomania of processing (read: eavesdropping) "all the worlds information" looks unecessary now that so much data is disappearing behind paywalls or into private academic, institutional or corporate info silos. Google != WWW.
Web search is as good as the content it indexes.
You forgot one thing...
the most notable function that Google provides to the WWW is NOT search - it is advertising revenue. There were other search engines (AltaVista anyone?) before Google. What Google did was index the web in a way that was relevant not just for search users, but also for ADVERTISERS, via their link scores.
This totally transformed the web income model, as now their was an objective metric for a site's popularity besides mere page hits. It also allowed semantic linking of a web site to relevant content, via AdWords.
What this has done is enabled about a gazillion web sites to be advertising funded, rather than paywall, subscription, or simply non funded at all. Which has dramatically increased the move of a whole lot of content on-line in an astonishing short amount of time. Which has enormously benefited the human race, as now we can get info on things in lighting fast times that is up-to-date, comprehensive, and even socially filtered. We are connected in ways that a generation ago would seem incomprehensible (and I know, because I span those generations).
THAT has been the uplift of Google's indexing, and one that TBL's specification of a functional index could not have accomplished...
Profitable and successful websites are online because they have a solid stand alone business model or independent sources of funding.
The growth of freely available online information is a function of extremely low bandwidth and hosting costs and the skills / tools required to author web content being easy to learn and easily accesible.
Neither of these are "uplift" from Adwords, Adsense or any other of Google's advertising offerings.
In my opinion, Google's advertising system (ergo their business model) is way overvalued. Adwords does not contribute any significant value added to the web. Sponsored links are mostly irrelevant to the page in which they are being displayed. They look terrible and provide little of interest to the end user. Also, with free tools like adblock plus they are tivially easy to filter and block.
> What this has done is enabled about a gazillion web sites to be advertising funded, rather than paywall, subscription, or simply non funded at all.
Maybe this was the original goal, it is also a central pilar on which google's business model rests - if web content is not open and free to crawl the value of google's search index decreases proportionaly. However as a new business model for the web age it has failed - Google's ad system does not significantly monetise the web.
As content owners / creators wake up to the reality of their loss making existence its going to get tough for google which has no original content of its own.
Publishers like O'Reilly have removed their catalogue of technical books from google's index. Numerous E-Readers have emerged to offer pay-per-view monetised download of e-books. Newscorp are putting up paywalls and blocking the crawlers and 3rd party news aggregators. Universities and Academic research institutions are making their data private and monetising their research by charging subscriptions or pay-per-download. Facebook users are screaming about their personal data being made public...
The worlds information will not be freely available and the idea that it should be organised by a single for-profit corporation is just plain wrongheaded.
Google's awful search engine has contributed to bugger all, frankly. it may be the best we have, but that doesn't actually mean it's not shit. As a search engine it is a minor (but real) step up from AltaVista.
The advertising attached to it is really none of Google's doing. It does not provide an objective measure of popularity, it provides the highly subjective and moronically simple-minded PageRank number, which they calculate because it was all they could think of. You could calculate a checksum for each webpage and it would be just as objective, and about as useful.
Advertising uses PageRank because Google uses PageRank and Google owns the market for searching. If the market was split among engines, then advertisers would use something else. If Google used something else, they would use it. As long as the only search engine is using some metric to decide who appears on their results, then that metric - no matter how braindead or otherwise - will be the only one that matters.
Now we're locked into Google's lousy search system until some sort of miracle occurs and someone with a better algorithm (not hard) gets the resources to set up in competition to them with a comparably sized database (very, very hard indeed). I'm not holding my breath.
for this fascinating and instructive view of what's behind Google's new search infrastructure ! Articles of this type are the reason why I subscribe to the Reg....
Eh? Where's the rub?
So after listing several positive aspects of Percolator, it says:
"The rub is that Caffeine uses roughly twice the resources to keep up with the same crawl rate."
So what's the rub here? It sounds like yet another advantage. Am I missing something?