The two database gurus whose blog produced a storm of protest over their criticism of Google's MapReduce technology last month have hit back with a robust defense. In their latest joint posting Michael Stonebraker and David DeWitt have respond to four specific issues raised by their critics: that MapReduce is not a database and …
Well if no one else is going to comment ....
It always amuses me that people call them databases, and yes there is a logic behind the name, but it is better to understand that what you want is a datastore.
Now, a flat file is always faster than a database if the data is to be read sequentially, and all things are equal. I did read an amusing post where someone use to believe this and then was amazed at the speed of the database versus the flat file, well put quite simple shove the flat file onto a ram disk and then do the test again.
Some databases use a form of compression, and whilst that can introduce an overhead it can also out perform non compressed at times, so yet again if that trick is being used then compress your flat file.
A database really is a suite of data access and storage functions, there is no more magic than that, and yes of course it becomes very complicated when looking for the generic and general way of storing and presenting all forms of data and processes. A DBMS has to work as an ecommerce site one moment, a pictur store the next, an airline booking system, an accounting system, a collection of searchable urls etc etc.
And therein lies the nub of this debate. If you are making a database or shall we use the term datastore, and that datastore is meant to be a huge datastore that is accessed by many people all at different times, and you are using many cheap nodes all connected to power this datastore, you would be a fool to install a standard DBMS system. Oh, the headaches you would give yourself, and your organization.
Conversely you would be foolish to install a similar datastore, to that described above, if you are building an application that will be accessed simultaneously by a hundred or so people that was not a standard DBMS.
The only thing I can think that is happening here is people think they should follow what Google is doing, and whilst there is a lot to be learned from Google, there is no point in going against the grain and using all the features they use, because frankly you will be using ideas just not suited to your project.
Google does not have to be accurate in their search results, their primary concern is to shift huge amounts of data about. And by accuracy I am referring to the ordering, they may have a tolerance of say 90% in that department, whereas in an accounting system you may be kinda miffed to order by price and have the results spewed all over the place, so accuracy required there is 100%.
Horses for courses really, the mapreduce technique is valid today when dealing with huge datastores, as performance increases and costs drop the traditional DBMS system will again become the better approach for that size of database, of course if datastorage requirements out pace the performance increases then mapreduce style ideas still have a role to play. I suspect that mapreduce will just be rolled into DBMS as yet another feature. This is the power of the MySQL database with the swappable engine model. But, what do I use, Postgresql I want to know where I am with the data.
According to the google bumf MapReduce "processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key."
I'm to hungover to unpack that properly, but it sounds like exactly what an RDBMS does (q.v Codd 1970).
At least, what an RDBMS does to a properly normalised dataset anyway. So for my money, there is a valid case for comparing the two.
<misty pointless reminiscence>
I recently spent about seven years working with (amongst many and various other things) various "SQL engines" and fairly humongous datasets and I don't think I ever saw a properly normalised dataset, it seems to be something that people have difficulty with. In fact, if I cast my mind back to the halcyon days of university, ISTR that most of my fellow CS students had difficulty wrapping their heads around the normalisation process.
Mind you, in all fairness, many of them were utter fuckwits in any case.
</misty pointless reminiscence>
Paris, because I'm badly hungover and talking bollocks, and this must be what she feels like most of the time, only richer.
Amanfrommars, is that you?
It's the purple tux-tail jacket with the squirty lapel flower...
So let me get this straight.
They take a mapreduce that takes raw page visits and show how expensive it is to calculate the highest ranking visitor.
Then they take the same data, this time, it seems to have already been processed and inserted into a database and the indexes on that table already are built and correct for the task they want to do.
Then they say 'look how easy it is in a database'. And we're supposed to be so stupid we don't notice?
What is the problem here, databases are fine for data that can be indexed.
i.e. X>Y AND Y>Z IMPLIES X>Z.
Standard RDBMS are only good for data that can be indexed AND is of a known type, e.g.Integer, string etc. Because RDBMS can't index data types they don't know about.
So you would not use a database if:
1. The data doesn't lend itself to indexing
2. The indexing is more expensive than the time saved
3. The type isn't know to the database and it's not worth writing your own BTree indexing.
4. You don't trust the database vendors claims of bug free code for their latest 10.2.3.443 version, since it was only last week that 10.2.3.442 was out.
5. You don't want to.
Option 5. Is very important, because the individual programmer sitting with the individual set of data is in the best position to decide the approach that should be taken, and should be free to apply a mapreduce, not just a database, if they choose.
Can we kill the whole Database = RDBMS debate? Relational dbs are only a type of database that just happens to be dominant (at the time being, and in a certain market space) that starts with flat files iteratively processed, includes the basic optimisation of hash tables, embraces MDDB and covers the gamut of information storage and retrieval concepts - both generic and purposed.
In no way do I want to attempt any analysis of MapReduce, but trying to shortcircuit criticism with the "We arent database, dont use your database mumbo-jumbo on us!" is just stupid. Argue on substance and leave the semantics out - or dont argue at all.