Cutting edge Big Data projects might seem the sole preserve of big name multinationals and government organisations but the democratisation of these next gen analytics capabilities is coming soon to an SMB near you, according to Intel. Speaking at the APAC launch of the Intel Distribution for Apache Hadoop, Chipzilla’s global …
So is Intel going to configure it all for the SMBs as well? From what I can tell the barrier to using Big Data techniques is not being able to run the free open source software but rather the complete lack of people with the skills and knowledge to use the software stack. Putting it in the cloud won't help, and neither will bundling it into a distribution.
Hadoop is like Crystal Reports - it's only useful if you have a goal. Telling a CEO that you can do analytics will get you nowhere if you can't back it up with examples.
But most businesses do better if they have analytics. Chicken and egg. So:
Step 1) Collect all the data you can
Step 2) Start interrogating it
Step 3) Alter your busienss/marketing practices based on what you discover
I have a dozen companies Microsoft could use for case studies. (Were they willing to front some hardware! I don't have hadoop-class anything lying around.) That said...a lot of these companies already do analytics. Using PHP. And MySQL. Dear god, I am about the move the FIRST of these SMBs to an SSD for the MySQL database! Standard SQL databases will hold pretty much all the data these companies actually use.
You've got a long way to go to sell me on the necessity of that. Sure, the same company we're moving to the SSD for the MySQL database has potentially 100TB of data per year coming in. Most of it, however, is imagery. Can you even imagine what you'd need to do image-based analysis to extra things like "what are most people taking pictures of" etc?
Yeah, so we stick to sales data, geographics....if we get really ambitious we could pull metadata from the images and analyse that. But where's the ROI in pulling apart the images, scanning for "pictures of babies, pictures of landscapes, pictures of cars" etc. Will knowing what people are shooting produce more of a revenue bump than the cost of the nuclear substation and small shopping mall we'd need to crunch the data?
Hadoop for SMB? WHY?
That's the beauty Trevor, hadoop class hardware is exactly what you do have because it scales in a linear way with hardware so you can use what you have. You're right on the value of the analytics but the cost of a man to write the stuff is so prohibitive that it's a joke at the moment for an SMB to even consider. The exception being, if they can find a placement student (intern in your speak? Sorry don't know Canada, over here we often take a year out from University for work experience) who happens to be enthusiastic, very clever, and doesn't realise their true value yet. In fact, people who don't know their value will likely be the catalyst for this whole shebang.
The idea that Hadoop is "cheaper" is a myth. Hadoop solves the "expensive server" problem by spamming a whole bunch of shitty consumer-grade hardware at the problem. If you do the research into the subject and talk to the right people there is rather a lot of dissention as to whether or not this actually results in an over price drop.
You see, the expensive databases (Oracle, DB2, etc) are really tightly coded the hardware for performance. They aren't perfect, but they are a hell of a lot more efficient than Hadoop. Plus, you generally get away with doing what you need to do on a single (or smallish number) of exceptionally powerful boxes. This drives down your power, cool, space and networking bills by quite a bit.
You can overcome some of the inherent limitations with Hadoop if you have shit-hot programmers, but as you pointed out, SMBs don't. What's more, as the traditional DB folks are being kicked out of the higher end positions thanks to Hadoop actually being useful (and cheaper) when you get to petascale, the cost of the expertise required to do Neat Things with traditional databases is plummeting.
I have on hand a handful of system that could theoretically be Hadoop nodes. They would be exceptionally shitty Hadoop nodes and they wouldn't come anywhere close to providing the compute, IOPS or network bandwidth required to do the imagery analysis discussed above. Assuming, of course, I could find a dev to program it.
The ability to use consumer hardware doesn't mean it's cheaper. It means it scales out in a more linear fashion. When you have a small scale budget, limited space, limited cooling and big requirements, Hadoop just isn't the thing.