Big Data is all the rage. Now if only someone had to clue what to do with it. According to a new survey of senior executives by Big data consultantancy NewVantage, Big Data is "top of mind for leading industry executives," but these same executives struggle to find the right people to analyse their data. In fact, while 70 per …
"Big Data is all the rage. Now if only someone had to clue what to do with it."
Says it all.
Just ignore the sales men.
what to do
Wrong way around :)
The question is ... what do you want to do with big data for your business? Maximum profit, kinder to the environment, happy employees, to understand out customers.
The new SQL warehousing features in the upcoming version of Informix (march 26th) are amazing.
Now if only I could find someone wanting to use it :)
Are any of these "data scientists" reportedly sought out by industry actually supposed to end up doing anything that a more traditional scientist might regard as science? E.g,. "data R&D", or "data theory", "data experiments", or whatever might make sense ... and what would they be?
I'm not trying to make a point either way, I'm just curious as to what industrial data scientists might actually be expected to do.
If I were to take IBM's word for it ...
Then "data scientist" means "not a scientist of any kind, but is hoping that it makes them sound good and that none will expect any actual science off them"
I'm not even that impressed by Imperial's MSc Data Science & Management writeup -
Perhaps like traditional scientists combine bits of DNA that would never get together (like putting bacteria DNA into corn to make it Roundup Ready) data scientists could combine data in new and stupid ways.
data scientists could combine data in new and stupid ways
We've already been seeing the work of "data scientists" in databases like shopping carts.
People who bought this router also bought:
Well I think this is exactly what you get when you do not employ data scientist, but use "magical" software off-the-shelf.
This would funny if not so sadly true.
I'm just curious as to what industrial data scientists might actually be expected to do.
In theory, typical "data scientist" tasks might include working on better classification algorithms or predictive models in the particular problem domains and with the particular data of the organization they work for. In principle "big data" is a filtering or compression problem: find a signal in a lot of noise.
Much of this involves understanding the features and weaknesses of various algorithms and approaches. That's a common topic on Vincent Granville's blog, for example; see his "The Curse of Big Data", or "What MapReduce Can't Do", or "The 8 worst predictive modeling techniques".
In practice, I suspect a "data scientist" in industry often ends up primarily wrangling some sort of Hadoop or data-warehouse installation and running some fairly trivial analyses on it - naive sentiment analysis on consumer reviews or some such. In academia, a "data scientist" may be someone with a CS or applied maths background developing new algorithms, but often it's someone in the humanities simply applying well-known techniques to large corpora. Of course these people don't always call themselves "data scientists" (in fact, in academia, at least in the US, folks in the humanities are more likely to avoid the "scientist" label, for various reasons).
But the real value in "data science", I think, is in the sort of thing Granville is talking about: being able to say, "look, for what you're hoping achieve here, this goofy little naive Viterbi-algorithm approach or whatever is not likely to produce very meaningful results, and I know that because I've done some sampling of your data and it doesn't fit the distribution your algorithm assumes". Or "no, this magic Hadoop thing will not help you do the thing that you saw on NCIS last night". It means understanding probability and statistics, and a wide range of the algorithms used for classification and predictive modelling (and paying attention to new work in this area, which is moving quite rapidly), and what's available in terms of hardware and software today to implement that stuff. And, importantly, some understanding of the data and problem domain. So a real data scientist is something of a mathematician and something of a computer scientist and something of an IT guy and something of a domain expert. And had better be a skeptic.
(And this, incidentally, is why I think Matt's claim in this piece and similar ones is unpersuasive. Sure, Cloudera and other firms can try to sell "big data applications", and some of them will probably be pretty successful at it, but that sort of generic approach is unlikely to be very useful for most customers.)
The alternatives don't sound good
While you can train a reasonably competent person to perform some basic analysis, and perhaps use a few key techniques, if you look at a proper course on Data Analysis (such as the one running on Coursera at the moment), you'll see that there are a wide range of techniques, complex statistical underpinnings, and many things that you can do wrong.
If you have someone who knows the business, but only has some training on how to use a few tools, they won't know about the rights and wrongs of data cleaning, various issues that can introduce bias, how to correctly estimate confidence levels, etc. Since you can often make statistics seem to back up a range of conflicting viewpoints just by biasing the selection of data, there's a lot that can go wrong from that viewpoint that assumes that what a proper data analyst has studied is something easily learned from a couple of short training courses.
I say this as a programmer with an interest in data analysis, seeing just how much there is to cover in big data technologies, statistical methods, underlying mathematics, statistical programming languages, reporting standards and more. It's a big subject, and I don't think data scientists can be adequately replaced by an existing employee receiving a little training in Hadoop.
I really do hate the term "big data". Partly because of my pedantry (how can data be big), mainly because it feels like yet another marketting buzz-phrase.
Yeah. I especially used to dislike a particular sub-field of physicists going on interminably about "giant magnetoresistance", and, later "colossal magnetoresistance". So, in my head, I used to rename it "tiny magnetoconductance". :-)
Still, turned out to be fantastically useful for hard drive bit densities, despite the irritating buzzphrase. Buzzphrases tell you little either way.
I agree. This info-deck from Martin Fowler is nice in that it extracts some of the more important concepts about Big Data from the buzz-word itself:-
All too true
"All of which should provide some comfort to those organisations that have been struggling to find data scientists to analyse their data. It may turn out that the "mythical data scientist" is actually Lily who works one cubicle over. "
Really, it is. And so many of those companies trying to find the "mythical data scientist" have so often let go Lily already !
Chasing the something something.
So they have all the underpants and no gnome scientists to turn them into profit.....
Maybe they should question why they collected the data in the first place?
Linux - at least its got its gnome.
Re: Chasing the something something.
It reminds me of government thinking
"We have all of this data, but it isn't helping us"
"Then we must get more data"
I'm an unemployed data expert
I've been transforming data into business information for 18 years saving companies billions. Exactly what these guys are probably looking for, but I'm titles as an 'Informix DBA' not SQL server/Oracle so I'll always be unemployed and not eligible for benefits unlike all the made up cv's from indians (see this so much) who took some meaningless msoft certification or lied their way into a job.
Funny story in that it's great there are so many jobs and people looking for someone like me, gives me some hope of not starving soon.
Re: I'm an unemployed data expert
Matt, good article. We are seeing an increase in businesses seeking specialized skills to help address challenges that arose with the era of big data. The HPCC Systems platform from LexisNexis helps to fill this gap by allowing data analysts themselves to own the complete data lifecycle. Designed by data scientists, ECL is a declarative programming language used to express data algorithms across the entire HPCC platform. Their built-in analytics libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. More at http://hpccsystems.com
Econometricians have been coping with huge data sets for decades, with steadily added new data types, with steadily greater sizes of "huge" as computing power has increased, and with all the attendant problems associated with data collection, data cleaning, interpretation (mapping to data types), statistical techniques, and especially validating and creating models. I started, many decades ago, in statistics and computer science, made a detour into multiple fields of engineering, and in the early Nineties ended up in econometrics, statistical and scientific computing, and experimental design. The people are already out there and I'm quite sure they would absolutely love being lavished with tons of money and perhaps some stock options when they turn in usable, real-world, results. The problem is that the job description is, as usual, asking for absolutely the wrong people. There are, as dbbloke, pointed out, more than a few people out there with the real-world data-analytic chops to do the job as well. Again, they are writing job requirements (mandatory job titles) that don't match the people with the skills. They SHOULD be writing job descriptions based upon skill-sets. This has always been a problem as a new business specialization comes along.
Long before I studied econometrics, I already had experience in applying "big data" analytic techniques in fields as diverse as the physical and social sciences, logistics (supply chain), financials, and eventually even epidemiology with an eye to (successfully) developing predictive analytics around those problem areas. However, my job "titles" never, ever, included it. I started in the 1974 on mainframes and moved along, in government, as the technology progressed. Now if you bother to read my evaluations, especially Accomplishments, it's a different story. All totally unrelated to my stated job, field systems engineer, but something I amused myself with while waiting for a call to arms (literally in my case). So, CIO's need to pull their noses out of the marketers cracks (you know which ones I'm talking about), list specific job skill-sets required, without reference to marketing hype, and think about exactly what business requirments they want to Meet and Accomplish. [Caps intended, of course.] Sheez, ya'd think they'd larn this by now, but No!