Unless you have found a clever way of avoiding the internet completely, you no doubt have been warned that THERE IS A BIG DATA EXPLOSION! By many accounts, we are currently drowning in information - from log files to stock charts to customer profiles - and face a host of new products cropping up to help us manage the onslaught. …
I've been told by serious home office types that they're not doing imp/ccdp snooping because there would be "too much data"; fair enough, I thought, but looking at some of ACPO ltd's memos on the subject from 2001 they mentioned 2 standard intelligence apps for this!
Whatever happened to miss Maxwell and her Bayesian data handling suite?
Chilead, wasn't it? Can't civil society crossover .mil tech in this case to solve the data-deluge.
No such problems for the US it seems
Check out the Wired article on the NSA's new data center in Utah. It seems they're grabbing as much as they can get their hands on with no mention of retention/purge policy.
I know what I want. I know reliable internet venders that sell it. When I visit a vender's site, my browser spends ages waiting for replies from various analytics companies. I get really bored, and look for the product elsewhere, but other sites a just as bad.
I used to middle click on a bunch of products to open pages in new tabs, start a new search, then, while that page is loading, make purchase decisions about the other tabs. If I try that now, the first middle click causes my browser to lock up and refuse to scroll until the tab loads. The CPU is almost idle, the network connection is not saturated. I am waiting for dozens of remote servers to guess what I want to buy. Please can someone find the webfarters responsible and educate them with a clue stick until they understand quantum mechanics from the point of view of a cat in a box.
Re: Shopping trip
> When I visit a vender's site, my browser spends ages waiting for replies from various analytics companies
So use tools like noscript/adblock plus/ghostery so that you don't have to wait.
Re: Shopping trip
All of these block access, which causes the venders' sites to lock up or all the links to point at the current page. Is there anything that spoofs replies from tracking web sites?
Yeh, I read Taleb's book. He has a talent for taking a half-true idea and pushing it beyond its limits. Pretty much exactly what he's complaining other do, in fact. There are ways of handling noise in data and a huge literature on the subject. I doubt if Taleb has read much of it, though.
Taleb's not a guy I'd quote in anything vaguely serious. He's managed to piss off pretty much every economist on the planet because they didn't predictict the crisis. The collapse of the American housing makret was an outlier. Duh. He's an idiot who thinks that because his Black Swan idea made some people rich on a one-in-a-million bet his theory had cred. It doesn't.
Outliers are acknowledged but dismissed because they're... um, outliers.
It's called the Kruger-Dunning effect. The less you know about a particular subject, the more you believe you are right. Apparently Taleb is a "victim" of it. There are, unfortunately, whole companies of such people. The problem there is that for every little bit of knowledge you get there you get a lot of pain as you suddenly see what absolute junk you and your colleagues have produced.
More data IS better...
... provided you're talking performance of machine learning algorithms. Turns out the more data you feed your algorithms the more their performance converves.
Of course, big data is so much hype now because it allows lots of vendors to sell lots of spendy hardware to store and munch through ever more data. That's what big data is about. They're selling means and methods to run algorithms on data. That is all.
But if you're not looking for large-scale trends, then suddenly all that data is a threat to your sanity, to the performance of your systems, to everybody else's privacy and peace of mind.
So as the concept is sold to have the results of black-box munging the data drive our decisions, so has marketeering got it arse-backwards again. The most important choice is to pick the goal you're after, and find suitably matching means to get there. To choose what you want to know, then have that drive your discovery process. If you picked right, your resulting decisions should be "better" in some definable way. If not, well, you'll be no worse off than you'd be buying spendy kit for no clear reason, not so?
Didn't go far enough
Matt, I think you hit the nail on the head, but inexplicably failed to apply any downward force.
In most cases, there is not 0.05% signal, or even 0.00005% signal. There is *no* signal. That's a huge qualitative difference, because in the first two cases you might be tempted to look harder (or "smarter", in bullshit parlance) and in the latter case the only sane approach is to stop looking. That's what the first 90% of the article said, and then you went and spoiled it by asking for tools to filter the ocean of raw data.
Wait - what?
WTF does sampling rate have to do with signal to noise ratio? (Well - technically there is a relationship, but it depends on useful signal bandwidth and measurement error. Only a moron who had never heard of Claude Shannon would think that moar samples automatically means less signal and moar noise.)
It's excusable for Asay, who seems pretty clueless about technical abstractions. It's inexcusable for Taleb who has worked in finance and should know what quants are paid to do.
Which point is orthogonal to the fact that many managerial types are easy suckers for any faddy fantasy that promises total predictability and profitable control.
So more data -> good is certainly just as stupid.
But that's not the level the good data mining outfits operate at.
Asay might want to do some basic research about the usefulness of basic research. Otherwise he's just commenting on his own ignorance.
Of course that won't stop CCDP
Surprise data <> information.
No matter *how* much of it you collect.
Mises' "Human Action" in action?
It all depends on the type of data.
If you have data that converges nicely to a value when you calculate the mean, you are home free. This applies to physical experiments, which are repeatable, can be studied in isolation and where you have a model to check against.
If you have data that is all over the place and strongly depends on intelligent agents and/or random fluctuations you are looking for trouble. It gets worse if you don't even have a model or a clue what to look for. Statistics-using economists and "traders" (basically, swimming-pool attendants in charge of your money) are dead stinking fish once the party is over (that's you, Krugman) - they never knew what to look for in the first place. Same goes for politicians fantasizing about cakewalks in foreign lands and generals pouring about the latest metrics about how the war is going before they are shipped back to D.C. Your social dreckwork cannot be far behind.
Something of a point
To elaborate on his stock market example, if you look at prices on a minute-by-minute basis, there's a tremendous amount of random fluctuation (i.e., lots of noise). If you only look at daily closing prices, you have a few orders of magnitude less data to process, and it's just as good for making medium- and long-term predictions. Of course, it's easy to go too far; monthly stock market updates might not provide enough data to extrapolate from with an acceptable degree of confidence. And there are exceptions: an automated arbitrage trading program might be able to make use of price updates as often as every second.
As for log files, many of them are useless, and most will never be looked at. But when security breaches happen, they're essential in figuring out how someone got into the system and what they accessed.
The point here is that companies need to work on collecting better quality data, not more of it.
Re: Something of a point
Actually log files are very useful since you can track your users. If you have enough of them you can make all kinds of statistics. You can not only find out which parts of your site are popular, but also how influential those parts are. Like how much is reading a single article going to change your browsing behavior.
More data is only a problem if you don't know how to process it. I say this from the perspective of having done a lot of it.
It takes about 100,000+ samples to provide any real confidence that a population is gaussian w/ only a single mode. It takes a lot more data to provide convincing evidence that there's not one or more other populations in the data. Anyone that doubts this is encouraged to fire up R and run some experiments.
The Black Swan, which I'm giving up on half way through, is a series of half clever repetitions of, "The statement in this box is false."
The gaussian distribution extends to infinity. So a "black swan" might easily be from the extremely low probability part of a normal distribution, not an aberration. You need massive amounts of data to decide the case.
Of course, for the clueless, any data is too much.
That's not what Taleb is saying in the Black Swan at all. He's saying that you can't just assume everything is a normal distribution. The stuff from the low probability parts of the Gaussian occurs with low probability - Black Swans occur too often, because they come from non-Gaussian distributions. You do *not* need massive amounts of data to show this - two events out of a thousand, with supposed probability of one in a billion, is sufficient to show you're almost certainly dealing with a non-Gaussian distribution.
Taleb's book is mostly an exhortation for professional statisticians - especially in finance - to remember their training about other distributions and stop pretending everything is Gaussian (or, your version, that you supposedly need so much data to prove it is non-Gaussian that you may as well assume that it is). ISTR there's also some stuff about statistical independence too (e.g. if one person gets foreclosed on, it's likely others will too - foreclosure is not an event independent of wider circumstances) but I presume you're not arguing against that.
He's kind of right, but approaching it wrong. The more different ways you look at the same data set, the more likely you are to find something that looks significant. Sadly, it's just random noise.
Everytime I hear that daft story about Beer and Nappies at a BI Vendor, I want to scream - if you keep looking at the data, you'll find something, it just doesn't mean anything,
Read Stafford Beer
Taleb is retreading waters Beer walked on in 1973 ("Brain of the Firm"). Beer not only explains why storing every bit is irrelevant and pointless, he also predicted the rise of big data. Nearly 40 years ago. In fact, you can see that "big data" has and always will grow to consume available capacity, until people finally realize that all the data is a useless waste of space. Once you understand how to analyze it, you analyze it in real time then throw it away.
Beer would disagree with Taleb about the frequency of analysis. In Beer's theory, data that is averaged over a year is a year out of date, and hence almost completely useless. It is better to have 95:1 noise:signal than to have data that is too late to act upon. Plenty of statistical methods can cope with 95:1 noise:signal. Not the human brain, perhaps.
I look forward to reading Taleb's new book. His last one was awesome.
The boat has been missed
Interesting article on big data, and yes, it is all teh rage, and yes, information germaine to answering the questions one may have is a valuable tool for decision-making, and performance evaluation of any initiative.
That being said, seems to me the writer of this piece believes the data stands alone, offers answers AND questions, which is the missed boat. It requires someone to understand the right questions to ask whenever any data is to be used as an "oracle of sorts". Spending over a decade in Marketing, I can certainly lend much insight into this. If you do not know what questions need answers, you do not know what data will be appropriate to answer said questions.
In other words, no matter how much data is collected, it still requires a human brain to understand how to apply the right data to get the right answers/ insights. This could be a lot of data, or a little data, or somewhere in between. If an answer to a question requires viewing a short-term issue, focused detail data in a short time window is needed (hourly price fluctuations, as an example). If you are seeking a more bigger picture kind of trend for decision-making, a longer period of less detailed information (monthly or weekly, rather than daily or hourly) will suffice.
It isn't the volume of data, rather understanding how to use data and what is truly appropriate, which then kills the noise that leads to "analysis paralysis", and rather leaves you with "signals" or results which are more redily identified, and easy to do if you have half a brain and an understanding of the subject requiring insight into.
Effectively, the message of this piece is hogwash, because it missed the boat.
- IT bloke publishes comprehensive maps of CALL CENTRE menu HELL
- Nine-year-old Opportunity Mars rover sets NASA distance record
- Prankster 'Superhero' takes on robot traffic warden AND WINS
- Comment Congress: It's not the Glass that's scary - It's the GOOGLE
- Analysis Who is the mystery sixth member of LulzSec?