back to article You can crunch it all you like, but the answer is NOT always in the data

Evidence-based decision making is so clearly sensible because the alternative — making random decisions based on no evidence — is so clearly ludicrous. The “evidence” that we often use is in the form of information that we extract from raw data, often by data mining. Sadly, there has been an upsurge in the number people who …

Silver badge

Know thy data

If you have a large enough dataset, you can make it say anything if you look hard enough. Just look at all the hidden messages in the bible.

This problem with thinking a larger data set will give you a better answer is not a new problem. (It was very recently discussed in the latest episode of the BBC More Or Less podcast) Their famous example was the 1936 Presidential election where a magazine undertook a poll to forecast the election result - and got it wrong. Yet a much smaller Gallup poll got the answer right.

(For those of us who have a passing interest in the use & abuse of numbers & statistics, the More Or Less podcast is an excellent weekly listen.)

12
0
Silver badge

Re: Know thy data

The Bible letter-counters are a very good analogy. If your goal is to prove your theory, you can always find the data to make that happen.

Rigorous investigators try to find the data that contradicts their theory; frauds only look for the data that confirms it. (e.g. every piece of "scientific" evidence used to bolster every single conspiracy theory)

8
0

This post has been deleted by its author

Re: Know thy data

Ah, that's when you start *torturing* the data.

2
0
Silver badge

The question?

They say that a question well asked is half answered. And so it appears to be.

> "the answer is therefore always in the data"

Now, while the answer might be in the data, whether it is or not will depend on what the question is. If you collect data regarding the size and distribution of pebbles on a beach, that won't provide answers to questions about the price of gold. You have to have the correct data and know what is the right question.

Knowing what the right question actually is, is the most overlooked part of software design. That is what makes it so difficult. Ask one person what it (a new project) should do and you'll get one answer, ask another person and you'll get a different answer. Ask a third and they'll tell you "I can't say: but I'll know it when I see it".

In most cases, the primary goal (no matter what the management team might say) of a piece of software is to meet the expectations of the users. Leaving aside the functional requirements: often the smallest part and the easiest to get right, most users simply want three things - they want the software to be fast, they want it to work intuitively and they want it to be consistent. After that we get down to small matters like producing the correct answer, not requiring an entire datacentre to support a single instance and not taking 20 years to develop.

So far as how that meshes with the "business requirements", so long as the users are happy, the auditors are happy and the budget wasn't exceeded, that is pretty much all you need to count a project as a success, The problem is that hardly any company ever asks the users what they want from a new application and hardly any of the attributes they value are ever measured, or designed in. So we end up with all the correct technical data to design a project, but none of the data necessary to answer the important questions that will define whether it will succeed or not.

4
0

Re: The question?

Quite. First, analyse the question; that should tell you what data are required, or that no amount of information will provide the answer, or that what you need is not available.

1
0
Silver badge
Boffin

Let's hear it for the hypothesis

Start with a hypothesis. Use the data to prove or disprove. All this digging around just gives random rubbish.

2
0

Re: Let's hear it for the hypothesis

All this digging around just gives random rubbish.

That is demonstrably, empirically wrong, as any number of applications of unsupervised machine-learning algorithms demonstrate. Take Maximum-Entropy Markov Machines, for example; they start with no hypothesis by definition (that's what "maximum entropy" means in this context), but in suitable applications they converge on a model which has a probability of getting the correct answer1 which is significantly higher, and indeed often much higher, than random.

Man, look at these sophomores all over my lawn.

1As measured by whatever metric is appropriate in the circumstances, such as f0, which is the mean of the recall and precision metrics for MM decoding applications.

1
0
Trollface

New Noun

Data Mining, noun: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything."

Noun?

8
0
Thumb Up

Re: New Noun

Maybe not a noun, but a brilliant saying!

8
0

Re: New Noun

Noun?

"Mining" = verb. Compound "data mining" = gerund, functionally a noun.

Excellent phrase either way.

6
0

Re: New Noun

Indeed, just become by new e-mail signature

1
0

Re: New Noun

So now we're all doing Grammar Overanalyzing? I didn't know that was a thing.

*Ba dum tsh* ... *crickets*

2
0

Re: New Noun

So now we're all doing Grammar Overanalyzing? I didn't know that was a thing.

You must be new here.

But yes, in this context, "data mining" (and indeed "mining" all by its lonesome) is a nominative verbal or gerund, which in English grammar functions exactly as a noun. Dictionaries conventionally label gerunds and gerundial phrases as "nouns" when they define them, so the original quote is using the standard form for the genre.

1
0

Left out the most important data...

...in predicting the World Cup result. The competence, or otherwise, of the referee's optician.

2
0
Anonymous Coward

Re: Left out the most important data...

Stop picking on the poor referees. They're just doing what their bosses in FIFA tell them.

You may not be able to buy a result, but you can certainly get some fortunate decisions....

0
0

There were quite a few statisticians who `successfully' mined financial data to price derivatives and other securities.

Turned out really well didn't it.

6
0
Anonymous Coward

"The answer"

Worth just mentioning that data mining isn't always about trying to find "the answer" - it's often about trying to improve the accuracy of your best guess.

Most betting systems are not about guaranteeing you wins - just tilting the odds in your favour. Likewise with data mining. If you had a big enough data set for World Cups (I'd argue there haven't been enough World Cups) then mining the data to improve your guess as to which three or four teams were likely to win is a much more feasible proposition than mining the data to tell you which one team will win.

Likewise, supermarkets will mine data to tweak various things (price, promotions, position, etc.) to improve sales - just a minor change leading to a minor percentage increase in sales can have huge effects. That's not about finding "the answer" to any particular question, just using data for some gain.

6
0

Its about probability

Brazil facing Germany in their division was virtually a 50% probability for either to win based on the writers scoring assigning 35, 34 score respectively to Brazil and Germany. For Germany then to go on and win the World Cup confirms this data mining model has credibility

1
0
Coat

the answer is Bayesian

no what's the question?

P.

1
0
Bronze badge

More data = more noise?

It is well-known, particularly in econometric circles, that throwing more data at a problem doesn't necessarily improve your chances of successfully predicting an outcome, and will frequently have the opposite effect.

In a nutshell, by funnelling additional information into a predictive scheme you may well just be adding useless noise. This is clear, for example, in linear regression modelling. If you throw in a new variable as a regressor you now have more model parameters to estimate, and the estimation accuracy (of all parameters) suffers. This will frequently (but not always) result in larger residual errors and thus poorer predictive performance.

In financial prediction, knowing what to throw into the mix is absolutely crucial. There is a large statistical literature on the subject, but in practice there is also a large element of voodoo.

2
0
Holmes

Stats 101

correlation does not imply causation

4
0
fch
Devil

Re: Stats 101

Last time I said that, some business boffins gave me a dozen downvotes.

Well, maybe because I couldn't hide my belief (which I can't be bothered in the slightest to back up by data) that the main usage of data mining in business is to justify decisions. Make a decision, then look for data to provide a justification.

Not the scientific method. But apparently very practical in business and politics.

0
0
Bronze badge
Coat

Coin flipping

I once demonstrated to a statistics teacher that coin-flipping is not the ideal example of a random process. More over, the larger the coin, the longer a run could be maintained before breaking. If you are consistent, then when the coin is tossed, it reaches about the same height each time and the rise and fall occupy a consistent span for each toss. If your thumb is consistent in starting the spin, the rate of spin is consistent. If you can maintain the height at which you catch the coin, the number of turns during the rise and fall is consistent. With a larger coin the variation in the spin rate given by your thumb is lower relative to the mass and diameter of the coin meaning the uncertainty in the rate of spin in each toss is less. US Morgan dollars are thus better than US pennies. Good physical skills can control the result with very little error. I've tossed runs of 50 to 100 at a time and won beer money from unsuspecting physics and stat students many times. On a really good day, you can vary the catch height and control to a degree the length of alternate runs. The only uncertainty in a coin toss is during the first toss. You need that first toss to "calibrate." This is why a single coin toss is used to break a stalemate. Three or more and chance may have gone out the window depending upon the "tosser." ;-)

3
0

Re: Coin flipping

The article's also wrong about 50:50 being the most probable outcome. Its probability is tied with that of 51:49 and 49:51 (considering heads-dominant and tails-dominant cases separately; if you treat "off by one" as a single case, then that case is more probable than heads=tails).

To get 50:50, on the previous toss you must be at 49:50 or 50:49 (i.e., one more head than tail, or vice versa). It doesn't matter how you got there; you have one flip left, and you've flipped the coin 99 times, so this last flip has to bring you equal, or fail to do so.

The probability of the last outcome (assuming a fair coin &c) is 0.5 heads, so you now have equal probability of ending up with 50:50 or 49:51 (or 51:49, if heads were ahead).

So 50:50 is among the most probable outcomes, but it is not the most probable.

1
0
Silver badge
Thumb Up

Good article

It makes a refreshing change to come across people whao actually have a clue about statistics.

P.S. and not a single "It stands to reason" anywhere in the article :)

0
0
Happy

Data as THE answer, Expert Systems and AI are suspect over this...

Solutions to real life problems and verification of real life answers will never be 100% correct as data sets are only really good for the time they are taken (sort of a Quantum problem) and real things might have occured B/4 the data set was stored, and now, much later, the actual data set might be not even there...

RTOS UNIX type systems used current data for decision making ( not any part of this article).

IMHO= good luck with your big-data thingy... try drilling down on current data for a better answer...RS.

0
0
Silver badge
Holmes

There is phrase for this

A perfectly apt phrase. Coined long before all things digital. IIRC, even before all things electric.

"The map is not the terrain."

How simple is that?

1
0
Anonymous Coward

Laplace 0 Dynamical complexity 3

Excellent article.

N = 1 is actually rather more tractable than N > 1 IF someone acts on the result of a data mine, or if something (a butterfly?) acts at all on something that affects the system. This immediately changes the system, and p of the result changes, often radically. Ouch. You'll only ever be able to discern some of the proximal variables, and very few of the distal variables, before analysis, and the analysis cannot tell you the other variables that you should have used. Your result might be useful (make money) briefly if the system has some hysteresis, but the result can not be used as a causal law (if we do this, that happens) or even as a rule of thumb.

Of course, dynamical data mining (continuously updating the data and mining them) might help might give some small insight into what happens when we change the price of butter in a supermarket chain, but if the opposition has the same model, it will be able to counter your change immediately. In such systems, we're never going to have any control over all of the variables that affect the system, so I suspect the whole approach is just an expensive, doomed, escalation. Certainly more expensive than getting 10 carefully chosen dynamical systems to sit around a table to discuss a product...

0
0
Silver badge

You failed to talk about the hidden danger of testing multiple hypothesis against a set of data

Any hypothesis based on the data should be assumed as flaky unless tested by independent data. Lets go through an example:

You throw your perfect coin 10 times and have a 100 people (subjects) trying to control the result by mind power. There are 2**10 = 1024 possible outcomes. The result will probably be disappointing, even with very hard concentration your subjects should not be able to control the coin more than 50% of the time. But once you got that result, you might break down the group of subjects into subgroups -- wouldn't it make sense that the power is stronger in some than in others? Lets split the group in male and female, large and small, sort them by skin color and eye color, age and place of birth. Suddenly you have 1000 possible sub-groups and you will forcibly find a statistically significant control over the coin toss in some sub-group. But every extra hypothesis actually reduces the significance of your result -- so the correlation is worthless. The correlation must be tested against new data to be significant.

This problem does occur in serious academic research and a quite senior professor recently reported that some humans have a statistically undisputable ability of foresight (but only the males , only involving sexual pictures, ... and whatever else was required to make his data talk).

0
0
J 3
Alien

In other words, we begin to measure the physics of the flip. Is this actually possible?

Well, all nice and good, up to a certain extent. Otherwise, we get to a point where it is silly. Or, as a couple of complexity boffins said quite a few years ago now, "Don't model bulldozers with quarks". Now, finding that sweet spot is the real problem...

0
0

A coin flip ABSOLUTELY DOES NOT have a 50/50 ratio.

I've seen it myself - a 50 pence piece land on it's edge, wobble momentarily, and then come to a halt vertical.

What's the percentile chance of that then?

0
0

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2017