Thou shalt not question thine computer model
Especially in 'Climb it Seance'
An academic paper published in Nature has been criticized by a data scientist – who found a glaring schoolboy error in the study when he tried to reproduce the machine-learning research. The paper in question, published in August last year, describes how neural networks can be trained to predict the location of aftershocks …
(1) This isn't a dispute in any way related to climate science, even by computer modelling, as your "Climb it Seance" implies. Further, you should know by now that it's *Brexit* that has to be dragged into every discussion nowadays, the climate thing being somewhat passe.
(2) I encourage reading of the Nature referee's comments on the github link. That said, I think perhaps the comment (and a reply) might well have been quite a useful addition to the literature; even if a comment might be slightly off the mark, the debate can be instructive.
(1) well, actually, it is, in the sense that the fundamental flaw ("the model is heavily overfitting to the training data") has a direct correlative in greenhouse global warming's modelling.
(for anyone interested (zzzz...) :
The forcing factor is the most important factor in AGW's core model, but is only an arbitrary nexus with a factor attached (then latterly-applied rationale for what it "must" mean), where that factor is quite literally the number necessary for this arbitrary factor to bring the model's numbers in line with its "training" dataset, "to recover the input numbers".
In econometrics/statistics/modelling, this is called "tautology". Tautology by itself essentially destroys the validity of any model.
One characteristic of a tautology-affected (aka "over-fitted") model is its inability to forecast, its poor power/accuracy outside of its "training" dataset. And this is demonstrated by the climate model's appalling forecast results.
Amusingly, there was a conference a few years ago specifically to debate changing the forcing factor -- by fiat, not measurement! Well, if it was real, you couldn't even propose doing so without laughing. "We should change pi! It would make my results so much better if you changed pi!"
I cannot find anywhere in the article, or the github content, where the phrase "we do not accept criticism of our data handling from a data expert" appears.
Can you locate it for me? I would like to understand better the context or intent of your "does not sit well" phrasing.
I'd like to see a response from Raj about the authors comments. Can he explain why they are wrong?
“The network is mapping modeled stress changes to aftershocks, and this mapping will be entirely different for the example in the training data set and the example in the testing data sets, although they overlap geographically," the pair said.
"There’s no information in the training data set that would help the network before well on the testing data set - instead, the network is being asked in the testing data set to explain the same aftershocks that it has seen in the training data set, but with a different mainshocks. If anything, this would hurt [the] performance on the testing data set,” DeVries and Meade, wrote back to Shah.
"Can he explain why they are wrong?"
Here's where they are wrong:
"[...]admitted that their model was trained and tested on a subset of the same data[...]"
And then I ignored everything they said after that as special pleading and 'but we know what we're doing, we're scientists'.
He doesn't have to prove them wrong. They are wrong by assumption. They have to prove that they are right, to others' satisfaction. And someone else needs to be able to reproduce their results using the same data, which apparently cannot be done, at least easily.
The problem is that they fail to address his actual criticism which is not that it is learning about specific main shock's relationship with aftershocks, but that it is learning what that relationship is is specific regions so is not generalisable to other places.
In fact his own testing shows that if you run it properly its no better than existing techniques.
A secondary issue is that they fail to mention that a much computationally much lighter ML does as well as deep learning.
He highlights other methodological issues too.
They do not seem to get it even after he explains.
Writing on a train, so responding to comments here rather than the paper and letters themselves, but you'd think learning the aftershock pattern for a specific region is still a useful thing to be able to do. Possibly the authors are saying that this is an aim of the paper that is being missed?
It's not that uncommon for training/validation/test sets to be created by partitioning a larger data set. Ideally you also test on a completely independent testing set, but in a lot of contexts it's rare for it to have nothing in common with the original data.
It's a great example of silos in science. The earthquake scientists don't seem to care that they got their data science wrong, and apparently it doesn't affect what they were trying to say. The data scientists apparently don't know enough earthquake science to be able to understand the main aim of the paper, but they do understand the data science and that's good enough for their comments to be valid. The result is a flawed paper that can't stand deep scrutiny by truly independent reviewers. The Nature referee seems to think flawed science is OK as long as none of the likely readers are going to be bothered by it. I wonder what their background is? There are lots of slightly similar example papers in biological fields, that rely on dodgy statistics because the authors never bothered to consult an actual statistician.
My daughter is in Bionformatics and is trying to get everyone to pay attention to something which matters if you are not doing human stuff and might matter then but you have to look. She is Bio and Info from the ground up, not a convert so understands the biological significance of the issue and how it impacts the data analysis.
The problem is the computer side don't see the biological significance and the Biologists don't rate the bioinfo implications. Oh and they don't want to have re-analyse all their data but this time doing it properly.
The sex of the person making the point is also, sadly, an issue. She has however run the maths and has the proof worked out.
She may or may not be relieved to know this is an extremely common sort of problem with research.
To paraphrase Bismarck: Research papers are like Sausages, 99% of them you can't swallow after you've seen how they were made.
A hobby of mine is pulling the original-research any time any loud announcement is made which has the implication of people "needing to change". Eg, meat vs bowel cancer, destroy your energy infrastructure, plastic bags kill turtles, etc. So far, every single "research" paper(s) has fallen to pieces on close inspection. Disturbingly often, in fact, their own results flatly contradict what their Conclusion/Summary says they are! You don't even need to crawl the methodology; merely observe the 2 radically different messages in the one paper.
One exception: the old ozone-layer stuff. That was solidly performed, including itself pointing out where it was weak.
Funny, second time this month that I read about Nature refusing to publish anything that might indicate a previous article was flawed.
It's a shame because it really puts the trust in peer-review in jeopardy.
> puts the trust in peer-review in jeopardy
I'm afraid that went out the window quite a while back. The revelations over the last decade+ of how badly it's being hijacked procedurally (and often also socially) mean that it is no longer any real indicator of solidity/quality/reasonableness, merely of conformance to (local) paradigm. Google "grievance studies hoax" for an indication of how little it means.
You really do have to crawl each methodology yourself.
And that's just what *I* have seen. I think back to the old-hands rolling their eyes re peer review 3 decades ago, and really the gamesmanship must have been starting a lot earlier.
"Typically this is done with a random sample of your data – the test set – which you never expose to the model," he added. "This ensures your model has not learned from this data and provides a strong measure to ascertain generalizability."
Call me Mr Silly, but why not simply remove the suspect data from the result set and see how the model's prediction rate is affected? Or just re-run a new set of data that doesn't contain the suspect data-set? The fact that this was not even considered makes me look askance at the original paper. After all, one of the requirements of a scientific experiment is that it should be reproducible (and preferably modifiable).
Biting the hand that feeds IT © 1998–2019