50% success rate?
So it's as accurate as flipping a coin, or am I misunderstanding the numbers?
A surprising number of peer-reviewed premature-birth-predicting machine-learning systems are nowhere near as accurate as first thought, according to a new study. Gilles Vandewiele, a PhD student at Ghent University in Belgium, and his colleagues discovered the shortcomings while investigating how well artificial intelligence …
This post has been deleted by its author
You predict premature in cases where it's not and can drop your accuracy further. Flipping a coin will always get you 50% accuracy in the long run (kind of a special case, anything other than a 50:50 prediction and the resulting accuracy depends on the population balance, but 50:50 has the same accuracy rate for positives and negatives independently).
Of course the extreme case is predict the wrong answer each time, for 0% accuracy. That's generally not the baseline because you assume that this means you simply had your outcomes the wrong way around to start with, but it could occur by chance (extremely unlikely in all but the smallest datasets, unless you really have trained to some signal that's magically inverted in the testing data).
Should we believe Google's claims to do it better than humans?
Also ALL so called AI is really pattern matching driven by a HUMAN curated database.
Even if as good as experts, how do you train experts to have human curated data in the future as things change or arise, if experts are replaced by these systems?
That is certainly a limit if you are training computers to do things that people are doing and comparing to the human output (matching a hand-drawn segmentation or a radiologists classification), but you can also train to predict final outcomes (as in this case actually). Which is sort of what those human experts would be doing to start with in many cases.
Potentially a computer can learn from a much bigger dataset than a human could ever hope to in their lifetime (or, at a lower threshold, be able to fully absorb and distinguish), there's no philosophical reason they can't do some tasks better, and pattern recognition is a good candidate. On the other hand, they also seem to need a larger quantity of task-specific data, humans are still better at generalizing small numbers of examples using their existing knowledge.
I think the key issue in so many of these case is 'potentially'. You can train a computer very easily to repeat a task like pattern recognition - but if you don't have a good conceptual model of what the pattern is and how the computer is spotting it, the chances are you are just painting the bullseye on the side of the barn after you fired the arrow.
Just because something is peer reviewed it doesn't mean to say it works and results reproducible, as Bayer research scientists found....
also covered by Reuters....
"Just because something is peer reviewed it doesn't mean to say it works and results reproducible"
Indeed. However, ideally peer review at least acts as a decent filter to stop obviously shoddy work getting put out in the first place. Scientific publications get a lot of flack for not being reproducible or not giving significant enough results, but that's not actually a problem most of the time because the whole point is to put the results out there and let other people look into it and try to reproduce them, extend the work, or whatever.
In this case though, it appears to be sheer incompetence on the part of both the reviews and editorial staff. This is exactly the sort of thing peer review is there to catch - obviously unbelievable results caused by poor experimental method. Either the papers clearly describe their own failings and shouldn't even make it to review in the first place, or they don't describe the work in enough detail to justify publication. Many papers don't hold up given time to do reproductions or further related work, but it's rare to see such clear sloppiness through the entire experimental and publication process.
Here is the actual paper with that title: https://www.google.com/url?sa=t&source=web&rct=j&url=https://journals.plos.org/plosmedicine/article%3Fid%3D10.1371/journal.pmed.0020124&ved=2ahUKEwjp4puP353nAhVGUd4KHSoaD8IQFjAAegQIARAB&usg=AOvVaw3ej46EjYOkYi2cVzeTN8z-
Unsurprisingly, this kind of problem is most prevalent in fields where scientists fight about big pots of money. Say, medically relevant research.
In my field, scientists tend to measure hard numbers - - no big financial incentives and wrong results will be falsified (eventually) ruining your reputation. But in fast-moving fashionable and well funded fields (AI, etc.) the incentives are all wrong.
The total of 300 feels like just about enough data for that. Although it depends if they need to do any hyperparameter searching where they would need a third data set. The problem they will run into is that they may keep tweaking their model and using it on the test data set and finding one that works. Or keep repartioning their data until they get something. Everything is likely to be highly over-fitted since all they have done is add a manual phase into this.
The sample was not necessarily too small to train some systems, but many would be liable to over-fit the training data and not generalise. The big problem was ensuring the testing was right. Unfortunately this isn't even uncommon. When I was doing my Masters degree I was reviewing the available literature on predicting foreign currency movements and an outright majority of the published papers I found using Machine Learning to predict foreign currency movements made elementary mistakes in their testing procedure not dissimilar to this, leading to unbelievably high prediction scores. I really hoped to find one had a suitable process to reliably predict next-day currency movements, but unsurprisingly that virtual unlimited pot of gold was not real and my final paper primarily served to debunk a dozen or so other papers by showing that their results were not reproducible and to explain the identifiable flaws in their processes.
The simple lesson is that too many people don't understand the necessary processes for doing machine learning properly - including many academics writing papers about it.
It's worse than just "publish or perish". Under a lot of models used for promotion and tenure, there's no disincentive for producing bad studies at all. A lot of models simply look at the number of papers published and perhaps how prestigious the outlet is. Models rarely take into account any type of paper quality; when they do, it's usually in the form of the number of citations. In that case, having your paper debunked still increases your ratings! There might be one that takes into account the number of papers that you've had retracted, but it's certainly not popular.
This post has been deleted by its author
Biting the hand that feeds IT © 1998–2020