I have seen quite some papers on deep learning methods that reach 99+ % scores, but in many cases all that means is that the network is much better at faithfully reproducing the errors made by the person or persons drawing up the "ground truth". Getting reliable ground truth data sets is very hard indeed, especially for the hundreds of thousands of examples needed by deep learning in particular. Note that this does not mean that deep learning is the wrong approach per se, it is just that it is much harder to get a reliable ground truth if you need many, many examples. Simulation can certainly help, but it can be hard to simulate all of the deficiencies of your imaging system.
The kind of feedback suggested could take the form of a curation process by which ground truths themselves can be amended when new data come out. We have sometimes found that methods for blood-vessel detection were penalised by finding faint vessels missed by the doctors drawing up the ground truth. What is needed is a process by which an expert reassesses the ground truth and after due process adds the missing features. I haven't seen any agreement on such a process for existing data sets, in the scientific community, but it is certainly needed. Part of this reticence may stem from the fact that changing the labelling of an existing ground truth would mean having to re-run old experiments, which should be possible, but is an unwelcome chore.