Just had a look at the paper
What stuck me as odd was that at first glance they seemed to be training a deep neural network (because most CNNs are employed in deep networks) on just 9000 images, which is usually way too little, but then looking at their architecture, it is just 5 layers, or so, which is hardly deep. Indeed the authors clearly state the shallow architecture was chosen because of the low number of example images. As the authors state, more data are needed, but I would suggest a comparison to other, feature-based approaches might be nice as well, just to see if CNNs really work better than other ML approaches in this case.