"Isn't this more of a tiny "bug" than a fundamental problem?"
Probably. It'll likely be some artefact of the use of frames rather than the concept of continuous video, coupled with artificial objects being *much* easier to learn and later classify than natural ones. Particularly natural ones that have spent millions of years evolving to resist being seen.
It's an interesting attack vector, going after the feature extraction rather than the learning or classification phases. Not a new one, but probably the single most vulnerable one, given that ANNs are black boxes.