"why the algorithm..such a heavy weighting on..only 2% of the footage."
Got it in a nutshell.
These "deep learning"* algorithms are meant to be "robust" which implies not deflected by minor disturbances. IE "I am 98% confident this video is about cats, but there is a 2% chance it's got something to do with cars." So 95% certain it's about a cars is an Epic Fail.
*which makes the work sound so much more insightful than multi-layer neural network, which actually explains what (AFAIK) all of these algorithms actually are, and therefor suggests people could actually work out how they work.