Re: Let's think about how Google might have implemented this.
I would also be willing to lay money on this being an I-frame thing. It's a sensible, obvious optimisation to the ML pipeline, leveraging the work done by the video compression to extract the "key" information. Easily accounts for the surprisingly high levels of confidence in the predictions - the NN is going to "see" almost nothing but the inserted image.