Let's think about how Google might have implemented this.
(1) Google already automatically classifies images, so it is reasonable to assume they would try to leverage / reuse their image classifiers.
(2) Since video is simply a bunch of still images, it is reasonable to assumed Google simply takes stills from the video and passes it to their existing (and trained) image classifiers.
(3) It is pointless to process every-single-frame in the video because that would be prohibitively expensive and there really isn't much change from frame to frame.
(4) Google, probably, selects only the key frames (I-frames) for classification. (Depending on computation cost, Google may drop key frames if they are similar to other key frames - why classify the same, or very similar image, over and over again. Of course, this depends on whether image classification is more expensive than comparing two key frames).
(5) It should be obvious that every inserted image is an I-frame, so it WILL be classified.
(6) Google has some algorithm (or neural net) that tries to boil down the contents of a video (several thousands or millions of images) into a single classification. Clearly, if you have film walking about in a city, you will have cars, buildings, people, trees, etc. Google's classifier has to come back with a single answer. This is, probably, weighted by the confidence of the original classifications. Cars, buildings, laptops, and food plates probably have a higher level of classification confidence.
I imagine that Google will, over time, tweak the final classification to give more weight to duration of a single classification rather than confidence of classification (or perhaps some admixture of the two).
On the other hand, I could be blowing smoke since I have absolutely no idea how Google is doing this, but is the way I would approach it.
(A pint, because we could spend hours arguing over how this is or should be implemented - or even if it should be implemented at all.)