Train multiple CNNs on different things. One on texture, one on sihloetes, one on shape, one on edges. Train them all with a dataset that has multiple views on the same object, so a cat from the front and a cat from the back and from the side and above looking down when the cat is looking up etc.
Same image processed multiple ways each time.
Then train another one to use the output of the others to weight the responses. This supervisor is trained by giving it the set of processed images as one example, and the rotations and viewpoints as sets that are related, so it's tagged as this set and these sets are all the same thing, not these images are all of related things. If one quickly comes up banana and two come up toaster then it's more likely to be a toaster. you could even feedback from the supervisor to say try again, no one else thinks it's a banana and see what I says then.