The brief outage ... caused problems in the organisation's complex MySQL replication architecture

If you have to define a system as "Complex", you can guarantee that when it goes wrong, it will go wrong in a manner that will take a long time to clean up afterwards.

I know you're operating at scale, but Keep It Simple, Stupid. Simple is the only way you stand of keeping big things like this running.

