Still not understanding
Why it took so long to disable the failing switch once it was identified.
Assuming that if the switch had completely crashed the backup would have taken over, then why not just turn the damn thing off?
Unless assumptions were made about the maximum size of the backlog/queues which could build up during failover, and the system just wasn't sized to recover from a massive backlog due to an undetected partial failure.
This does sound quite likely, as the report talks about clearing out queues before switching to the backup switch. Perhaps the system couldn't recover if transactions were more than a certain age? Although you would expect that old transactions could be assumed to have failed (as was the case here) and been automatically recorded then purged.