The failover switch didn't have a fence mechanism to power off the primary completely,
What f*** fence mechanism? Routing protocols and L2 failover do not have any of that.
Partial failure, especially on high bandwidth optical interfaces is NOT rare. It has been years since I have dealt with large SP operations, but I have seen tens of those. There is really f*** all you can do in such case except having a well trained ops team which can deduce what is going on from the stats (as you may not see this a normal fault) and go in and KILL the erroneous interface (and later the whole card proceeding to switch or router if need be). It also needs to have the authority to do so. Which is what I suspect is the issue here. The ops team did not have the authority to go in with kill orders and by the time it was authorized it was too late - there was a gigantic backlog.