back to article The Navitaire cock-up, filleted

The Register has found out more about the Navitaire data centre responsible for the 21-hour Virgin Blue outage on Sunday and the second outage on Tuesday. It makes the prolonged first outage even more puzzling. Accenture's Eric Ulmer says (Google Docs) that the Navitaire system has a high transaction volume, and the total …

COMMENTS

This topic is closed for new posts.
  1. frank ly

    Value Engineering

    "..That is a blunder of the first magnitude by whoever designed, implemented and ran the system."

    No, it's value engineering. That decision has probably saved at least two million dollars in capital outlay and running costs over the lifetime of the system. Oh, wait a minute.......

  2. Anonymous Coward
    FAIL

    Take a look at when it happened.... and where....

    This is AUSTRALIA!

    The footy grand final was on at the time the data centre failed. Theys were all down the pub wotchin fuudy, and gettin p1ssed. Outsourced support staff were cleaning toilets in Delhi.

    Poor Virgin Blue is getting slammed in the media here, while only those of us who can read, and understand how the modern world is structured know that Accenture is to blame...

    Weren't they once Arthur Anderson Managment Consultants? Why do I get the feeling Virgin will find that the only non-outsourced staff member is the gardener?

    Maybe Athur Anderson / Accenture could get a new slogan from this: DataFAIL Experts!

    To sum it all up: Australia = FAIL = AustFAILia

    Anon. Our Government watches us, I don't want the Eye of Conroy to see me.

  3. JimC
    Joke

    > Only Virgin Blue

    Obviously one can't imagine that his Bransonness might have chosen the budget SLA...

  4. Destroy All Monsters Silver badge
    Headmaster

    Welcome to the real...

    ...where failover systems don't.

    Now, for some editorial logic. We read:

    "Alternatively, why didn't it run the system off the remote data centre holding the SnapMirror data? The implication is that, because it didn't do this, the Snapmirror process of sending OLTP data to the second data centre was useless."

    The deduction goes the wrong way and should be replaced by an *induction*:

    Instead of saying

    "Navitaire didn't run the system off the remote data centre, therefore the process was useless."

    one should mean to say

    "As Navitaire didn't run the system off the remote data centre, that process was presumably useless."

  5. raving angry loony

    pointing fingers

    Until proven otherwise, I'm going to blame an accountant for the decisions that led to this outage. Probably one whose budget wasn't actually affected if anything really did happen. It sounds remarkably like the kind of "cost of everything, value of nothing" decisions I've fought for most of my career.

  6. Anonymous Coward
    Anonymous Coward

    Maybe...

    Could it be that the replication to the DR site isn't real time (or should have been, but wasn't) and the extended outage was caused by not having captured all the transactions. I can see that if you've taken a load of payment details and booked flights associated with them, you really don't want to loose them. If a failure occurs and the remote site doesn't have every single disk transaction, you've not really any idea of what you've lost and you need to go to great lengths to retain that data.

    Just speculation, mind.

  7. Anonymous Coward
    Grenade

    Acc-i-denture

    Well....

    That's Accenture for you.... and their now defunct - Go on be a Tiger slogan/campaign, etc seems to be ifn full force...

    Ahh yes, cheating, denying and ummm... ultimately poor perfomance. Been there seen that...

  8. Anonymous Coward
    FAIL

    Chickens / roost

    Accenture in their infinite arrogance decided they knew better than the storage vendors and went with a 'value-engineered' system and their own consultants for implementation, so it was an accident waiting to happen.

    Properly implemented and operated, the technologies would have prevented/saved the day (they are very well proven).

  9. strobeonline
    Happy

    We may never know

    Real time snap-shotting to remote site and prevention of data loss is all very well but does not of itself allow for a remote system to assume the production takeover and data integrity protection as a failover site.

    In order for a remote site to assume production processing with the same level of protection (and surely that should be the aim if attempted) requires a minimum of three sites and preferably four in the loop. This is of course incredibly complex to maintain and monitor - read expensive and slower to build.

    I would guess that decisions were made for primarily financial reasons whose implications were either UNexplained, UNaccepted or (in a worst case) UNrecognised by any of the decision- makers.

    UNbelievable in this age perhaps? After 30 years in the ICT infrastructure business, I view it as the inevitable result of adoption of new technology at low (no?) cost without reading the small print that says your mission critical services are only as good as the least expensive component . Perhaps instead of avoiding capital investment they need to adopt the old adage . If you need something doing well, do it yourself?

    http://www.theregister.co.uk/Design/graphics/icons/comment/happy_32.png

  10. Concerned but optimistic

    Re thinking risk profiles

    A good point is raised here in regard to the transaction integrity and retention capability of the DR site.

    An often forgotten part of DR planning is the decision tree. A good decision tree will help evaluate the type of disaster or outage being faced and weigh the cost of enacting a particular course of action against another.

    It may well be that the known costs and risks of migrating to (and back from) the DR site to live were estimated to be very high, and so additional time was invested in repairing the primary site. This does spell out a weakness in all but the most sophisticated (and expensive) DR strategies.

    A lot of DR strategy testing/proving is focussed around reestablishing core capability in the event of critical failure or destruction of primary site. These are predicated on the magnitude of risk in wholesale damage and long term repair prospects at the primary site. Not enough work is done on coping with the lower magnitude of risk involved with various equipment failure and the relative costs of migration vs repair.

    N+1 is fairly fundamental here. Switching to the DR site should not involve accepting subpar data protection or data integrity. Operations like these need two fully operational systems/sites with as close to realtime transaction mirroring as possible.

  11. Anonymous Coward
    FAIL

    The Navitaire cock-up

    I used to do application support for a major UK airline who used a Navitaire booking system. While the basic system was very good, their levels of incompetence were quite staggering when it came to minor details - like assuming there are 24 hours in EVERY day. so when the clocks change...........

    Yes, you guessed it, you we lost a whole days financial transactions.

    They managed to recover them when I eventually got it into their thick heads that one day a year was 23 hours and one day a year was 25 hours.

This topic is closed for new posts.