back to article BA's 'global IT system failure' was due to 'power surge'

British Airways CEO Alex Cruz has said the root cause of Saturday's London flight-grounding IT systems ambi-cockup was "a power supply issue*" and that the airline has "no evidence of any cyberattack". The airline has cancelled all flights from London's Heathrow and Gatwick amid what BA has confirmed to The Register is a " …

  1. Tom Paine Silver badge
    Pint

    "Tirelessly"?

    The airline's IT teams are working "tirelessly" to fix the problems, said Cruz.

    I bet they're not, you know. At the time of writing - 19:48 on the Saturday of a Bank Holiday weekend - I'm pretty sure they're tired, fed up, and just want to go to the pub.

    1. allthecoolshortnamesweretaken Silver badge

      Re: "Tirelessly"?

      "Teams"? As in "more than one"? After all the RIFs?

      1. Yet Another Anonymous coward Silver badge

        Re: "Tirelessly"?

        They have one team on rentacoder and another from craigslist

        1. Anonymous Coward
          Anonymous Coward

          Re: "Tirelessly"?

          Nope, they are all on a flight from Mumbai to London. There is no one left here to fix the issues, they all got outsourced some time ago.

          Power Supply? Perhaps these 'support' guys think that a few PC supplies obtained from a market stall on the way to the airport will fix the issue.

          Seriously, don't BA have a remote DR site?

          Probably got axed or is in the process of being moved to India.

          1. handleoclast Silver badge

            Re: "don't BA have a remote DR site?"

            Probably got axed or is in the process of being moved to India.

            Close. The people who kept insisting that BA invest in a remote DR site got axed and their jobs moved to India. Not only are Indians cheaper, they don't keep insisting on stupid ideas like remote DR that costs lots of money, they just do what they're told to do without question.

            1. Tbird

              Re: "don't BA have a remote DR site?"

              Couldn't agree more.

              Any decent sized company would have a 'Failover' system or DR system ready to kick in from a remote location, it's standard practice even in smaller companies.

              Apart from millions of pounds in claims the government should fine BA for such poor practice. It's all very well saying 'Rebook' or 'Refund' and don't come to the airport but people have holidays planned at the other end of the journey that have taken a year to plan and save for.

              We all also know that a trip to Heathrow involves parking, taxis, motorways and is not as simple as just 'popping over' when your fights ready.

              Shame on you Alex Cruz, the shareholders should speak out!

          2. Version 1.0 Silver badge

            Re: "Tirelessly"?

            Seriously, don't BA have a remote DR site?

            Of course it does, it's kitted out with WMRN drives (Write Many, Read Never) but they were having a reliability problem with them causing slow writes so they redirected the backups to dev/null - it was much faster.

    2. anthonyhegedus Silver badge

      Re: "Tirelessly"?

      Do they have pubs in that part of India?

    3. Trixr Bronze badge

      Re: "Tirelessly"?

      I dunno, it's not a bank holiday in India, and they're probably flogging all the poor bastards to death over there.

    4. Robert E A Harvey

      Re: "Tirelessly"?

      It must be true. He was wearing a yellow high-viz waistcoat when he said it

      1. Duffaboy
        Trollface

        Re: "Tirelessly"?

        And a Clipboard

    5. Voland's right hand Silver badge

      Re: "Tirelessly"?

      They are lying too. Their system was half-knackered the day before so things do not compute. It was definitely not a Saturday failure - it started 24hours before that.

      I did not get my check-in notification until 10 hours late and I the boarding pass emails were 8 hours late on Friday.

      So they are massively lying. Someone should check if there are holes in the walls in their office at Watersisde from Pinoccio nose punching through them at mach 1.

      1. Anonymous Coward
        Anonymous Coward

        Re: "Tirelessly"?

        "So they are massively lying. Someone should check if there are holes in the walls in their office at Watersisde from Pinoccio nose punching through them at mach 1."

        Ala 9/11?

    6. TheVogon Silver badge

      Re: "Tirelessly"?

      Did TATA outsource it to Capita?

      1. W T Riker

        Re: "Tirelessly"?

        The correct spelling is crapita

    7. Aralivas

      Re: "Tirelessly"?

      I work in IT operations of different banks since almost 30 years .

      Unfortunately I experienced the same trend in all banks I works for : outsourcing the IT to offshore.

      And each time the result was the same . Poor service and a lot of disruption and miscommunication between offshore teams and local teams.

      However I cannot understand how a major airline company like BA does not have a tested and validated disaster recovery plan.

      In banks it's a common practice to do DR drills each year and validated all critical applications in case of a major incident (fire, power outage, earth quake etc).

      During those drills which takes place on two weekends all IT staff is present and they simulate the full outage of a data center and try to bring up the most critical applications on the second data center. Normally the applications should be up in less than two hours , otherwise the DR test is supposed to be a failure.

      Failure of a power supply is not a valid reason these days. The UPS (uninterrupted power supply) with strong batteries are able to keep up the most critical systems and servers for 24 hours or more.

      If the British Airways' IT director decided not to have a disaster recovery data center and not to perform such disaster recovery drills yearly then he has to be fired ! This is the basics of a Tier 0 ( critical applications) IT architecture.

      The bad news is that if the BA does not improve its IT architecture it means the same issue could happen again.

      1. Amorous Cowherder
        Facepalm

        Re: "Tirelessly"?

        Not sure about other banks but it's part of our mandatory requirement to the auditors to prove we have a functioning DR site and appropriate, tested procedures for using it!

      2. Jellied Eel Silver badge

        Re: "Tirelessly"?

        This probably isn't a DR issue, but an HA one. So BA relies heavily on IT to know where it's aircraft, passengers, staff, luggage, spare crews, spare parts and everything else is, in real-time. So lots of interdependent data that would need to be synchronously replicated between the DCs at LHR so an accurate state table is maintained, even if X breaks.. But then if X does break, and there's data losss or corruption, getting back to a working state gets harder. Rolling back to a previous state may tell you where stuff was, but not where it is now.. Which can be a fun sizing challenge if you don't have enough transaction capacity to handle an entire resync.

        Or maybe power & cooling capacity. Unusually for a UK bank holiday, the weather has been quite nice. So cooling & power demands increased in and around LHR, which includes lots of large datacentres. On the plus side, there'd be plenty of food for IT & power folks working at LHR given meals have probably been arriving, with no passengers to eat them.

      3. davidhall121

        Re: "Tirelessly"?

        24 hours on UPS...

        I think not !

        10 years in the DC business leads me to believe that the warm weather likely played a part !

      4. Anonymous Coward
        Anonymous Coward

        Re: "Tirelessly"?

        I worked with DR systems which have a multitude of recovery point objectives for the apps from 1 to 24 hours. And failing back to the primary system at the end of DR has some serious omissions, so there's a tendency not to want to activate the DR unless absolutely necessary.

        As for testing the DR plan periodically? It isn't done...would result in too much downtime of critical systems and take weeks to work out how to do it. We just pulled the wool over the customer's eyes and did a limited set of testing of apps.

        When I had to activate DR applications, the amount of reconfiguration work required and troubleshooting took 6 hours.

        The customer got what they specified and paid for.

        Welcome to the world of non-banking organisations.

        1. tfb Silver badge

          Re: "Tirelessly"?

          I worked for banks. They had DR sites. DR tests typically involved a long, carefully-sequenced series of events to migrate one, or at most a few, services between the sites. If they ever had a major event in one if the DCs and had to do an unplanned DR of a large number / all of the services then I had no doubt they woukd have failed, both to do the DR and then shortly afterwards as a bank (and, depending on which bank it was, this would likely have triggered a cascade failure of other banks with results which make 2007-2008 look like the narrowly-avoided catastrophe it was).

          Banks are not better at DR: it is just convenient to believe they are. We now know what a (partial?) DR looks like for BA: we should live in the pious hope that we never find out what one is like for a bank, although we will.

          Of course, when it hapens it will be convenient to blame people with brown skins who live far away, whose fault it isn't: racism is the solution to all problems, of course.

        2. CrazyOldCatMan Silver badge

          Re: "Tirelessly"?

          Welcome to the world of non-banking organisations.

          Well - BA *used* to (in the early 90's) use mainframes running TPF. As did quite a lot of banks.

          Whether BA still do I don't know.

    8. Tom Paine Silver badge
      Angel

      Re: "Tirelessly"?

      Finally, a bit of actual detail from Mr Cruz. I took the liberty of transcribing relevant bits, hear it at

      (starts about 12m in) http://www.bbc.co.uk/programmes/b08rp2xd

      A: "On Sat morning, We had a power surge in one of our DCs which

      affected the networking hardware, that stopped messaging --

      millions and millions of messages that come between all the

      different systems and applications within the BA network. It

      affected ALL the operations systems - baggage, operations,

      passenger processing, etc. We will make a full investigation... "

      Q: "I'm not an IT expert but I've spoken to a lot of people who are,

      some of them connected to your company,. and they are staggered,

      frankly and that's the word I'd use, that there isn't some kind of

      backup that just kicks in when you have power problems. If there

      IS a backup system, why didn't it work? Because these are experts

      - professionals -- they cannot /believe/ you've had a problem

      going over several *days*."

      A: "Well, the actual problem only lasted a few minutes. So there WAS a

      power surge, there WAS a backup system, which DID not work, at

      that particular point in time. It was restored after a few hours

      in terms of some hardware changes, but eventually it took a long

      time for messsaging, and for systems to come up again as the

      operation was picking up again. We will find out exactly WHY the

      backup systems did not trigger at the right time,and we will make

      sure it doesn't happen again."

      (part 1)

      1. TkH11

        Re: "Tirelessly"?

        Doesn't explain how the network switches and equipment lost power. Were the UPS's properly maintained?

  2. Pen-y-gors Silver badge

    Ho hum

    Another business where the phrase 'single point of failure' was possibly just words - or where one failure cascaded down to overload the backup.

    Resilience costs money.

    1. Grimsterise

      Re: Ho hum

      Amen

      1. Danny 14 Silver badge

        Re: Ho hum

        Said the same to our bean counter. Either give me the money for 2 identicals systems or the money for one and log my concern. Money for 1.5 wont work.

        1. h4rm0ny
          Joke

          Re: Ho hum

          Yeah, BA IT staff told their CEO they needed greater redundancy... So he fired them.

          They're called Tata because that's what they say once they've got your money.

          Tata have stated they'll be flying hundreds of engineers to the UK to resolve the problem. As soon as they find an airline able to transport them.

          It technically IS a power supply issue. Alex Cruz should never have had any.

          1. Anonymous Coward
            Anonymous Coward

            Re: Ho hum

            Agree, TCS is the cut rate provider among cut rate providers. They always seem to promise the moon to win contracts but the follow through has not been impressive based on the engagements I have seen.

            1. John Smith 19 Gold badge
              Unhappy

              "Agree, TCS is the cut rate provider among cut rate providers. "

              Sounds like they have a bright future joining the "Usual suspects" in HMG IT contracts.

              Bright for them. Not so bright for the British taxpayer.

            2. JimboSmith Silver badge

              Re: Ho hum

              I know a company (coz I used to slave for them) that went with a software/hardware supplier who promised the earth and then didn't deliver. The funny thing is they weren't cheap but they were cheerful when you called them to hear them say:

              "No our systems don't offer that functionality".

              "The old system did and that was a damn sight less expensive than yours"

              "We could develop it for you but that's going to cost dev time"

        2. Aitor 1 Silver badge

          Re: Ho hum

          But two identical systems capable of taking over each other is not 2x the expense, but 4x the expense.

          So the intelligent thing here would be to have systems as light as possible (jo Java, please, PLEASE), and have them replicated in three places.

          Now, knowing this type of company, I can imagine many fat servers with complicated setups.. the 90s on steroids.

          The solution, of course, is to have critical systems that are LIGHT. It saves a ton of money, and they could be working right now, just a small hiccup.

          Note: you would need 4x identical systems, + 4 smaller ones for being "bomb proof"

          2x identical systems on production. Different locations

          2x the above, for preproduction tests, as you cant test with your clients.

          4x for developing and integration. They can be smaller, but have to retain the architecture.

          At best, you can get rid of integration and be the same as preproduction.

          These days, almost nobody does this.. too expensive.

          1. Peter Gathercole Silver badge

            Re: Ho hum

            It does not have to be quite so expensive.

            Most organisations faced with a disaster scenario will pause pretty much all development and next phase testing.

            So it is possible to use some of your DR environment for either development or PreProduction.

            The trick is to have a set of rules that dictate the order of shedding load in PP to allow you to fire up the DR environment.

            So, you have your database server in DR running all the time in remote update mode, shadowing all of the write operations while doing none of the query. This will use a fraction of the resource. You also have the rest of the representative DR environment running at, say, 10% of capacity. This allows you to continue patching the DR environment

            When you call a disaster, you shutdown PP, and dynamically add the CPU and memory to your DR environment. You the switch the database to full operation, point all the satellite systems to your DR environment, and you should be back in business.

            This will not five you a fully fault tolerant environment, but will give you an environment which you can spin up in a matter of minutes rather than hours, and will prevent you from having valuable resources sitting doing nothing. The only doubling up is in storage, because you have to have the PP and DR environments built simultaneously.

            With today's automation tools, or using locally written bespoke tools, it should be possible to pretty much automate the shutdown and reallocation of the resources.

            One of the difficult things to decide is when to call DR. Many times it is better to try to fix the main environment rather than switch, because no matter how you set it up, it is quicker to switch to DR than to switch back. Get the decision wrong, and you either have the pain of moving back, or you end up waiting for things to be fixed, which often take longer that the estimates. The responsibility for that decision is what the managers are paid the big bucks for,

            1. Anonymous Coward
              Anonymous Coward

              Re: Ho hum

              You don't even need all that.

              Set up your DCs in an Active-Active setup and run them both serving the resources it needs from the quickest location. Find some bullet proof filesystem and storage hardware that won't fall over if there is a couple of lost writes (easier said than done!) and make sure you use resource pools efficiently with proper tiering of your applications.

              Therefore one DC goes down then all workloads are moved to spare capacity on the other site - non-critical workloads automatically have their resources decreased or removed and your critical workloads carry on running.

              After restoration of the dead DC then you manually or automatically move your workloads back and your non-critical workloads find themselves with some resources to play with again.

              This is how cloud computing from the big vendors is designed to work. However legacy systems abound in many places, including old data warehouses with 'mainframe' beasts. Sometimes it isn't easy to reengineer it all to be lightweight virtual servers. It's why the banks struggle so much and why new entrants to the banking sector can create far better systems.

              1. Anonymous Coward
                Anonymous Coward

                Re: Ho hum

                Are you a sales guy for one of the large cloud providers by any chance?

                Who in their right mind would gift the entire internal workings of a huge multinational company to a single outfit?

                Why not create your own "cloud" infrastructure with your own "legacy" systems?

                Thats the real way to do it from a huge companies perspective.....

                1. Anonymous Coward
                  Anonymous Coward

                  Re: Ho hum

                  "Who in their right mind would gift the entire internal workings of a huge multinational company to a single outfit?"

                  Salesforce, Snapchat, AirBnB, Kellogg's, Netflix, Spotify... many companies, more companies moving workload to some cloud service on a daily basis. You don't need to give it all to one cloud provider, could be multiple... almost certainly will be multiple. My point is - How many occasions do you recall when amazon.com or google.com were down or performance degraded in your life? Probably not once, despite constant code releases (they don't have a 'freeze period', many daily releases), crazy spikes in traffic, users in all corners of the world, constant DDOS and similar attempts. How many on prem environments can keep email or the well worn application of your choosing up with exceptional performance at those levels (planned outages counting... the idea that there are such a thing as "planned outages" and they are considered acceptable or necessary is my point)?

                  "Why not create your own "cloud" infrastructure with your own "legacy" systems?"

                  You won't have the reliability or performance, especially to users in far flung corners, of a cloud provider. The financials are much worse on prem. In a cloud service you can pay for actual utilization as opposed to having to scale to whatever your one peak day is in a three year period (and probably beyond that as no one is exactly sure what next year's peak will be). Shut workloads off or reconfigure on the fly vs writing architectures in stone. No need to pay a fortune for things like Cisco, VMware, EMC, etc. Use open source and non proprietary gear in the cloud.

                  Also, all applications written post 2000 are natively in the cloud, aaS. There is no on prem option. Most of the legacy providers, e.g. Microsoft and Oracle, are pushing customers to adopt cloud as well. Unless you plan to never use anything new, you're bound to be in the cloud at some point.

                  1. patrickstar

                    Re: Ho hum

                    The full IT setup of a global airline is a lot harder to distribute than anything those companies do.

                    Atleast three (Netflix, Spotify and SnapchatI) are just about pushing bits to end-users with little intelligence, which isn't even remotely comparable to what BA does. And Google search, while being a massive database, has no hard consistency requirements.

                    Kellogg's just uses it for marketing and development - see https://aws.amazon.com/solutions/case-studies/kellogg-company/ . Which seems like a pretty good use case, but again in no way comparable to airline central IT.

                    Salesforce has had atleast one major outage, by the way.

                    Didn't Netflix move to their own infrastructure from the clown... with cost given as the reason?

                    Sigh, kids of today - thinking a streaming service is the most complex, demanding and critical IT setup there is. Next you'll suggest they rewrite everything in nodeJS with framework-of-the-day...

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: Ho hum

                      "Kellogg's just uses it for marketing and development"

                      No they don't. They have SAP and all of the centralized IT running in cloud. That is just one example too. There are all manner of companies using it for core IT. Every Workday ERP customer, by definition, uses cloud.

                      "Salesforce has had atleast one major outage, by the way."

                      After the started moving to cloud? They had several outages which is why Benioff decided that maintaining infrastructure doesn't make them a better sales software company, move it to the cloud. This isn't our area of expertise. They just started their migration last year.

                      "Didn't Netflix move to their own infrastructure from the clown... with cost given as the reason?"

                      No, the opposite of that is true. They decided to move everything to the cloud purely for speed and agility... and then found, to their surprise, it also saved them a lot money. Link below.

                      https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration

                      "thinking a streaming service is the most complex, demanding and critical IT setup there is."

                      You think the IT that traditional enterprises run is *more* complex, demanding and critical than Google, Amazon, etc's services? The most valuable companies in the world... running billion active user applications worth hundreds of billions of dollars? Those are the most complex, high value applications on the planet. You could add orgs like Netflix and Snapchat to those rosters. Those applications don't support "the business". Those applications are the business. The reason that traditional companies haven't yet moved the whole shooting match to cloud is nothing to do with the average insurance company doing something on the AS/400 which is just beyond the technical prowess of AWS or Google. It is because of internal politics... i.e. the IT Ops people dragging their feet because they think cloud makes them obsolete. I don't think this is true. It will work similarly to the on prem environment for admins. It's not like people even go into their on prem data centers that often... but the perception is that cloud is a threat so IT Ops tries to kill any cloud iniative. I think those IT Ops people are harming their careers as cloud is going to happen, you can delay it but not going to stop it. Soon every company will be looking for someone with 5 years of experience running production cloud from the Big 3 and those that dragged their feet the longest won't have it whereas the early adopters will be in high demand.

                      1. patrickstar

                        Re: Ho hum

                        Regarding Kellogg's, if you read Amazon's own marketing material, it's about SAP Accelerated Trade Promotion Management (analytics) and HANA (in-memory database), not SAP ERP.

                        They are not related as to what tasks they do. HANA isn't even a SAP product originally.

                        As to Netflix - sorry, I confused them with Dropbox. Point remains.

                        That you keep bringing up Google or Netflix or whatever as a reason for why BA could migrate their IT infrastructure to AWS clearly shows you have absolutely no clue what you are talking about.

                        They are completely different applications. The issues and challenges are not remotely comparable in any way.

                        If you have a service that realistically can be distributed over a lot of unreliable hosts, then AWS or similar might be for you. Such as pushing a lot of bits (streaming services, Snapchat), or maintaining huge databases without hard consistency requirements (Google search, analytics). Neither of which is easy at those scales, of course, but they do fundamentally lend themselves to this way of operating.

                        What you need for core IT operations at eg. an airline or a bank is totally different.

                        Plus you are completely glossing over the myriad of very good reasons why someone could need their own infrastructure and/or fully control the staff involved. (Can you even get a current list of everyone with admin or physical access to the hosts from AWS...?)

                        1. Anonymous Coward
                          Anonymous Coward

                          Re: Ho hum

                          "What you need for core IT operations at eg. an airline or a bank is totally different."

                          You seem to think that because Amazon.com and Google search are different services than an airline's systems (although Amazon.com is pretty similar to an airline's systems), that they cannot take on an airline reservation system or flight scheduling system of their cloud services. Google, for instance, has invented a totally revolutionary RDB (with ACID compliance, strong consistency) called Spanner which is perfect for an airline system or a bank... an infinitely scalable, infinite performance traditional SQL DB.

                          "sorry, I confused them with Dropbox."

                          True, Dropbox did move their stuff off of AWS. I think for a service of Dropbox's size with 600-700 millon active users that moving off of AWS is not unreasonable. AWS is largely just IaaS with no network. Even so, it may make sense for a Dropbox but that assumes you have many hundreds of millions of active users to achieve the sort of scale where that could make sense... and we'll see if that makes sense in a few years. AWS was charging huge mark ups on their cloud services, massive margins. Largely because until Google and Azure came on to the scene over the last few years, they had no competitors. Now those three are in a massive price war and prices are falling through the floor. Cloud prices are going to continue to fall and it may not be viable in the future.... This is a rare case though in any case. Dropbox is one of the largest computing users in the world. The average company, even large company like BA, is not close to their scale.

                          "Plus you are completely glossing over the myriad of very good reasons why someone could need their own infrastructure and/or fully control the staff involved"

                          I don't think there are a myriad of reasons. The one reason people cite is security.... just generally. I think this is unfounded though. Google, for instance, uses a private protocol within their data centers for security and performance (not IP). Even if you were able to get in, there isn't much you could do as any computer you have would not understand the Google protocol. Google builds all of their own equipment so there is no vector for attack. Unlike the average company which uses Cisco or Juniper access points with about a million people out there with knowledge of those technologies. DDOS is another good one. You are not tipping AWS or Google over with a DDOS attack, but could knock down an average company. As far as internal security, it is well locked down, caged, etc in any major cloud service. Nothing to worry about... AT&T, Orange, Verizon, etc could be intercepting an de-encrypting the packets you send over their network.. but no one is worried about that because you know they have solid safe guards and every incentive not to let that happen. Everyone is using the "network cloud", but, because that is the way it has always worked, people just accept it.

                          1. patrickstar

                            Re: Ho hum

                            I'm sure that there is some possibility that in say 30 or 50 years time, having your entire business rely on Spanner could be a good idea. That's how long some of the existing systems for this have taken to get where they are today - very, very reliable if properly maintained.

                            As to security: I really can't fathom that you're trying to argue that it's somehow secure just because they use custom protocols instead of IP. Or custom networking gear (uh, they design their own forwarding ASICs, or what?).

                            At the very least, that certainly didn't stop the NSA from eavesdropping on the links between the Google DCs...

                            Pretty much everyone consider their telco links untrusted these days, by the way. Thus AT&T or whatever has no way of "de-encrypting" your data since they aren't involved in encrypting it in the first place. Have you really missed the net-wide push for end-to-end encryption?

                            I don't know offhand what hypervisor Google uses, but AWS is all Xen. Have you checked the vulnerability history for Xen lately? Do you really want Russia/China/US intelligence being able to run code on the same servers as you keep all your corporate secrets on, separated by nothing more than THAT?

                            Never mind how secure the hosting is against external parties, what if I want to know such a basic thing about my security posture as who actually has access to the servers? That's pretty fundamental if you're going to keep your crown jewels on them.

                            What if I need it for compliance and won't be allowed to operate without it?

                            How do I get a complete list of that from the likes of Google or AWS? Do I get updates before any additions are made? Can I get a list of staff and approve/disapprove of them myself? Can I veto new hires?

                2. Anonymous Coward
                  Anonymous Coward

                  Re: Ho hum

                  "Are you a sales guy for ..."

                  Re-read my post. I never said to use a large cloud computing company, although many people may choose this as a solution.

                  I said to set up *your* datacentres in an active/active mode rather than an active/passive mode. However this is really hard with some legacy systems. They just don't have the ability to transfer workloads easily or successfully share storage without data loss etc. I was acknowledging that active/active can be difficult with systems that weren't designed for it. However if you rewrite your system with active/active designed in then it is a lot easier.

                  1. Anonymous Coward
                    Anonymous Coward

                    Re: Ho hum

                    "I said to set up *your* datacentres in an active/active mode rather than an active/passive mode."

                    You can set up DBs in an active-active mode... and it isn't all that difficult to do. The problem is that it kills performance as you would need to write to the primary DB... then the primary DB would synchronously send that write to the second DB... the second DB would acknowledge to the primary that it has, indeed, written that data... then the primary could start on the next write. For every single write. Happening at ms rates, but it will have a performance impact if you are doing it in any sort of high performance workload. It is also really expensive as it involves buying something like Oracle Active Data Guard or comparable. You can also have multiple active DBs in an HA set up with RAC, assuming you are using Oracle. Problem there is 1) Really expensive. 2) The RAC manager evicts nodes and fails about as often as the DB itself so kind of a waste of effort. 3) All RAC nodes write to the same storage. If that storage name space goes down, it doesn't help you to have the DB servers, with no access to storage, still running.

                    The way to do it is to shard and cluster a DB across multiple zones/DCs... or Google just released a DB called Spanner, their internal RDB, which is on a whole new level. Really complicated to explain, but impressive.

                    1. TheVogon Silver badge

                      Re: Ho hum

                      "The RAC manager evicts nodes and fails about as often as the DB itself so kind of a waste of effort"

                      Worked just fine for me in multiple companies on both Windows and Linux. What did Oracle say?

                      "All RAC nodes write to the same storage"

                      The same storage that can be Synchronously replicated to other arrays across sites and with a failure over time in seconds if it goes down. So clustering basically. Like it says on the box... Or you can automate failover and recovery to completely separate storage with Data guard for failover time of a few minutes.

                      "The way to do it is to shard and cluster a DB across multiple zones/DCs"

                      Like say Oracle RAC or SQL Server and presumably many others already can do then. But only RAC is true active active on the same DB instance.

                      "Google just released a DB called Spanner"

                      If I cared about uptime, support and the product actually existing next week, the last thing I would consider is anything from Google.

                    2. jamestaylor17

                      Re: Ho hum

                      ''The problem is that it kills performance...''

                      Not always, depends on your distance I've implemented Oracle DBs and replicated synchronously across miles of fibre without any knock on performance - and that's with a high performance workload. Of course and major distance and you would be in trouble.

                      Your point about sharding is, of course, well made but the truth is BA's legacy architecture is unlikely to be suitable.

                      1. Anonymous Coward
                        Anonymous Coward

                        Re: Ho hum

                        "Not always, depends on your distance I've implemented Oracle DBs and replicated synchronously across miles of fibre without any knock on performance - and that's with a high performance workload. Of course and major distance and you would be in trouble."

                        True enough, fair point. If you have a dark fiber network of say 10-20 miles or some similar distance, you are going to need a tremendous amount of I/O (write I/O) before you bottleneck the system... the more you increase the distance, the fewer number of writes it takes to choke it. For most shops, they are never going to hit the write scale to performance choke the DBs.

                        "Your point about sharding is, of course, well made but the truth is BA's legacy architecture is unlikely to be suitable."

                        True, it would likely be a substantial effort to modernize whatever BA is using (many airlines are still using that monolithic mainframe architecture).

              2. The First Dave

                Re: Ho hum

                " Find some bullet proof filesystem"

                Find that and you can retire from IT...

                1. Real Ale is Best
                  Boffin

                  Re: Ho hum

                  " Find some bullet proof filesystem"

                  Find that and you can retire from IT...

                  Fast, cheap, reliable. Pick two.

            2. Anonymous Coward
              Anonymous Coward

              Re: Ho hum

              You're right. It's called a DRP and all DRPs ought to have a resumption plan.

          2. Wayland Bronze badge

            Re: Ho hum

            No you need two systems both doing the job like two engines but with the capacity to carry on with one system. Then have enough spares so you can go out and mend the broken one.

    2. P. Lee Silver badge

      Re: Ho hum

      >Resilience costs money.

      And it is grossly inefficient. Right up to the point when you need it.

      IT is often way too focused on efficiency.

      1. John Smith 19 Gold badge
        Unhappy

        "And it is grossly inefficient. Right up to the point when you need it."

        Not necessarily.

        If the backup system is under the companies control (not a specific DR company) it can serve to train new sysadmins, test out OS patches and application upgrades and various other tasks, provided the procedures (and staff) exist to roll it back to the identical configuration of the live system in the event of the live system going down.

        The real issue BA should be thinking about is this.

        How many people in this situation will be thinking "F**k BA. Can't be trusted. Never using them again."

        1. LittleTyke

          Re: "And it is grossly inefficient. Right up to the point when you need it."

          "Never using them again" -- Well, I'm already considering alternative airlines to Hamburg: EasyJet flies from Luton, and Lufthansa from Heathrow. I'm going to compare fares next time before deciding. Now Ryanair, though. I wouldn't fly with them if you paid me.

          1. Doctor Evil

            Re: "And it is grossly inefficient. Right up to the point when you need it."

            "Now Ryanair, though. I wouldn't fly with them if you paid me."

            Oh? Why would that be?

            (North American here with relatives who do fly on Ryanair, so I really am curious about your reasons.)

            1. Vic

              Re: "And it is grossly inefficient. Right up to the point when you need it."

              "Now Ryanair, though. I wouldn't fly with them if you paid me."

              Oh? Why would that be?

              This might give you a general flavour...

              Vic.

        2. Vic

          Re: "And it is grossly inefficient. Right up to the point when you need it."

          How many people in this situation will be thinking "F**k BA. Can't be trusted. Never using them again."

          I used to work somewhere where all travel was booked according to the ABBA Principle - "Anyone But BA"...

          Vic.

        3. CrazyOldCatMan Silver badge

          Re: "And it is grossly inefficient. Right up to the point when you need it."

          it can serve to train new sysadmins, test out OS patches and application upgrades and various other tasks

          *DING*

          Instead, people do all those things on an underspecced copy of the live system[1] when, for a little bit more, they could get resilience.

          [1] If you have even that. We test stuff on the live environment and hope it doesn't break. Not by choice, by budget.

    3. Anonymous Coward
      Anonymous Coward

      Re: Ho hum

      Another business you say...https://www.theregister.co.uk/2017/05/26/major_incident_at_capita_data_centre/

      ""According to rumour, there was a power failure in West Malling and the generators failed, shutting the whole data centre down.""

      2 major UK organisations in a few days, how strange. Hopefully it is nothing like when all those undersea comms cables kept getting 'cut by fishing boats' over the course of a few weeks (shortly before Snowden became famous).

    4. Anonymous Coward
      Anonymous Coward

      Re: Ho hum

      Delta had the same issue a few months ago. Some hardware failure, SPOF, with a power outage took down the airline for over a day.

      There should be redundant power supplies in nearly every piece of commercial hardware on the market... likewise if you are talking about it by the rack or row. So one power supply failing should not cause any issues. You would need to lose primary power... and then lose both power supplies at the same time, or both the component and rack or row power supplies in the few minutes time before the generators took over supplying power.

      It is rarely a question of cash. Most large companies spend huge amounts of cash, tens of millions a year, on HA/DR and related hw and sw. There is plenty of cash in most cases. It is just that setting up an architecture and testing an architecture in which you can go in and pull any component, turn off power at anytime without issue is difficult. Most companies have this HA/DR design from the 80s, 90s in which they are failing over from primary data center to DR data center or ancient HA solutions, like Oracle RAC... which is not a "real application cluster" despite the name. Less a money problem, more of an architecture problem... and an organizational silo issues, old school IT Ops vs development instead of DevOps. The gold standard would be Google's cluster and shard up and down the stack, across multiple regions architecture or similar... but, unless they are using Google Cloud or similar, few have that architecture in place.

      1. TheVogon Silver badge

        Re: Ho hum

        "like Oracle RAC... which is not a "real application cluster" despite the name."

        If your application is a database it is. True real time active / active. Oracle are a horrible company that I would never use unless I had to, but there are not many real equivalents if any to RAC server.

        1. Anonymous Coward
          Anonymous Coward

          Re: Ho hum

          "If your application is a database it is. True real time active / active."

          What happens in a RAC cluster if the storage array all of the nodes are writing to goes down? They go down, well technically they are up but it doesn't really matter if they all lose storage. It's not a real cluster, certainly not a fail proof HA solution, if you have SPOF at the storage level. You can use Active Data Guard to get around it, but now you are talking DR and fail over instead of HA where any component can crash without impact, or with only nominal performance degradation.

          1. TheVogon Silver badge

            Re: Ho hum

            "What happens in a RAC cluster if the storage array all of the nodes are writing to goes down?"

            Then the cluster nodes using that array will fail, and those using the synchronously replicated array on your other site will remain up.

          2. Anonymous Coward
            Anonymous Coward

            Re: Ho hum

            "It's not a real cluster, certainly not a fail proof HA solution, if you have SPOF at the storage level. You can use Active Data Guard to get around it, but now you are talking DR and fail over instead of HA where any component can crash without impact, or with only nominal performance degradation."

            Someone doesn't know much about RAC. See http://www.oracle.com/us/products/database/300460-132393.pdf

            1. Anonymous Coward
              Anonymous Coward

              Re: Ho hum

              "Someone doesn't know much about RAC."

              For RAC... which nearly all RAC customers use, all data writes to a single storage name space and losing the name space will take down the cluster... SPOF like I wrote.

              Oracle does have Extended RAC which, you are right, does fix the storage SPOF. There is a reason why very few use Extended RAC though. First, and probably foremost, it is extremely expensive (even by Oracle standards). Second, as you are basically creating one DB across two sites, you would need some third environment for DR. Not impossible, but falls under the extremely expensive problem. Third, this is not going to work at any sort of scale as you are replicating across the two sites and the two sites need to maintain a perfect mirror at all times as it is the same DB. That isn't going to work at scale because the two sites will need to write in a serial manner so the same data always exists on both sides. If it can't scale, there isn't much point, in most cases, in paying millions upon millions for a fairly lightly used application, DB. It avoids the storage SPOF, but creates a scaling issue in exchange.

              1. This post has been deleted by its author

              2. TheVogon Silver badge

                Re: Ho hum

                "For RAC... which nearly all RAC customers use, "

                Well I would hope RAC customers actually use it seeing as I seem to recall it's about $100K / CPU.

                "all data writes to a single storage name space and losing the name space will take down the cluster... SPOF like I wrote."

                It doesn't if you don't want it to. There are multiple ways of designing RAC with no single point of failure. Depending on your budget RPO and RTO.

                "Oracle does have Extended RAC which, you are right, does fix the storage SPOF"

                It's still RAC server.

                "There is a reason why very few use Extended RAC though. First, and probably foremost, it is extremely expensive (even by Oracle standards). "

                Plenty of places use it. If you can afford Oracle RAC then array replication licensing probably isn't a cost issue..

                "Second, as you are basically creating one DB across two sites, you would need some third environment for DR"

                No you don't. Please leave this stuff to people who understand how to do it right! If you also want to have fast recovery from replicated storage corruption (never seen it with Oracle, but technically it could happen) then you say use Dataguard and ship logs to 2 target servers - one in each DC - so you still have a completely separate failover system even if you loose either DC. (However you do need a simple quorum system in a third location to prevent split brain on the cluster.)

                1. Anonymous Coward
                  Anonymous Coward

                  Re: Ho hum

                  "Plenty of places use it. If you can afford Oracle RAC then array replication licensing probably isn't a cost issue.."

                  Very few Oracle DBs use RAC in the first place. Of those that use RAC, the vast majority have a cluster (usually three servers) writing to a storage array at site one, data guard shipping logs to site two for failover. Generally async as it is a free license and people are concerned about passing corruptions. Exceptionally few use Extended RAC across two data centers for HA with a third environment (could be in the same second data center or not) for DR.

                  "It doesn't if you don't want it to. There are multiple ways of designing RAC with no single point of failure. Depending on your budget RPO and RTO."

                  You would synchronously replicate it, if that is what you mean. Avoids the SPOF, but creates the performance bottleneck described below.

                  "No you don't. Please leave this stuff to people who understand how to do it right! If you also want to have fast recovery from replicated storage corruption (never seen it with Oracle, but technically it could happen) then you say use Dataguard and ship logs to 2 target servers - one in each DC"

                  That's probably why I wrote "some third environment for DR" and not a third site, necessarily... although you might want a third site as the two synchronous sites have to be within maybe 20-30 miles of each other and a DR site should probably not be in the same area as the primary. Adding a third set of hardware still adds cost, even if there are three sets (storage devices) in two data centers.

                  The real issue though, as I wrote and you did not address, is performance. If you need to keep two mirrors across some sync distance in perfect harmony. You would have to have cluster A write to primary storage A, primary storage A sync replicate that write across a network to storage at site B... then start the next write. So anytime you alter the DB tables, you need to write to primary, replicate that write across a 20 mile network or whatever to storage B, have storage B send the acknowledge to storage A that it did, indeed, complete the write... then the DBs can start on the next write. That is a massive bottleneck if you are trying to run a massive throughput operation.

                  1. Anonymous Coward
                    Anonymous Coward

                    Re: massive bottleneck

                    "[synchronous replication] is a massive bottleneck if you are trying to run a massive throughput operation."

                    Suppose the data in question is related to e.g. the automated manufacture of high value relatively low volume goods (vehicles, chips, whatever). The impact of downtime (lost production) is high, but the throughput and latency requirements may be relatively low.

                    In that kind of circumstance it's entirely plausible for the impact of downtime to be yuuuuge (?) even though the performance requirement isn't particularly challenging.

                  2. Anonymous Coward
                    Anonymous Coward

                    Re: Ho hum

                    You are right. However, it is not impossible to design a system that only loses 15 or 30 minutes of data. Some flights will be affected but not a total shut down.

                    1. CrazyOldCatMan Silver badge

                      Re: Ho hum

                      Some flights will be affected but not a total shut down.

                      Except that you would then have planes in the wrong places. So, the plane that is supposed to be doing the SFO-LHR flight is currently in Bangkok, and the TPE-PEK plane is currently in LHR.

                      Which can be sorted out, but it will involve juggling the whole fleet and that ain't easy or cheap.

      2. paulc

        Re: Ho hum

        I was on the commissioning team for the GCHQ doughnut... redundancy was taken very seriously there...

        the building does not run directly off the mains supply, instead, the mains was used to drive motor-generators to provide a very stable supply and in the event of a power cut, diesel motors would start up to drive the motor-generators instead...

        1. TheVogon Silver badge

          Re: Ho hum

          "the building does not run directly off the mains supply, instead, the mains was used to drive motor-generators to provide a very stable supply and in the event of a power cut, diesel motors would start up to drive the motor-generators instead..."

          I think you are talking about Pillar rotary UPS systems. Jut a different type of UPS.

        2. Anonymous Coward
          Anonymous Coward

          Re: Ho hum

          That is probably also a way to deal with TEMPEST and the unwanted emissions being sent back down the power cables: there is no electrical link from the sensitive equipment to the mains power cables coming into the site.

      3. John Smith 19 Gold badge
        Unhappy

        "Delta had the same issue a few months ago. "

        The question is not how long the DC took to recover.

        The question is how long the business took to recover the lost passengers who decided it can't be trusted with their bookings.

        That's harder to measure but I'll bet it's still not happened.

      4. Anonymous Coward
        Anonymous Coward

        Re: Ho hum

        I haven't been in a data centre where the servers and network gear are not fed from dual redundant power supplies and PDUs, I would expect them all to be so configured.

        I have experienced high voltage overhead line faults feeding power to the site and it's difficult to increase the redundancy here, unless the data centre just happens to be located at intersection of two different HV overhead circuits.

        If a high voltage fault occurs, and if you only have a single feed from the national grid power, then you're entirely reliant on the UPS and diesel generator working for long enough. If the batteries in the UPS are not maintained, then you could be in trouble. In which case, it doesn't matter about how many redundant PSUs each server has, or how the power distribution is configured within the data centre.

        So they should not have lost power to the equipment, but they evidently did. My money is on a failure of the UPS or the generator and not on the down stream power distribution architecture.

        DR architecture is probably, as you suggest, old style, and not using virtualization, and I bet the Indian Support team couldn't work out how to bring it up Troubleshoot something? Not a chance.

        1. jamestaylor17

          Re: Ho hum

          ''So they should not have lost power to the equipment, but they evidently did. My money is on a failure of the UPS or the generator and not on the down stream power distribution architecture.''

          At last, someone who knows what they are talking about, I would add it could also be a HV switchgear fault

    5. Anonymous Coward
      Anonymous Coward

      Re: Ho hum

      Supposedly the two centres in the Heathrow area are fed by the usual dual power supply from 'two different suppliers' if a poster on the Daily Mail website who worked for BA is to be believed. Now have they got a different supply than the one the rest of us use from the National Grid for one of those. Or is it just two different cables coming into the buildings? I think Mr Cruz may have not realised that "remote" doesn't mean "in the next door building".

  3. Potemkine Silver badge

    In your face, dumbass

    And it has nothing to do with lay off of hundreds of IT staff, of course... BA gets what it deserves.

    For Mr Cruz, 'redundancy' means firing people, not doubling critical systems to avoid two major failures in a couple of months.

    1. PhilipN Silver badge

      Redundancy

      Long ago, in the second generation post-punch card mainframe era (before the arrival of desktops and broadband), peripherally to a legal matter I got to learn that BA had a 6 million quid computer system - and a second 6 million quid backup system 100% redundant designed only to cut in the instant the first one failed.

      Transposing those key features to today, when what you might call the cost per bit is so much cheaper - why no absolute failsafe backup system?

      1. Tom Paine Silver badge

        Re: Redundancy

        One of many things that have changed, apart from the technology, since then: way more bits.

        1. CrazyOldCatMan Silver badge

          Re: Redundancy

          One of many things that have changed, apart from the technology, since then: way more bits.

          And way more capacity to move said bits..

          (a 128K X25 link was fast when I was doing airline CRS programming in the early 1990's)

      2. yoganmahew

        Re: Redundancy

        @PhilipN

        The second system ran VM, test systems etc. so it didn't sit idle. It did, as you say, perform the 100% redundant hardware role (likewise the DASD for VM/test systems could be used in a pinch to recover the system). There were still come choke points that might have been tricky - DASD controllers, tape drives, speed of disks (the test system ones were older and slower), but there were also features that could be turned off so 80% capacity was enough!

      3. Anonymous Coward
        Anonymous Coward

        Re: Redundancy

        Whilst working for BT the system running most of them NHSs digital systems at the time had a 'power issue'.

        We knew the local electricity company were undertaking power cable replacement work, we had 2x generators running with 2x generators on standby, 7x UPS in line, we'd done a full DR test of losing mains and generator power beforehand and the UPSs and secondary generators had done their job so we thought we'd tested every eventuality. On the day, instead of cutting one mains power off and then installing the new mains power, the power company installed the second line into live alongside the still live original power line.

        It turned out the German UPS company had installed the breakers incorrectly in 6 out of the 7 so that when they got 2x 240v the breakers fused open so all our data centre kit also got the full 480v. The local fuses on the kit blew but some of the PSUs still got fried along with some disks and RAM.

        We had a secondary warm standby DR site available and switched over to this within 1hour and all services resumed apart from one, which was unfortunately our internet gateway server which rebooted itself and then crashed with disk errors, Again unfortunately the primary server was one who's disks got fried and it took us 2 days to restore from backup.

        So in summary, even the best laid plans don't always mean IT systems cope when the proverbial hits the spinning blades.

        1. patrickstar

          Re: Redundancy

          I would typically think you misunderstood something about the electrics (wiring up two feeds in parallell would either cause nothing, if they were in phase, or a short, if they were out of phase), but I'm not familiar with the UK system.

          Is this 240V single (phase+neutral), two (two phases with 240V average difference) or three phase (where the voltage between phases would be higher but give 240V to neutral) ?

          In all scenarios except the two phase one, overvoltage could also be caused by losing the neutral wire.

          In the two-phase scenario I guess you could somehow end up with two phases with more than 240V difference.

          Otherwise I'd think it was a short voltage spike or similar.

  4. Fred Dibnah

    Power

    If power is the issue, what are the IT team doing looking at it? A job for the electricians, shurely?

    1. Anonymous Coward
      Anonymous Coward

      Re: Power

      One does not simply flip the switch to bring a datacentre back online.

      1. Version 1.0 Silver badge

        Re: Power

        One does not simply flip the switch to bring a datacentre back online.

        True - you have to book a flight from India to visit the datacenter and flip the switch.

        1. SkippyBing Silver badge

          Re: Power

          'True - you have to book a flight from India to visit the datacenter and flip the switch.'

          That's okay, they can book it online. oh...

      2. Anonymous Coward
        Anonymous Coward

        Re: Power

        Damn right! You got to start the generators too, or the velociraptors will escape. You should know this, it's UNIX system.

      3. Anonymous Coward
        Anonymous Coward

        Re: Power

        Correct; you have to have a tested and proven plan for it. Clearly lacking in this case.

      4. Griffo

        Re: Power

        It certainly is a big job in bringing back a whole DC. And i mean a real dedicated DC, not a simple computer room. I've been there done that once before. One of the HV feeds into one of our DC's literally exploded causing that drasted thing called a fire. While we had redundant power entry points into the facility, the on-site team took the step to cut all power for safety reasons.

        In addition to dead hard-drives, there were also dead power supplies, and some servers that just decided they didn't want to turn back on. That was the simple stuff. Then there were switches and routers that for some reason didn't have the latest configs committed.... or started up with no-config whatsoever. Having to work out the all network configs was a total nightmare. Then the storage systems that came up dirty, the replicated databases that needed to be re-seeded.. you name it, there was just about everything that could possible fail, we had at least one. Across thousands of servers with hundreds of routers, dozens and dozens of SAN's, all kinds of ancillary devices such as robotic tape drives etc, it took us a good 36 hours straight just to get operational, and probably weeks before we could claim it was 100% as good as before the accident. It's no trivial matter.

    2. Yet Another Anonymous coward Silver badge

      Re: Power

      >If power is the issue, what are the IT team doing looking at it? A job for the electricians, shurely?

      The electricians are busy doing the database upgrade.

      1. Marketing Hack Silver badge
        Devil

        Re: Power

        The electricians wanted a purchase order before starting work, but BA's procurement systems are down :)

    3. Anonymous Coward
      Anonymous Coward

      Re: Power

      Supply chain management.

    4. JimRoyal

      Re: Power

      From experience, I can tell you that bringing a data centre back online involves more than flicking the switch. For one thing, there will be drive failures. When you have drives that have been running for a couple of years, or more, cool down then some of them fail to spin up when the power comes back. The techs will all be praying that each failure is in a separate RAID array.

      1. Anonymous Coward
        Anonymous Coward

        Re: Power

        From experience, I can tell you that bringing a data centre back online involves more than flicking the switch.

        Cruz should know that, given that his CV on the IAG website claims he worked for Sabre for five years. But maybe he cleaned the toilets, or was in marketing? Looking at that career history, quite how somebody so under-experienced got to be CEO of one of the world's largest airline groups I cannot guess.

        1. JimboSmith Silver badge

          Re: Power

          He's only CEO of BA not the parent company IAG which is the group that has all the airlines in it. How he got to be running BA though is a mystery, might see if a bookie will give me odds on how long he stays CEO.

      2. Madonnait

        Re: Power

        WTF? No generators? What kind of D.C. doesn't have generators? No failsafe automatic transition to the generators? Can't be......

    5. Charlie Clark Silver badge

      Re: Power

      My guess is that "power supply" was the term agreed with the insurers (and possibly the Home Office given the scale) for a SNAFU that can be presented as a freak one-off event with no hint of incompetence or possibly even compromise.

  5. Dave Harvey

    Really a power failure?

    I know that one is not supposed to attribute to malice anything which could equally be attributed to mere cock-up, BUT......the scale of the problem, presumably affecting systems which are supposed to be redundant, combined with the particularly bad timing, and the "off-shoring" does make we wonder whether a "time-bomb" set by a disgruntled ex-employee might be the real root cause.

    1. Electron Shepherd

      Re: Really a power failure?

      Possibly not left by a disgruntled employee.

      The original WannaCrypt worm ran around encrypting files, but a more stealthy variant could have installed itself and simply waited until a later point in time, or for instructions from a command and control centre somewhere.

      Never mind $300 in BitCoin to ransom a few Excel spreadsheets and a couple of PowerPoint presentations. How about $ <really big number> or no BA flights take off?

    2. Tom Paine Silver badge

      Re: Really a power failure?

      You can come up with any number of scenarios to explain the gap between the claim of "power failure" and the apparent impact. I've no idea what the story is, but "power failure" is too glib.

      I can't believe BA don't have DC level redundancy. The proverbial jumbo jet crashing on LD4 (say, I have no idea if they're in that site, though it's dead handy for Heathrow) shouldn't mean more than, say, 30 mins service outage, tops. And that's assuming lots of legacy gear that can't easily be moved to a modern, realtime replication, hot failover mirrored servers / data / sites set up; if you were starting from scratch today with greenfield sites, a DC outage wouldn't be noticed outside the IT dept.

      1. Danny 14 Silver badge

        Re: Really a power failure?

        Non english speaking cleaner mixed the 'cleaner socket' with 'top of rack' socket. Thar she blows!

        1. RW

          Re: Really a power failure?

          There are authenticated cases where, for example, a hospital cleaner would unplug clinical equipment keeping patients alive so he could plug in his floor polisher.

          1. Anonymous Coward
            Anonymous Coward

            Re: Really a power failure?

            Authenticated you say?

            http://www.snopes.com/horrors/freakish/cleaner.asp

            1. Anonymous Coward
              Anonymous Coward

              Re: Really a power failure?

              "Authenticated you say?"

              The English Electric Deuce computer had mercury delay lines. Their cabinets looked like large mushrooms. Their precision heater jackets had a dangling cable plugged into a mains socket in a false floor tile. They experienced the problem of cleaners using the socket for the vacuum cleaner.

              In the UK in the 1980s standard power sockets appeared in computer rooms with the earth pin rotated from square. There were several different angles available. The idea was to ensure people couldn't casually plug uncertified equipment into those sockets.

              1. fajensen Silver badge
                Facepalm

                Re: Really a power failure?

                The idea was to ensure people couldn't casually plug uncertified equipment into those sockets.

                The reality was that people would hacksaw that damn ground pin off so that they could get work done!

              2. Trigonoceps occipitalis

                Re: Really a power failure?

                "The idea was to ensure people couldn't casually plug uncertified equipment into those sockets."

                But they would have to unplug my heart-lung machine to find this out.

                I took over as project manager for a national radio project. New broom etc I queried the cost of a double 13A socket when we only need one. For the tech's kettle apparently.

              3. Stoneshop Silver badge
                Holmes

                Re: Really a power failure?

                The idea was to ensure people couldn't casually plug uncertified equipment into those sockets.

                Cleaners will only notice they can't plug their vacuum in after unplugging that which must never be unplugged.

                1. JimC Silver badge
                  Facepalm

                  Re: Really a power failure?

                  And the code monkeys could never understand why I got upset about them plugging things into the oh so-very-handy waist height sockets labelled 'reserved for vacuum cleaners'.

                  Funny how intelligent people had so much trouble grasping that the best way to avoid screw ups from minimum cost contract cleaners is to make it about ten times easier to do it right than do it wrong...

              4. PNGuinn
                Facepalm

                Re: Really a power failure?

                "In the UK in the 1980s standard power sockets appeared in computer rooms with the earth pin rotated from square. There were several different angles available. The idea was to ensure people couldn't casually plug uncertified equipment into those sockets."

                Does that really prevent a cleaner pulling the plug, trying every which way to plug in the vac, including kicking the plug, then wandering off to pull another plug somewhere, leaving the first one on the floor somewhere, possibly trodden on?

                Perhaps I overestimate the capacity of the average cleaner ...

                1. JimC Silver badge

                  Re: Does that really prevent a cleaner

                  > pulling the plug, trying every which way to plug in the vac,

                  No, but it did preclude the supposedly intelligent IT staff from plugging the kettle, refrigerator, cooling fan etc into the smoothed clean power supply, which, in those days when a lot of IT equipment was rather less resilient than it is now, was in itself worthwhile. The most valued sockets in the office were the tiny handful, identities as far as possible kept secret, which weren't on the clean power supply but were UPS and generator backed :-)

            2. Kevin Johnston

              Re: Really a power failure?

              It may not have been a hospital and dead patients but I was working in a Flight Simulator company and overnight the cleaners unplugged an external disc unit providing customised scenarios for a customer. Losing the connectivity froze the 'in use' simulator mid session leaving the instructor to show the students how to egress from a platfom stuck 5m up and at a 30degree angle

              These things are not all Urban myths

              1. Alan Brown Silver badge

                Re: Really a power failure?

                I've also run into cases of people unplugging stuff they shouldn't, to plug in stuff they shouldn't.(*)

                And denied doing it - until confronted with CCTV evidence, then still tried to deny it.

                (*) In one case, a toasted sandwich maker. That kind of shit belongs in the kitchen.

            3. tinman

              Re: Really a power failure?

              Authenticated you say?

              http://www.snopes.com/horrors/freakish/cleaner.asp

              but... in Trials of an Expert Witness Harold Klawans, a neurologist, talks about being called in to consult on a case that was somewhat similar.

              A hospital porter had been tasked to taking a patient to theatre for surgery. He didn't notice that she had a respiratory arrest and that she stopped breathing for a time. It was spotted by someone else and she was resuscitated but the respiratory centre of her brain was damaged so she was conscious and unaffected cognitively but couldn't breathe without a ventilator. This meant she ended up staying in ICU for some months. Being aware she got fed up with crosswords and knitting so she asked for a TV which the hospital were only too happy to supply as they were facing a lawsuit from her for their staff member's negligence. The TV was brought by a porter who had recently been taken off patient transport duties due to his lack of clinical awareness and unwillingness to be retrained in CPR. He brought it to her room, and seeing she was asleep thought it would be a nice surprise for her if he plugged it in so it'd be on for her when she awoke. So guess what medical equipment he unplugged to plug in the TV? Yes, indeed it was the ventilator and this was back in 1974 so the machine was not designed to alarm if unplugged

          2. djberriman

            Re: Really a power failure?

            I remember way back when, ooo perhaps late 80's, a customer had a machine that kept crashing, tech support where there (real tech support who could build operating systems), engineers (real ones that were certified to repair boards at component level by the manufacturer) and of course sales (dealing with the irate customer). It was only when one of the office girls ran in late (who clearly made the brew) and flicked the kettle on all was revealed. Somehow, someone had managed at some time managed to connect up that socket to the clean supply.

            Similarly around the same era large customer (still around) rang up saying all their external comms were down (this was done via automated analogue modems - state of the art then). Was only when I visited site I noticed the computer room (yes they were rooms then) was laid out differently. They had 'moved' everything above the floor but not below the floor and in doing so managed to wrap the comms cables round the 3 phase power supply. It wasn't possible sort it there and then but luckily (as I was a scout) I had some spare cables in the car and managed to get them going.

            Simpler and fun times.

      2. John Brown (no body) Silver badge

        Re: Really a power failure?

        "I've no idea what the story is, but "power failure" is too glib."

        Maybe both BA and Capita data centres have early version 'net connected smart meters with poor to no security?

      3. Nicko

        Re: Really a power failure?

        I too was wondering about LD4/LD5/LD6... been in those loads, but they are serious DCs... can't believe that Equinix would get that wrong. It's one of their key sales points...

        NY4 is under the flight path into Newark - Do Equinix just happen to like building DCs near airports?

        1. Anonymous Coward
          Anonymous Coward

          Re: Really a power failure?

          Quite safe actually. Aircraft usually crash on the runway or far from it.

          1. Vic

            Re: Really a power failure?

            Aircraft usually crash on the runway or far from it.

            There was a crashed aircraft right on the threshold at Sandown the other day.

            It's really distracting having to overfly a crashed aircraft of the same type as you're flying in order to land...

            Vic.

      4. Anonymous Coward
        Anonymous Coward

        Re: Really a power failure?

        There are two DC one at Cranebank and one at Boadicea House.

    3. Anonymous Coward
      Anonymous Coward

      Re: Really a power failure?

      Weren't the 7/7 terrorist attacks in London initially attributed to a "power surge"?

    4. Nick Kew Silver badge

      Re: Really a power failure?

      After last night's big thunderstorms, it seems plausible power to the datacentre was indeed hit. Or that a power surge damaged something, which might be reported as "power failure" if the details were considered far too confusing for the readers.

      Perhaps they had redundancy against one system being knocked out, but ended up instead with two systems each apparently still working but irreconcilably at odds with each other?

      1. Jove Bronze badge

        Re: Really a power failure?

        Rubbish. This is a global business with multiple DCs and no doubt redundancy built-in at all levels of such critical business systems. Any narrative trying to suggest otherwise must be originating from the BA PR department.

        1. fajensen Silver badge
          Boffin

          Re: Really a power failure?

          This is a global business with multiple DCs and no doubt redundancy built-in at all levels of such critical business systems.

          The iron law of stupidity: Stupidity scales in proportion to size while "smarts" scale only at about the square root of size!

          Thus, the more "global", the more "multi-billion", more "multi-market" and whatnot, the more Retarded the business (or Empire) really is.

          1. CrazyOldCatMan Silver badge

            Re: Really a power failure?

            The iron law of stupidity: Stupidity scales in proportion to size while "smarts" scale only at about the square root of size!

            And the addon to that law: To work out the IQ of a crowd, take the lowest IQ of the individual members and divide by the number in the crowd.

        2. TkH11

          Re: Really a power failure?

          Clearly not as much redundancy as you think, otherwise it wouldn't have happened!

          I have spent the last 10 years working on IT solutions in data centres, you would be surprised at how many major companies and critical infrastructure do not have an adequate disaster recovery solution.

        3. Captain Badmouth
          Holmes

          Re: Really a power failure?

          Rubbish. This is a global business with multiple DCs and no doubt redundancy built-in at all levels of such critical business systems. Any narrative trying to suggest otherwise must be originating from the BA BS department.

          Fixed.

    5. Phil O'Sophical Silver badge

      Re: Really a power failure?

      Let me guess, muppet manager got tweet saying there was a huge DC (meaning Data Centre) problem, and never having heard of anything other than "laptop" and "cloud" decided that it must mean a power supply problem? He's probably wondering why it takes so long to replace a fuse, and whether they've tried B&Q yet.

    6. Doctor Syntax Silver badge

      Re: Really a power failure?

      "I know that one is not supposed to attribute to malice anything which could equally be attributed to mere cock-up"

      OTOH any senior manglement cost-cutting exercise is indistinguishable from malice.

  6. Anonymous Coward
    Anonymous Coward

    Curious, why did it only hit BA and not Iberia?

    1. Tom Paine Silver badge

      Presumably they still maintain their legacy core systems. They're not the sort of systems you can mash together with six months of late working and a few free pizzas to cover a long weekend of cutting over.

      1. Dan 55 Silver badge
        Holmes

        This might have been attempt to try and mash some of them together.

        1. yoganmahew

          Iberia still have their Unisys mainframe - Resiber, though direct traffic for Iberia routes through Amadeus.

          Aer Lingus also still have their ALCS system - ASTRAL on IBM z/series.

          I don't believe the power issue thing either, at least not as hinted at. Sounds like a split DC, but without 100% capacity in each, and at least one of them without the electrical power to ramp up, even at 50%, to the busiest day of the year for the new layer of systems on top of the Amadeus service bus. BA may need a bus replacement service... sorry, I'll get me coat...

          A long time ago, in a Cranford far, far, away, I worked at a BA company. Waterside was part of the rot. Brand's don't need to concern themselves with operations, they have a dashboard and a formula that says x cuts = y profits with z risk. Well, z risk just showed up and shat all over the operation. Brand that MoFo.

    2. Anonymous Coward
      Anonymous Coward

      Re: curious

      (Ex-BA IT)

      BA, Iberia, AerLingus and Vueling all have their own DC's.

    3. Anonymous Coward
      Anonymous Coward

      Amadeus

      https://uk.reuters.com/article/uk-british-airways-amadeus-it-group-idUKKBN18M1XG

    4. This post has been deleted by its author

  7. Rusty 1
    WTF?

    Power failure: so the diverse sourced mains supplies failed, and the UPSs, and the generators? That must have been some failure!

    Surely they would just failover to their alternative data centre(s)?

    They haven't gone cheap, have they?

    1. HWwiz

      Smells funny.

      Just does not sound right.

      I work in a Bank DC in the UK, and we have 5 mirrored DC's throughout the country.

      Ours has 3 different incoming power supplies from different Counties.

      Plus out the back, we have 8 massive V12 Diesel Generators that would power a small town.

      Now i would imagine that BA would at least have something on the same scale, if not bigger.

      So why the outage ?......

      We do Power tests every month. An incoming power feed is killed, to check redundancy.

      We also shut off 1 of the 2 power rails going into each cab, again to test redundancy.

      Perhaps BA are running their backbone on some beefed up PC in someones bedroom ?.

      1. Pascal Monett Silver badge
        Trollface

        @HWwiz

        It seems you are doing things properly. Your post has been noted and an officially-credentialed MBA will be dispatched on site to correct that situation forthwith.

        1. Destroy All Monsters Silver badge

          Re: @HWwiz

          Your post has been noted and an officially-credentialed MBA will be dispatched on site

          I'm happy to hear that. I'm ready provide you with names and complete CVs for a few of ours.

        2. gregthecanuck
          Pint

          Re: @HWwiz

          Congratulations - you win LOL of the day. :-)

      2. TDog

        Re: Smells funny.

        Well I worked for a tier II insurance company (began with A and ended with lyn when they had a power outage I was very impressed with their ability hire and install a diesel container based pod to get it all working again within 2 days.

        I'm sure that the cost benefit / cost effective post hoc analysis which I was not invited to contribute to was saying just how successful they had been. After all; senior management has no real idea of acceptable in middle order companies, and could well be most impressed by how quickly the fuckup was contained; if not controlled.

        1. John Brown (no body) Silver badge

          Re: Smells funny.

          "I was very impressed with their ability hire and install a diesel container based pod to get it all working again within 2 days."

          To be honest, unless the DC is miles away from anywhere, I'd expect a mobile genny set to be on-site same day if that is the required solution. Many power critical places have these sort of of emergency call-outs on contract stand-by. It doesn't actually cost all that much because the suppliers can use them elsewhere on lower grade contracts where they can be pulled if needed for priority contracts.

          Our housing estate was partially taken out by a power cut a few months ago at about 10pm. The sub-station took 3 days to fix but a genny set was installed and running by 4am (based on the time on my clocks flashing display)

          Obviously bringing everything back up can take a lot longer, but that's why DCs have massive battery UPS and on-site generators in the first place. It seems that the "power issue" wasn't just a loss of power from external sources since any proper DC should be able to manage for a day or two without it, at least. Indefinitely if they book in fuel deliveries.

      3. Dazed and Confused Silver badge

        Re: Smells funny.

        Plus out the back, we have 8 massive V12 Diesel Generators that would power a small town.

        I remember a story a customer told me once about their backup generators. They were on the roof of the building which housed their DC and the diesel tanks were buried under the car park. Every couple of months the generators were tested, they'd be fired up for a few minutes to make sure everything was in working order.

        Then came the day of the actual power cut.

        Everything went fine to start with, the batteries kicked in, then the generators all started, then after a couple of minutes each of the generators coughed and died one by one.

        It turned out the pumps the get the fuel up to the roof were on the wrong side of the cross over switches.

        1. Anonymous Coward
          Anonymous Coward

          Re: Smells funny.

          Ouch, been there, got the T-Shirt.

          Although in our case it was the 25p microswitch attached to the ball-float in the genny header tank (that would turn on the pumps to replenish the header tank (~24 hour's fuel) from the external tanks (1 week's fuel)) that failed and turned a "phew, everything's OK" into "oh fuck."

          On the bright side, I learned how to prime a dry diesel engine that night.

        2. Anonymous Coward
          Anonymous Coward

          Re: Smells funny.

          It happened.

        3. Anonymous Coward
          Anonymous Coward

          Re: Smells funny.

          An oil company and I can't spill the name (it wasn't BP though) had a power outage at one of their DC leading to an outage at that DC. In the post mortem it was discovered that the UPS had worked fine but the first generator had developed a fault or already had a fault. The second generator had conked out just after starting up and the third had lasted about 15 mins before ceasing to work. The reason for the second and third generators failing was that no one had checked the fuel tanks.

        4. Charles Smith

          Re: Smells funny.

          Fuel in the basement, generators on the roof? Fuel leak during testing? I was working in a large white riverside building at the north end of London Bridge when precisely that type of event happened. The London Fire Brigade were not impressed when traders refused to leave their diesel soaked trading desks because there were outstanding trades to be closed.

      4. TkH11

        Re: Smells funny.

        You work in a bank, where they do it right. They have the money to do it right and they will have calculated the cost of down time and decided to do it right.

        I bet for BA's data centres, they only have a single power feed in from the national grid, UPS containing lead acid or gel batteries with a diesel generator. The UPS providing temporary power for a few minutes until the diesel generator kicks in. Probably weren't maintaining the batteries. Mains power to the site fails, batteries kick in and immediately fail, causing all servers to lose power and crash. Two minutes later the generator starts up, brings power back to the servers. Databases are in a hell of a state because data files were being written to at the time, so you need to start bringing in DBAs to recover them.

        Remote support team in India, is quite useless, they probably resolve issues by restarting apps and lack the level of expertise to debug all the problems and recover the system.

        I bet that's what has happened.

    2. Doctor_Wibble
      Boffin

      We had one joyful time (late 90s) with a power cut but the magic power switchy thing didn't switch over to the generator, which IIRC started up fine... the building UPS lasted for a while and anything with its own UPS lived a few hours longer and bit by bit it all died.

      We were a major site within the company but our temporary outage didn't stop the rest of the company from functioning relatively normally though they did soup up the systems after that anyway.

      On the other hand, if your entire operation depends on one specific thing working then you have as many of those magic power switchy things as you can fit in the box.

  8. Ken Moorhouse Silver badge

    BA's problems putting planes into the Cloud[s]

    Alternative [clickbait] headline?

    1. Danny 14 Silver badge

      Re: BA's problems putting planes into the Cloud[s]

      The cloud will save us all!

  9. Anonymous Coward
    Anonymous Coward

    Whoever saved a few £million a year with that outsource...

    .. just cost BA a £billion +

    1. Tom Paine Silver badge

      Re: Whoever saved a few £million a year with that outsource...

      Whoever saved a few £million a year with that outsource...

      .. just cost BA a £billion +

      Maybe; maybe not. We've no way of knowing whether the outsourcing deal had anything to do with this.

      GDPR is going to bring in mandatory breach reporting. I'd like to see mandatory RCA reports for system failures of anything that causes this level of disruption and inconvenience to that many people. Not sure how you'd mask out security-sensitive info (OS or server packages and versions, say) but dammit! we are geeks and we MUST KNOW THE GORY DETAILS! Feed us! Feed us!!

      Sorry about that, spent a bit too long gardening out in the sun this afternoon I think :>

  10. Anonymous Coward
    Anonymous Coward

    Back-up, folks?

    It never fails to amaze me how companies' systems seem to go down at the drop of a hat and how far-reached these happenings are, geographically speaking. How come ALL BA business seems to have ground to a halt over such a wide area? Where's the back-up for all this? Surely there should be some sort of contingency plan for major break-downs or, at the very least, some type of basic fall-back to keep things running, if only somewhat slower than normal? If I, with my domestic set-up of six computers, can arrange auto-back-up and redundancy over the whole of my network and the associated external hard drives, this sort of thing should be child's-play to a big company (that's if they still have the qualified staff to execute such an action, of course!). To say that it's a power problem sounds a classic generalisation to try and keep people......er......."happy" - that's apart from all the doubtless ratty passengers whose holidays have been wrecked. There are (I would guess) a lot of extremely unhappy folk out there who will probably never grace BA with their custom ever again.

    1. This post has been deleted by its author

      1. monty75

        Re: Back-up, folks?

        No, but I don't suppose he's running an international airline either.

        1. koswix

          iPlayer?

          I dunno, it would certainly explain why Ryanair's website is so awful!

          1. a_yank_lurker Silver badge

            Re: iPlayer?

            Isn't Ryanair's still up?

        2. LDS Silver badge

          Re: Back-up, folks?

          I'm sure there's a lot of people who can barely manage six machines working in much larger installations...

      2. Grunt #1

        Re: Back-up, folks?

        You are wrong.

        Resilience is easier to apply in a large organisation with multiple DCs and sufficient resources. It takes planning and money and the same principles apply to all organisations large and small. All it takes is to employ specialists with knowledge and experience gained on the DC floor.

        In case anyone is feeling smug, when was the last time you tested your failover? Do you have a plan? Was that risk acceptance based on hard cold facts or just to save money?

        1. Anonymous Coward
          Anonymous Coward

          Re: Six machines

          "If I, with my domestic set-up of six computers, can arrange auto-back-up and redundancy over the whole of my network and the associated external hard drives, this sort of thing should be child's-play to a big company"

          Let me guess - you're a senior manager in BA IT who's just discovered that looking after 500 racks of equipment IS harder than looking after backups and redundancy across six computers?

          The single biggest difference is that BA needed to provide 24x7 services across both internal and external faults - my guess is that your six machines wouldn't last more than a day without power and a fault that resulted in half of your estate being unusable combined with a failure in the "working" part of your estate would quickly cause major issues....

          1. TkH11

            Re: Six machines

            It is not childs play. Disaster Recovery is not the same as backup.

            You need replication mechanisms that can transfer data between primary and DR sites. A key question is how much manual work is required by technical staff to activate the DR applications.

            Some systems can failed over automatically within seconds, others require manual intervention, restoration of databases, which can take hours.

            It all comes down to cost. If you want a lower recovery time objective, then it's going to cost more.

    2. Doctor Syntax Silver badge

      Re: Back-up, folks?

      "There are (I would guess) a lot of extremely unhappy folk out there who will probably never grace BA with their custom ever again."

      Yes but we didn't all make that decision this weekend.

    3. John Brown (no body) Silver badge

      Re: Back-up, folks?

      "To say that it's a power problem sounds a classic generalisation to try and keep people......er......."happy"

      I suspect it's part of the blame game. They say "power failure" and the implication to most people is that the local power company is at fault and poor old BA have no control over it. By the time the truth comes out, most people will have forgotten all about it and, if it's reported at all, it'll be little non-story, not a BIG HEADLINE.

      Those of us who understand resilience and DR can speculate all we like, but the general public and the politicians with little no clue about either, will happily go on blaming the 'leccy supplier and feeling sorry for BA.

      1. illiad

        Re: Back-up, folks?

        yes, dont bother with the TV news (RT, SKY, BBC) the first 20 mins was all about terror, then 2 or 3 mins on BA!

      2. TkH11

        Re: Back-up, folks?

        That's exactly what is is, it's reputation damage limitation exercise. BA never goes into detail of what's gong on when it has these repeated widespread IT failures.

        Say it's a power failure and everybody thinks it's not BA's fault.

    4. Paul Crawford Silver badge

      Re: should be child's-play?

      Well our out-sourced staff can't do it.

      Say, can you find us a child with some computing aptitude?

    5. Anonymous Coward
      Anonymous Coward

      Re: Back-up, folks?

      It never fails to amaze me

      Things are only "amazing" the first few times. I worked as admin for data centres for mobile services of the Five-Nine kind (what the brochure says, anyway. Reality is Different, but, that is why lawyers write the SLA's so it is never enforcable). Some months in that job I would make more than double the quite decent pay package over stupid crap failing over trivial things like useless logs eating all the disk.

      Because - Apart from the shiny-shiny front end, ALL of it shit, built on top of aggregated layers of shit - and - for every new bunch of con-sluttant Java programmers hired in to develop yet more features, a few more frameworks will be glommed onto the mess because the con-sluttants need to pad their CV's and get out of this shit.

      "Data-Archeologist" will be a thing in the future, someone, an expert, to carefully scrape all the sediment off and re-discover what the core of a critical system actually does and revive it!!

      It more amazing that civilisation has so far not collapsed to 3'rd world unreliability just yet. Guess we just haven't digitalised and clouded quite enough, I suppose.

      1. TkH11

        Re: Back-up, folks?

        Agreed. I have worked on a number of large-ish systems over the years. All claim to have DR and high availability. The reality is different. The customer are BS'ed and believe the hype.

        I recall there being a big mobile phone network failure a few years back, you'd have thought that DR systems would have been entirely automatic and instantiated within minutes. So why did it take most of the day to instantiate the DR?

        There's DR and there's DR. There are all manner of options to be chosen at specification and design time.

    6. Peter Gathercole Silver badge

      Re: Back-up, folks?

      We hear the failures. We very rarely hear where site resilience and DR worked as designed. It's just not news worthy.

      "Stop Press: Full site power outage hits Company X. Service not affected as DR worked flawlessly. Spokesperson says they were a little nervous, but had full confidence in their systems. Nobody fired".

      Not much of a headline, is it, although "DR architect praised, company thanks all staff involved and Accountants agree that the money to have DR environment was well spent" would be one I would like, but never expect to see.

      I know that some organisations get it right, because I've worked through a number of real events and full exercises that show that things can work, and none of the real events ever appeared in the press.

      1. Anonymous Coward
        Anonymous Coward

        Re: Back-up, folks?

        "We very rarely hear where site resilience and DR worked as designed. It's just not news worthy."

        Which is part of the reason that outlets like Availability Digest exist - to research and document what does work, so that people who want to learn from what works (rather than just read what's generating advertising revenue) have places to look:

        http://www.availabilitydigest.com/

      2. Anonymous Coward
        Anonymous Coward

        Re: Back-up, folks?

        I worked for a firm that had a major event. My boss made a point of thanking me and everyone else for their hard work and a plan that worked as expected. My bonus that year was rather good too.

        He progressed and was eventually replaced. IT was outsourced by his replacement. The outsourcer offshored everything and is now in trouble themselves. Whoops !

  11. Ian Emery Silver badge

    Heads will roll

    But somehow, I doubt the person/s actually responsible will be one of them.

    BA having been going downhill for over a decade; this just shows how shit they really are now.

    One question though, their tech support was outsourced to India, but was the actual work outsourced to Capita by any chance??

    Its strange BOTH are claiming power issues.

    1. Commswonk Silver badge

      Re: Heads will roll

      Nearly right. The usual starting point is Deputy Heads will roll...

    2. Doctor Syntax Silver badge

      Re: Heads will roll

      "One question though, their tech support was outsourced to India, but was the actual work outsourced to Capita by any chance??

      Its strange BOTH are claiming power issues."

      Also the Pension Regulator's site has been offline for some time: http://www.bbc.co.uk/news/business-40057025 Was that a Crapita business?

    3. Grunt #1

      Re: Heads will roll

      If the executive class don't buy DR then guess what they get.

      1. Anonymous Coward
        Anonymous Coward

        Re: Heads will roll

        "If the executive class don't buy DR then guess what they get."

        Big bonuses and later on - a golden handshake.

  12. Anonymous Coward
    Anonymous Coward

    Is it because BA waived Tata

    To many of its IT staff?

    1. TkH11

      Re: Is it because BA waived Tata

      It comes down to who has responsibility for the power services of the data centre. Generally that is not IT staff. Sometimes the data centre is operated by a third party and they should be testing power distribution systems, UPS periodically.

      The power failure may not even be BA's fault, but

      i) their failure to instantiate DR sufficiently quickly

      and

      ii) their failure to recover the primary site's applications quickly enough

      will be down to BA's IT, which looks as if it's now based in India.

      1. illiad

        Re: Is it because BA waived Tata

        I think the *whole* of BA's system failing (at the same time??) cannot be just a power supply fault, just bad reporting of what a DC is... Must be a main DATA Centre... :)

      2. Alan Brown Silver badge

        Re: Is it because BA waived Tata

        You say "the" data centre, like you expect it's normal to only have one of them.

  13. Brett Weaver

    Heathrow and Gatwick?

    The article says that flights from Heathrow and Gatwick are affected. BA flies from a lot of other locations so presumably the systems are not down but the local delivery of GUI...

    CEO should be fired just after he fires the CIO for allowing this to happen. No excuses. Airlines are computer system reliant companies. Unless a super hero or other unworldly event is involved, management should be frog marched out today.

    1. Tom Paine Silver badge
      FAIL

      Re: Heathrow and Gatwick?

      The article says that flights from Heathrow and Gatwick are affected. BA flies from a lot of other locations so presumably the systems are not down but the local delivery of GUI...

      The BBC and other outlets reports say it was (is?) global, with all aircraft movements stopped everywhere in the world. Except the ones in the air. presumably.

      1. Marketing Hack Silver badge

        Re: Heathrow and Gatwick?

        Yes, BA flights out of San Francisco and San Jose are impacted too, so that's 3-4 nonstops between the SF Bay Area and Heathrow.

      2. Anonymous Coward
        Anonymous Coward

        Re: Heathrow and Gatwick?

        Luckily flying aircraft have fail safe systems on board.

    2. Commswonk Silver badge

      Re: Heathrow and Gatwick?

      CEO should be fired just after he fires the CIO for allowing this to happen.

      Unless, of course, the CIO has kept a copy of the email / memo / minutes in which the CFO refused the money to replace the batteries in a big UPS or replace some other mission - critical bit of hardware that was approaching end of life or the like.

      1. Doctor Syntax Silver badge

        Re: Heathrow and Gatwick?

        "Unless, of course, the CIO has kept a copy of the email / memo / minutes in which the CFO refused the money to replace the batteries in a big UPS"

        Just sack the lot of them and start over again.

    3. anthonyhegedus Silver badge

      Re: Heathrow and Gatwick?

      If it's just Heathrow and Gatwick then why couldn't staff phone another site like Paris or Glasgow or anywhere really? Probably because it was down everywhere but even more systems were down at Heathrow and Gatwick.

      1. Danny 14 Silver badge

        Re: Heathrow and Gatwick?

        The ones in the air stopped after landing as they couldnt disembark. Some were sat on the tarmac for 3 hours.

      2. Anonymous Coward
        Anonymous Coward

        Re: Heathrow and Gatwick?

        Brexit ?

    4. Dodgy Geezer Silver badge

      Re: Heathrow and Gatwick?

      Seeing as how the Chairman and the CEO are the same person, I suspect that the CUP is no different...

  14. Tony S

    Cynical Me

    Something about this does not add up.

    As others have said, the DC would have multiple power supplies, UPS and backup generators to provide continuity in the event of power supply failure. If they don't, then it has to be a major cock-up in the design, or evidence that BA management are being cheap.

    Equally, if the design of the systems is appropriate, then there would be a fail over to alternative systems. It simply does not make sense that a company of that size and turnover would not be able to do this. If they cannot, then again, there has been a failure of management to adequately plan for the appropriate scenarios.

    The BA spokes person was adamant that outsourcing was not behind the problem; but I suspect that it is highly unlikely that this outage would not be related to that decision in some way. I really hope that the BA shareholders demand a full explanation; and that they make their displeasure known to those people that have been failing to plan appropriately.

    1. AbsolutelyBarking

      Re: Cynical Me

      Agree totally. Can't believe that there wouldn't be several layers of power redundancy (tested) on critical systems. The 'power supply' line has a strong whiff of horse manure in my view.

      Incident report should make interesting reading...

    2. N000dles

      Re: Cynical Me

      I've experienced a toilet taking down a whole DC installed with all the bells and whistles for failing over to generators and UPS systems. It was leaking from a downpipe within the lagging which caused water build up in the basement in the cabinet where the system would switch from the mains feed to the backup UPS and generators. We tested the generators and UPS regularly but management were always scared to pull the plug on the mains for testing. One day when the mains did go out the switch-over to the UPS and generator did not happen continuously and the few seconds delay from on to off to back on again with 240 cabinets turned on blew all the breakers in the PDUs. Try buying 120 x 63amp breakers at 2am, even in London to bring back the DC from the dead.

      The only way to ensure systems like this stay up are geographically dispersed sites and ruthless operations management that ensures everything is duplicated religiously. You should be able to have a physical chaos monkey choose any cable they like and the confidence fail over occurs without disruption like seen today. Otherwise, you just don't have a backup until it's been tested...

      1. SkippyBing Silver badge

        Re: Cynical Me

        'You should be able to have a physical chaos monkey'

        Where do I send my CV?

        1. Anonymous Coward
          Anonymous Coward

          Re: Cynical Me

          'You should be able to have a physical chaos monkey'....

          Where do I send my CV?

          Appears they already have a well qualified candidate filling that role. Along with a complete menagerie of of senior managerial chaos monkeys, running round in a panic, parroting that its all the fault of the power supply.

          Like many large businesses, big airlines are now just IT companies with a few (usually leased) planes, maintained and sometimes crewed by third party companies. For the senior management to be so evidently clueless in the design, testing, and operation of the business critical systems that ARE their business is utterly unacceptable.

        2. Anonymous Coward
          Anonymous Coward

          Re: Cynical Me

          I already work at BA.

          My job is done, I'll get my coat.

      2. Trixr Bronze badge

        Re: Cynical Me

        Yup, I know of an instance where a certain country's largest airport's ATC systems were literally two minutes away from a complete power failure. Mains power didn't come back after some issue with switching between that and the genny (I did hear the gory details, but my understanding of what's what was limited). The contract electrician (no more on-site sparkies after "efficiency" cuts) had to be called out from the other side of town (a town with awful road congestion at the best of times).

        The only reason the whole lot didn't go down was due to the site manager and staff literally running around the ops building and tower powering down every single piece of electrical equipment that did not concern the tower cab's ATC display systems and nav aids. Was there any review in terms of obtaining another genny and/or onsite sparkie during operational hours? No.

        1. Alan Brown Silver badge

          Re: Cynical Me

          "Was there any review in terms of obtaining another genny and/or onsite sparkie during operational hours? No."

          In other words, next time it happens the site manager won't bother.

          It's amazing how much money there is to fix things AFTER the stable door has been smashed to smithereens.

      3. Dal90

        Re: Cynical Me

        Can not +1 this enough.

        Unless you have a non-stop infrastructure designed to, and that you willing to, chaos monkey any component at any time...you do not have a non-stop infrastructure.

        You have a very expensive wing and a prayer.

        Reliability metrics probably shouldn't be expressed in terms like 99.999% uptime, but instead something like 99.9999% of transactions complete successfully without delay due to failover, and 99.99999% transactions complete without returning an explicit failure to the user because data integrity could not be guaranteed due to the failover.

      4. TechnicalBen Silver badge

        Re: Back in the day...

        I worked in a call centre, the office space contingency was to rent a spare location if our offices ever burnt down (or similar).

        While obviously "renting" spare capacity of bespoke systems is not really viable unless everything is "cloud" (read universally deployable apps, which of cause takes a lot of the specialist options away from your software/hardware), can there not be a joint effort by some companies for some spare capacity?

        While leaving a second complete duplicate somewhere is very costly, could there not be some similar/VM/smaller systems for at least basic use, shared costs across companies as it's very unlikely all of them go down the same day?

        1. Anonymous Coward
          Anonymous Coward

          Re: Back in the day...

          "While obviously "renting" spare capacity of bespoke systems is not really viable unless everything is "cloud""

          Back in the day when people knew how to do these things right, when cloud was still called "bureau timesharing", there were companies that had their DR kit in a freight container (or maybe several, across several sites) that could be readily deployed to anywhere with the required connectivity. It might be a container+kit they owned or it might be a container+kit rented from a DR specialist or whatever. And they'd know how it worked and how to deploy when it was needed (not "if").

          That's probably a couple of decades ago, so it's all largely been forgotten by now, and "containers" means something entirely different.

          E.g. here's one that Revlon prepared earlier (article from 2009), but my recollection is that this had already been going on elsewhere for years by that time:

          http://www.computerworld.com/article/2524681/business-intelligence/revlon-creates-a-global-it-network-using--mini-me--datacenters.html

        2. Anonymous Coward
          Anonymous Coward

          Re: Back in the day...

          "shared costs across companies as it's very unlikely all of them go down the same day?"

          If this were something on offer and BA had signed up for it, how would all the other corporate clients feel knowing that BA are now using the DR centre, so if anything goes pear shaped, they're on their own? Its like paying for an insurance policy that won't pay out if somebody else has made a similar claim before you on a given day. Would you buy cheap car insurance with a clause that reads:

          "RBS Bastardo Ledswinger Insurance plc pay only one claim each working day across all policies written and in force; If your claim is our second or subsequent of that day it will not be paid, although we will deem the claim fully honoured. No refunds, the noo!"?

          1. Grunt #1

            Re: Back in the day...

            When you return to work on Tuesday ask about your DRP and sharing resources, you will be amazed how many firms are in this position. It is a risk that does reduce costs and is unlikely to occur. In most cases the major DR suppliers allow you to be locked out for 90 days. It is a first come first served business model.

            The most likely scenario for multiple organisations to be hit simultaneously is through cyber attack; you have been warned.

        3. Alan Brown Silver badge

          Re: Back in the day...

          "the office space contingency was to rent a spare location if our offices ever burnt down"

          Oh, so you need that much space, that urgently? Sure, I can do that for a 2500% premium.

          1. Anonymous Coward
            Anonymous Coward

            Re: Back in the day...

            Sungard will happily arrange for this in advance for a small fee.

            ( I don't work for them, but have used them)

  15. Anonymous Coward
    Anonymous Coward

    Penny wise pound foolish

    BA: Penny wise pound foolish.

    Officially BA tells the world that it was due to a power failure but that 'IT is looking into it', which makes no sense. Power failure ==> sparky is the go to surely.

    What may have happened

    1. BA offshores IT

    2. UK has bank holiday weekend (3 days)

    3. Upgrade / new release cockup ==> systems down

    4. No plan B.

  16. DVDfever

    Top up your meter?

    I'm reminded of that ad for a company which allows you to top up from your phone, so even if the office BA use is closed for the weekend, they can still get it sorted.

    1. Tom Paine Silver badge

      Re: Penny wise pound foolish

      Any number of things may have happened.

      1. Anonymous Coward
        Anonymous Coward

        Re: Penny wise pound foolish

        "Any number of things may have happened."

        There are a finite number of things that you can think of that may go wrong ...and an infinite number that can go wrong.

        1. TkH11

          Re: Penny wise pound foolish

          There isn't an infinite number of things that can go wrong, it's finite, but just very large in number.

          1. TechnicalBen Silver badge
            Holmes

            Re: TkH11 "Some infinities are bigger than others"

            No, it is infinite. Though depending on your mission objectives and definition of "success". When we frame success it has a defined boxed in end result (so while infinite it is bounded by what types of infinity). Failure is open ended (we can add both any finite number and any other type of infinity).

            There is one way it can go right. But for ways it can go wrong, it is infinite. As you can always add one more spanner into the works... forever. ;)

    2. TkH11

      Re: Penny wise pound foolish

      What you are suggesting is that BA is lying to the public about the cause of the failure. Too much chance of being found it, I don't honestly think they would so blatantly lie to hide what happened.

      I believe what is likely is that there actually was a power failure, which may or may not have been BA's fault (if its BAs responsibility to maintain the UPS, and they failed to do it, then it is their fault).

      Having experienced a genuine power failure to the site, and some kind of failure in the UPS and generator system, I bet BA's remote off-shore India IT support team struggled to bring the systems and applications back.

      So they're only telling 1/10th of the real story, enough to make everyone think they are not to blame, when in fact they probably are.

      1. Anonymous Coward
        Anonymous Coward

        Re: Penny wise pound foolish

        No matter what happens, the responsibility lies with the BA executives. It may not be their fault it went wrong, it is their fault for allowing it to destroy the business.

      2. Anonymous Coward
        Anonymous Coward

        Re: Penny wise pound foolish

        BA is to blame regardless of the cause. The CEO should be be given the boot, clearly not fit for the job.

        Outsourcing IT when IT is the blood that supports the core business? Oh well, serves them well.

        1. Anonymous Coward
          Anonymous Coward

          Re: Penny wise pound foolish

          Would they buy aircraft just because they are cheaper without some serious due diligence?

      3. Alan Brown Silver badge

        Re: Penny wise pound foolish

        "So they're only telling 1/10th of the real story, enough to make everyone think they are not to blame, when in fact they probably are."

        Which is probably why the utility companies are now all stepping forward to categorically deny any kind of power surge anywhere in the country.

  17. BlokeOnMotorway

    Still, ElReg finally got the story about 7 hours after the Beeb and everyone else, and have managed to add exactly what insight?

    1. Anonymous Coward
      Anonymous Coward

      I have to agree

      It was disheartening to not see anything on this for hours on EL Reg.

      Perhaps they where all on a jolly to Spain for the weekend and got stuck at the airport?

      1. Will Godfrey Silver badge
        Facepalm

        Re: I have to agree

        I don't.

        The Beeb is a well funded 24hr 365 day service. El Reg is not.

        1. Anonymous Coward
          Anonymous Coward

          Re: I have to agree

          "The Beeb is a well funded 24hr 365 day service. "

          Wasn't there a proposal a while back that the BBC should cut back on 24/7 news resources - as the media moguls didn't like the competition?

          1. theblackhand

            Re: I have to agree

            My take on ElReg's story is that it's a placeholder for comments and a link to contact someone directly for people that do know more.

            Looking around the various news sites and places that might know, I don't see much more than what is known about the affects of the outage and the official statements.

            The interesting stuff will come in during the week when people are back in the office and go for after work drinks with ex-colleagues :)

            1. Martin an gof Silver badge

              Re: I have to agree

              My take on ElReg's story is that it's a placeholder for comments and a link to contact someone directly for people that do know more.

              El Reg is always quiet on the editorial front at the weekend, ever since they stopped their "weekend edition" experiment. The surprise to me was that they managed to get a story out at all and I think you are correct, put something - anything - out there, and hope that the commentards will fill the gap. However few of BA's IT staff are left in the UK, I bet a couple of them read El Reg...

              For analysis I'll come back on Tuesday.

              For the record, I have to agree with everyone saying "it stinks" because a huge business such as BA (or Capita - yup, it's a bit of a co-incidence) shouldn't fall completely over for lack of a few Amps at some data centre or other.

              What's next? London City Airport's recently fanfared "remote tower"?

              M.

      2. Doctor Syntax Silver badge

        Re: I have to agree

        "Perhaps they where all on a jolly to Spain for the weekend and got stuck at the airport?"

        And the BA employee with the only key to the genny shed in his pocket was on the same flight.

    2. a_a

      You must be new here. ElReg is light and frothy take on IT news, it's not 24/7 breaking news.

    3. Anonymous Coward
      Anonymous Coward

      Don't worry your time will come.

      Because we are one of the following.

      - The smug ones who have been lucky to avoid a total failure.

      - The smug ones who have planned and tested for this.

      - The smug ones that don't work for BA/Capita/TCS etc..

      - The smug ones who are not the ones in BA DCs right now.

      1. Anonymous Coward
        Anonymous Coward

        Re: Don't worry your time will come.

        In 1999 I was part of a team at a broadcaster - dealing with a planned power shutdown to test the UPS and generators just in case the power went out on New Years Eve. My role was simple in that I had to shut down anything that had been left on where there wasn't a user at their desk to do it themselves. The non core systems (i.e. anything but the studios/IT/Engineering kit) would not have power and it was essential that everything else was off with work saved. My floor was cleared bang on time and after receiving clearance from all the other floors the fire alarm tannoy was used to announce to anyone left in the building not involved that the power was about to be cut.

        Power is cut and then restored 2 minutes later which was odd given I thought we were going to give the generators a bit of a run. So I found the Head of Technology and the Facilities Management Manager looking concerned, they said that the generators had not kicked in. It turned out to be something that was easily fixed and that in part involved sacking the company contracted to provide maintenance. Questions about their 'regular' testing schedule were raised soon afterwards. We had apparently made assurances that should the power go out at 00:00 01/01/2000 we'd still be broadcasting with no interruptions. That was why we were running the test in September and still had time to fix anything that failed.

  18. Anne Hunny Mouse

    Doesn't add up

    One of the things reported on PM on Radio 4 that early on when people scanning boarding passes were getting incorrect destinations on the screen. They reported that someone flying to Sweden got 3 different incorrect destination when the card was scanned.

    It was also reported that at least initially BA's phones weren't working at Heathrow but I would have thought they would have had some local survivability in place if the phones couldn't register to the systems at the data centre, with a backup local breakout to PSTN.

    My best (albeit poor) guess that it is more likely to be network related. Faulty router(s)?

    1. Nick Kew Silver badge

      Re: Doesn't add up

      Does potentially add up if the root cause was last night's thunderstorms corrupting something. In a manner that wasn't anticipated by whatever monitoring they have in place.

    2. theblackhand

      Re: Doesn't add up

      Faulty network equipment rarely results in faulty destinations when scanning boarding passes - they result in either slow or no connectivity.

      The boarding pass issue sounds more like storage, with either a fault (i.e. the power issues or a resulting hardware failure) causing a failover to another site with either stale data or the failover process not working smoothly (i.e. automated scripts firing in the wrong order or manual steps not being run correctly).

      i.e. I suspect this is more of an RBS type issue rather than a Kings College type failure.

      1. Anonymous Coward
        Anonymous Coward

        Re: Doesn't add up

        makes it sound like an application messaging tier failure I've seen this before where the weblogic tier was misconfigured. Responses to transactions were routed back to the wrong IP address. An interesting side effect was that a customer would call in, register their details with the IVR system and their case details would pop up on a call center agents pc in one call center while the phone call would route to a different one. Unfortunately the poor agent who recieved the phone call then could not access the case. If this has just started happening I. Suspect a software upgrade is to blame

        1. TechnicalBen Silver badge
          Thumb Up

          Re: Doesn't add up

          Thanks Anon, ip routing of responses/caching problems does sound familiar... Steam and some other consumer stores/websites had such a problem when a reconfig sent the wrong cached pages all around, so customer data got spewed out to the wrong people.

          So a power outage, recovered from, did not "recover" as expected?

          1. TheVogon Silver badge

            Re: Doesn't add up

            "So a power outage, recovered from, did not "recover" as expected?"

            An educated guess tells me that they have lost a hall or a datacentre. And probably only then found out that vital systems are not fully replicated / key stuff doesn't work without it. Most probably systems that were DR tested were tested in isolation without a proper full DR shutdown and someone overlooked critical dependencies.

            Once you are in such a situation and find you would need to redesign your infrastructure to fix gaping design holes, it's usual faster and safer to fix and turn back on the broken stuff.

            1. Anonymous Coward
              Anonymous Coward

              Re: Doesn't add up

              If so, they are not the only ones.

              AC for obvious reasons.

  19. SJG

    Operational Failover is incredibly complex

    Let's assume that BA have lost a data centre. The process of switching hundreds, maybe thousands of sustenance to a secondary site is extraordinarily complex. Assuming that everything has been replicated accurately (probably not) then you've also got a variety of RPO recovery points dependent on the type and criticality of system. BA have mainframes, various types of RDBMS and storage systems that may be extremely difficult to get back to a consistent transaction point.

    I know of no companies who routinely switch off an entire data centre to see whether their systems run after failover. Thus BA and most big companies who find themselves in this position will likely be running never fully tested recovery procedures and recovery code.

    The weak point of any true DR capability is the difficulty of synchronising multiple, independent transactional systems which may have failed at subtly different times.

    1. Tony W

      Re: Operational Failover is incredibly complex

      You've pointed out what might be the problem. I once worked for a (public) organisation in which no-one dared take responsibility for pulling the big switch in case the backup system didn't take over properly. With the almost inevitable result that when a real failure occured, the backup system didn't take over properly. At least in a planned test they could have had relevant people warned and a team standing by to get it working with the minium disruption.

      1. thondwe

        Re: Operational Failover is incredibly complex

        Agreed. Plus, N+1 systems are based on educated (sic!) guesses of what loads will be - which they won't as systems just grow to fill the capacity provided. So, when N+1 become N (e.g. you lose a high density compute rack, then everything fails over - typical VMs then restart will more resources than they had when they were running normally and the hypervisor had nicked all the empty ram/unused CPU cycles. So, now nothing fits and everything starts thrashing and crashing domino style...

        At some point you'll be looking to turn the whole thing off and on again - assuming you've documented that process properly, and nothing been corrupted, ...

        1. Doctor Syntax Silver badge

          Re: Operational Failover is incredibly complex

          "assuming you've documented that process properly"

          And you didn't go paperless so the whole documentation is on one of the servers that's not working.

      2. Anonymous Coward
        Anonymous Coward

        Re: Operational Failover is incredibly complex

        At least in a planned test you can decide when the 'failure' occurs i.e. not during peak processing time!

        1. nematoad Silver badge
          Happy

          Re: Operational Failover is incredibly complex

          "...not during peak processing time!"

          Ah, yes! The happy memories of faffing about switching things on and off at 2 o'clock in the morning to make sure that the DR was properly set up.

          Unsociable but richly rewarding!

    2. mmussett

      Re: Operational Failover is incredibly complex

      I know plenty of companies that routinely switch DCs every night to avoid this sort of monumentous cock-up.

      1. Tom Paine Silver badge

        Re: Operational Failover is incredibly complex

        Those companies probably don't have a mishmash of legacy systems, some decades old, and complicated links to other service providers and their networks. That said, I intuit - possibly wrongly - that a mishmash of legacy systems would be less likely to fail completely, because different chunks of it would have been originally designed as standalone, or at least much less interdependent. (Anyone care to wield the cluestick with actual data or proper research on whether that's the case?)

        It's interesting too that quite a few of these sorts of mega-outages hit industries that were some of the first to computerise in the 60s and 70s -- air travel and retail banking. What other sectors would fit that category and are also high volume / mass market infrastructural systems, I wonder?

        * (looks uneasily at all those ageing nuclear stations built on coastlines before they'd discovered the Storegga Slide... )

        1. Anonymous Coward
          Anonymous Coward

          Re: Operational Failover is incredibly complex

          It's worse than a mishmash. It'll be set up like a bank. Sure you've got your all-singing, all-dancing modern apps-and-web front end for the masses, but under the hood there will be the same core monolithic mainframe actually doing the real work, substantially unchanged for 30 years. This is translated into modern representations with layers upon layers upon layers of interfaces. Take out any one of those layers and it's anyone's guess what the outcome will be. That's why DR for these systems is particularly tricky.

    3. Blotto Bronze badge

      Re: Operational Failover is incredibly complex

      Ive worked in numerous places, (public and private sector) where the DC's have had to be powered down for 5 yearly electrical testing. Its a complete power down, all systems off, ac off, ups off, geny's isolated etc. Its a pain to manage, eerie walking through silent data halls slowly getting to ambient temp, with the constant worry if what wont come back up.

      Full DC power downs are not a rare event.

      1. TechnicalBen Silver badge

        Re: Operational Failover is incredibly complex

        I forget the company (and my Google Fu is off today), but it may have been Google or Valve's Steam, had a fire and it took out a DC, but the system just carried on normally.

        So it can be done well and right.

    4. jamesb2147

      Re: Operational Failover is incredibly complex

      Having assisted in a terrifically minor way in helping develop and test such a system for a client, I can vouch for this. It took a team of 15 6+ months of work to get that system up and tested for failover, and they were relatively small (think AS400 + 200-ish VM's) and already had the DR environment built out when we became involved.

      Also, we have no evidence with which to judge BA beyond their own words that this is related to a power outage.

    5. Rob D.

      Re: Operational Failover is incredibly complex

      Plenty of companies do test DR or equivalent in production every six months or so. At least those who take it seriously. Any company not doing this is accepting the risk of not being able to run after a serious problem. CIO without that live testing on the list of required operating costs is immediately culpable.

    6. Anonymous Coward
      Anonymous Coward

      Re: Operational Failover is incredibly complex

      I know of no companies who routinely switch off an entire data centre to see whether their systems run after failover.

      I do.

    7. TheVogon Silver badge

      Re: Operational Failover is incredibly complex

      " The process of switching hundreds, maybe thousands of sustenance to a secondary site is extraordinarily complex. "

      It shouldn't be. You would normally have it prioritised, documented and or scripted and tested. Or ideally for a company the size of BA - running active active - or active passive and clustered with automatic failover.

  20. a_yank_lurker Silver badge
    Facepalm

    Outsourced with Delta?

    Is BA using the same inept outsourcing team Delta uses? Mission critical tasks need to be kept internal end of story. BA is an airline thus passenger booking, boarding, and flight activities should be kept internal. They are getting what they deserve.

    1. jamesb2147

      Re: Outsourced with Delta?

      Ironically, amongst the most apt comments here.

      All those proclaiming from their high horses about the importance of backups and redundancy and failover and IT outsourcing... you've all jumped the gun. Delta blamed a power outage, and do you know who here believed them? Basically no one. James Hamilton from AWS believed them, though. He helps design resilient systems and has twice encountered failover power systems (basically, the big switches) that the manufacturer refuses to properly configure (they disagree on what a proper configuration is). AWS had to source new hardware and ended up writing their own firmware for the controller, as the manufacturer refused to reconfigure it, IIRC.

      You can read about that here: http://perspectives.mvdirona.com/2017/04/at-scale-rare-events-arent-rare/

      Now, BA has some real IT issues, but the outrage vented here really has nothing to do with BA, when we don't even know the source of the problem beyond that there is a power issue.

      EDITED: Added the bit about writing their own custom firmware for an electric supplier's hardware.

      1. nematoad Silver badge

        Re: Outsourced with Delta?

        What's the betting on BA stonewalling request for compensation on the grounds of "exceptional circumstances"?

        1. TheVogon Silver badge

          Re: Outsourced with Delta?

          "What's the betting on BA stonewalling request for compensation on the grounds of "exceptional circumstances"?"

          Zero. The Supreme court already decided that technical failures are not outside of airlines control...

  21. This post has been deleted by its author

  22. Anonymous Coward
    Anonymous Coward

    Remember when

    BA outsourced their catering to some random outfit, which then went on strike?

  23. Blotto Bronze badge

    What else has BA poorly maintained?

    Their IT should be redundant and resilient, a bit like the critical systems on an aircraft.

    If BA have got their IT wrong this basd, whats to say they've got their aircraft maintenance correct?

    1. kmac499

      Re: What else has BA poorly maintained?

      Considering the likely compensation cost that will balloon from this, maybe anonymous incident reporting should be set up for IT jockeys just like the plane jockeys.

      1. Anonymous Coward
        Anonymous Coward

        Re: What else has BA poorly maintained?

        http://www.pprune.org/rumours-news-13/

        1. Anonymous Coward
          Anonymous Coward

          Re: What else has BA poorly maintained?

          Thanks for that link.

          Here's a quote from a commenter there, "The beancounters who propose and/or approve such 'cost savings' should be told that their continued employment and retirement benefits rest on the continued faultless operations of the IT systems they wish to save money on. That might concentrate their minds a little."

    2. Martin an gof Silver badge

      Re: What else has BA poorly maintained?

      If BA have got their IT wrong this basd, whats to say they've got their aircraft maintenance correct?

      Engineering Giants. It's a repeat from a few years ago, but very interesting nonetheless.

      I think the main difference is that there are manufacturer-mandated and internationally agreed standards to the maintenance of the actual aircraft and its systems. Problems, at least among the major airlines, seem these days to be confined to genuine mistakes, rather than pure incompetence or deliberate flouting of the rules. Be proved to have got it wrong and you risk everything from enforced grounding and checking of aircraft types, to loss of licence to operate.

      There are no such systems in place for IT. You are very much on your own, and there's no comeback other than a few disgruntled customers or suppliers - I bet the airports will get compensation from BA for blocked parking stands etc.

      M.

    3. Anonymous Coward
      Anonymous Coward

      Think of the savings?.

      - Are you sure you can fly trans-ocean with only one engine?

      - What do you mean we don't need a co-pilot?

      - Are you sure you need that much fuel?

    4. Anonymous Coward
      Anonymous Coward

      Re: What else has BA poorly maintained?

      They lost that when the system crashed.

    5. Nifty

      Re: What else has BA poorly maintained?

      Re: "Their IT should be redundant and resilient, a bit like the critical systems on an aircraft"

      Their staff are resilient and redundant.

    6. Vic

      Re: What else has BA poorly maintained?

      If BA have got their IT wrong this basd, whats to say they've got their aircraft maintenance correct?

      Unlikely. Aircraft maintenance is fully-licenced. Those doing the work face loss of licence (and therefore job) if they screw up at all, and a prison sentence if anyone is injured as a result of and shoddy work. They generally do stuff properly.

      Vic.

  24. Charles Smith

    I can imagine the phone call...

    "Look, I know we fired you just before Christmas, but it wasn't personal."

    "Eff Off"

    "Please come back, just for the weekend and help us get running again."

    "Eff off"

    "We'd pay you loads of money if you'd help. Just this once."

    "Send me a Christmas card, I might help then. Meanwhile why don't you phone India."

  25. Nick Kew Silver badge
    FAIL

    WannaFly?

    Given that I'm late to this commentard party, can I really be the first to coin a word for it?

    Glad to have avoided BA consistently since they messed me about inexcusably in about '95 or '96. BA = delay, expense, rudeness, hassle, and above all, lack of information.

  26. pleb

    Redundancy?

    Someone at HQ misunderstood when told that IT needed more redundancy...

  27. Anonymous Coward
    Anonymous Coward

    The single points of failure?

    To add to speculation about possible causes...

    1) Not paying the datacentre bills... hard to think BA might not cough up to a red letter before the switch is pulled.

    2) Shared storage failure/corruption... something like a catastrophic SAN failure would potentially take out dozens of systems in an instant, and require a massive co-ordinated recovery effort

    3) Networking... corruption of core router configuration making everything everywhere inaccessible?

    But power? C'mon. That's just nonsense.

    1. Anonymous Coward
      Anonymous Coward

      Re: The single points of failure?

      Power failure could cause any of those things, and complicate recovery if the resulting state is ill defined. Throw in the fact it's one of the hottest days of the year, it's their busiest day of the year and that they just shipped off 3/4 of their in-house IT to TCS and this is going to be seriously bloody messy.

    2. James Anderson

      Re The single points of failure?

      Probable senario:

      About 5 years ago they set up a load balanced system where each data centre handled 50% of peak load with a few spare boxes.

      About 3 years ago they upgraded some servers to handle extra load but did not bother upgrading the spare capacity.

      About 2 years ago IT pointed out that with increased traffic each data centre could now handle only 70% of peak load but 50% of normal load and management decided that was "good enough" as peak load only occurred 5 or 6 days a year.

      About a year ago everything outsourced to India, nobody noticed (or were too embarrassed to point out) that each data centre now only coped with 80% peak load.

      Yesterday one data centre crashed. The remaining data centre tried to cope with 160% of peak load, chaos theory was invoked and the systems crashed one by one, and, crashed again on restart.

      When power was restored at the other data centre the databases, queues etc. were in a state that required experienced engineers with deep knowledge of the systems involved -- like the ones that were handed there P.45 last year.

      Not sure how much this will cost BA -- but just the 200 to 400 euros due to each and every passenger should be enough to have the shareholders demanding cruel and unusual punishment.

      1. Anonymous Coward
        Anonymous Coward

        Re: Re The single points of failure?

        "[...] management decided that was "good enough" as peak load only occurred 5 or 6 days a year."

        The Universal Law of DR:

        If anything can go wrong - it will go wrong - at the worst possible moment.

    3. Rhino Djanghardt

      Re: The single points of failure?

      I agree, the power excuse is likely to be a red herring / BS. It's also an easy target for software guys who investigate a failure and are tasked with providing an RCA. I was once asked by a Sysad to check the power to a particular server that had gone offline; the server PSU's were fine, as were rack PDU's and all other upstream power systems. However, in the absence of any other evidence (such as syslogs etc), the guy still reported the cause as power failure at the data centre. The most likely cause IMO was that the server had either shut itself down, or been shutdown accidentally by a Sysad (or DBA 'trusted' with the root password).

      Back to BA: your #2 scenario sounds feasible and may tie in with the early reports of incorrect data (on boarding passes), due to corruption. The same outcome could also be due to database corruption caused by a hosed disk partition. The bottom line though is probably: 1. Poor HA design, 2. Inadequate / absent testing, or 3. Deficient routine maintenance.

      Note also that when large, complex systems are outsourced, knowledge transfer to the new company / teams is often skimped or even overlooked in the rush to get the business and operations transferred. As others have pointed out, you will never get back the depth & breadth of experience you had in the original system designers and custodians. Even if you re-hire some of them at consultancy rates, they still have to liaise / battle with the new guys, whose work culture may be quite different.

    4. TheVogon Silver badge

      Re: The single points of failure?

      "catastrophic SAN failure would potentially take out dozens of systems in an instant, and require a massive co-ordinated recovery effort"

      Both of the independent SANs fail at the same time? Seems unlikely. Especially as we already know it's some sort of power failure...

    5. Anonymous Coward
      Anonymous Coward

      Re: The single points of failure?

      Power failure isn't nonsense. It takes a short while (perhaps a minute or two) for diesel generators to start up when there's been a mains power failure to the site on the national grid. What powers the servers in that time? Batteries, in the UPS. If the batteries are knackered, they can't deliver enough juice for long enough until the genny kicks in, result? Loss of power to the servers, corrupted database data files and lord knows what. Seen it happen.

  28. Custard Fridge

    Boots on the ground

    Once the dust settles I would see this as a victory for UK based IT workers - however much you outsource abroad, when it hits the fan, and blimey it looks like it has here, you need those boots on the ground in the data centre/s as fast as possible.

    A small army of knowledgable people who between then designed / built / installed / maintain it, will be able to get it started quicker than equally talented people who did none of those things and who are based thousands of miles away.

    Also - If someone cut through a mains supply (or three) quite that powerful why aren't we reading about it as a news event itself?

    I can't quite believe this is a cover-up for something more sinister - unless there are a lot of police cars / black transit vans around the data centres and nobody has spotted them yet...

    1. Doctor Syntax Silver badge

      Re: Boots on the ground

      "Once the dust settles I would see this as a victory for UK based IT workers"

      And more than just UK-based but in.house.

      Whether you're running a bank or an airline or anything else where the IT operation is essential to being able to function you have to regard IT as one of your core competences. Outsourcing it simply doesn't make sense.

      1. Boris the Cockroach Silver badge

        Re: Boots on the ground

        Quote: Outsourcing it simply doesn't make sense.

        It makes perfect sense

        Outsource the IT support saving the company X millions.

        Get a fat bonus as a reward.

        Add that to your CV

        Leave and join a.n.other company with the line "I saved x millions at my last job"

        New company gives you x millions for being good

        Previous company falls over due to IT support being worth as effective as an infinite number of monkeys

        New company sweeps up all the previous companies customers.

        You get another fat bonus

        And decide to outsource your current employers IT op...... (and repeat until you can early retirement at 45)

  29. Doctor Syntax Silver badge

    It seems as if power failure has simply become the latest PR spokesnumpty's boiler plate to be used when even they realise "only a few customers were affected" isn't going to wash.

    Or ... could it be ... all these "all the UK ran on renewables" and "solar accounted for more than nuclear" stunts were achieved by a bit of load-shedding?

  30. rwbthatisme

    Split brain

    Reading between the lines my hunch is that they got into a split brain scenario of some sort. The power failure broke down the coordination between the data centres and then as they came back online both centres started to operate independently, so that requests and updates to data centre a would not replicate to data centre b. The information about boarding passes swapping from one flight to another point to this. The upshot is now both centres are running with incomplete and irreconcilable data and the only remedy is to shut down and roll back to the last known good data point.

    1. Mike Pellatt

      Re: Split brain

      That tallies nicely with "people getting random destinations on scanning boarding cards" mentioned earlier

    2. Merlinski
      Black Helicopters

      Re: Split brain

      Ahhh - so you're saying that their system became self aware!

      What we learnt from Microsoft was that within minutes of "going live" a system quickly learns basic Anglo Saxon and responds to all requests with outrage and tells management to f#ck off.

      http://www.bbc.co.uk/news/technology-35890188

      It's going to take them a while to undo that split brain thingee.

    3. Anonymous Coward
      Anonymous Coward

      Re: Split brain

      You mean they didn't plan to avoid this situation? Why?

      1. illiad

        Re: Split brain

        no money, they sacked half the IT staff, and the rest went on holiday, or worse... :)

    4. Anonymous Coward
      Anonymous Coward

      Re: Split brain

      Thanks for that, I think I'll have the system back up real soon now

  31. Duncan Macdonald Silver badge

    Access from India

    Could it be that the support team that is now in India cannot access the systems to restart them ?

    Restart often needs hands on staff to push buttons and enter initial commands.

    1. Wensleydale Cheese Silver badge

      Re: Access from India

      "Could it be that the support team that is now in India cannot access the systems to restart them ?"

      They'll be on the next plane.

      Oh wait...

    2. TkH11

      Re: Access from India

      That will be the case if they haven't got ILOM connectivity to the servers sorted out.

      ILOM allows remote users to come in over IP and start the servers up. Once started up, remote users can switch to ssh over the main interface.

  32. Anonymous Coward
    Anonymous Coward

    Some information about BA's data centres here : https://www.ait-pg.co.uk/our-work/british-airways/

    1. Wensleydale Cheese Silver badge

      Early adopters of data centre infrastructure management (DCIM) software

      "Some information about BA's data centres here : www.ait-pg.co.uk/our-work/british-airways/"

      BA was an early adopter of data centre infrastructure management (DCIM) software and wanted to extend its use with an easy to use flexible solution that could make server allocation faster and provide instant reporting and real time dashboards of power and cooling capacity.

      Clients testimonial: Keith Bott, Service Manager, British Airway [sic]*

      The new DCIM software allows us to quickly allocate space for new servers, manage power and network connectivity, issue work orders and provide capacity planning across all British Airways data centres. AIT provided the expertise and man power to audit and upload data, incorporate our modifications, and support the team during initial role out, to give us a leading edge DCIM solution that meets our exact needs.

      What could possibly go wrong?

      * an obviously un-proofread testimonial, when they missed the comma out of Client's and the final 's' from British Airways.

      1. Ken Moorhouse Silver badge

        Re: Early adopters of data centre infrastructure management (DCIM) software

        Ah... Could this be the problem?

        Storing all the documentation in a folder called DCIM that they found on drive D of their pc, then wondering where it all went when they'd finished charging up their iPhone.

      2. Destroy All Monsters Silver badge
        Paris Hilton

        Re: Early adopters of data centre infrastructure management (DCIM) software

        This seems to be a very good tool to have.

        I don't see why anything should go more spectularly wrong if you have it.

  33. aaaa

    Paying for criticism

    The biggest problem I see with outsourcing is lack of criticism.

    Ie: your own IT team will freely criticise management decisions and choices of technology etc.

    Once you outsource it's a lot of "yes sir, no sir".

    RBS also had a huge IT failure about 12 months after outsourcing to India (from memory) and despite government enquiry there was very little said at the end of the day. I agree strongly with other suggestions here that there needs to be some legislative penealty for companies who outsource then fail. They shouldn't be allowed to 'justify' it as being unrelated. The tests are just to black/white - whereas this is a complex phycological issue - it's not about IT skills - I'm sure TATA are great at that - it's about human ego and reward, same as how the GFC taught governments that they need to control how financial sales people are rewarded - the global IT crisis will eventually team politicians the same thing about they CIO's

    1. Doctor Syntax Silver badge

      Re: Paying for criticism

      "this is a complex phycological issue"

      They're using seaweed to predict the weather?

  34. RW

    A good, nay excellent, example of a single point of failure.

    Unless by "power supply" they mean the entire power grid for Great Britain.

    1. katrinab Silver badge

      And I haven't seen any reports of anything similar to the Black Friday power cut in Central London.

  35. Marketing Hack Silver badge
    FAIL

    "We would never compromise the integrity and security of our IT systems."

    Doesn't sound like you have to, BA, they're pretty F'd up as it is.

    Wow, just wow. Britain's flag carrier and the largest carrier operating at Heathrow is grounded. Plus Heathrow is their largest hub and still the world's busiest airport, or is certainly still in the top 3 or 5 globally. Plus they screwed the pooch at Gatwick too. And the telephone service centers and website are hosed as well! And at the start of a major British holiday weekend!

    That's not even a hat trick. That's hitting for the cycle! (For those of you flummoxed by the onset of American sports terms, "the cycle" is getting at least one of each of the 4 types of base hits a baseball player can get in one game, and it is the rarest single-game achievement in team sports) Yay team!!

    So, BA =?

    -Barely Airborne

    -Badly Automated

    -Backups Afflicted

    -Buyers Abandoned

  36. Anonymous Coward
    Anonymous Coward

    BA problem?

    Sounds more like a BS problem.

  37. Karlis 1

    > BA has a very large IT infrastructure;

    Expected.

    > it has over 500 data cabinets spread across six halls in two different sites near its Heathrow Waterside HQ.

    That's all? That's barely a medium infrastructure. Ok, 500 racks is not small, but it is a looooong way off from _very large_.

    1. Tony S

      That's all? That's barely a medium infrastructure. Ok, 500 racks is not small, but it is a looooong way off from _very large_.

      Totally agree. I worked for an SME that had an annual turnover less than BA's daily turnover. We were managing 8 cabinets in 5 DCs in 2 countries. (not including small wall cabinets for network distribution)

      We had built that up from scratch ourselves, and it had been done fairly cost effectively. We tested on a regular basis, but the biggest test was when the local power transformer for the main DC had to be upgraded; and the whole area was disconnected whilst the engineers made the change.

      I was in that morning to keep an eye on things, and the CEO was hovering nervously. But everything worked as it should, although it did give me a few things that needed to be discussed afterwards for improvement.

      Fast forward 3 years and the parent company had decided that outsourcing the IT services was the way to go, and I was made redundant. In the following 4 years, they had three major outages, 1 of which lasted for over 2 weeks. I'm told that the cost of their losses for the least of those incidents was about €20,000, (loss of business, cost of extra staff to resolve, etc.); which at the time was a little more than 6 months worth of my salary.

      I have no problem with people blaming IT when things go wrong, if it is actually the IT at fault. When it's much more down to poor decisions being made for purely financial reasons (i.e. to line someone's pockets) then I tend to get a bit warm under the collar.

      1. h4rm0ny
        Thumb Up

        >>"But everything worked as it should, although it did give me a few things that needed to be discussed afterwards for improvement."

        It never gets better than this!

      2. aaaa
        Unhappy

        Insurance and lack of competiton

        Fast forward 3 years and the parent company had decided that outsourcing the IT services was the way to go, and I was made redundant. In the following 4 years, they had three major outages, 1 of which lasted for over 2 weeks. I'm told that the cost of their losses for the least of those incidents was about €20,000,

        And they probably had insurance to recover that €20K, cheaper than maintaining a reliable system, and when the 'competition' doesn't offer a service which is demonstrably more reliable - there is no competitive pressure to do any better.

        1. LyingMan

          Re: Insurance and lack of competiton

          I have heard that as a bigger supportive argument in many of the big IT outsourcing agreements.

          But in this instance, tangible losses might amount to 100m to 150m (compensation etc) but the impact on brand / loyalty etc how could that be costed? And how much will the insurance cover?

          For a PLC, it may be possible to incur a loss of that proportion and use the insurance to recover out of the same.. But if a SME encountered a significant loss like that, the insurance might not cover the loss of customers or folding down of the business at the extreme.

  38. Charles Smith

    Convenient excuse?

    "OMG we've really Fff'd this upgrade and the Database is dying...what can we do???"

    BOFH: "You see that big red button/ Press it to summon help."

  39. We're all in it together

    Where's Leslie Nielson when you need him?

    "Mr Hammond ate fish and Randy says there are five more cases and they all had fish too"

    "And the co-pilot had fish"

    "What did the navigator have?"

    "He had fish"

    "Alllrrrright. Now we know what we're up against"

  40. Grunt #1

    The new BA business model ?

    Send jobs abroad instead of passengers.

  41. Grunt #1

    This is poor DR planning and management.

    The power failure is not the fault of Tata (TCS). It's the fault of the BA management who decided to risk accept a total power outage without a hot standby recovery option or a proven and tested resumption plan.

  42. Anonymous Coward
    Anonymous Coward

    How much was the DR budget?

    Whatever they saved was lost in the first hour of the failure.

  43. Anonymous Coward
    Anonymous Coward

    You can't outsource risk

    ... or impact.

  44. Anonymous Coward
    Anonymous Coward

    Apparently BA has been trying to save money by switching to wind power. The wind dropped this morning and so did the power and hence all the servers went down. When asked why there wasn't any back-up power as the wind often dropped in the UK, the manager in charge of infrastructure said - we forgot.

  45. BeeBee

    Lame excuse

    Where are the dual corded redundant PSU's, the gold stock for swap outs by the Data Center production operations staff, did someone ignore the phone home "oops,I have failed - please swap me out"

    Smacks of a cover up to me.

  46. Systems Analyst

    Apparently, the power-supplies themselves can be servers - yes, they are on a network themselves. The power-supply remote management software licences are very expensive. There is also a system-wide kill-switch. Plenty of scope for trouble.

    A lot of older companies are run by muppets; this dates from when the computer department was small beer, with no board member. Today, data is increasingly central to all large companies; it is not a side-issue for techies. Throw out your chumps, BA.

    1. Anonymous Coward
      Anonymous Coward

      Solar power

      On the sunniest day of the year when 30% of UK power was provided by solar power.

      Did someone shut the nuclear or gas generated power stations down as a result?

    2. LyingMan

      Loss of 'wind'

      I thought all of them are full of 'hot air' and capable of discharging said air appropriately and effectively!

  47. Anonymous Coward
    Anonymous Coward

    Indian coding and support...

    Is my theory.

  48. Anonymous Coward
    Anonymous Coward

    Power outage does not explain wrong destination on your eTicket

    Someone got hacked or fucked up an upgrade. Blame the power is a Public Relations boilerplate response.

    Data centers with UPS and Generators would have this covered. Even if they failed, this would not show up by the systems giving out the wrong destination details.

  49. Doctor Syntax Silver badge

    "We would never compromise the integrity and security of our IT systems."

    If this is uncompromised integrity what would it look like if it had really been compromised?

    1. Tom Paine Silver badge

      Loss of integrity is when the system tries to issue boarding passes for 650 passengers in Schipol for a 737 in Buenos Aires, and routes their luggage to Vienna.

      Reminds me of the old jome "Breakfast in London - Lunch in New York - luggage in Tokyo"

  50. Anonymous Coward
    Anonymous Coward

    6Ps every time

    The only power failure here is that held by the executive. They have failed to plan and control BA properly, they have failed in their role as guardians of the business.

  51. Anonymous Coward
    Anonymous Coward

    Schadenfreude

    I always make the point that you need a DRP that works. It's entertaining reading this, but I have a plane to catch and it's not BA.

  52. AbortRetryFail

    The dog ate my homework

    Power supply failure? Seriously?

  53. EastFinchleyite

    Speculation

    So much speculation. It is fun but it will be a while until we get the facts and maybe not ever.

    One thing I'd like to add. If the whole of the BA fleet was grounded because of a failure in a major plane component such as the engines (reliability, safety or such like) there would be a huge shitstorm and the supplier would be severely damaged (financial and reputation). We seems to accept these IT failures with a shrug albeit an angry shrug. I think this is because we expect IT systems to fail an experience shows we are right.

  54. Anonymous Coward
    Anonymous Coward

    Where is the communications director?

    Power failure or Communications failure? Both!

  55. anthonyhegedus Silver badge

    The CEO and CIO should be forced out. It happened on their watch, it's just not acceptable. There weren't multiple earthquakes in multiple locations, nuclear wars, meteor strikes or tidal waves. It was incompetence pure and simple.

    Why did it take them several hours to update their website? Come on, it's not hard, I can do it in half an hour. Just change the name servers FFS and do a quick website with seamonkey! But no, they have to keep their home page running so that they can sell a bit more.

    Why couldn't they communicate with the poor passengers left wandering around a fucking airport for 8 hours??

    And that's apart from the fact that their entire global systems went down.

    You can't afford to have a half-baked system running where hundreds of thousands of people's physical locations are concerned. It's not like it's Sky TV, or the BBC website, or even a train company. It's inexcusable, and companies like BA should be required by law to have proper resilience built in. It's about time directors of organisations like this were made more liable. It would soon focus their energy on doing this properly if the knew they'd be punished when things cock up like this.

    But that won't happen, they'll learn that they can get away with it, so they will continue to make the same mistakes.

    1. Grunt #1

      Ships

      On a ship it is always the captain who gets court martialled if something major happens. Even if he/she was asleep when it happened. the same applies in an aircraft.

      1. Destroy All Monsters Silver badge

        Re: Ships

        > court martialled

        We are not at war though.

        Or are we?

  56. InNY

    No one's told managlement

    what ten bob for the meter actually means?

  57. Fustbariclation
    Mushroom

    Odd that nobody notices that it happens at the busiest time.

    Things that go bang at one of the busiest weekends of the year are almost certainly capacity problems - which are difficult to fix.

    When you read of staff having been retrenched recently, it makes it even more likely it's a capacity problem.

    Experienced, technical staff, with a sound knowledge of the systems, and their history, are expensive and rare. Their value is not obvious as they tend not to be top-flight communicators and are not that keen on blowing their own trumpets either.

    The first you notice of their absence is usually just this, a very big bang, that appears to come from nowhere and with nobody knowing how to fix it.

    Capacity errors are also notorious for being immune to defences like removing spofs by having multiple data centres.

    1. Captain Badmouth

      Re: Odd that nobody notices that it happens at the busiest time.

      "The first you notice of their absence is usually just this, a very big bang, that appears to come from nowhere"

      as in :

      "Nobody knows what I do until I don't do it..."

  58. Nifty

    I heard that all BA phone lines were down for the duration and website affected too.

    The public were unable to talk to any BA staff or even get basic advice.

    If their phone lines are so tightly integrated, what does this say about BA's resilience to other kinds of catastrophe?

  59. JeffyPoooh Silver badge
    Pint

    My buddy works in IT

    He built a system decades ago that is still in use.

    He claims that one could vaporize the primary system and the offsite backup system (in his basement) would complete the in-progress transactions. Nobody would even notice.

    1. Destroy All Monsters Silver badge

      Re: My buddy works in IT

      > decades ago

      > vaporize the primary system nobody would even notice.

      Choose exactly one.

      (Then subtract one)

      1. JeffyPoooh Silver badge

        Re: My buddy works in IT

        He hand-coded it himself, starting mid-1980s. Had 32 terminals running off a pre-PC PC-like server at one point. Later evolutions, yes - still plural decades ago, had evolved to include paralleled servers mirroring each other.

        He and I took some CS classes together circa 1980. So I can pretty much imagine exactly how he did it. It's not really all that complicated, provided you're not limiting yourself by relying on pre-packaged databases etc. You have to be able to build things from near scratch.

        So, as far as I know, yes he did. And yes, plural decades (20+ years) ago. He had been building it up for about a decade by then.

  60. Anonymous Coward
    Anonymous Coward

    Michael O'Leary is going to have fun with this one.

  61. This post has been deleted by its author

    1. joe bloggs 6

      you do an upgrade on one of the busiest days? really?

    2. ginko74

      Old tweet

      The tweet you are quoting is dated 11 April.

  62. TheBorg

    Same old story it seems these days - large organisations that can't design a highly available system to run their critical applications to continue operating and not suffer very public loss of reputation.

    Comes down to senior management not having a clue, spotty grads pretending to be architects and from there it all goes tits up !

    In my mainframe days (yes, another dinosaur), we didn't have these issue, data centres were designed properly and though very expensive we had back up generators and alternate systems in case of fail over.

    Added to this the TCS factor = clueless bodies who can't make a decision and likely have never even been into a data centre.

    I don't believe the power supply story either :)

  63. DaveStrudwick

    BA Board Composition Not Fit For Purpose....

    The board own shares valued at £24.5M so they have a huge personal vested interest in getting the company’s costs as low as possible. So of course they’ll outsource IT support. The some 50% of the firm is owned by Qatar and the institutional shareholders. None of these investors will even ask about the actual risk to the company form IT related (cyber) risks and their mitigation…because they wouldn’t have invested so heavily in something so apparently prone.

    The Annual Corporate Governance Report for 2016 has pages of methodology in relation to financial risk, but the risk from critical IT failure gets a sentence. Why? Well it might be that the firms Audit and Compliance Committee doesn’t seem to have a single trained, experienced or fully qualified IT specialist, yet according to the latest corporate governance report their function includes(pg.32)

    g. To evaluate all aspects of the non-financial risks the Company is exposed to, including operational, technological, legal, social, environmental, political and reputational risks.

    Hmmm…it would seem this committee composition is not fit-for-purpose! (Perhaps the Head of Group Audit and Risk Management needs to find a sword to fall on!) The Risk control and Management Systems include Enterprise Risk Management under class B. (Pg. 40 of the report) The mention of the risk from ‘Failure of Critical IT Systems’ is clearly not mitigated as the recovery approach clearly just don’t work! (Pg.41)

    I conclude that this company’s boards are far too focussed on finance and therefore the firm is not properly run.

  64. rwill2

    May's way to reduce immigration ... outsourcing skilled IT jobs to India

    May's way to reduce immigration ... lets replace UK/EU workers with jobs in India or low paid Indians in UK, what can go wrong:

    https://www.theregister.co.uk/2016/06/24/ba_job_offshoring_gmb_union_hand_delivered_letters/

    Generally I think it's time we all leave big companies and their fat cat managers that outsource us and cut down on IT security / infrastructure budgets, and join much more rewarding startups with modern 100% cloud solutions, where we get some respect for our skills, can innovate and even get company equity/shares. I've done that few years back, so much more happier and no ways going back.

  65. Louiscyphre

    Flooding and power outages in Bangalore this weekend

    http://timesofindia.indiatimes.com/city/bengaluru/two-days-of-consecutive-rain-leave-bengaluru-in-complete-mess/articleshow/58876810.cms

    http://bangaloremirror.indiatimes.com/bangalore/civic/the-dark-waders/articleshow/58874816.cms

    I'm sure it's only coincidence

    1. Anonymous Coward
      Anonymous Coward

      Comes down to senior management not having a clue, spotty grads pretending to be architects and from there it all goes tits up !

      Don't tell me about it.

      Around here (not BA) the high-schoolers have been given the keys to the pantry and are running the show (into the ground) because the old hands don't into the latest fads, are considered deplorable/unagile and unpresentable to customers that shall be fed Yoof Power.

      The old ones have other priorities than CV stuffing though.

      Oh well.

      1. Anonymous Coward
        Anonymous Coward

        Agile ?

        It seems to be the only thing agile here is knowing how to avoid the blame.

  66. pauleverett

    this is 4th major outage I have heard of in the last several days, directly blaming 'the power supply', for a complete tits up. Maybe the power supplies are being hacked, cos this is not normal, and way to big and regular, to pass of lightly.

  67. Anonymous South African Coward Silver badge

    Couple of years ago management decided that I should host email for four sites. As well as the transmittal of financial files to their respective destinations.

    I then insisted that the company procure a proper genset.

    Which was done.

    Today can congratulate myself on my foresight and insistence as we had a couple of times with total power loss from Eskom, which would have turned out more expensive for the company had we not purchased the generator.

    And a good backup structure which was tweaked over the years. One incident of cryptolocker tested the resilience of the backup system, and no data was lost (except the affected user's personal data files, boohoo).

    Even today I am looking at protecting online backups from nasty stuff like wcry and the such. Not fun, but hey, a sysadmin's gotta do what a sysadmin need to do.

    Next project will be cloud backup, to backup critical and core company documents without which the company will have a very hard time. Cloudy backup will be evaluated thoroughly, and will also be tested. I will not move everything to the cloud, as it is a single point of failure.

    I may be old-fashioned, but I prefer physical servers onsite instead of cloudy servers, as you cannot poke them should they barf and decide to be sluggish.

    This BA IT incident is just one more reason to be very, very careful when outsourcing your IT departmwnt, you never know what sort of people you will get.

    With your own IT department in-house you have full control over them, and you can meet each team member on a one to one meeting. Outsourcer? Forget about meeting their team members. And, yup, you don't have full co trol over the outsourced IT team members.

    And the most important rule of outsourcing is that the company offering outsourcing services will most probably also service more than two or three other companies, and they will not always be giving you 100% of their time.

  68. Jaded Geek

    State of IT

    Its interesting reading the various comments on here, including the links to the issues that Capita had, also lets not forget RBS/Natwest major batch failure of several years ago.

    I've worked in IT for nearly 30 years and have seen the gradual dumbing down of IT, some in part due to technology changes, others due to offshoring.

    When I first started you had to understand the systems, I was lucky enough to work on the early Unix servers, where there was no such thing as google, you either sat down with a manual and learnt something or you figured it out. It was also a time where if you upgraded the hardware (new cards/drives) etc, you had to know what you were doing as more often as not the cards had dip switches that needed setting and in some cases you had to mess around with kernels to get stuff to work and don't even get me started on RS232 communications and hardware handshaking.

    As computing became more intuitive as things became more plug and play as computer programming became easier, people needed to understand less about how computer systems work.

    I've worked for several large companies that have huge offshore offices and teams, both directly employed by them or through various consultancies. The reason for this is obviously its cheaper, considerably cheaper, however the average skill level is not what I would personally expect. This isn't a criticism as such, more of an observation and the issue boils down to experience. I kept being told by my employers that the people were the brightest graduates etc, but how does that help with a banking system that was developed in the 70s? Did all those graduates have the experience when RBS's batch failed?

    I'm not saying that some of the onshore people would be any better, but you also have to factor in the culture thing, as someone pointed out the offshore people tend to just do as they are told and not question something, even if its wrong. There are plenty of examples where I have seen entire test systems trashed because someone was told to push data in, which they did, but there were errors in the data or system problems, when this was pointed out to them they said they knew, but they were just told to push data in.

    There have been a couple of comments in the thread about using simple Java type apps, which would have possibly made things easier to recover, but if BA is anything like the large Banks, they will have legacy systems bolted into all sorts and a real rats nest of systems as they evolved over the decades, its just the way it is. If you then offshore this (or even onshore it via a consultancy) all the knowledge of how that stuff connects gets lost, also the simple things like what order should a system be started. I worked through a major power failure of a data centre and we lost almost every system due to diverse power not actually being that diverse when push came to shove. However the real issue was the recovery process, hundreds of people on a call trying to get their system started first, but not understanding the dependency of what they need. Someone actually wanted the outlook servers starting up first, so we could communicate via email, forget the fact that all the remote gateways we needed to access said email was down!

    We still don't know what the actual cause of the outage was (BAs), but I do wonder if the reason for the extended outage was because people didn't have a document that said "open when sh*t happens" as you can't document everything and at the end of the day there is no substitute for experience, regardless of the location of that person or team.

    If the extended outage is a result of the offshoring it would be interesting to understand how much the outage will cost BA in compensation versus what they saved with the deal.

    1. Anonymous Coward
      Anonymous Coward

      Re: State of IT

      "it would be interesting to understand how much the outage will cost BA in compensation versus what they saved with the deal."

      Sunday, 8pm: UK rags are suggesting £100M in compensation costs so far, with more chaos still to follow.

      Presumably BA (or parent) just pass the cost on to their outsourcer.? There was presumably an SLA, and also liquidated damages etc, in the contract, wasn't there?

      1. Anonymous Coward
        Anonymous Coward

        Re: State of IT

        Last night I sent British Airways a Freedom of Information request asking for the very same information.

        Wonder what they will decide to respond with!

        1. Anonymous Coward
          Anonymous Coward

          Re: State of IT

          Doubt it as a private company they're not covered by FOI

    2. Fake_McFakename

      Re: State of IT

      The RBS Group scheduling system incident was caused by a botched upgrade performed by internal IT staff in Scotland, nothing to do with outsourcing.

      It's incredibly frustrating to see outsourcing being touted as the Great Evil of IT whenever there's a major incident like this, especially when people haven't the first idea about the actual root cause of the incident.

      1. Anonymous Coward
        Anonymous Coward

        Re: State of IT

        "The RBS Group scheduling system incident was caused by a botched upgrade performed by internal IT staff in Scotland, nothing to do with outsourcing. [...]"

        You'll have a definitive source for that, presumably?

        Btw, thanks for creating a brand new account today for posting that highly valuable item.

        1. Fake_McFakename

          Re: State of IT

          "You'll have a definitive source for that, presumably?

          Btw, thanks for creating a brand new account today for posting that highly valuable item."

          This is RBS Group's incident report containing the technical details:

          http://www.rbs.com/content/dam/rbs/Documents/News/2014/11/RBS_IT_Systems_Failure_Final.pdf

          These are the letters between the Treasury Committee and RBS about the incident:

          http://www.parliament.uk/business/committees/committees-a-z/commons-select/treasury-committee/news/treasury-committee-publishes-letters-on-rbs-it-failures/

          This is the specific letter in which RBS specify that the team involved were based in Edinburgh:

          http://www.parliament.uk/documents/commons-committees/treasury/120706-Andrew-Tyrie-re-IT-systems.pdf

          Btw, thanks for continuing the Register Comment Section's long tradition of posting anti-outsourcing opinion with little or no facts about root cause or actual real-life experience of major incidents

          1. Anonymous Coward
            Anonymous Coward

            Re: State of IT

            Who allowed them to outsource it to Scotland?

          2. Anonymous Coward
            Anonymous Coward

            Re: State of IT

            The initial RBS problem was caused by UK staff (botched CA7 upgrade) - but - the real damage was done by the recently off-shored ops analysts trying to recover the batch suites, stuff run out of order, stuff run twice, stuff not run at all - basically shoving jobs in and hoping for the best led to corrupt databases, lost data, referential integrity between applications being lost. The 90 odd UK staff who knew these suites in intricate detail had been recently dumped. The off-shoring turned a trivial issue into a major disaster.

    3. Anonymous South African Coward Silver badge

      Re: State of IT

      "...and don't even get me started on RS232 communications and hardware handshaking."

      Heh, been there, done that. It was the time when you still got cards with jumpers on and newer NE2000's with software configuration utilities was being released.

      I preferred the jumper style of changing addresses though, makes it much easier to determine what IRQ and whatnots a card is using when the PC is switched off.

    4. keithpeter
      Coat

      Re: State of IT

      "I've worked in IT for nearly 30 years and have seen the gradual dumbing down of IT, some in part due to technology changes, others due to offshoring."

      @Jaded Geek and all

      Disclaimer: I'm an end user

      Could the increased demand for 'live' and 'real time' data be a factor as well It strikes me that results in the layering of systems to considerable depth and possibly unknown spread Again as you say later in your post, the dependency graph for a restart becomes very complex and may not be known/documented. The graph could even be cyclic (so System A version 9.3 expects System B 3.7 to be running, but System B 3.8 has been installed that depends on System A 9.3)

      Coat: not my mess so out for a walk

      1. one crazy media

        Re: State of IT

        Dumbing down is not the word. Anyone who who can send a Tex is a Network Engineer. Anyone who can format a word document is software engineer.

        Company's don't want to pay decent wages to educated and qualified engineers, so they get dummies, who can't think.

  69. Anonymous Coward
    Anonymous Coward

    Alex Cruz, meet Dido Harding, I'm sure you'll get on really well as you have a lot in common.

  70. Camilla Smythe Silver badge

    I do not get it.

    I've been running Linux on my desktop, vintage 2006, for quite a long time and every time someone fucks up the house electrics it comes back up the same as it was before after swearing at who fucked them up and resetting the breakers.

    1. vikingivesterled

      Re: I do not get it.

      Yes but; does it run a relational database, did it ever loose power in the middle of an upgrade and did the power ever come back in a set of surges that fried your power supply. That is why even your home computer should be protected by the cheapest of ups's.

  71. trev101

    This is going to go down in history "How not to do IT you depend upon".

    If they got rid off all staff who knew how to design a resilient system and staff who had specialist knowledge of existing systems indicates an extremely poor judgement by the management.

    1. Destroy All Monsters Silver badge

      This is going to go down in history "How not to do IT you depend upon".

      Since the 70's or so, that book has become, fat, fat, fat.

      Right next to it is a thin volume about "successful land wars in asia waged by people from the european landmass", but that's for another discussion.

  72. Tombola

    Very Old IT Person

    Yes, I'm v old! When I joined an IT Dept. there still was a rusting Deuce in its stores, waiting to be sold for its metal value that included the mercury in its primitive memory. So I've come thru' it all & now in retirement. The aspect that I can't understand is that in over 200 observations, none come from any folks that really know what has happended & whta is going on.

    Is it the outsourcing or, what?, to Tata. What is Tata running?

    Is it badly designed & implemented resilience vis-a-vis power supply?

    Or whatever?

    Over 200 "guessers" have been at it but no inside knowledge that can explain. Readers must include some employees who could say what really is happening. The official statement that its a a power outage suggests that somebody hasn't thrown enough logs into the silly upgraded Drax power station.

    It has to be much more complictated & therefore much more worthwhile knowing about. So let's have no more rabbitting about whether Linux would have saved them & some inside facts, please!

    1. Anonymous Coward
      Anonymous Coward

      Re: Very Old IT Person

      I agree - we're definitely missing some unofficial info as to what actually happened.

      I the absence of any details, I will add I'm firmly in the "outsourcing your core / mission critical IT is beyond dumb" camp.

      I had to review the source code of a system we outsourced to TCS to write - and it made you want to cry - horrible verbose & unreadable spaghetti code. 20 lines of gibberish to do what you could have written in about 5 lines in a much cleaner fashion - over and over again. Dealing with systems written by the old school big 5 UK consultancies was bad enough (a few good people and dozens of grads with learn programming books in one hand) - but this was on another level.

      1. Anonymous Coward
        Anonymous Coward

        Re: Very Old IT Person

        A flash from the past for me.

        I used to be extremely good with visual basic.

        The problem is (as is now with perl, python, javascript, php...) that people felt it was easier, therefor the average programmer was CRAP.

        So my customer outsourcedthis system to a company in Morocco.

        The result? No passing of parameters. Everything global with a1, a2 variables, ignoring strong typing, etc etc.

        To their credit, it DID work and they tested it perfectly: no bugs.

        But.. impossible to mantain... and very expensive at it.. so in a few years, it had to be scrapped.

        These days user interfaces should aim at being easy to use and mantain. Almost forget efficiency.. computers are cheap, just dont do anything stupid..

        As for the snafu... well, when you otsource on price, this is what you get. Outsourcing for the same quality is at least 30% more expensive than doping it in house.

        If you cannot manage your IT department, even less so the outsourcers...

    2. Anonymous Coward
      Anonymous Coward

      Re: Very Old IT Person

      So let's have no more rabbitting about whether Linux would have saved

      ????

      I haven't seen a single post along those lines.

      Most are speculating about management failures, and bayesian reasoning indicates indeed that this is likely to be root cause.

    3. Grunt #1

      Re: Very Old IT Person

      I suspect everyone who knows is working, worn out or resting and has better things to do. The fact there are probably too few of them won't help, no matter what the reason.

      It seems the BA communications plan is to tell no-one, including passengers. What puzzles me is they have good plans for flying incidents why treat their IT differently?

      1. Aitor 1 Silver badge

        Re: Very Old IT Person

        They see IT as they see plumbing: a cost centre with almost no repercussion in their main gig.

        The fact that they whould be a SW powerhouse escapes them.

        1. vikingivesterled

          Re: Very Old IT Person

          Until, like in plumbing, the shit starts to fly, or in this case not fly.

        2. Anonymous Coward
          Anonymous Coward

          Re: Very Old IT Person

          Plumbing is probably the only time you don't want a backup.

      2. ajcee

        Re: Very Old IT Person

        "What puzzles me is they have good plans for flying incidents why treat their IT differently?"

        Because IT is a cost, silly! IT doesn't make any money.

        After all: if you're not directly selling to the punters and bringing in the money, what good are you?

        1. Anonymous Coward
          Anonymous Coward

          Re: Very Old IT Person

          "if you're not directly selling to the punters and bringing in the money, what good are you?"

          Good question, and if the IT Director hasn't got a convincing answer when the CEO asks the question, then the IT Director should get what's coming to him.

          Shame all those BA ex-customers and soon-to-be ex-staff had to suffer before the CEO would finally understand what IT's for, and what happens when it's done badly. Maybe the compensation for this weekend's customers can be deducted from the "executive compensation" approved by the Remuneration Committee, after all, £100M or so presumably isn't much to them..

          1. Hans 1 Silver badge

            Re: Very Old IT Person

            Sadly, CEO and IT director will blame the proles, you know, the understaffed underfunded in the trenches trying to keep the machine rolling while adhering to useless time-wasting "corporate policies" and always taking the blame for the results of board decisions ...

            Crap, I must be a lefty!

  73. Cincinnataroo

    The stand out for me is all the guessing. No decent folk working on this problem who feel some responsibility to humanity and tell us the truth.

    That in itself is a serious issue.

    Does anybody here have insight into the people who are reputed to be doing the work? (Or has smart person designed a nice family of DNN's and there's no humans involved at all?)

  74. Wzrd1

    Well, I can't speak to BA, but I know of a case

    Where a US military installation, quite important for wartime communications, entirely lost power to critical communications center power for the entire bloody war, due to a single transformer and a dodgy building UPS, which was to keep everything operational for all of five minutes, in order to let standby generators come fully up to stable speed.

    It turned out, due to the installation being in a friendly nation in the region, it had lower priority (odd, as US CENTCOM was HQ'd there). So, when the battery room full of batteries outlasted their lifetimes and failed and due to budgeting, was not funded for lifecycle replacement.

    Until all war communications to the US failed. A month later, the batteries arrived by boat and then had to endure customs.

    That all after correction of a lack of generator testing on a monthly basis, which management claimed was unheard of, but the technical control facility supervisors admitted to being a regular test that they had forgotten about and hence, managed to avoid being part of our monthly SOP.

    That, being brought up by myself, the installation IASO, in a shocked outburst when told that the generator failed and was untested.

    The gaffe in SOP was corrected.

    To then fail again, due to a different transformer explosion from failure, due to a leak of coolant oil in the desert heat and a week previous flood, caused by a ruptured pipe.

    Not a single one of us dreamed of water from the one inch pipe leaking onto the calcium carbonate layer directly beneath the sand flooding into the below ground diesel oil tank, displacing it and upon need, the generator getting fuel from the lines, then a fine drink of fresh water.

    Yes, another change in SOP. Whenever there is a flood within X meters of a below ground generator fuel supply, test the generator again. The generator was tested the week before the leak, so was two weeks from the next test.

    Boy, was my face red!

    1. Kreton

      Re: Well, I can't speak to BA, but I know of a case

      I worked with backup software on a number of operating systems and knew that disaster recovery was more than just a backup of the data. From the number of comments showing diverse practical experience here, it seems an invaluable source of information for someone to construct a How To Manual from these comments and to invite contributions from others. A new website perhaps? Don't have the time myself but I'm sure someone has, so that the next time something like this happens the embarrassment of fingers wagging and saying "It was on the Web" can be well an truly propagated. Oh, and don't have the IT director and team living too close to the facility as if there is major incident in the area they will want to secure their family before the computer systems.

      1. Anonymous Coward
        Anonymous Coward

        Re: Well, I can't speak to BA, but I know of a case

        For starters try this..... http://www.continuitycentral.com/

  75. 0laf Silver badge
    FAIL

    Not shocked

    Last time I flew BA (2016) the plane broke before the doors even closed (fuel valve problem). BA basically reacted like they've never seen or even heard of a broken plane before and as if all their staff had just come off a week long absinthe and amphetamine bender. They lost a bus load of passengers who then re-entered T5 without going through security. BA staff were wandering round shouting "I don't know what to do" and the tannoy was making automated boarding calls the staff didn't know about. I've rarely seen such a display of shambolic ineptitude.

    Still the compo (when the ombudsman made them stop ignoring me) was more than the costs of the flight.

    So to see a fuck up of this magnitude, really not surprised at all.

    1. Anonymous Coward
      Anonymous Coward

      EU tech: Vive La Merde

      Airbus?

      French+British

      A cooperation between some who don't care about engineering and some who don't care about service..?

      1. Anonymous Coward
        Anonymous Coward

        Re: EU tech: Vive La Merde

        I've worked with the French and they really care about engineering and were very good.

        OK, I get your point.

  76. RantyDave

    The ol' corrupted backup

    My money's on "oh smeg, the backup's broken". Somehow incorrect data has been being written to the backup for the past six months - and you can replicate incorrect data as much as you like, it's still wrong. Hence the enormous delay while they pick through the wreckage and fix tables one by one until the damn thing is working again.

    1. Anonymous Coward
      Anonymous Coward

      Re: The ol' corrupted backup

      "Somehow incorrect data has been being written to the backup for the past six months "

      And presumably nobody's tried a restore, because it's too time consuming and expensive.

      That restore could have been tested on the "DR/test/spare" system which various folks have been mentioning - IF management had the foresight to invest in equipment and skills. But outsourcing isn't about value for money, it's about cheap. Until senior management get to understand that management failures will affect them personally, anyway.

      1. vikingivesterled

        Re: The ol' corrupted backup

        Backup's and DR testing never gave any manager a promotion. It is quite soulsucking to work on something for years and years that nobody up top ever notices, until if you are (un)lucky the one day it is needed. And then you'd probly get criticism for why the main system went down at all and why there was a short break in service, instead of a path on the back for how quick and easy it was restored thank's to your decade of quiet labouring to ensure it would.

        Smart talkers steer well clear of it and work instead on sexy development projects that gives new income streams and bonuses, leaving DR to them with a conscience, who will be the first out the door when savings are looked for.

    2. Anonymous Coward
      Anonymous Coward

      Re: The ol' corrupted backup

      What makes you think they ever put the right data in the tapes?

      I have worked for quite a few Fortune 500 companies that did not test their backups as "these systems are not mission critical". Only to find that well, maybe they are after all, we just marked them "non critical" to pay less to outsource them.. me being the project leader of the outsourcing company I was of course "well, 12x5 service, but we are happy to make an effort, will send you the bill".

      And this is how you make money with outsourcing: everything not in the contract, you are happy to provide.. at a steep price. There is where you make the money.

  77. TerryH

    Couldnt start the standby....?

    https://www.thesun.co.uk/news/3669536/british-airway-it-failure-outsourced-staff/

    Article claims that outsourced staff didnt know how to start the backup or standby system...?

    Needs to be verified of course but, if true, then makes you wonder what other knowledge and procedural gaps there are now.. were there gaps before the outsource deal? or were they just not handed over? would they have been handed over but the outsource deal was rushed? Are TCS actually any good on the whole? how safe do you now feel flying on a BA plane?

    My experience with TCS has been mixed. They are all well meaning and they have a few very good, stand out technical people but there are a lot who dont have much experience or knowledge and they are just thrown in at the deep end at the client.

    Recently had one of them sat with me and a few colleagues (windows server infrastructure build team).He had been clearly been tasked by his head office to turn each of our 20 to 30 years experience and knowledge into some sort of script that they could then follow to fulfill server build projects... It was painful and an entirely impossible task in my opinion. Plus he could not even do a basic install of Windows server without help to begin with!

    Anyway a very depressing state of affairs and perhaps if companies want to cut costs so badly then directors can consider taking more pay cuts etc. Once youve sacked or pissed off a loyal worker you will never get them back

    1. Anonymous Coward
      Anonymous Coward

      Re: Couldnt start the standby....?

      "Recently had one of them sat with me and a few colleagues (windows server infrastructure build team).He had been clearly been tasked by his head office to turn each of our 20 to 30 years experience and knowledge into some sort of script that they could then follow to fulfill server build projects... It was painful and an entirely impossible task in my opinion. Plus he could not even do a basic install of Windows server without help to begin with!"

      This is one of the problems with outsourcing.

      I once spent a few weeks handing over a system to an Indian guy, and it was built in a language he had no experience in. To his credit, he was a fast learner.

      The other problem is that these companies move people around and have high turnover of staff. I managed an intranet for a company that got pushed out to India and bumped into one of the users, and they were complaining that like every 2 months, they got new people to deal with. So, they'd then have to explain what things meant. Things would take a long time because the staff didn't know the codebase. Eventually, they moved it back to the UK.

      The best setups I've seen are a mixed thing. Team in the UK of half a dozen employed technical staff, and a load of guys out somewhere else. That team is there for the long-term, know the code base pretty well. They can turn around live problems in hours rather than days.

    2. Saj73

      Re: Couldnt start the standby....?

      Absolutely correct, i have gone through with various of these indian flavoured companies. All of them rubbish. 1/100 knows what needs to be done and knowledge transfer is never enough - how do you expect 20yrs of expertise to be transferred to offshore who are only interested in working their shift.

  78. Anonymous Coward
    Anonymous Coward

    You get the idea

    Ted Striker: My orders came through. My squadron ships out tomorrow. We're bombing the storage depots at Daiquiri at 1800 hours. We're coming in from the north, below their radar.

    Elaine Dickinson: When will you be back?

    Ted Striker: I can't tell you that. It's classified.

  79. rainbowlite

    I worked with TCS many years ago and found the staff to be very bright and often over qualified, however there were rarely allowed to use their brains as they were constrained by a do as the customer says mantra. This often led to very inefficient/poor designs or code, perfectly replicated again and again.

    Regardless of who the work gets moved to, if it is not in house then there is can be an immediate disconnect between urgency and pain/responsibility. Even within an organisation, especially where people work out of multiple sites and remotely, if you are not amongst the users who are staring at you or can hear the senior managers stomping around, it naturally becomes less of a driver for you to burn the extra hours etc.

    We still don't know what happened - I prefer the ideas that it was a peak demand issue sprinkled with a reduction in the capacity at maybe one DC.

    1. LyingMan

      Well.. that was eons ago.. Now there are two significant changes..

      1. Brightness has faded.. The recruitment is focusing on the cheap and easy but not so bright, who wont jump ship after gaining three or four years of experience. This has been going on for the last 10 to 11 years but accelerated in the last 6 years

      2. The culture within TCS has also changed. Previously as many mentioned it was 'Customer is the king'. Now, it is pass the ball back to the customer with more questions. If the customer asks for a change, ask repeatedly about the specification until the customer cannot answer in any clear form and then implement something minimal that the customer cannot complain as not meeting the requirements. And in the meantime grill the customer for test cases so that the customer loses the will to live (remember that in most of the companies the business team's turn over is so much that if you ask questions for sufficient time, the person who asks the question would have left the team before it comes around to implement!) and bake in the code for the test cases to pass.

  80. Anonymous Coward
    Anonymous Coward

    No surprise

    I used to work for BA in the IT department until last year. Given the management and outsourcing now in place this latest debacle is no surprise at all. I could say a lot more, but it would just turn in to a rant...

    1. stevenotinit
      Thumb Up

      Re: No surprise

      I'd like to hear your rant, actually. I don't think I'm the only one, and I think BA needs to hear it for their own good, as well as for their customers and IT staff!

      1. This post has been deleted by its author

      2. Anonymous Coward
        Anonymous Coward

        Re: No surprise

        They never listened to any of us when we still worked there, so it wouldn't make a jot of difference now. Management there are living in a parallel universe where nothing can be said against the great idea that is outsourcing. BA used to be a great place to work until 3-4 years ago, the sad thing is that it deteriorated to the extent that I was happy to leave...

  81. Anonymous Coward
    Anonymous Coward

    Laughing stock

    Striker: We're going to have to blow up the computer!

    Elaine Dickinson: Blow ROC?

    [a smiling face appears on the computer]

  82. Grunt #1

    At least Sainsbury's have reacted quickly.

    http://www.continuitycentral.com/index.php/jobs/2015-operational-resilience-manager

    If they can do it, why can't BA?

  83. A Mills

    False savings

    Various sources report that BA will very likely be facing total compensation claims a great deal north of 110 million pounds, I suspect that a much smaller sum than this could have well bought them some decent system redundancy.

    Bean counters in charge of IT, you know it makes sense.

    1. Anonymous South African Coward Silver badge

      Re: False savings

      Ouch.

  84. Anonymous Coward
    Anonymous Coward

    Comment from a Times article.

    From the IT rumour mill

    Allegedly, the staff at the Indian data centre were told to apply some security fixes to the computers in the data centre. The BA IT systems have two, parallel systems to cope with updates. What was supposed to happen was that they apply the fixes to the computers of the secondary system, and when all is working, apply to the computers of the primary system. In this way, the programs all keep running without any interruption.

    What they actually did was apply the patches to _all_ the computers. Then they shutdown and restarted the entire data centre. Unfortunately, computers in these data centres are used to being up and running for lengthy periods of time. That means, when you restart them, components like memory chips and network cards fail. Compounding this, if you start all the systems at once, the power drain is immense and you may end up with not enough power going to the computers - this can also cause components to fail. It takes quite a long time to identify all the hardware that failed and replace it.

    So the claim that it was caused by "power supply issues" is not untrue. Bluntly - some idiot shut down the power.

    Would this have happened if outsourcing had not be done? Probably not, because prior to outsourcing you had BA employees who were experienced in maintaining BA computer systems, and know without thinking what the proper procedures are. To the offshore staff, there is no context, they've no idea what they're dealing with - it's just a bunch of computers that need to be patched. Job done, get bonus for doing it quickly, move on.

    1. Anonymous Coward
      Anonymous Coward

      Re: Comment from a Times article.

      Aside from the fact that you don't physically power off servers to install software upgrades that sounds completely plausible.

    2. Stoneshop Silver badge

      Re: Comment from a Times article.

      Unfortunately, computers in these data centres are used to being up and running for lengthy periods of time.

      True.

      That means, when you restart them, components like memory chips and network cards fail.

      Nonsense; only if you power-cycle them Just rebooting without power cycling doesn't matter to memory or network cards. Processors and fans may be working closer to full load while booting compared to average load, and with it the PSUs will be working harder, but your standard data centre gear can cope with that.

      Compounding this, if you start all the systems at once, the power drain is immense and you may end up with not enough power going to the computers

      Switching PSUs have the habit of drawing more current from the mains as the voltage drops. Which will cause the voltage to drop even more, etc., until they blow a fuse or a circuit breaker trips. But as this lightens the load on the entire feed, it's really quite hard to get a DC to go down this way.

      - this can also cause components to fail. It takes quite a long time to identify all the hardware that failed and replace it.

      Any operational monitoring tool will immediately call out the systems that it can't connect to; the harder part will be getting the hardware jocks to replace/fix all affected gear.

  85. Paul Hovnanian Silver badge

    Power Supply Issue

    Someone tripped over the cord.

  86. N2 Silver badge

    And don't try to re-book...

    No worries there Senor Cruz, we wont so we wont have to bother your call centre either.

  87. simonorch

    memories

    It reminds me of a story from a number of years ago involving a previous company i worked for.

    Brand spanking new DC of which they were very proud. One day the power in the area went down but their lovely shiny generator hadn't kicked in. No problems, just go out with this key card for the electronic lock to get in and start the generator manually...oh, wait..

    1. Kreton

      Re: memories

      When I started my first job over 50 years ago I came across a standby generator, great big thing with a massive flywheel/clutch arrangement. If the AC supply failed, the clutch held by an electro magnet released and the flywheel engaged the diesel generator to start it up. I can't remember the exact figures but I'm sure the replacement power was flowing in less than 2 seconds.

  88. Anonymous Coward
    Anonymous Coward

    More interesting reading : http://www.flyertalk.com/forum/british-airways-executive-club/1739886-outsourcing-prediction-8.html (and elsewhere on FT)

    1. 23456ytrfdfghuikoi876tr

      and interesting DC insights here:

      http://www.electron.co.uk/pdf/ba-heathrow.pdf

      https://cdn1.ait-pg.co.uk/wp-content/uploads/2016/09/British_Airways_Case_Study.pdf

      http://www.star-ref.co.uk/media/43734/case-study-no-02-british-airways.pdf

      http://assets.raritan.com/case_studies/downloads/british-airways-raritan-case-study.pdf

  89. Anonymous Coward
    Anonymous Coward

    Now that things are settling down I suspect the BA senior management are busy polishing their sloping shoulders. Given that they can't blame outsourcing - both because they approved it and also it is generally a bad idea to piss off the people who control your IT lifeblood - my guess is that they will go with "unforeseeable combination of circumstances" plus, of course, "lessons will be learned".

    1. Anonymous Coward
      Anonymous Coward

      Good opportunity

      As I say, good opportunity to "learn lessons and hire a buddy as infrastructure continuity director".

      And then, pay money to another buddy to deeply study the problem.. and maybe fire a few of the internal IT guys for not being part of the solution, and keep pointing and the problems.

  90. Anonymous Coward
    Anonymous Coward

    Happenings ten years time ago

    This six minute video was released ten years ago by a then-major IT vendor. It features resilient dual data centres, multiple system architectures, multiple software architectures, the usual best in class HA/DR stuff from that era. Stuff that has often been referred to more recently as "legacy" systems - forgetting the full definition, which is "legacy == stuff that works."

    https://www.youtube.com/watch?v=qMCHpUtJnEI

    Lose a site, normal service resumes in a few minutes. Not a few days. Design it right and the failures can even be handled transparently.

    Somewhere, much of what was known back then seems to have been forgotten.

    The video was made by/for and focuses on HP, who aren't around any more as such. But most of the featured stuff (or close equivalent) is still available. Even the non-commodity stuff. Now that even Matt Bryant has admitted that IA64 is dead, NonStop and VMS are both on the way to x86 [1].

    Anyone got BA's CEO's and CIO's email address? Is it working at the moment?

    [1] Maybe commodity x86 isn't such a bright idea given the recent (and already forgotten?) AMT/vPro vulnerabilities. Or maybe nobody cares any more.

    1. Anonymous Coward
      Anonymous Coward

      Re: Happenings ten years time ago

      "Anyone got BA's CEO's and CIO's email address? Is it working at the moment?"

      According to a flight attendant I spoke to after I had a particularly bad time with BA pre flight, it's Alex.Cruz@ba.com. She said she was sick of apologising for BA over the last few years.... And I know what she means. The cost cutting is very evident and perhaps BA should now stand for Budget Airways.

      I note from BBC reports that his last airline apparently had similar issues.

      Anonymous? - yes, I have some notoriously difficult to use air miles which I intend to at least try and use before I finish flying with that airline for good!

      1. Alan Brown Silver badge

        Re: Happenings ten years time ago

        "I have some notoriously difficult to use air miles which I intend to at least try and use before I finish flying with that airline for good!"

        They can generally be used elsewhere within the same alliance.

  91. Tom Paine Silver badge

    72h and counting

    It's Monday lunchtime and it's looking like the PR disaster has greatly exacerbated how memorable this will be in years to come when people come to book flights. This is what they should have been googling three days ago:

    Grauniad: "Saving your reputation when a PR scandal hits"

    https://www.theguardian.com/small-business-network/2015/oct/23/save-reputation-pr-scandal-media-brand

    Torygraph: "Six tips to help you manage a public relations disaster"

    http://www.telegraph.co.uk/connect/small-business/how-to-manage-a-public-relations-disaster/

    Forbes: "10 Tips For Reputation And Crisis Management In The Digital World"

    https://www.forbes.com/sites/ekaterinawalter/2013/11/12/10-tips-for-reputation-and-crisis-management-in-the-digital-world/#bc0de87c0c68

    Listening to endless voxpops from very pissed off BA pax, those articles make very interesting reading. BA seems to have confused the "Do" and "Don't" lists...

    They're now saying the famous power failure was for a few seconds; agree with above commentards saying that's suggestive of some sort of data replication ./ inconsistency issue. Still hungry for the gory details though... come on, someone in the know, post here as AC (from a personal device obvs)

  92. one crazy media

    Is BT powering their global IT systems from a single power source. If you are going to lie, come with good lie. Otherwise, you like Trump.

    1. vikingivesterled

      It is the sad fact of electriccity. You can't power the same stuff from 2 different grids at the same time. Has something to do with how AC power alternates and the need of synchronization. Meaning all failure reactions are of a switch in nature. This is why you sometimes see a blink of the lights when the local grid to your door switches to a different source. And somtimes these switch-in's fail or several happen in a series leading to power surges and failed equipment.

      Most airlines has a main data-center with a main database to ensure the seat you book is not already taken by somebody else booking through a different center. It is not like tomatoes where any tomato will do the job. You will not be happy if that specific seat on that specific plane on that specific flight you booked is occupied by that somebody else..

      1. chairman_of_the_bored
        Joke

        Except if you are United...

      2. patrickstar

        The entire grid of a nation is usually in phase (with a few exceptions like Japan).

        However, you can't just go hook up random connections since the current flows need to be managed. And if one feed dies due to a short somewhere, you don't want to be feeding current into that. Etc.

      3. Alan Brown Silver badge

        "You can't power the same stuff from 2 different grids at the same time."

        Yes, you can: It's all in the secret sauce.

        https://www.upssystems.co.uk/knowledge-base/understanding-standby-power/

        And when it's really critical, you DON'T connect the DC directly to the external grid. That's what flywheel systems are for (you can't afford power glitches when testing spacecraft parts as one f'instance)

        http://www.power-thru.com/ (our ones run about 300kW continuous load apiece and have gennies backing them up.)

        For the UK, Caterpillar have some quite nice packaged systems ranging from 250kVA to 3500kVA - and they can be stacked if you need need more than that or to build up from a small size as your DC grows.

        http://www.cat.com/en_GB/products/new/power-systems/electric-power-generation/continuous-power-module/1000028362.html

        So, yes. This _is_ a solved problem - and it's been a solved problem for at least a couple of decades.

        1. Anonymous Coward
          Anonymous Coward

          Are there any qualified electrical people on this site?

          If so, please educate us.

  93. Tom Paine Silver badge
    Go

    Interview with Cruz has more detail

    Interview on today's WatO has a lot more lines between which technical detail can be read. Starts about 10m in, after the news bulletin;

    http://www.bbc.co.uk/programmes/b08rp2xd

    1. anthonyhegedus Silver badge

      Re: Interview with Cruz has more detail

      Well he said he won't resign, which hopefully will cost them even more. He forgot to mention anything about incidentals that need to be paid for like onward flights, hotels etc.

      And I didn't hear anything other than that there were local power problems at Heathrow. And it definitely wasn't the outsourced staff or the redundancies. And presumably it wasn't a hack either. I don't suppose they need to be hacked, they're quite capable of ruining themselves on their own anyway.

      Lame excuses, no detail and they're going to shaft the customers and probably reward the CEO.

  94. OFSO

    Only one way to do it - and that is the right way.

    1) The European Space Operations Centre has two independent connections to the German national grid - one entered the site at one end, one at the other. In addition a massive genset was hired for weeks surrounding critical ops, with living accommodation for the operator. (Yes it was that large).

    2) A backup copy of all software and data files was made every day at midnight, encompassing the control centre and the tracking telemetry & command station main computers (STAMACS) spread around the world..

    In my 25 years there I only remember one massive failure, which is when an excavator driver dug up the mains supply cable - in the days when there was only one power feed. In fact that incident is why there are two power connections today. Maybe someone here will correct me but I do not remember any major computing/software outages lasting more than an hour, before the system was reloaded and back on line.

    Of course we were not trying to cut costs and maximise profits.

    1. tfb Silver badge

      Re: Only one way to do it - and that is the right way.

      So, all your systems were physically close to each other then? That is not 'the right way'.

  95. StrapJock

    Twent years ago...

    I remember designing network infrastructures on behalf of BAA some 20 years ago (1996 - 2000). Three of them were in the UK, the others in far flung places. Back then all network infrastructures had to be designed to survive catastrophic failure e.g. bomb blast, fire, flood, air crash etc., and be capable of "self-healing" - i.e. have the ability to continue processing passengers within 3 minutes. All the major airlines including BA insisted on these requirements. Not an easy task back then but we managed to get the infrastructure restore down to 30 seconds.

    It seems like BA have sacrificed those standards in favour of saving money and that it now has a system which takes 3 days to restore. I wonder how much that will cost them?

    Progress.

  96. thelondoner

    BA boss 'won't resign' over flight chaos

    "BA chief executive Alex Cruz says he will not resign and that flight disruption had nothing to do with cutting costs.

    He told the BBC a power surge, had "only lasted a few minutes", but the back-up system had not worked properly.

    He said the IT failure was not due to technical staff being outsourced from the UK to India."

    http://www.bbc.co.uk/news/uk-40083778

  97. Rosie

    Power supply problem

    You'd think they could come up with something more plausible than that old chestnut.

  98. JPSC

    Never say never

    "We would never compromise the integrity and security of our IT systems"...

    .. unless placing a fundamental part of our infrastructure under the control of a bunch of incompetent foreign workers would increase our CEO's quarterly bonus.

    TFTFY.

  99. Anonymous Coward
    Anonymous Coward

    So one power outage can take down an entire system and its not his fault. If I was a major shareholder I'd be outraged and he'd be gone.

  100. Duffaboy
    FAIL

    Here is why they have had a system failure

    They most likely have put someone in charge of IT spent who knows Jack S about IT

  101. Duffaboy
    Trollface

    Could it be that its still down because

    The IT Support staff also have to fly the planes, and can't fly them because the IT systems are down..

  102. john.w

    When IT is your company

    Airline CEOs used to understand the importance of their IT; a quote from Harry Lucas's 1999 book 'Information Technology and the Productivity Paradox - Assessing the value of Investing in IT'

    Robert Crandall, American (Airlines) recently retired CEO, said that “if forced to break up American, I might just sell the airline and keep the reservations system". It was SABRE developed by AA and IBM - https://www.sabre.com/files/Sabre-History.pdf

    BA should keep the IT in house and sub contract the planes to Air India.

  103. Anonymous Coward
    Anonymous Coward

    I'm enjoying BA's pain ...

    If it isn't the CEO'S fault, whose is it?

    Former BA CEO, Alex Cruz .....

    I gave up on on BA (and United) years ago for other reasons.

  104. chairman_of_the_bored

    IT datacentre design is NOT rocket science, but it has to be approached holistically - any major datacentre worth its name will have at least dual feeds to the grid plus generators on standby. But if nobody really understands the importance of the applications running (and why would remote staff in India have a clue) then you end up with non-overlapping up-time requirements. The datacentre (probably) had their power back up within their agreed contractual requirements, but nobody seems to to have considered the implications of countless and linked database servers crashing mid-transaction. If the load sheets for each aircraft relied on comms links back to the UK didn't anyone consider the possibility of a comms breakdown? Why not a local backup in (say) excel? It is probably a cheap and fair jibe to attack the Indian outsourcers, but all these things should have been considered years ago.

  105. Pserendipity

    Fake news!

    BA flights were leaving without baggage on Friday, with no explanation.

    BA flights from everywhere but LHR and LGA were flying.

    Now, if MI6 got a whisper of a bomb on a BA plane leaving London over the Bank Holiday weekend, how could the authorities avoid panic and ensure that all passengers and baggage got fully screened?

    Hmmm, how about blaming computer systems (everybody does) the weather (climate change is politically neutral) and outsourced IT (well, the Commonwealth is next on the immigrant exclusion agenda once Brexit is sorted) while playing games with the security levels?

    [I await receipt of a personal D-Notice]

  106. Brock Landers

    Who will do the needful and revert?

    Outsourcing is sanctioned at the top and so no heads will roll. If this debacle is a result of outsourcing to a BobCo in India, then you can bet that the Execs at the top will defend the BobCo. To admit they made a mistake in outsourcing a vital function of their business to a 3rd World Bob Shop would be like the Pope saying "there is no God", just will not happen. The answer will be "MOAR OUTSOURCING".

    Sorry to be the bearer of bad news, but those Directors with their MBA in Utter B*ll*cks hate IT, they hate 1st world IT staff and they hate the fact that they themselves do not understand IT. They'll always do what they can to offload IT to India et al with their mistaken belief that IT people are not smart, anyone can do IT and Indians are the best and the cheapest when compared to 1st World people. It's almost as if such people cannot stand people below them who are smarter than them: the IT guys & gals.

    I seem to remember this rot setting in during the late 90s. Anyone remember when non-IT chumps started to show up in IT depts? They were called "Service Management" armed with their sh-ITIL psychobabble. Then we got the "InfoSec" types who knew feck all about Windows, Servers, Networking, Exchange etc. but barked out orders based on sh*t they were reading on InfoSec forums. And then we got the creeping death known as the "Bobification of IT"...Indians with "Bachelors Degrees" and MCSE certifications. And now we know that cheating on exams and certs is an epidemic in India. Why are directors & managers in the 1st world outsourcing 1st world jobs to 3rd world crooks?

    Who will do the needful? Who will revert?

    1. Anonymous Coward
      Anonymous Coward

      Re: Who will do the needful and revert?

      Oh no, on the contrary... they DO understand that IT ppl tend to be smarter than them but with less people skills. But some DO have people skills.

      Therefore it is very dangerous to have your own IT, as they will potentially know more about running the business than you.. so you prevent all of that by outsourcing it.

      1. Brock Landers

        Re: Who will do the needful and revert?

        Thank you for saying the needful.

  107. Anonymous Coward
    Anonymous Coward

    power surge at 9.30am on Saturday that affected the company's messaging system

    From the Independent. Cruz reportedly said:

    'The IT failure was caused by a short but catastrophic power surge at 9.30am on Saturday that affected the company's messaging system, he said, and the backup system failed to work properly.'

    I wonder if they had a single point of failure in their communications network?

    Having spent my early career in Engineering Projects for a very well known British Telecommunications company, when building the new fangled digital network it was absolutely drilled into us to ensure that there was always a hot standby backups, diverse routing, backup power supplies etc. etc. And we tested it too. I spent some weeks of my life going through testing plans ensuring that if you made a 999 call in our area it would be bloody difficult for the call to not get through.

    For us 5 9's was not good enough. I doubt BA were operating at 3.

    1. Anonymous Coward
      Anonymous Coward

      Re: power surge at 9.30am on Saturday that affected the company's messaging system

      Most power spikes and cuts are very short, the real issue is how long does it take to get back to normal.

      1. Anonymous Coward
        Anonymous Coward

        Re: power surge at 9.30am on Saturday that affected the company's messaging system

        However when it comes to critical systems power spikes should not affect you. You have built in protection for that?

        Power cuts should also not affect you (at least for quite a period). You have battery for immediate takeover without loss of service and then generators to keep the batteries going?

        You do test it at regular intervals as well?

  108. Kreton

    Ticked the wrong box

    The customer complained there was no data on the backup tapes. They had ticked the wrong box in the backup software and presumably ever since they installed the system all they had backed up was the system, not the data. Much anguish and gnashing of teeth.

    1. Korev Silver badge

      Re: Ticked the wrong box

      Where I used to work, most of the IT was outsourced to someone you'd have heard of. For months they'd been taking monthly backups for archiving, no one at the said firm thought it was odd a thousand person site's main filer only required a single tape! Of course when we needed them to restore they only then realised. This was the excuses we needed to bring this service into our shadow IT group...

  109. PyroBrit

    Maintain your generators

    Whilst working at a company in 2001 we had a total power failure to the building. Quite correctly, the UPS maintained the integrity of the server room and the backup generator started up as commanded.

    Upon contacting the power company we were told it would be down for at least an hour. Our building services guy said we had fuel for the generator for 48 hours. Two hours later the generator dies.

    Turns out the fuel gauge was stuck on full and the tank was close to empty. Dipsticks are your friend.

    1. Florida1920 Silver badge

      Re: Maintain your generators

      Dipsticks are your friend.

      Sounds more like dipsticks were in charge of the data center.

    2. Aitor 1 Silver badge

      Re: Maintain your generators

      Seen that more than once.

      Also, people expect 3 year old fuel not to clog the filters.

      1. Alan Brown Silver badge

        Re: Maintain your generators

        "Also, people expect 3 year old fuel not to clog the filters."

        Which is why you have circulation pumps, regular run tests, redundant filtration systems and duplicated fuel gauges.

        Of course, if it's all put together by speed-e-qwik building contractors then you'd better be absolutely sure you covered every last nut bolt and washer in your contract and specified that any departures _they_ make from the plans are not allowed (else they will and then try to charge you for putting it right)

  110. Saj73

    Tata Consultancy.... surprised at incompetency.... I am not

    Typical, i am not surprised that Tata or HCL name would have cropped up. All the knowledge walked out when they let go of local resources. Tata and group are just ticket takes - i work and see this first hand in my line of business with these 'Indian IT specialists' who are complete rubbish only 1/100 will know what needs to be done. It is mass production and shift work; i will bet my bottom that the challenges they would have had just to bring resources on board because it was past their work time. This wont be the first or last time and the lame excuse that this is standard practise in industry. Well go and have a look at GE or larger organisations who are now going back because of the complete 3rd class incompetence that these outsourcing companies bring.

    This is time for the british public to demand that BA bring its core IT competencies inhouse. This wont be the first or last time this will happen.

    1. Alan Brown Silver badge

      Re: Tata Consultancy.... surprised at incompetency.... I am not

      "This is time for the british public to demand that BA "

      The "british public" can't demand anything. BA is owned by a spanish private consortium.

      So that's a lack of excellence in engineering OR customer service.

  111. SarkyGit

    Have they said it was the IT kit that had a power failure

    Ever seen what very quickly happens in a DC when the air-con fails.

    It will boil your data and traffic, this could cause plenty of anomolies that a DR site wont recognise as issues until thermal cutouts (hopefully correctly configured) kick in or physical parts fail.

    You can be assured you have many days work in front of you.

  112. PeterHP

    As a retired Computer Power & Environment Specialist there is no such thing as a power surge, there are voltage spikes that can take out Power supplies. This sounds like bull shit from someone who does not know what they are talking about or getting fed a line. In my experience most CEO and IT Directors did not know how much money the company would lose per hour if the systems went down or if they would stay in business and did not consider the IT system important to the operation of the company. I bet they now know a DR system would have been be cheaper, but they not alone I have seen many airline Data Centres that run on a wing and a prayer.

    1. vikingivesterled

      In fairness to Cruz he didn't specify what the, in laymans tems power surge in engineer terms voltage spikes, took out. It could have been sample non ups'd air-con's power supplies being destroyed and a lack of environmental alarms going to the right people leading to overheating before manual intervention. AC can also be notoriously difficult to fix quickly. I have myself used emergency blowers and toll out ducts to cool an airline's overheating data center, where the windows where sealed and unopenable to pass pressurized fire control tests.

  113. ricardian

    Power spikes & surges

    This is something on a much, much smaller scale than the BA fiasco!

    I'm the organist at the kirk on Stronsay, a tiny island in Orkney. Our new electronic organ arrived 4 years ago and behaved well until about 12 months ago. Since then it has had frequent "seizures" which made it unplayable for about 24 hours after which it behaved normally. Speaking to an agent of the manufacturer he asked if we had many solar panels or wind turbines. I replied that we did and the number was growing quite quickly. He said that these gadgets create voltage spikes which can affect delicate electronic kit and recommended a "spike buster" mains socket. Since fitting one of these the organ has behaved perfectly. I suspect that with the growing number of wind turbines & solar panels this sort of problem will become more and more noticeable

    1. vikingivesterled

      Re: Power spikes & surges

      That would probably only be an issue when there is more prower produced than can be consumed. Meaning the island needs something that can instantly lead away, consume or absorb overproduced power, like a sizeable battery bank, water/pool heater or similar. Alternatively if it is not connected to the national grid, the base ac sync creating generator/device is not sufficiently advanced.

      1. ricardian

        Re: Power spikes & surges

        Our island is connected to the grid. Orkney has a power surplus but is unable to export the power because the cable across the Pentland Firth is already operating at full capacity. We do have a "smart grid" https://www.ssepd.co.uk/OrkneySmartGrid/

    2. anonymous boring coward Silver badge

      Re: Power spikes & surges

      Although on a smaller scale, an organ mishap can be very humiliating.

    3. Alan Brown Silver badge

      Re: Power spikes & surges

      "He said that these gadgets create voltage spikes which can affect delicate electronic kit "

      If they can 'affect delicate electronic equipment' then something's not complying with standards and it isn't the incoming power that's out of spec....

      Seriously. The _allowable_ quality of mains power hasn't changed since the 1920s. Brownouts, minor dropouts, massive spikes and superimposed noises are _all_ acceptable. The only thing that isn't allowed is serious deviation from the notional 50Hz supply frequency (60Hz if you're in certain countries)

      This is _why_ we use the ultimate power conditioning system at $orkplace - a flywheel

      As for your "spike-buster", if things are as bad as you say, you'll probably find the internal filters are dead in 3-6 months with no indication other than the telltale light on it having gone out.

      If your equipment is that touchy (or your power that bad), then use a proper UPS with brownout/spike filtering such as one of these: http://www.apc.com/shop/th/en/products/APC-Smart-UPS-1000VA-LCD-230V/P-SMT1000I

  114. Anonymous Coward
    Anonymous Coward

    Power spikes etc.

    The difference between your situation and DCs is they have the money to invest in a UPS which conditions the power as well.

    Your point is noted though as there are many endpoints that don't have the same facility.

  115. Anonymous Coward
    Anonymous Coward

    Outsourcery

    Now the BAU function has been outsourced the real bills will arrive. All the changes that are now deemed necessary will be chargeable leading to a massive increase in the IT cost base.

  116. Florida1920 Silver badge
    Holmes

    "fixed by local resources"

    Translation: Two people. One to talk to India on the phone, the other to apply the fixes.

  117. Anonymous Coward
    Anonymous Coward

    Something else will crash on Tuesday.

    The share price.

  118. Dodgy Geezer Silver badge

    I see that El Reg is unable....

    ...to get ANY data leaked from the BA IT staff at all.

    One more advantage of outsourcing to a company which does not speak English...

    1. Anonymous Coward
      Anonymous Coward

      Re: I see that El Reg is unable....

      Is there an Indian version of El Reg?

  119. GrapeBunch Bronze badge

    Real-time redundancy is why Nature invented DNA.

  120. Milton Silver badge

    Hands up ... if you believe this for a second

    Sorry, it won't wash. A single point of catastrophic failure, in 2017, for one of the world's biggest airlines, which relies upon a vast real-time IT system? A "power failure"?

    Even BA cannot be that incompetent. Pull the other one.

  121. anonymous boring coward Silver badge

    OK, so something failed. And they didn't have a working automatic failover. I get that. Embarrassing, and the CEO should go just for that reason.

    What I don't get is how it could take so long to fix it? It must have been absolute top priority to fix within the hour, with extra bonuses and pats on backs to the engineers who quickly brought it back up again. How could it take so long?

    1. pleb

      "It must have been absolute top priority to fix within the hour, with extra bonuses and pats on backs to the engineers who quickly brought it back up again. How could it take so long?"

      they had the engineers all lined up ready to execute the timely

      fix you speak of. Trouble was they could not get them booked on a flight over from India.

  122. Anonymous South African Coward Silver badge

    As an interesting aside, how does Lufthansa's IT stack up against BA's IT?

  123. Anonymous Coward
    Anonymous Coward

    My 20p/80p 's worth

    It's 80 degrees and only 20% of staff are in over the long weekend

    80% of the legacy systems knowledge went when 20% of experienced staff were decommissioned

    80% of the time systems can handle 20% data over capacity

    120% of Uks power is available from wind and solar so 80 % of coal/nuclear capacity is off-line

    20% cloud cover and wind dropping to 80% cause sudden massive drop in grid capacity…. causing large voltage spikes

    ‘Leccy fall out agreements briefly swing in to action, BA can use UPS + generators to cover this

    DC switches to UPS whilst only 80% of the 20% under-capacity generators spin up successfully - "the power surge"

    80% of current customer accounts lose critical 20% of their data when twin system can't synch.

    An 80% chance this is 20% right or a 20% chance this is 80% right?

  124. Ian P

    The power (with its backups) will never fail so we don't plan for it.

    I guess it is just a blinkered approach. You convince yourself that the power will be fine and so you ignore the case when there is a failure, hence chaos when the system that will never fail actually fails. Is it the MDs fault? Yes for hiring an incompetent IT Manager. But I'd replace the IT Manager when the dust has settled. But are those crucial nuggets of information that he has in his head backed up?

  125. Anonymous Coward
    Anonymous Coward

    Brownout to brownpants

    My take a "power surge" happened in the form of a brownout probably triggered by IT.

    Simple scenario, support identify a requirement to do a needful update, this is automated via a management tool. The playbook for doing needful updates states the command sequence to execute, this is fat fingered (or applied incorrectly) and rather than progressively rolling out to the estate it applies to all.

    Servers all start rebooting near simultaneously, the inrush startup currents promptly overload the PDU/UPS/genset, many failures happen some physical hardware some data corruption. Local support are asked to please revert, sadly it's Saturday most are not in work and many no longer work for the company.

    Fat finger brownout thus becomes a brownpants moment.

  126. Anonymous Coward
    Anonymous Coward

    IT outsourcing

    I used to work at BA - not in IT, but in an area that worked with IT day in and day out.

    I, and many others left as the airline and capabilities that we loved gradually got ruined.

    Most of the IT guys and gals were offered generous packages to leave. Three "global suppliers" were chosen, and all work had to be given to them. They would pitch for it, and cheapest nearly always won. Unsurprisingly the good IT people took the packages, and the less good stayed (some good stayed also, but not enough).

    SIP / CAP / CAP2 / FICO / FLY - etc. are all complex systems, and when experience leaves then the support level goes down considerably. I think they have probably cut too much, the senior management don't have enough knowledge of IT to know when one cut is too much. The IT guys were resistant to change so the head was cut off the snake, then lots of yes people remained, and this is where we end up (as well as the complete outages of the website recently).

    To say that this is unrelated to the removal of most of the people who knew how these systems worked is disingenuous. Two data centres, with back up power, so I fail to understand how one power surge could take out both of them independently - sounds more like a failed update / upgrade by inexperienced staff - and then a lack of experienced staff around to fix it.

    Such a shame.

  127. Mvdb

    A summary of what went wrong inside the BA datacenter here.

    http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

    I hope BA wil make public the root cause so the world can learn from it. I however doubt BA will indeed do this.

  128. 0laf Silver badge
    Black Helicopters

    From another forum and a friend of a friend that works with BA IT.

    The outsourcer was told to apply security patches which they did and powercycled the whole datacenter.

    When it came back up it popped many network cards and memory modules when the power spiked.

    The outsourcers lacked expereince in initiating the DR plan and it didn't work. Or maybe DR wasn't in the contract.

    True or not I dunno.

  129. Anonymous Coward
    Anonymous Coward

    Soft target?

    With all the terrorist risks to add to the natural causes and cock ups that will happen, I find it surprising that the location of the BA DCs are known. Even some idiot loser can work out that somehow hitting the data centres will have an impact out of all proportion to the cost. That being the case why doesn't BA have a plan that works?

  130. Mvdb

    Another update to my reconstruction of what went wrong:

    An UPS in one of BA London datacenters failed for some reason. As a result, systems went down. Power was restored within minutes however not gradually. As a result a power surge happened which damaged servers and networking equipment. This resulted in many systems down and Enterprise Service Bus.

    http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

    Big question: why wasn't a failover to the other datacenter initiated?

    1. Anonymous Coward
      Anonymous Coward

      Probably

      .. because they estimated the failover would take longer than fix on site. Clearly the wrong decision was made. Or perhaps they had no faith in their DR plan.

    2. Snoopy the coward

      Messaging systems failed to sync...

      From what I have read, they did a failover but they just can't resync again, meaning they don't have a point-in-time recovery capability for their messaging systems. Not sure what messaging systems BA are using but I know MQ can recover quite easily, it will resent what has failed, will not resend a duplicate.

      But anyway I think the applications are linked in a very complicated manner and a failover need to be done in a very strict sequence, or else it will ruin everything, requiring a restore from tape to recover. So the initial failure was the power surge which destroyed some hardware, the failover was initiated but it just couldn't continue from where it went down, thus requiring many hours of manual recovery work to get it up again.

  131. Captain Badmouth

    Willie Walsh

    on BBC now still blaming a failure of electrical power to their IT systems.We know where problem occurred, he says.

    BBC report that industry experts remain sceptical.

    1. Grunt #1

      Re: Willie Walsh

      Mr White Wa(l)sh.

      "We know what happened but we're still investigating why it happened and that investigation will take some time," he said.

      - We're hoping some other sucker is in the headlines when we publish.

      "The team at British Airways did everything they could in the circumstances to recover the operation as quickly as they could."

      - The recovery they performed was no doubt a fantastic job which pulled BA out of a tailspin at the last minute. The real question is what caused the tailspin.

  132. Anonymous Coward
    Anonymous Coward

    There were companies in the WTC on 9/11 with redundant DCs in New Jersey. The backup DC took over, they didn't lose any data, the file system didn't drop buffers on the floor, etc. And it wasn't Windows or UNIX based. The technology is out there but people don't like "old" proprietary systems... except when it saves them money.