BA systems down globally

Author
Discussion

Henners

12,230 posts

194 months

Tuesday 30th May 2017
quotequote all
rxe said:
Murph7355 said:
Any data centre worth its salt will be able to withstand a shaky mains supply.
From the locations of the data centres upthread, it sounds like they are home made. Most of my experience is with data centre providers - so Tier 3 or 4, loads of rooms, high levels of professionalism - no shonky installations here. Most "internal" data centres I have seen have not been as good. Single power supplies, haven't changed the UPS batteries for a decade, never tested A/B power since the thing was built, all the comms coming through a single duct.

All of these things are really expensive to retrofit, so if no one at board level is listening, then it will never be fixed.

Looking on the positive side, they're probably listening now....
I have a feeling that rather than a T4 centre with n+x DRUPS, power smoothing etc, they're running out of a rickety old box in a shed, attached to a car battery hehe

Murph7355

37,715 posts

256 months

Tuesday 30th May 2017
quotequote all
anonymous said:
[redacted]
Nah.

That's not a "surge" and even IF they were running really close to capacity coordinating any volume of machines to start up such that their max demand all hit at exactly the same time would suggest the people doing it are far from incompetent smile

At worst they might have had issues with a proportion of racks. Not the whole data centre.

Without knowing exactly what set up they have it's impossible to say with any certainty. But the excuses thus far smell of BS.

This will have its root at cost cutting one way or another, and/or mgmt incompetence.

MitchT

15,867 posts

209 months

Tuesday 30th May 2017
quotequote all
Never ceases to amaze me how massive organisations can suffer apocalyptic events because of a power surge. Meanwhile, sitting at home doing IT-critical things like browsing PH and watching Harry's Garage on YouTube, I'm oblivious to any such event because my laptop's battery just takes over if there's an issue with the mains supply.

yajeed

4,892 posts

254 months

Tuesday 30th May 2017
quotequote all
MitchT said:
Never ceases to amaze me how massive organisations can suffer apocalyptic events because of a power surge. Meanwhile, sitting at home doing IT-critical things like browsing PH and watching Harry's Garage on YouTube, I'm oblivious to any such event because my laptop's battery just takes over if there's an issue with the mains supply.
You have a battery powered wifi router?

You're right though, I've no idea why they use servers, buy expensive storage etc when they could buy a Compaq laptop from PC world and run all of their global infrastructure from it.

MitchT

15,867 posts

209 months

Tuesday 30th May 2017
quotequote all
Overlooking sarcasm, is there a reason why every electronic device that was rated for use in such critical applications couldn't have a battery to provide a buffer of emergency power?

smack

9,729 posts

191 months

Tuesday 30th May 2017
quotequote all
BA's primary DC is Boadicea House (which I am sure has been mentioned in this thread), which was built in the late 60's apparently, and is located within the BA Engineering Base, which most people will recognise as the place all the larger hangers by Hatton Cross. The building has backup generators, but in recent years, they expanded it to a 6 generator setup in car park next to the building:



I am guessing that installation doesn't come cheap, and no doubt required an upgrade of a lot of the old electrical supply systems for the installation (from what I has seen in other large installations)

jonny996

2,616 posts

217 months

Tuesday 30th May 2017
quotequote all
MitchT said:
Overlooking sarcasm, is there a reason why every electronic device that was rated for use in such critical applications couldn't have a battery to provide a buffer of emergency power?
2 reasons, the heat batteries produce & the fire risk they have.

clonmult

10,529 posts

209 months

Tuesday 30th May 2017
quotequote all
yajeed said:
MitchT said:
Never ceases to amaze me how massive organisations can suffer apocalyptic events because of a power surge. Meanwhile, sitting at home doing IT-critical things like browsing PH and watching Harry's Garage on YouTube, I'm oblivious to any such event because my laptop's battery just takes over if there's an issue with the mains supply.
You have a battery powered wifi router?

You're right though, I've no idea why they use servers, buy expensive storage etc when they could buy a Compaq laptop from PC world and run all of their global infrastructure from it.
I just swap over to tethering from my mobile ....

My experience of DCs is minimal; I've been supporting systems that are geographically fault tolerant and can handle one DC being completely pulled from the grid (we've demonstrated the functionality) and the system could be back up and running on the secondary DC within a few minutes - but the Service Mangement people would have to organise several hours of calls discussing the business continuity plans before we would be allowed to swap things over. But worse case it would be a couple of hours, and as the databases are never more than a second or two out of sync, data loss is going to be minimal.

Full resilience isn't that difficult. But it is expensive. Which is where I suspect there has been a problem - cost cutting will have resulted in the failover processes being pared back to the minimum. No live sync between DCs, maybe relying on a tape backup to restore. And then find out that the tape backups haven't been done for a while (or they were backing up nothing?)

Power problems, whilst possible, seem highly unlikely. No chance of Cruz being honest with his statements, he's covering his arse.

Murph7355

37,715 posts

256 months

Tuesday 30th May 2017
quotequote all
MitchT said:
Overlooking sarcasm, is there a reason why every electronic device that was rated for use in such critical applications couldn't have a battery to provide a buffer of emergency power?
In a decent data centre in essence they do, it's called a UPS. It's not intended to provide power on its own for very long, just enough to regain a stable supply either via the grid or generators (or to shut down gracefully).

It's centrally provided within the DC.

Of course there's no law that says you must provide one. But I cannot imagine a company like BA not having this functionality in its DCs. If it doesn't, I refer back to my earlier root cause hypothesis smile

clonmult

10,529 posts

209 months

Tuesday 30th May 2017
quotequote all
Murph7355 said:
MitchT said:
Overlooking sarcasm, is there a reason why every electronic device that was rated for use in such critical applications couldn't have a battery to provide a buffer of emergency power?
In a decent data centre in essence they do, it's called a UPS. It's not intended to provide power on its own for very long, just enough to regain a stable supply either via the grid or generators (or to shut down gracefully).

It's centrally provided within the DC.

Of course there's no law that says you must provide one. But I cannot imagine a company like BA not having this functionality in its DCs. If it doesn't, I refer back to my earlier root cause hypothesis smile
A DC could be potentially rather large - thousands of servers. Use of a UPS across all of them would possibly not be required, it would be applied on a system-by-system basis. So systems deemed critical would be behind a UPS, and even if they werent, the failover to the secondary DC would typically be handled fairly quickly (few hours?).

Which is why the claims of it being a power problem are increasingly unbelievable.

57 Chevy

5,410 posts

235 months

Tuesday 30th May 2017
quotequote all
anonymous said:
[redacted]
Comment from a Times article.
From the IT rumour mill
Allegedly, the staff at the Indian data centre were told to apply some security fixes to the computers in the data centre. The BA IT systems have two, parallel systems to cope with updates. What was supposed to happen was that they apply the fixes to the computers of the secondary system, and when all is working, apply to the computers of the primary system. In this way, the programs all keep running without any interruption.
What they actually did was apply the patches to _all_ the computers.
Very similar to what happened to RBS when it outsourced and then they tried to upgrade the CA7 scheduler.

Edited to add link... https://www.theregister.co.uk/2012/06/25/rbs_natwe...

https://www.theregister.co.uk/2012/06/26/rbs_natwe...

https://www.theregister.co.uk/2012/06/28/rbs_job_c...



Edited by 57 Chevy on Tuesday 30th May 14:06

Smiler.

11,752 posts

230 months

Tuesday 30th May 2017
quotequote all
Murph7355 said:
Smiler. said:
SSE wade in: "what power surge?"

hehe


Still, they might be taking supply from the Grundon CHP plant down the road, which is a bit "rubbish."

wink
Any data centre worth its salt will be able to withstand a shaky mains supply.
Of course. Just in case you missed it

Piersman2

6,598 posts

199 months

Tuesday 30th May 2017
quotequote all
Sign me up for the scepticism of it being due to the 'power failure'.

Based on my experience of TCS it's far more likely they have fecked up the updating/transporting of code into Production, either by putting it there at the wrong time, or putting the wrong stuff in, or putting it in in the wrong order.

The surprise for me is that they've only taken 3 days to sort it out, it used to take them 3 days just to work out who would look into it.


PurpleTurtle

6,987 posts

144 months

Tuesday 30th May 2017
quotequote all
I am an IT contractor working for a blue chip brand who have globally outsourced their IT to India.

I remain contracted to them because by the time my bit of the operation came up for outsourcing so many mistakes had been made elsewhere on the planet that my business critical application was deemed too important for the Outsource provider to be let loose on solo, due to the reputational damage that an outage such as BA's would create. So now I spend 50% of my time doing some of the things I used to do, and 50% policing an Offshore team.

At the start they throw their best people at it. A handful are good - we have had one or two over the years who we would have hired, had outsourcing not occurred. In the main however the Indian staff are fresh out of college, have been chucked on a course in whatever technology it is they are to support, then thrown at the Production application to support it. They have very little experience in business in general, and zero in the business they are supposed to be supporting. If a problem can be scripted down to the Nth degree such that following steps 1-10 to the letter always results in the same outcome then they are good, hard workers. Those kinds of thing are the exception, rather than the rule. I digress, but I should add that when I started in IT nearly 25 years ago the accepted wisdom was that you added no value for 2yrs, you were a trainee until then, and only really became trusted and productive after 2yrs. In my background nobody inexperienced was let anywhere near Production (live) systems. You cut your teeth in Development.

The problem comes when things get any more complicated than basic scripted tasks. Anything requiring some analytical thinking, some liason with users or some deep understanding of the business is beyond the Outsource staff, they just don't have the experience. To compound that risk, there is the Indian cultural norm of not wanting to 'lose face' and not wanting to admit that they do no know something.

Every week I have this kind of conversation:

Me: "Program XYZ has failed with return code ABC123. Do you know what that means and can you fix it?"
Offshore Techie: "Yes"
Me: "OK, what does it mean?"
OT: (long silence, bit of scratching about, more silence): " err, I don't know"

It is mind-numbingly tedious that they cannot just grasp the simple concept of Western business that people we would much rather they said "I'm not sure, but I can find out". I have repeated this until I am blue in the face, nothing changes.

Add in to the mix that all of these bodies in a Bangalore office ALL want to be the manager not the techie (it's a status thing) then none of them help each other out - they are competing with each other for promotion. My man who doesn't know what ABC123 means will never go to his colleague to find out - they would rather bluster around on their own completely in the dark before eventually admitting defeat than seek help. As long as they do something (even it it's not the right thing) to in some way justify closure of their incident ticket within SLA they move on to the next one.

I have had five years of dealing with this, and know many friends in different industries who will tell you the exact same tale.

Bullst is this to do with power sources. That may have been the original cause, but the three day delay will 100% be down to some very inexperienced people lying through their teeth about what they do or do not know about running these applications. Cruz knows and supports this because he's got a background in Outsourcing and will not publicly go on record slating his provider TCS. In the meantime 700 skilled British IT jobs have gone from BA to India.



Edited by PurpleTurtle on Tuesday 30th May 14:47

pushthebutton

1,097 posts

182 months

Tuesday 30th May 2017
quotequote all
Alex Cruz started with BA in April 2016. He had a few months prior to joining receiving business specific briefings in order to get him up to speed by April.

The 'boss' of BA doesn't have the same implications as it used to IMO. It's now Willie Walsh via IAG who controls the purse strings and hence dictates the size of the box that Alex Cruz can work within. Most of AC's issues since he joined have centered on budgetary overspend; a budget that was set by WW & IAG which, in turn, is controlled by the 10-year business plan that WW presented to BA/IAG's investors in order to secure future funding - mostly relating to aircraft fleet replacement. It was said at the time that the business plan was overly aggressive and potentially unachievable, but that's where BA are now.

Personally, I'm not sure how much of the blame can be apportioned to AC and how many of his decisions over the last 14 months will be shown to have had a direct affect. Even if it is eventually shown to have been caused/exACerbated by AC outsourcing, ultimately his budgeting decisions will have been directly influenced by the limitations imposed by IAG. I'd hazard a guess that he'll have been told by IAG/WW in no uncertain times that if he can't deliver then they'll find someone can: IMO Keith Williams had his hands tied in a similar fashion since the inception of IAG.

If the cause turns out to be related to outsourcing or departmental budgets then I can't see how the blame can be apportioned directly to AC when, in reality, it will have been largely influenced by the diktats of IAG and WW.

Edited by pushthebutton on Tuesday 30th May 15:33

Vaud

50,495 posts

155 months

Tuesday 30th May 2017
quotequote all
PurpleTurtle said:
will not publicly go on record slating his provider TCS.
There is normally something contractual restricting what can be said, at least until the deep dive has been completed.

It may be faults outside of TCS scope. It may be applications that they are supporting on "best endeavours basis", etc.

Especially when you are relying on a supplier for goodwill (at least until it is fixed) it's normally best to hold back from allegations in the press.

Puggit

48,439 posts

248 months

Tuesday 30th May 2017
quotequote all
anonymous said:
[redacted]
Many organisations I've spoken to have moved over to India and come back.

Now they're all starting the journey to the public cloud instead, and the IT circle of life continues...

Vaud

50,495 posts

155 months

Tuesday 30th May 2017
quotequote all
anonymous said:
[redacted]
Yes. Plenty. Our global help desk runs from India and it's really good. 24/7, remote fix, etc. But then we have a good standard build so it is easier to build predictable. Lots of our legacy apps are run out of there... and testing (which is mostly automated)

clonmult

10,529 posts

209 months

Tuesday 30th May 2017
quotequote all
57 Chevy said:
anonymous said:
[redacted]
Comment from a Times article.
From the IT rumour mill
Allegedly, the staff at the Indian data centre were told to apply some security fixes to the computers in the data centre. The BA IT systems have two, parallel systems to cope with updates. What was supposed to happen was that they apply the fixes to the computers of the secondary system, and when all is working, apply to the computers of the primary system. In this way, the programs all keep running without any interruption.
What they actually did was apply the patches to _all_ the computers.
Very similar to what happened to RBS when it outsourced and then they tried to upgrade the CA7 scheduler.

Edited to add link... https://www.theregister.co.uk/2012/06/25/rbs_natwe...

https://www.theregister.co.uk/2012/06/26/rbs_natwe...

https://www.theregister.co.uk/2012/06/28/rbs_job_c...



Edited by 57 Chevy on Tuesday 30th May 14:06
The security patches claim is more believable than a power supply problem, but it still strikes me as being highly improbable. Or at least, if this is the case, it makes an absolute mockery of their change management solutions.

No way could you get approval for an outage to services across both DCs and all systems. And I find it highly unlikely that anyone in an outsourced support role would actually go ahead and start making such changes without being asked to do so as a result of change request .... but again, it depends on the processes in place at BA.

And the chances of the real truth coming out on this are really improbable.

bga

8,134 posts

251 months

Tuesday 30th May 2017
quotequote all
Vaud said:
anonymous said:
[redacted]
Yes. Plenty. Our global help desk runs from India and it's really good. 24/7, remote fix, etc. But then we have a good standard build so it is easier to build predictable. Lots of our legacy apps are run out of there... and testing (which is mostly automated)
Plenty of my clients have had similar positive experiences.

The common factor is that they all have been very selective about what is done offshore and/or outsourced.

Plenty of my clients have also had very poor experience with offshoring and outsourcing. These tend to be the ones who see IT as a commodity with little strategic value. That is not to suggest that all IT is strategic, rather that it should be recognised when it is. Those are also the companies who have or are planning on bringing support back onshore.