BA systems down globally

Author
Discussion

davebem

746 posts

179 months

Saturday 27th May 2017
quotequote all
Where I used to work we had a data centre in turin that got took out by mice chewing the power cables! This was a trucking company and even then dr was regularly tested and there was diesel generators. Im flying on one of the first BA flights out tomorrow from heathrow to meet a cruise ship in barcelona so a little nervous tonight.

scrwright

2,672 posts

192 months

Saturday 27th May 2017
quotequote all
Seen airport infrastructure running without UPS as recently as the start of last year, and more recently never tested UPS systems. Bigger the outfit, worse the IT (in my experience)

djc206

12,499 posts

127 months

Saturday 27th May 2017
quotequote all
Seventy said:
I'm due to fly to Joburg at 7 tonight.
What would you do? I don't live 10 mins from the airport and there is zero information about my flight on the BA website - in fact there is zero information about anything. They say all flights cancelled until six but they may well just be putting out a time to placate people before changing it later.
Rock and hard place!
One of the Jo'burg flights just departed. No idea if anyone was on it or whether it's effectively a positioning flight

Nardiola

1,176 posts

221 months

Saturday 27th May 2017
quotequote all
Vaud said:
Cold said:
BA are citing a "power supply issue" as the reason behind the IT systems failure.
Uber legacy app that didn't have fail over capability?
I reckon this potentially, when you see the state of the back office systems they looks distinctly 80s. Power failure brought it down, and whoever supports it now can't get it back up, just like a NatWest a while back.

Vaud

51,002 posts

157 months

Saturday 27th May 2017
quotequote all
djc206 said:
One of the Jo'burg flights just departed. No idea if anyone was on it or whether it's effectively a positioning flight
If it's the manifest system that is down (?) then maybe they can run a more manual system for crew only to allow repositioning?

jamiem555

756 posts

213 months

Saturday 27th May 2017
quotequote all
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.

djc206

12,499 posts

127 months

Saturday 27th May 2017
quotequote all
Vaud said:
If it's the manifest system that is down (?) then maybe they can run a more manual system for crew only to allow repositioning?
Maybe. There are a couple of BA pilots on the forums but I'm guessing their social media policy is stopping them from posting answers to our questions.

smack

9,732 posts

193 months

Saturday 27th May 2017
quotequote all
smack said:
Seventy said:
I'm due to fly to Joburg at 7 tonight.
What would you do? I don't live 10 mins from the airport and there is zero information about my flight on the BA website - in fact there is zero information about anything. They say all flights cancelled until six but they may well just be putting out a time to placate people before changing it later.
Rock and hard place!
When the st hits the fan, BA will do everything to get their Long Haul planes out, and sacrifice short haul - when snow/fog/etc that is what happens. The costs to cancel a 380 return service is massive, so I will expect it will go, probably very late, else there could be customers stuck in places like JNB for days which will have to be put up in hotels at the Airlines cost.
BA55, your 1900 flight to JNB is taxing to take off right now.

Actual

802 posts

108 months

Saturday 27th May 2017
quotequote all
Any system complex enough to be useful is capable of catastrophic failure.

More resilience = more complexity = bigger failure.

Also... What can go wrong will go wrong
.

rxe

6,700 posts

105 months

Saturday 27th May 2017
quotequote all
Ob. disclosure: I do this stuff for a living. Not for BA.

What is odd about this one is the fact that it seems to have killed them globally. Most companies are really bad at resilience, they don't spend money on it, they don't test it, But most companies don't have a system that stops 'planes taking off in the USA as a result of a fault in the U.K.

There are two issues here:

1) They've had a cock up. Power may be the initial event, but it is probably software related now. Bringing up a load of spaghetti when it is in an indeterminate state is really, really hard. Nearly all disaster events are software related - data centres do fail, but it is very rare.

2) They've built a system that is hugely fragile. It fails, and 'planes all over the world are grounded. No ability to fail back to something else (print outs of passenger manifests?).

The combination is the killer. You can take risks when no one will notice for a day or two. If people are going to notice in seconds, you need to design it to be robust - and you need to have a plan for catastrophic failure.

I remember a real discussion about something very critical that I built about 10 years ago. Extreme scenarios: what happens if both data centres are nuked in a limited strike - possible, as they were less than 100 miles apart. The answer was simple: if that happens, we fall back to a set of manual processes that we practice regularly. And we vault the data on a different continent, a weeks data loss was deemed acceptable.



AndStilliRise

2,295 posts

118 months

Saturday 27th May 2017
quotequote all
Makes our production issue seem small in comparison.

Murph7355

37,947 posts

258 months

Saturday 27th May 2017
quotequote all
ruggedscotty said:
Power outages in data centres,,,,

So much can go wrong, systems may be designed to deal with an outage, uninterruptible power supplies and diesels etc to give system resiliance, this all fits in with how the data centre is desgined. Some skimp and only UPS the IT equipment, depending on the diesels kicking in to supply all the non critical equipment, things like air handling units chillers and chilled water circuits, things that dont need to be supported for 10 minutes or so, it cuts the cost of the UPS system - you only need to support say 2MW of IT infrastructure of a total site load of say 4 MW. trouble is your generators fail then you have no cooling etc and that impacts the running of the IT side of things.

The resilience of the system could have a weak point and its just unfortunate that its been uncovered today. Its a saturday aswell so maybe a few maintenance activities were being carried out. I can remember having a data centre on raw mains while they upgraded a UPS, yes its a risk but one that was calculated, but like all best calculations sometimes there is a wrong shout.

be interesting to find out what has happened here
For a company so reliant on technology it is still incompetence of the highest order.

I've seen some woeful set ups before, but for an organisation like them to be taken out by a power issue is unthinkable.

To the point where I'm inclined to agree with a later poster who suggested the power issue is a cover for "something else".

Funk

26,379 posts

211 months

Saturday 27th May 2017
quotequote all
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.

Sheepshanks

33,225 posts

121 months

Saturday 27th May 2017
quotequote all
djc206 said:
One of the Jo'burg flights just departed. No idea if anyone was on it or whether it's effectively a positioning flight
Be quite annoying if it's got some passengers on board when they told everyone not to turn up.

Tuna

19,930 posts

286 months

Sunday 28th May 2017
quotequote all
Not the best timing that Die Hard 2 has just started on TV..

Fittster

20,120 posts

215 months

Sunday 28th May 2017
quotequote all
rxe said:
Ob. disclosure: I do this stuff for a living. Not for BA.

What is odd about this one is the fact that it seems to have killed them globally. Most companies are really bad at resilience, they don't spend money on it, they don't test it, But most companies don't have a system that stops 'planes taking off in the USA as a result of a fault in the U.K.

There are two issues here:

1) They've had a cock up. Power may be the initial event, but it is probably software related now. Bringing up a load of spaghetti when it is in an indeterminate state is really, really hard. Nearly all disaster events are software related - data centres do fail, but it is very rare.

2) They've built a system that is hugely fragile. It fails, and 'planes all over the world are grounded. No ability to fail back to something else (print outs of passenger manifests?).

The combination is the killer. You can take risks when no one will notice for a day or two. If people are going to notice in seconds, you need to design it to be robust - and you need to have a plan for catastrophic failure.

I remember a real discussion about something very critical that I built about 10 years ago. Extreme scenarios: what happens if both data centres are nuked in a limited strike - possible, as they were less than 100 miles apart. The answer was simple: if that happens, we fall back to a set of manual processes that we practice regularly. And we vault the data on a different continent, a weeks data loss was deemed acceptable.
Point 1 I'd agree with, can't believe a simple power outage is the only problem.

Point 2. No, the idea that a global, critical IT system can be replaced with pen and paper in an emergency is fanciful IMHO.

Fittster

20,120 posts

215 months

Sunday 28th May 2017
quotequote all
Funk said:
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.
An a DR system won't be costly? I work for a global SI and we will quite happily supply all the availability/resilence you could want but its not going to be cheap.

Funk

26,379 posts

211 months

Sunday 28th May 2017
quotequote all
Fittster said:
Funk said:
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.
An a DR system won't be costly? I work for a global SI and we will quite happily supply all the availability/resilence you could want but its not going to be cheap.
It depends on how important the systems are I guess. It doesn't even have to be that expensive; protect core systems and accept that there's likely to be a short-term performance hit if running in DR circumstances. It won't be free, but it needn't cost the earth either.

Matt p

1,039 posts

210 months

Sunday 28th May 2017
quotequote all
Would take something pretty catastrophic to take down the entire power system in a DC. So much back up is normally in place. At least in the DC's I've worked in its normally N+1+1+1 regarding chilled water.

Byker28i

61,771 posts

219 months

Sunday 28th May 2017
quotequote all
Matt p said:
Would take something pretty catastrophic to take down the entire power system in a DC. So much back up is normally in place. At least in the DC's I've worked in its normally N+1+1+1 regarding chilled water.
Modern datacentres use airflow through hot/cold aisles with no need to fans, chillers etc. They only come on very rarely, so keeps power costs down.
Most usually have dual feed supplies to that datacentre, UPS, large generators etc so this smacks more of an internal issue to the datacentre rooms. There's probably power on site but they can't get it to where's needed i.e. the servers. Seems strange they haven't a DR site.

I'd suggest the cost of a DR site would easily be a fraction of what yesterday cost them.