BA systems down globally
Discussion
Where I used to work we had a data centre in turin that got took out by mice chewing the power cables! This was a trucking company and even then dr was regularly tested and there was diesel generators. Im flying on one of the first BA flights out tomorrow from heathrow to meet a cruise ship in barcelona so a little nervous tonight.
Seventy said:
I'm due to fly to Joburg at 7 tonight.
What would you do? I don't live 10 mins from the airport and there is zero information about my flight on the BA website - in fact there is zero information about anything. They say all flights cancelled until six but they may well just be putting out a time to placate people before changing it later.
Rock and hard place!
One of the Jo'burg flights just departed. No idea if anyone was on it or whether it's effectively a positioning flightWhat would you do? I don't live 10 mins from the airport and there is zero information about my flight on the BA website - in fact there is zero information about anything. They say all flights cancelled until six but they may well just be putting out a time to placate people before changing it later.
Rock and hard place!
Vaud said:
Cold said:
BA are citing a "power supply issue" as the reason behind the IT systems failure.
Uber legacy app that didn't have fail over capability?Vaud said:
If it's the manifest system that is down (?) then maybe they can run a more manual system for crew only to allow repositioning?
Maybe. There are a couple of BA pilots on the forums but I'm guessing their social media policy is stopping them from posting answers to our questions.smack said:
Seventy said:
I'm due to fly to Joburg at 7 tonight.
What would you do? I don't live 10 mins from the airport and there is zero information about my flight on the BA website - in fact there is zero information about anything. They say all flights cancelled until six but they may well just be putting out a time to placate people before changing it later.
Rock and hard place!
When the st hits the fan, BA will do everything to get their Long Haul planes out, and sacrifice short haul - when snow/fog/etc that is what happens. The costs to cancel a 380 return service is massive, so I will expect it will go, probably very late, else there could be customers stuck in places like JNB for days which will have to be put up in hotels at the Airlines cost.What would you do? I don't live 10 mins from the airport and there is zero information about my flight on the BA website - in fact there is zero information about anything. They say all flights cancelled until six but they may well just be putting out a time to placate people before changing it later.
Rock and hard place!
Ob. disclosure: I do this stuff for a living. Not for BA.
What is odd about this one is the fact that it seems to have killed them globally. Most companies are really bad at resilience, they don't spend money on it, they don't test it, But most companies don't have a system that stops 'planes taking off in the USA as a result of a fault in the U.K.
There are two issues here:
1) They've had a cock up. Power may be the initial event, but it is probably software related now. Bringing up a load of spaghetti when it is in an indeterminate state is really, really hard. Nearly all disaster events are software related - data centres do fail, but it is very rare.
2) They've built a system that is hugely fragile. It fails, and 'planes all over the world are grounded. No ability to fail back to something else (print outs of passenger manifests?).
The combination is the killer. You can take risks when no one will notice for a day or two. If people are going to notice in seconds, you need to design it to be robust - and you need to have a plan for catastrophic failure.
I remember a real discussion about something very critical that I built about 10 years ago. Extreme scenarios: what happens if both data centres are nuked in a limited strike - possible, as they were less than 100 miles apart. The answer was simple: if that happens, we fall back to a set of manual processes that we practice regularly. And we vault the data on a different continent, a weeks data loss was deemed acceptable.
What is odd about this one is the fact that it seems to have killed them globally. Most companies are really bad at resilience, they don't spend money on it, they don't test it, But most companies don't have a system that stops 'planes taking off in the USA as a result of a fault in the U.K.
There are two issues here:
1) They've had a cock up. Power may be the initial event, but it is probably software related now. Bringing up a load of spaghetti when it is in an indeterminate state is really, really hard. Nearly all disaster events are software related - data centres do fail, but it is very rare.
2) They've built a system that is hugely fragile. It fails, and 'planes all over the world are grounded. No ability to fail back to something else (print outs of passenger manifests?).
The combination is the killer. You can take risks when no one will notice for a day or two. If people are going to notice in seconds, you need to design it to be robust - and you need to have a plan for catastrophic failure.
I remember a real discussion about something very critical that I built about 10 years ago. Extreme scenarios: what happens if both data centres are nuked in a limited strike - possible, as they were less than 100 miles apart. The answer was simple: if that happens, we fall back to a set of manual processes that we practice regularly. And we vault the data on a different continent, a weeks data loss was deemed acceptable.
ruggedscotty said:
Power outages in data centres,,,,
So much can go wrong, systems may be designed to deal with an outage, uninterruptible power supplies and diesels etc to give system resiliance, this all fits in with how the data centre is desgined. Some skimp and only UPS the IT equipment, depending on the diesels kicking in to supply all the non critical equipment, things like air handling units chillers and chilled water circuits, things that dont need to be supported for 10 minutes or so, it cuts the cost of the UPS system - you only need to support say 2MW of IT infrastructure of a total site load of say 4 MW. trouble is your generators fail then you have no cooling etc and that impacts the running of the IT side of things.
The resilience of the system could have a weak point and its just unfortunate that its been uncovered today. Its a saturday aswell so maybe a few maintenance activities were being carried out. I can remember having a data centre on raw mains while they upgraded a UPS, yes its a risk but one that was calculated, but like all best calculations sometimes there is a wrong shout.
be interesting to find out what has happened here
For a company so reliant on technology it is still incompetence of the highest order. So much can go wrong, systems may be designed to deal with an outage, uninterruptible power supplies and diesels etc to give system resiliance, this all fits in with how the data centre is desgined. Some skimp and only UPS the IT equipment, depending on the diesels kicking in to supply all the non critical equipment, things like air handling units chillers and chilled water circuits, things that dont need to be supported for 10 minutes or so, it cuts the cost of the UPS system - you only need to support say 2MW of IT infrastructure of a total site load of say 4 MW. trouble is your generators fail then you have no cooling etc and that impacts the running of the IT side of things.
The resilience of the system could have a weak point and its just unfortunate that its been uncovered today. Its a saturday aswell so maybe a few maintenance activities were being carried out. I can remember having a data centre on raw mains while they upgraded a UPS, yes its a risk but one that was calculated, but like all best calculations sometimes there is a wrong shout.
be interesting to find out what has happened here
I've seen some woeful set ups before, but for an organisation like them to be taken out by a power issue is unthinkable.
To the point where I'm inclined to agree with a later poster who suggested the power issue is a cover for "something else".
rxe said:
Ob. disclosure: I do this stuff for a living. Not for BA.
What is odd about this one is the fact that it seems to have killed them globally. Most companies are really bad at resilience, they don't spend money on it, they don't test it, But most companies don't have a system that stops 'planes taking off in the USA as a result of a fault in the U.K.
There are two issues here:
1) They've had a cock up. Power may be the initial event, but it is probably software related now. Bringing up a load of spaghetti when it is in an indeterminate state is really, really hard. Nearly all disaster events are software related - data centres do fail, but it is very rare.
2) They've built a system that is hugely fragile. It fails, and 'planes all over the world are grounded. No ability to fail back to something else (print outs of passenger manifests?).
The combination is the killer. You can take risks when no one will notice for a day or two. If people are going to notice in seconds, you need to design it to be robust - and you need to have a plan for catastrophic failure.
I remember a real discussion about something very critical that I built about 10 years ago. Extreme scenarios: what happens if both data centres are nuked in a limited strike - possible, as they were less than 100 miles apart. The answer was simple: if that happens, we fall back to a set of manual processes that we practice regularly. And we vault the data on a different continent, a weeks data loss was deemed acceptable.
Point 1 I'd agree with, can't believe a simple power outage is the only problem. What is odd about this one is the fact that it seems to have killed them globally. Most companies are really bad at resilience, they don't spend money on it, they don't test it, But most companies don't have a system that stops 'planes taking off in the USA as a result of a fault in the U.K.
There are two issues here:
1) They've had a cock up. Power may be the initial event, but it is probably software related now. Bringing up a load of spaghetti when it is in an indeterminate state is really, really hard. Nearly all disaster events are software related - data centres do fail, but it is very rare.
2) They've built a system that is hugely fragile. It fails, and 'planes all over the world are grounded. No ability to fail back to something else (print outs of passenger manifests?).
The combination is the killer. You can take risks when no one will notice for a day or two. If people are going to notice in seconds, you need to design it to be robust - and you need to have a plan for catastrophic failure.
I remember a real discussion about something very critical that I built about 10 years ago. Extreme scenarios: what happens if both data centres are nuked in a limited strike - possible, as they were less than 100 miles apart. The answer was simple: if that happens, we fall back to a set of manual processes that we practice regularly. And we vault the data on a different continent, a weeks data loss was deemed acceptable.
Point 2. No, the idea that a global, critical IT system can be replaced with pen and paper in an emergency is fanciful IMHO.
Funk said:
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.
An a DR system won't be costly? I work for a global SI and we will quite happily supply all the availability/resilence you could want but its not going to be cheap.Fittster said:
Funk said:
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.
An a DR system won't be costly? I work for a global SI and we will quite happily supply all the availability/resilence you could want but its not going to be cheap.Matt p said:
Would take something pretty catastrophic to take down the entire power system in a DC. So much back up is normally in place. At least in the DC's I've worked in its normally N+1+1+1 regarding chilled water.
Modern datacentres use airflow through hot/cold aisles with no need to fans, chillers etc. They only come on very rarely, so keeps power costs down.Most usually have dual feed supplies to that datacentre, UPS, large generators etc so this smacks more of an internal issue to the datacentre rooms. There's probably power on site but they can't get it to where's needed i.e. the servers. Seems strange they haven't a DR site.
I'd suggest the cost of a DR site would easily be a fraction of what yesterday cost them.
Gassing Station | News, Politics & Economics | Top of Page | What's New | My Stuff