BA systems down globally
Discussion
Fittster said:
Point 1 I'd agree with, can't believe a simple power outage is the only problem.
Point 2. No, the idea that a global, critical IT system can be replaced with pen and paper in an emergency is fanciful IMHO.
Pen and paper is simply an example - though the system I referenced did have a fall back to 60 year old technology, and that included writing stuff down. They practiced it regularly, and it worked. Point 2. No, the idea that a global, critical IT system can be replaced with pen and paper in an emergency is fanciful IMHO.
In this case we have a 'plane full of passengers that can't take off because of a random IT failure. The 'plane is fine, the passengers are fine, but it sounds like they can't get the manifest. An example of a fall back would be storing copies of the manifests as PDFs (universally readable) at a different location. Manifests change, but locking away known good versions at hourly intervals would mean that the 'plane takes off, and the only passengers that would be annoyed are those who had booked at the last minute.
We should be in a position where the 'planes are flying, and all the effort is being focused on sorting bookings for early next week. That would give them enough time to recover the situation.
I've worked with a company that had a global IT failure that lasted rather a long time (no names).
The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Google and Amazon have found the same - on paper their architecture is redundant, distributed, replicated and backed up. Then one bit of kit fails, or a normal every day procedure trips up and you have a serious outage. DR practise usually takes down a datacenter and confirms it comes back up neatly, but doesn't accurately replicate random parts of the overall infrastructure failing in unpredictable ways.
Good luck to the guys having to sort this lot out.
The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Google and Amazon have found the same - on paper their architecture is redundant, distributed, replicated and backed up. Then one bit of kit fails, or a normal every day procedure trips up and you have a serious outage. DR practise usually takes down a datacenter and confirms it comes back up neatly, but doesn't accurately replicate random parts of the overall infrastructure failing in unpredictable ways.
Good luck to the guys having to sort this lot out.
No one has mentioned this happened on one of the hottest days, I will wager the DC provider got gready and has signed up to load shedding, this is were they step off grid power on to generator in times when the grid can't meet demand& the demand would have been high with Air con on max. Thing is with most shedding schemes you have to give control of your generators over to the grid & they don't tell you when they are switching you, if you line that up with poor change management you then have potential for a switch to happen at same time as generator maintenance.
Just a guess but a very well educated one
Just a guess but a very well educated one
Funk said:
Fittster said:
Funk said:
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.
An a DR system won't be costly? I work for a global SI and we will quite happily supply all the availability/resilence you could want but its not going to be cheap.Compared to what? Reputational damage on top of the direct costs of fines, compensation etc.
You're also assuming an identical hot standby facility, which isn't necessarily what they'd choose to do.
All that said, DR is tough to get right, especially when things fail in an unpredicted manner.
You're also assuming an identical hot standby facility, which isn't necessarily what they'd choose to do.
All that said, DR is tough to get right, especially when things fail in an unpredicted manner.
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).
The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Um... no.The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
I used to work for a university in IT. Universities these days typically have a lot of complex systems as many courses are run online. We had an outage due to a contractor leaving a box in the datacentre's power room (because the particulate matter that boxes shed is quite flammable, taking boxes into a datahall or power room are huge no-no's). The box caught fire and we lost power in the primary DC.
We were up and running in DR within 2 hours (business had an SLA of 4). We didn't DR 100% of everything, but enough that the university can continue business whilst we fix the main problem.
So the solution to complexity is not to DR your peripheral systems and only DR your core business systems. There are a lot of technologies out there that make this very simple, every SAN has mirroring capabilities, VMWare SRM allows you to replicate entire sites verbatim and automate DR, so on and so forth. Sure this stuff isn't cheap, but neither is grounding a major international airline.
When you need 100% uptime, you run two live datacentres in different location that are mirrored or at the very least, within a few minutes of synchronicity. I've done this for banks where a few minutes downtime is millions of pounds, it's not a difficult thing, however it's not a cheap thing either.
So I'm not buying the power failure story. If power fails, you switch over to secondaries. If systems fail that catastrophically that you do have a site wide power outage, you switch over to your DR site... and if that doesn't work, power outages shouldn't take that long to resolve as data-centres have every kind of required trade on retainer. Tata are not that stupid and neither are BA that any of this would have been missed. Somoene stuffed up and DR is to protect you against equipment failure, not human error.
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
Explains what? That youve had a bad SAP implementation? SAP is only as good as the people implementing it, there's good and bad as with most things. Those that slag it off usually know nothing about it.Cupramax said:
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
Explains what? That youve had a bad SAP implementation? SAP is only as good as the people implementing it, there's good and bad as with most things. Those that slag it off usually know nothing about it.More than 248,500 customers in 188 countries
86 % of Forbes 500 / 100 % of Fortune 100
98% of the 100 most valued brands
SAP customers produce
78% of the world’s food
82% of the world’s medical devices
69% of the world’s toys and games
74% of the world’s transaction revenue touches an SAP system
SAP touches US$16 trillion of retail purchases around the world (Ariba)
They may do passenger manifest on SAP, I don't know.
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
We implemented SAP a couple of years ago and we are still using Aggresso for some of our business-critical stuff (and then posting transactions manually into SAP). IMHO it's literally too big for most organisations and the quality of SAP consultants is variable.
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).
The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Erm sorry but just no, that comment just exposes your lack of knowledgeThe problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Very easy to have replicated systems through virtualisation, it's just cost of standby site/equipment and cost of the lines, depends on the period of dataloss you can survive as to how much it costs. Systems can be brought back on line very promptly.
Countdown said:
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
We implemented SAP a couple of years ago and we are still using Aggresso for some of our business-critical stuff (and then posting transactions manually into SAP). IMHO it's literally too big for most organisations and the quality of SAP consultants is variable.
jonny996 said:
....... Thing is with most shedding schemes you have to give control of your generators over to the grid & they don't tell you when they are switching you, if you line that up with poor change management you then have potential for a switch to happen at same time as generator maintenance.
Just a guess but a very well educated one
Sounds like you are talking about flextricity - whilst they can start your gennys at any time - you can manually take them offline and unavailable.Just a guess but a very well educated one
Would we even have demand control during the day and not in the middle of darkest winter?
Cupramax said:
Countdown said:
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
We implemented SAP a couple of years ago and we are still using Aggresso for some of our business-critical stuff (and then posting transactions manually into SAP). IMHO it's literally too big for most organisations and the quality of SAP consultants is variable.
Most issues with SAP are about how it is implemented. The tool works very well, but companies do the following things to screw it up:
- Think its an IT project so don't have business engagement at the right level
- Make SAP change to work the way their company works including all nuances
- Don't do anywhere near enough change management/training
- Focus on too much of the small stuff and lose focus on making the core work
- Try to do too much change at once
- Think its an IT project so don't have business engagement at the right level
- Make SAP change to work the way their company works including all nuances
- Don't do anywhere near enough change management/training
- Focus on too much of the small stuff and lose focus on making the core work
- Try to do too much change at once
The jiffle king said:
Most issues with SAP are about how it is implemented. The tool works very well, but companies do the following things to screw it up:
- Think its an IT project so don't have business engagement at the right level
- Make SAP change to work the way their company works including all nuances
- Don't do anywhere near enough change management/training
- Focus on too much of the small stuff and lose focus on making the core work
- Try to do too much change at once
I agree - it's a business change project with an IT component attached.- Think its an IT project so don't have business engagement at the right level
- Make SAP change to work the way their company works including all nuances
- Don't do anywhere near enough change management/training
- Focus on too much of the small stuff and lose focus on making the core work
- Try to do too much change at once
Seems coincidence that there was all the NHS and other hacking, wonder if they implemented some counter measures that have backfired, been there....
Or a disgruntled employee of course.
Always about the cost with IT, management going on about synergies and all that boll is, means sacking folk and offshoring it, then they wonder why stuff doesn't happen, still, can give the shareholders a nice dividend. They rely on folk doing daft hours and the skin of their teeth, well, fked up this time, people remember ruined holidays and cancelled meetings.
Stuff like this can put even big companies out of business, bet United are laughing as they are old news.
Or a disgruntled employee of course.
Always about the cost with IT, management going on about synergies and all that boll is, means sacking folk and offshoring it, then they wonder why stuff doesn't happen, still, can give the shareholders a nice dividend. They rely on folk doing daft hours and the skin of their teeth, well, fked up this time, people remember ruined holidays and cancelled meetings.
Stuff like this can put even big companies out of business, bet United are laughing as they are old news.
I had a terrible experience stranded for days at the hands of Vueling a few years ago, with no help to get home from them whatsoever, everyone on their cancelled flights was left entirely to their own devices. Now I see that the current BA chairman and CEO came from Vueling, I guess he brought more than his hi-viz with him, and suddenly BA's decline over the last 18 months makes sense.
Wonder if IAG will let him stay on, or is he toast by Tuesday. Still 73 cancelled flights at LHR today and a hell of a lot of anger on twitter.
Wonder if IAG will let him stay on, or is he toast by Tuesday. Still 73 cancelled flights at LHR today and a hell of a lot of anger on twitter.
Gassing Station | News, Politics & Economics | Top of Page | What's New | My Stuff