BA systems down globally

Author
Discussion

rxe

6,700 posts

103 months

Sunday 28th May 2017
quotequote all
Fittster said:
Point 1 I'd agree with, can't believe a simple power outage is the only problem.

Point 2. No, the idea that a global, critical IT system can be replaced with pen and paper in an emergency is fanciful IMHO.
Pen and paper is simply an example - though the system I referenced did have a fall back to 60 year old technology, and that included writing stuff down. They practiced it regularly, and it worked.

In this case we have a 'plane full of passengers that can't take off because of a random IT failure. The 'plane is fine, the passengers are fine, but it sounds like they can't get the manifest. An example of a fall back would be storing copies of the manifests as PDFs (universally readable) at a different location. Manifests change, but locking away known good versions at hourly intervals would mean that the 'plane takes off, and the only passengers that would be annoyed are those who had booked at the last minute.

We should be in a position where the 'planes are flying, and all the effort is being focused on sorting bookings for early next week. That would give them enough time to recover the situation.

Tuna

19,930 posts

284 months

Sunday 28th May 2017
quotequote all
I've worked with a company that had a global IT failure that lasted rather a long time (no names).

The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.

Google and Amazon have found the same - on paper their architecture is redundant, distributed, replicated and backed up. Then one bit of kit fails, or a normal every day procedure trips up and you have a serious outage. DR practise usually takes down a datacenter and confirms it comes back up neatly, but doesn't accurately replicate random parts of the overall infrastructure failing in unpredictable ways.

Good luck to the guys having to sort this lot out.

jonny996

2,612 posts

217 months

Sunday 28th May 2017
quotequote all
No one has mentioned this happened on one of the hottest days, I will wager the DC provider got gready and has signed up to load shedding, this is were they step off grid power on to generator in times when the grid can't meet demand& the demand would have been high with Air con on max. Thing is with most shedding schemes you have to give control of your generators over to the grid & they don't tell you when they are switching you, if you line that up with poor change management you then have potential for a switch to happen at same time as generator maintenance.
Just a guess but a very well educated one

Seventy

5,500 posts

138 months

Sunday 28th May 2017
quotequote all
I have managed to get myself onto a Qatar flight to Doha at 4 today and then on to Joburg.
I was ok as I just went home yesterday.
Not sure how it works re EU 261 as we've been rebooked.

Fittster

20,120 posts

213 months

Sunday 28th May 2017
quotequote all
Funk said:
Fittster said:
Funk said:
I work for a VAR selling all this sort of stuff and it staggers me how many businesses don't have a robust DR plan in place. Even a minor outage for an SME could be costly.
An a DR system won't be costly? I work for a global SI and we will quite happily supply all the availability/resilence you could want but its not going to be cheap.
It depends on how important the systems are I guess. It doesn't even have to be that expensive; protect core systems and accept that there's likely to be a short-term performance hit if running in DR circumstances. It won't be free, but it needn't cost the earth either.
Double hardware, double licenses, significant admin effort. Soon tots up to a big cost.

yajeed

4,891 posts

254 months

Sunday 28th May 2017
quotequote all
Compared to what? Reputational damage on top of the direct costs of fines, compensation etc.

You're also assuming an identical hot standby facility, which isn't necessarily what they'd choose to do.

All that said, DR is tough to get right, especially when things fail in an unpredicted manner.

glasgow mega snake

1,853 posts

84 months

Sunday 28th May 2017
quotequote all
I work for a global acronym provider and this thread had doubled our q2 turnover

Edited by glasgow mega snake on Sunday 28th May 14:16

captain_cynic

11,972 posts

95 months

Sunday 28th May 2017
quotequote all
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).

The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Um... no.

I used to work for a university in IT. Universities these days typically have a lot of complex systems as many courses are run online. We had an outage due to a contractor leaving a box in the datacentre's power room (because the particulate matter that boxes shed is quite flammable, taking boxes into a datahall or power room are huge no-no's). The box caught fire and we lost power in the primary DC.

We were up and running in DR within 2 hours (business had an SLA of 4). We didn't DR 100% of everything, but enough that the university can continue business whilst we fix the main problem.

So the solution to complexity is not to DR your peripheral systems and only DR your core business systems. There are a lot of technologies out there that make this very simple, every SAN has mirroring capabilities, VMWare SRM allows you to replicate entire sites verbatim and automate DR, so on and so forth. Sure this stuff isn't cheap, but neither is grounding a major international airline.

When you need 100% uptime, you run two live datacentres in different location that are mirrored or at the very least, within a few minutes of synchronicity. I've done this for banks where a few minutes downtime is millions of pounds, it's not a difficult thing, however it's not a cheap thing either.

So I'm not buying the power failure story. If power fails, you switch over to secondaries. If systems fail that catastrophically that you do have a site wide power outage, you switch over to your DR site... and if that doesn't work, power outages shouldn't take that long to resolve as data-centres have every kind of required trade on retainer. Tata are not that stupid and neither are BA that any of this would have been missed. Somoene stuffed up and DR is to protect you against equipment failure, not human error.

dudleybloke

19,805 posts

186 months

Sunday 28th May 2017
quotequote all
Would have thought that BA would be in the cloud.
wink

Cupramax

10,478 posts

252 months

Sunday 28th May 2017
quotequote all
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
Explains what? That youve had a bad SAP implementation? hehe SAP is only as good as the people implementing it, there's good and bad as with most things. Those that slag it off usually know nothing about it.

Vaud

50,426 posts

155 months

Sunday 28th May 2017
quotequote all
Cupramax said:
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
Explains what? That youve had a bad SAP implementation? hehe SAP is only as good as the people implementing it, there's good and bad as with most things. Those that slag it off usually know nothing about it.
Quite. Though will vary in scale of implementation:

More than 248,500 customers in 188 countries
86 % of Forbes 500 / 100 % of Fortune 100
98% of the 100 most valued brands

SAP customers produce
78% of the world’s food
82% of the world’s medical devices
69% of the world’s toys and games
74% of the world’s transaction revenue touches an SAP system
SAP touches US$16 trillion of retail purchases around the world (Ariba)


They may do passenger manifest on SAP, I don't know.

Countdown

39,824 posts

196 months

Sunday 28th May 2017
quotequote all
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
roflroflroflroflroflrofl

We implemented SAP a couple of years ago and we are still using Aggresso for some of our business-critical stuff (and then posting transactions manually into SAP). IMHO it's literally too big for most organisations and the quality of SAP consultants is variable.

Byker28i

59,569 posts

217 months

Sunday 28th May 2017
quotequote all
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).

The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Erm sorry but just no, that comment just exposes your lack of knowledge

Very easy to have replicated systems through virtualisation, it's just cost of standby site/equipment and cost of the lines, depends on the period of dataloss you can survive as to how much it costs. Systems can be brought back on line very promptly.


Cupramax

10,478 posts

252 months

Sunday 28th May 2017
quotequote all
Countdown said:
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
roflroflroflroflroflrofl

We implemented SAP a couple of years ago and we are still using Aggresso for some of our business-critical stuff (and then posting transactions manually into SAP). IMHO it's literally too big for most organisations and the quality of SAP consultants is variable.
Is closer to the truth and the correct answer. Having been on the messy end of implementing it, successfully, i hasten to add.

eliot

11,418 posts

254 months

Sunday 28th May 2017
quotequote all
jonny996 said:
....... Thing is with most shedding schemes you have to give control of your generators over to the grid & they don't tell you when they are switching you, if you line that up with poor change management you then have potential for a switch to happen at same time as generator maintenance.
Just a guess but a very well educated one
Sounds like you are talking about flextricity - whilst they can start your gennys at any time - you can manually take them offline and unavailable.
Would we even have demand control during the day and not in the middle of darkest winter?

Byker28i

59,569 posts

217 months

Sunday 28th May 2017
quotequote all
Cupramax said:
Countdown said:
jamiem555 said:
I heard they use SAP. If so, then that explains it. We've just moved to it and it's just waiting to fall to bits.
roflroflroflroflroflrofl

We implemented SAP a couple of years ago and we are still using Aggresso for some of our business-critical stuff (and then posting transactions manually into SAP). IMHO it's literally too big for most organisations and the quality of SAP consultants is variable.
Is closer to the truth and the correct answer. Having been on the messy end of implementing it, successfully, i hasten to add.
I think he forgot to add the cost of SAP consultants.

The jiffle king

6,910 posts

258 months

Sunday 28th May 2017
quotequote all
Most issues with SAP are about how it is implemented. The tool works very well, but companies do the following things to screw it up:
- Think its an IT project so don't have business engagement at the right level
- Make SAP change to work the way their company works including all nuances
- Don't do anywhere near enough change management/training
- Focus on too much of the small stuff and lose focus on making the core work
- Try to do too much change at once

Vaud

50,426 posts

155 months

Sunday 28th May 2017
quotequote all
The jiffle king said:
Most issues with SAP are about how it is implemented. The tool works very well, but companies do the following things to screw it up:
- Think its an IT project so don't have business engagement at the right level
- Make SAP change to work the way their company works including all nuances
- Don't do anywhere near enough change management/training
- Focus on too much of the small stuff and lose focus on making the core work
- Try to do too much change at once
I agree - it's a business change project with an IT component attached.

J4CKO

41,499 posts

200 months

Sunday 28th May 2017
quotequote all
Seems coincidence that there was all the NHS and other hacking, wonder if they implemented some counter measures that have backfired, been there....

Or a disgruntled employee of course.

Always about the cost with IT, management going on about synergies and all that boll is, means sacking folk and offshoring it, then they wonder why stuff doesn't happen, still, can give the shareholders a nice dividend. They rely on folk doing daft hours and the skin of their teeth, well, fked up this time, people remember ruined holidays and cancelled meetings.

Stuff like this can put even big companies out of business, bet United are laughing as they are old news.

kev1974

4,029 posts

129 months

Sunday 28th May 2017
quotequote all
I had a terrible experience stranded for days at the hands of Vueling a few years ago, with no help to get home from them whatsoever, everyone on their cancelled flights was left entirely to their own devices. Now I see that the current BA chairman and CEO came from Vueling, I guess he brought more than his hi-viz with him, and suddenly BA's decline over the last 18 months makes sense.

Wonder if IAG will let him stay on, or is he toast by Tuesday. Still 73 cancelled flights at LHR today and a hell of a lot of anger on twitter.