BA systems down globally

Author
Discussion

Marcellus

7,119 posts

219 months

Sunday 28th May 2017
quotequote all
As someone who got stranded in Sardinia yeaterday lunchtime all i can say is that apart from possibly causing the whole mess the BA staff who we met and tried to sort things out did as much as they could, starting the the Captain taking charge and briefing us, then him and the 1st officer being prepared to speak to any/everyone who wanted to ask questions/vent etc etc etc and then liaising with the ground crew (not BA staff) to sort out where we were supposed to go and when.

Then this morning after having done all of the legal briefings (safety etc etc etc) came in to the cabin to update us on how he saw the day/flight pan out.

Ok so we got home 24hours later than planned after a night in hotel and a bloody good dinner last night.

Carl_Manchester

12,196 posts

262 months

Sunday 28th May 2017
quotequote all
WestyCarl said:
Anyone more IT savy than me seriously think that the "power supply issue" is genuine. Our 2-bit, 30 staff company, has all servers on UPS, etc, I can't believe BA don't have any power outage backup (and test it regularly!).

Has "power supply issue" become the new excuse?
In relation to the reason they gave, for me, it's the timing and length of the outage that is the problem.

Highly doubtful any major changes or upgrades were booked for out of hours Friday evening on a May bank holiday. Power failure suddenly happens then and not at any other point during the year? It is possible but its a 1 in 365 shot.

It is more likely they had a software glitch due to a problem spotted, either the system failed or an engineer tried to apply a fix. It failed and then they tried to fail over from primary to secondary and realised they could not. Thus, have spent the last 24 hours or so restoring a very large amount of data which explains the long delay in getting the system back online.


kev1974

4,029 posts

129 months

Sunday 28th May 2017
quotequote all
Carl_Manchester said:
WestyCarl said:
Anyone more IT savy than me seriously think that the "power supply issue" is genuine. Our 2-bit, 30 staff company, has all servers on UPS, etc, I can't believe BA don't have any power outage backup (and test it regularly!).

Has "power supply issue" become the new excuse?
In relation to the reason they gave, for me, it's the timing and length of the outage that is the problem.

Highly doubtful any major changes or upgrades were booked for out of hours Friday evening on a May bank holiday. Power failure suddenly happens then and not at any other point during the year? It is possible but its a 1 in 365 shot.

It is more likely they had a software glitch due to a problem spotted, either the system failed or an engineer tried to apply a fix. It failed and then they tried to fail over from primary to secondary and realised they could not. Thus, have spent the last 24 hours or so restoring a very large amount of data which explains the long delay in getting the system back online.
Maybe, and that might explain why everything to do with existing passenger bookings or making new bookings/rebookings was down.

But it also took down everything to do with baggage, to the point that people at LHR and LGW couldn't reclaim the bags they'd just handed over in the hours previous, so that they had stuff for a night in a hotel.

BA could not even fly empty planes, so they could get them back in position to have a fresh start today.

It took out their phone system too.

And even more, reports from LHR say that staff were having to stand on chairs and shout to make announcements. So even their generic information exchange mechanisms (as in non-passenger-specific) were offline.

That sounds like one seriously badly architected system, that so much got knocked out. Why does a problem with the booking systems take out the phones and the passenger information displays as well, or vice versa?




LHRFlightman

1,939 posts

170 months

Sunday 28th May 2017
quotequote all
kev1974 said:
Carl_Manchester said:
WestyCarl said:
Anyone more IT savy than me seriously think that the "power supply issue" is genuine. Our 2-bit, 30 staff company, has all servers on UPS, etc, I can't believe BA don't have any power outage backup (and test it regularly!).

Has "power supply issue" become the new excuse?
In relation to the reason they gave, for me, it's the timing and length of the outage that is the problem.

Highly doubtful any major changes or upgrades were booked for out of hours Friday evening on a May bank holiday. Power failure suddenly happens then and not at any other point during the year? It is possible but its a 1 in 365 shot.

It is more likely they had a software glitch due to a problem spotted, either the system failed or an engineer tried to apply a fix. It failed and then they tried to fail over from primary to secondary and realised they could not. Thus, have spent the last 24 hours or so restoring a very large amount of data which explains the long delay in getting the system back online.
Maybe, and that might explain why everything to do with existing passenger bookings or making new bookings/rebookings was down.

But it also took down everything to do with baggage, to the point that people at LHR and LGW couldn't reclaim the bags they'd just handed over in the hours previous, so that they had stuff for a night in a hotel.

BA could not even fly empty planes, so they could get them back in position to have a fresh start today.

It took out their phone system too.

And even more, reports from LHR say that staff were having to stand on chairs and shout to make announcements. So even their generic information exchange mechanisms (as in non-passenger-specific) were offline.

That sounds like one seriously badly architected system, that so much got knocked out. Why does a problem with the booking systems take out the phones and the passenger information displays as well, or vice versa?
How do you know BA couldn't​ fly empty planes?

fatboy b

9,493 posts

216 months

Sunday 28th May 2017
quotequote all
smack said:
BA have outsourced their IT, and you reap what you sow.

I'm sad to hear this, along with the screw up with rolling out the FLY system, and I have been a supplier, I have many friends that work for BA, and one of their regular customers. I can only hope they realise they got it wrong, and make an amends to fix it and do it right, not as cheap as possible.
Didn't think companies still did this. So short sighted. I was at JLR when they outsourced their IT to try and save £7m a year. Within 4 weeks, they had a major line stop on the Range Rover line for 5 hours. A stop on that line is £30m a day, so wiped out a couple years of savings. Plus they lost all the good will in getting 'stuff' done that the new muppets charge a fortune for.

Vaud

50,496 posts

155 months

Sunday 28th May 2017
quotequote all
fatboy b said:
smack said:
BA have outsourced their IT, and you reap what you sow.

I'm sad to hear this, along with the screw up with rolling out the FLY system, and I have been a supplier, I have many friends that work for BA, and one of their regular customers. I can only hope they realise they got it wrong, and make an amends to fix it and do it right, not as cheap as possible.
Didn't think companies still did this. So short sighted. I was at JLR when they outsourced their IT to try and save £7m a year. Within 4 weeks, they had a major line stop on the Range Rover line for 5 hours. A stop on that line is £30m a day, so wiped out a couple years of savings. Plus they lost all the good will in getting 'stuff' done that the new muppets charge a fortune for.
Outsourcing is fine when done properly. How far do you vertically integrate?

Chim

7,259 posts

177 months

Sunday 28th May 2017
quotequote all
Byker28i said:
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).

The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Erm sorry but just no, that comment just exposes your lack of knowledge

Very easy to have replicated systems through virtualisation, it's just cost of standby site/equipment and cost of the lines, depends on the period of dataloss you can survive as to how much it costs. Systems can be brought back on line very promptly.
Erm, but just no, the knowledge that you are saying is lacking is perhaps yours. Legacy has a huge impact on DR capability. Hardware level failover is a piece of piss, application failover completely different and there are multiple issues with this across a host of common apps and legacy, not to mention integrations which are a complete nightmare.

Vaud

50,496 posts

155 months

Sunday 28th May 2017
quotequote all
Plus not all systems can be virtualized.

Tuna

19,930 posts

284 months

Sunday 28th May 2017
quotequote all
Byker28i said:
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).

The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Erm sorry but just no, that comment just exposes your lack of knowledge

Very easy to have replicated systems through virtualisation, it's just cost of standby site/equipment and cost of the lines, depends on the period of dataloss you can survive as to how much it costs. Systems can be brought back on line very promptly.
I'm glad there are so many experts around here. They're probably the same sort of experts that advised BA.

No offence to people who've managed university data centres and mid-scale virtualised systems, but when you have IT systems that run across international borders, manage global payment systems, have petabyte-class storage and have to deal with complex legislative requirements (such as flight passenger management), things get a few orders of magnitude more complex than "I can virtualise a server and fail over a data centre".

I'm not saying that's the cause of BAs problems, it could be something dumb and simple, but genuinely, don't enter a willy waving competition when you're that far out of your depth. The first person who says "I can solve this problem"... hasn't understood the problem.

Matt p

1,039 posts

208 months

Sunday 28th May 2017
quotequote all
Byker28i said:
Matt p said:
Would take something pretty catastrophic to take down the entire power system in a DC. So much back up is normally in place. At least in the DC's I've worked in its normally N+1+1+1 regarding chilled water.
Modern datacentres use airflow through hot/cold aisles with no need to fans, chillers etc. They only come on very rarely, so keeps power costs down.
Most usually have dual feed supplies to that datacentre, UPS, large generators etc so this smacks more of an internal issue to the datacentre rooms. There's probably power on site but they can't get it to where's needed i.e. the servers. Seems strange they haven't a DR site.

I'd suggest the cost of a DR site would easily be a fraction of what yesterday cost them.


Most chillers in DC's have a free cool function so reduce the need for running the refrigeration side. As for indoors seen/worked on a varied amount. Always seen some form of cooling weather it be via chw, dx or via AHU's. However there is a shift towards adiabatic cooling now.

Best set up for a day at room I've seen/worked on funnily enough was the QVC shopping channel.

Agreee with a DR backup very very strange BA couldn't change over.

Du1point8

21,608 posts

192 months

Monday 29th May 2017
quotequote all
anonymous said:
[redacted]
Was working in Waterside (BA HQ) and we told BA management about the flaws in the TCS iFLY system. They wouldn't listen and this is what you get. As for the power outage that is absolute rubbish. Two data centres, BoHo and Cranebank, both have dual power feeds from two suppliers and then there are the diesel generators that automatically kick in too. The last major outage with iFLY couldn't be fixed by TCS after 7hrs. The small group of guys left in BoHo were told to sort it out and it only took them 30mins. TCS have no skills in this area at all, they're nice guys but there it ends. Experienced senior management like Sarah Endersby, Steve Harding etc etc all got rid of. IAG directors to blame inc BA CEO Alex Cruz.
TCS I believe are Tata Consultancy Services, the Indian outsourcing company BA use.
Outsourcing seems to be the biggest waste of saving money possible, its never a small error that happens, its always a disaster when it comes to an outsourcing error.

Things I have noticed since I was part of outsourcing in 2005 onwards (several different companies).

1) No business knowledge despite having handover for several weeks and outsourcing say they know everything (maybe dont want to lose face), 1st incident that comes up, the outsource team give up, shrug their shoulders and blame the outgoing team for not doign a complete handover.
2) Blame... blame... blame again!! If something does go wrong, the outsource team spend more time blaming everyone other than themselves (usually working on equipment only they can access) and want to find the person responsible, rather than fix the fking problem.
3) No testing... outsource team will test their individual section of code/bug fix, it works on their computer so is dumped into OAT/Prod then it fks up... They defend their decision as it worked on their PC... They never tested the fix against the rest of the system and don't know what a regression test is.
4) If in doubt hard code the fix into the system to work, then claim ignorance and say the tech spec said if the value was X then do Y, so to make sure it was always X, ignore the incoming value and hard code X into the system... That's happened way too many times.
5) Friends of friends being hired for jobs... So we just lose $1000 a day contractors that their knowledge is second to none... Their replacement is the outsourcing managers son/nephew/daughter's BF/etc... When we ask just what the hell they are playing at, letting someone on a real-time trading platform with no experience and changing for an experienced member of staff.... Their management get all pissy and state well everyone has to start somewhere... When told to remove him from the project and get someone with the experience we are paying for, we get ignored and they hope it goes away.
6) The onshore team actually become a QA team and end up spending 50% of their time QA'ing the outsourced team's work, as the onshore manager is stting himself that he will be found out for the bad decision of outsourcing... when queried why his onshore team produce less work than the offshore team that is producing work to a very very high standard, said manager does not defend and tells the onshore team that they are lazy.... Then sees 5 of the 7 man team walk out of the meeting and out of the door, said manager needs to beg them to come back as a prod release is due.

So many more examples...

Those that can, leave India ASAP and end up in London/NY/etc... those that can't, end up being the outsource team.

Still 12 years later the same mistakes are being made time and time again.

I suspect some of the ex BA IT staff are about to make a st load of money as contractors shortly.

How much money did they save and now has been wiped out by the 150 million compo? Or is the insurance really going to pay?

robm3

4,927 posts

227 months

Monday 29th May 2017
quotequote all
anonymous said:
[redacted]
Was working in Waterside (BA HQ) and we told BA management about the flaws in the TCS iFLY system. They wouldn't listen and this is what you get. As for the power outage that is absolute rubbish. Two data centres, BoHo and Cranebank, both have dual power feeds from two suppliers and then there are the diesel generators that automatically kick in too. The last major outage with iFLY couldn't be fixed by TCS after 7hrs. The small group of guys left in BoHo were told to sort it out and it only took them 30mins. TCS have no skills in this area at all, they're nice guys but there it ends. Experienced senior management like Sarah Endersby, Steve Harding etc etc all got rid of. IAG directors to blame inc BA CEO Alex Cruz.
TCS I believe are Tata Consultancy Services, the Indian outsourcing company BA use.
Wow! That's really interesting. Despite this being a whinge about a different issue it looks like most PH'er's were correct in their assumption no way it's a power issue.
Ultimately the buck will stop with the poor sod who's BA's IT Director and possibly CEO as well although in my experience CEO's are too wily to get pinned with any serious responsibility "I contributed to the decision based on the information I was given" kind of rubbish.

mybrainhurts

90,809 posts

255 months

Monday 29th May 2017
quotequote all
My money's on Richard Branson donning a fake beard, sneaking in and pulling the plug..

Du1point8

21,608 posts

192 months

Monday 29th May 2017
quotequote all
mybrainhurts said:
My money's on Richard Branson donning a fake beard, sneaking in and pulling the plug..
In a bank that I used to work at... the Tech staff upon joining would get a tour of the data centre, smack bang in the centre of the floor is a big red button (no cover, etc), we asked what it does and it effectively shuts down the whole data centre.

All tours were canceled and a glass box put over the big red button 4 weeks later when a poor grad walked past it and tripped over his shoelaces and hit the big DO NOT fkING TOUCH button.

He probably had the shortest job with them ever as it was only a few days in.

J4CKO

41,560 posts

200 months

Monday 29th May 2017
quotequote all
Tuna said:
Byker28i said:
Tuna said:
I've worked with a company that had a global IT failure that lasted rather a long time (no names).

The problem with DR is that as your systems get more complex, distributed and layered, the dependencies get harder to understand. You can have all the failover you like on your shiny new system, but there will be bits of legacy kit, network routing, unreplicable systems and other nonsense that mean that when it genuinely does go south, bringing it back up can be a nightmare.
Erm sorry but just no, that comment just exposes your lack of knowledge

Very easy to have replicated systems through virtualisation, it's just cost of standby site/equipment and cost of the lines, depends on the period of dataloss you can survive as to how much it costs. Systems can be brought back on line very promptly.
I'm glad there are so many experts around here. They're probably the same sort of experts that advised BA.

No offence to people who've managed university data centres and mid-scale virtualised systems, but when you have IT systems that run across international borders, manage global payment systems, have petabyte-class storage and have to deal with complex legislative requirements (such as flight passenger management), things get a few orders of magnitude more complex than "I can virtualise a server and fail over a data centre".

I'm not saying that's the cause of BAs problems, it could be something dumb and simple, but genuinely, don't enter a willy waving competition when you're that far out of your depth. The first person who says "I can solve this problem"... hasn't understood the problem.
It's very simple, it wasn't fit for purpose, if they had a failure we shouldn't be aware as the redundancy should have kicked in.

I am surprised it is one global system to run everything, could understand if it lost a location but the whole lot ?

I would have imagined they had independent instances of the software replicating back to central system that could cope with a node being offline, or the node losing connection to the main system but not that sure on what it takes to run an international airline, but whatever it looks like you would think, with the stakes, the DR would be robust, rigorously and frequently tested. Maybe it was, but doesn't seem like it.

I think the days of offshoring to India are getting numbered, salaries are rising for skilled IT staff and the savings not as great that it's worth the downsides.


98elise

26,600 posts

161 months

Monday 29th May 2017
quotequote all
Du1point8 said:
Outsourcing seems to be the biggest waste of saving money possible, its never a small error that happens, its always a disaster when it comes to an outsourcing error.

Things I have noticed since I was part of outsourcing in 2005 onwards (several different companies).

1) No business knowledge despite having handover for several weeks and outsourcing say they know everything (maybe dont want to lose face), 1st incident that comes up, the outsource team give up, shrug their shoulders and blame the outgoing team for not doign a complete handover.
2)......

....(lots of other stuff about outsourcing)...
This is very true. I'm a contractor and I have decades of experience in my particular business field, and about 20 years on the IT side. A system I've migrated for a client is very difficult to handover to the support team because they lack the experience.

They have all the right technical skills, and I can teach how the system works, but I cannot teach experience.

When something goes wrong I will know roughly what area it's in, and what approach to take in isolating the issue as soon as it's explained to me.

It's especially important when it's a user error. If you don't understand the business process then you will have no chance of finding the issue.



4x4Tyke

6,506 posts

132 months

Monday 29th May 2017
quotequote all
WestyCarl said:
Anyone more IT savy than me seriously think that the "power supply issue" is genuine. Our 2-bit, 30 staff company, has all servers on UPS, etc, I can't believe BA don't have any power outage backup (and test it regularly!).

Has "power supply issue" become the new excuse?
Your final point is likely right. Power might of well have been the trigger, but the scale of the outage indicates much bigger problems afoot. I would put good money on the root cause of this turning out to be poor IT Governance, those big picture processes that prove the resilience is designed in, tested and proven.

In the past, we had a saying, you never get sacked for buying IBM (or Microsoft et.al.). Expensive but would provide you with the very best backup available.

Today we see IT treated as a cost centre, everything is outsourced on a lowest cost basis. Those suppliers are further whipped into line by crude metrics by managers that got a leg up by doing things quickly or cheaply, not properly. I work as a consultant in the IT QA and I see this kind of lack of concern for proper governance every day.

I'm currently consulting to a very large financial organisation, a globally recognised brand. They are heavily pressuring their supplier to deliver to a crude metric of test carried out, however those test cases include absolutely no proper logging, no proper proof, no auditing. I know that at least half of these tests are worthless. They are Paper tigers wink I keep telling them they are training their supplier to game the metrics, not deliver a quality product.

The NHS failure was exactly the same thing, the attack was the trigger, the root cause of the collapse of IT provision was governance failure by very senior management failure to ensure resilience was built in and proven.

The following pretty much sums up the problem, everybody is in the top and left segments, but claims to be in the lower right.





Edited by 4x4Tyke on Monday 29th May 10:03

4x4Tyke

6,506 posts

132 months

Monday 29th May 2017
quotequote all
anonymous said:
[redacted]
True even in good UPS design things can go wrong. Picture, a large UK based e-commerce organisation. New distribution centre with integral data centre, tens of million spent on the new data centre. Highly skilled team operating to best practices. I was part of the DR team on the software side.

The UPS had three tiers, rack, data centre and backup generators. The kick over was tested monthly and proven to good. It was sited on a new commercial park undergoing continued development. One of the external power supplies was taken out.

The diesel generator automatically kicked in from cold standby to warm standby. The generators were known to have 2X hours supply, were X was the time to get more fuel to the site. The problem was the fuel supply was overlook for monitoring after each of the resilience tests and refilling the diesel supply had not occurred. The new fuel did arrive in time, but with only tens of minutes leeway.

Lessons learnt and improvements made, the fuel supply was increased significantly, fuel level monitoring improved, and strategies formulated to reduce power consumption in DR situations.

jonny996

2,616 posts

217 months

Monday 29th May 2017
quotequote all
anonymous said:
[redacted]
I would say a proper run DC has diesel messured in days run time,

jonny996

2,616 posts

217 months

Monday 29th May 2017
quotequote all

[/quote]

If you have multiple "hot" DCs, then buying, storing and replacing that much diesel will be a waste of money.


[/quote]

Who really has multiple "hot" DC's. I assume you mean active/active?