TSB Bank IT failure: Déjà vu all over again
TSB bank in the UK has been experiencing severe computer issues, with customers unable to log into their online accounts or use the bank’s app and struggling to log in or make online payments. David Rubens looks at the issue from an organisational resilience perspective.
The problems began when the bank started to migrate data from its five million customers from the former owner, Lloyds’ system to a new one. The migration was scheduled to take place between 16:00 hrs on April 24 and 18:00 hrs on April 26 but, four weeks later, many customers said they were still experiencing problems and the bank admitted that it had received 40,000 complaints.
For the leaders of the TSB Bank, after the transition from one data management system to another failed miserably, leaving millions of on-line customers without basic services, the few weeks must have seemed like a never-ending nightmare from which they keep feeling that they are about to awaken, but then become dragged back into.
The idea of the catastrophic failure of banking IT systems is nothing new. In fact, in an article I wrote in January 2013, looking at what I thought might be the ‘mega-crisis’ of the near future, this was exactly the situation that I chose to focus on, saying:
“The scenario that I have been using as an example of the truly ‘Wicked Problem’ in my own work for the last couple of years is the breakdown of global banking IT systems, which – at one stroke –would leave vast areas of the population to survive purely on the money that they happened to be carrying at the time.
“Warnings that the underlying support systems behind global internet banking are reaching the functional limits of their operational complexity were seen when problems with RBS and NatWest computer systems left up to 12 million people without access to their money in June last year, and there were similar problems with Lloyds Banking Group (which includes Halifax and Bank of Scotland) in October.
“Similarly to the bank IT failure in South Korea that left 30 million customers affected for over a week in 2011, such stories seem to have a natural trajectory. Initial triggering, followed by a quick response by the company to say that they are working on it; the company then says that it has a solution that will be implemented and the problem will be fixed; it then seems that the problem is more complicated than first thought, and a range of interventions do not work, and it is then reported, either by news sources or by the bank itself in an effort to deflect responsibility, that actually the cause of the meltdown was not within the system, but because of human error by an outsourcing company that had been tasked with managing the IT system (and who had undoubtedly won that contract on the basis of lowest cost….).”
It seems clear that – except for the last part (though the reason for making the change was to save the fee that was being paid by TSB to Lloyds, which managed the legacy system that supported the online functions – that paragraph could have been written about the TSB failure.
The fact that large-scale IT projects are fraught with risks and dangers, both those that are unexpected and those that flow directly from the incompetence of the managers who are supposed to be responsible for the design and implementation of the systems, should not come as news.
The UK Department of Transport’s attempt to unify its human resources and financial systems in a single national centre? Branded as an exhibition of “stupendous incompetence” by the Public Accounts Committee. The attempt to create a national unified database for 385 magistrates in the UK? The initial contract was signed for £146 million. After 18 months the price had risen to £319 million, and when the contractor tried to raise it further to £389million, the contract was rescinded – though Fujitsu still received most of its fee, despite not delivering the service promised.
The project to unify the UK’s National Health Service patient records under one system was proudly announced as the largest non-military IT project in the world. It was perhaps not surprising that when it was finally abandoned in 2013, after seven years of failed efforts to get even the initial stage operational, only 13 trusts out of a projected 169 had received the system, the cost had grown from £6billion to over £10billion, and not only did the system itself not work, but the attempt to transfer some hospitals to the new system led to the loss or degradation of patient information that had a significant impact in terms of disruption, additional time and energy in trying to recover (or even identify) what had been lost, as well as the impact on services to critically ill patients.
So, given that this situation is neither random nor unexpected, but could have been predicted as being a high-likelihood outcome of trying to switch extremely complex data between two hosting systems, what lessons could be learned?
Technology is almost more complex than you think it is. Whatever you think you are able to do, however many resources you think you require, and however much time you think it is going to take – you are almost certainly going to be wrong. And not just slightly wrong – but by an exponential factor.
That is increased when transition from one system to another is involved. We all know how frustrating it is during the first few weeks of a new smartphone, when all of the things that the salesperson promised would happen seamlessly does not, and eventually it becomes clear that the transition process has resulted in the loss of data and functionality that had previously been taken for granted. What is true for a smartphone or laptop is even more so for the type of complex data management systems used by large corporations trying to integrate multiple sites and sources into single unified architecture.
Organisations that are suffering from IT failures will always try to underplay the significance and the seriousness of the situation, and will claim that recovery time will be a matter of hours. This then gets pushed further and further back as the scale and complexity of the problem starts to emerge. If the problem itself is complex, then it is unlikely that the solution is going to be simple.
At some stage, the spokesman for the organisation will say something along the lines of ‘This is an incredibly complex system, and there are issues that we had not expected’. This is simply not acceptable. Complexity is an integral part of the system. If you cannot understand, model or manage complexity, then you should use someone who can. In fact, it is the failure to understand the implications of complex systems and the challenges that they pose that is at the root of most IT failures – and certainly a significant factor in the failure to plan for and manage the recovery.
There should be a way of reviewing the management of complex IT projects before they are initiated, and specifically there should be a rigorous testing of the assumptions behind every aspect of the planning and preparation. If a company clams that it understands how to run a project like this, it should also be able to deliver a project framework that can be delivered in time and on budget. A lot of times, the reason for catastrophic failures is that the initial assumptions were completely unrealistic in terms of time, resources and budget (and were understood to be so by all the significant stakeholders on both sides of the negotiations).
These incidents have real consequences. Millions of people are affected by these systems failures, in some ways catastrophically and in many ways beyond recovery. The people who are responsible for the management of the systems should also be held responsible for the impacts of the failures that their incompetence causes.
The TSB IT failure is not something that ‘happened’. It was something that was ‘caused’. It was not something that was unexpected, arbitrary or anomalous. It was something that should have been considered as a central issue, both in the planning for the transfer of data, and when the button was pushed to make it happen.
The fact that once again a major institution that offers services that are critical to the safety and wellbeing of its millions of customers has acted in such a way as to create genuine suffering and hardship, is no longer an issue of issuing a corporate apology and trying to do enact some reputational management by offering perks, such as suspending interest payments for one month.
There needs to be a recognition of the level of responsibility that such organisations hold (as there is in every other aspect of critical national infrastructure), and there needs to be a clear signal that the institutions themselves, and the senior personnel running them, will be held organisationally and personally responsible in the event that such duties are not delivered.
I will finish off with another quote from the 2013 article: “I am certainly not complacent about the threat from global warming, population growth, urbanisation, food shortages, water disputes, global pandemics, species-jumping viruses, volcanoes, earthquakes, tsunamis, drought, rising levels of greenhouse gas emissions, growing political instability, rogue nuclear states, lone-wolf terrorists or unforeseen consequences of innovative bio-technology… but the one that really keeps me awake in the small hours is what will be the pictures on the world news seventy-two hours after the crash of global IT systems means that every bank account in the world reads £00.00. Happy New Year!”
Dr David Rubens DSyRM, CSyP, FSyI is a Member of CRJ’s Editorial Advisory Panel and CEO of Deltar Consultants, a strategic risk and crisis management consultancy. Deltar runs UK Government accredited Level 5 Award in Corporate Risk and Crisis Management programmes around the world.