Delta Down and Cloud Outages…What Happened?
Delta Airlines Cloud Down

Getting all parts of the airline back to normal from the glitch.

 

On Aug 8, 2016 Delta Airlines experienced an extended six-hour outage, which some analysts and journalists falsely attributed to flaws associated with aging technology – i.e., TPF running on mainframe systems.

This false trope failed to identify the real culprits, which are exposures to all IT solutions – in public and private clouds and even more so in non-mainframe environments. IT executives should not alter their views about certain technologies or decision-making based upon biases and false reporting of facts. Nor should business or IT executives assume that outages will become history if they move to the cloud.

Normally I do not write about a single system outage that impacts a company but the Delta downtime became a cause célèbre.

On Aug. 8th at 2:38 AM EDT a power outage hit the Delta data center, which caused a global system failure that lasted six hours before business could begin to go back to normal. Tens of thousands of passengers were stranded around the world and all systems – check-in, flight scheduling and departures, airport screens, reservations, websites, etc. – were affected by the meltdown.

Getting all parts of the airline back to normal from the glitch and all passengers to their ultimate destinations actually took days as hundreds of Delta flights and flight crews were out of position post recovery.

So who’s to blame?

The True Story

The initial report that a power outage was the culprit was partially correct. According to Delta’s COO, “a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. When this happened, critical systems and network equipment didn’t switch over to backups.

Other systems did.

And now we’re seeing instability in these systems.” What the executive did not mention was that it all started when Delta’s IT staff attempted to perform a routine switch to its backup generator, which resulted in a spike that caused a fire in an Automatic Transfer Switch (ATS).

Thus, in effect what Delta and users experienced was the result of a two-step failure. First, the ATS fire and subsequent shutdown meant that a server farm of about 500 servers also closed down abnormally.

Second, Delta’s staff then executed its standard failover process and executed switchovers to the backup IT systems. But this process also failed as critical systems and network equipment did not switch over to backup power. It was determined after the fact that about 300 of the approximately 7,000 data center components (of which the TPF mainframes were a very small component) were determined to not have been configured correctly to the available backup power and therefore remained offline without power.

Even before the details of the problem were made public it was apparent that Delta’s power outage impacted only them, as there are two unique power grids feeding the site and one provider, the Atlanta utility Georgia Power, claimed it was not responsible for the failure and had not received notifications of any outages in its territory.

In fact, Delta’s passenger service system (PSS) like all major PSSs are theoretically configured with no single point of failure – from the power supply through all equipment components and databases. But in Delta’s case there was either a lack of redundancy or the backup ATS failed to kick in as expected.

Next- Outage Track Records

RELATED POSTS

AI and Web3: Unleashing the Power of Decentralized Intelligence

AI and Web3: Unleashing the Power of Decentralized Intelligence

The fundamental definitions of AI and web3 as they stand today By now you have probably heard a lot about the pros and cons of Artificial Intelligence or AI and Web3. In this article, we will explore the relationship of AI and Web3, its implications across various...

Must Know Artificial Intelligence Insights for Small Business

Must Know Artificial Intelligence Insights for Small Business

Sorting out 5 AI-related terms and summary of the key AI players. It is difficult to avoid hearing all the noise screaming that new Artificial Intelligence (AI) tools are “game-changers” for the world. Let's begin by exploring 5 AI-related terms populating news and...

Polls

Sign Up for the Latin Biz Today Newsletter

Video Gallery

PR Newswire

Featured Authors

avatar for Carola BraccoCarola Bracco

Carola Otero Bracco is the Executive Dire...

Social Responsibility for Small Businesses

Money

Talent/HR

Legal

Marketing

Strategy

Another Latina Small Business Recipe for Success

Another Latina Small Business Recipe for Success

Lilia Rojas Latina entrepreneur and owner of La Flor de Jalisco bakery has achieved success   Lilia Rojas takes an almost literal approach to running her business: the positive meaning of having her cake and eating it to. Perhaps that’s why her 14-year-old...

Fashion

Food

Music

Sports

Innovation

Work, Life & Culture

Culture

Health & Fitness

Travel & Destinations

Personal Blogs

Pin It on Pinterest