Downtime, Outages and Failures - Understanding Their True Costs
- 11 Apr 2019
- Written by: Gad Cohen
About
This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes carried out in the enterprise cloud environment. Evolven helps leading enterprises cut the number of incidents, slash troubleshoot time, and eliminate unauthorized changes. Learn more
When it comes to mission-critical applications or data-center performance quality, enterprises are willing to make huge investments. Unfortunately, these investments don’t always fully deliver.
Confronting system downtime
Despite the efforts invested in infrastructure robustness, many IT organizations continue to deal with database, hardware, and software downtime incidents that last from just a few minutes to several days, completely incapacitating the business and causing tremendous losses.
Downtime Expected
The world of IT failure can sometimes seem awkward.
Despite the variety of advanced solutions and the mounting data collected by major enterprise software vendors and IT departments (from ERP to CRM and more), outages are still a valid and a terrifying threat to the industry.
On the other hand, IT failures have somehow become an inherently accepted, even expected, part of the enterprise life.
This is counter intuitive…
IT downtime revisited
While IT professionals find themselves confronting downtimes from time to time, and then they are fully focused on trying to get on top of them, the business organization as a whole suffers from the ‘financial pain’ by effects, which tend to be very significant.
In the past, we took an in-depth look at the multiple ways in which IT downtime can impact enterprises’ bottom line (you can read more about it here - Cost and Scope of Unplanned Outages). We looked at different aspects, from direct loss of revenues through reputation damage to indirect effects such as decrease in productivity.
Now, I wish to revisit the issue and examine how organizations should address and assess threats to their IT operations, including systems, applications and data, by analysing solid (and established) benchmarks that represent the potential costs behind downtime and outages.
System outages:
Measuring big brand failures
When should the industry start measuring the financial impact of big brand outages, such as the one that recently hit Facebook, theone that hit hundreds of thousands of Lloyds Bank customers, or the Jetstar outage that resulted in hundreds of flights delays?
In other words, at what point is an outage ‘significant enough’ so that a cost analysis becomes valuable to the industry in order to learn from it and predict the impact of future outage incidents?
Well, apparently at some point the outage creates an impact that can’t be ignored, PR wise. That’s the point of no return, which is followed by financial impact estimations.
Downtime costs vary significantly between industries. The affected business size is obviously a critical factor, but it is not the only major one. The role of the IT systems in the business is also key.
Setting a numerical value behind an IT outage means predefining its implications across multiple business and organizational aspects, so that the whole industry can learn and optimize accordingly.
A failure of a critical application can lead to two distinct types of losses:
- Loss of the application service – the impact of downtime varies according to the application and the business;
- Loss of data – the potential loss of data due to a system outage can have significant legal and financial implications.
Now, I am sure that you would agree that today's data centers should never go down; applications must stay available 24/7, and internal (let alone external) end-users worldwide must be able to rely on data centers’ availability (for critical data and application availability) at all times.
Well, reality bites. In the back office (meaning inside the data center) this is not the case. No organization enjoys 100% uptime. Should you aspire to reach 100%? Sure. But you should also develop a deep understanding of downtime implications and ways to minimize it.
The worst outage nightmare ever? Probably the one that happened to you…
Some past outage incidents turned into PR catastrophes, like the mythological Virgin Blue debacle from 2010, or the recent one that affected Facebook.
Why? The mass impact probably had something to do with it.
As a reminder, the Virgin Blue outage prevented passengers from boarding flights for 11 days (!!) resulting in negative press, damaged reputation, and millions of dollars lost.
To be more accurate: Virgin Blue's reservations management company, Navitaire, ended up compensating Virgin Blue for more than $20 million (Navitaire booking glitch earns Virgin $20M in Compo).
There are many other incidents that still manage to capture the attention of the media. Here’s just one recent article by USA Today about the Wells Fargo outage that prevented customers from accessing their accounts for many hours.
I can safely say that anyone in the IT industry would agree that outages or downtimes are VERY bad for business. They are unwanted, very harmful financially, and must be fought against using all available resources.
Misconfigurations are key
The IT Process Institute's Visible Ops Handbook reported in the past that "80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers" (Visible Ops).
The Enterprise Management Association reported that 60% of availability and performance errors are the result of misconfigurations.
What’s the cost?
Downtime can cost companies $5,600 per minute and up to $300,000 per hour in web application downtime (according to a 2014 Gartner's analysis).
The average hourly cost of enterprise server downtime, worldwide, 2017-2018:
Source: Statista
Application maintenance costs are increasing at an annual rate of 20%. But that can’t solve all of your problems. A past industry survey revealed that at least one-quarter of polled downtime was caused by configuration errors. (How much will you spend on application downtime this year?).
How common are downtimes or outages?
Ok, downtime can be a financial nightmare. That part is clear. But If you wish to properly estimate the risk potential of outages to your business, the immediate question should be “how likely is it to happen?”
Source: Data Center Knowledge
Ok, so outages are way too common to be ignored by thinking “I am not likely to experience a major outage”. Now comes the question of how to calculate their specific risk to your business.
Production and application downtimes costs made clear
Unplanned outages are up to IT to resolve. Nevertheless, and as I already mentioned, at the end of the day these outages impact the entire organization.
An important part of a thorough outage risk evaluation process is estimating how much money you will lose per hour (or minute, or any other time increment of your choice) in the incident of downtime.
For enterprises that depend solely on data centers' ability to deliver IT and networking services to customers – such as telecommunications service providers or e-commerce companies – downtime can be particularly costly, with the highest cost of a single event topping $1 million (more than $11,000 per minute) according to estimations by experts.
In a USA Today survey of 200 data center managers, over 80% reported that their downtime costs exceeded $50,000 per hour. Over 25% reported downtime costs of over $500,000 per hour (!!).
According to another survey, while companies can't achieve zero downtime, one in every 10 companies said that their availability must be greater than 99.999%.
Source: Searchcio Techtarget
To get a firm understanding of the implications of production and release downtime, let's take a look at how the consequences of downtime are manifested.
Downtime cost - per year or per incident?
A 2017 study revealed that out of 400 IT decision makers, 46% experienced more than four hours of IT-related downtime over 12 months; 23% said that they incurred costs ranging from $12,000 up to more than $1 million per hour.
Over 35% admitted that they are unsure of the cost of an outage to their business.
If you ask Delta airlines, which had to cancel 280 flights due to an outage in 2017, the losses of a single outage incident can reach over $150 million.
A couple of years ago, Dun & Bradstreet reported that 59% of Fortune 500 companies experience a minimum of 1.6 downtime hours per week.
If you take the average Fortune 500 company (or a company that employ at least 10,000 employees) and assume that it pays an IT team members an average of $56 per hour, then (assuming the entire IT is busy solving the downtime) just the labor part of downtime for an organization of this size would reach $896,000 per week, translating to more than $46 million per year (Assessing The Financial Impact Of Downtime).
Of course that the reality is more complicated, as you need to take into consideration many parameters like the time of the event (mid-week or weekend? Day or night time?) and more. Still, understanding the costs of outages will significantly help estimate your risk potential and the ROI of tools that can help minimizing the effect of downtime incidents.
Has the industry managed to learn from the past and to minimize the collateral damage during an outage?
How have things changed from the past?
So, we already know that downtimes and outage incidents still happen today, and the industry has yet to successfully abolish. But how has their cost changed over time? Are these incidents less harmful today?
In 2010, a research by Coleman Parkes found that IT downtime incidents collectively cost businesses more than 127 million man-hours per year - an average of 545 man-hours per company - in employee productivity.
In 2009, it was reported that the average downtime costs vary considerably across industries, from approximately $90,000 per hour in the media sector to about $6.48 million per hour for large online brokerages (How to quantify downtime).
According to a survey of IT managers conducted during those years, companies are becoming more aware of the direct financial costs of computer downtime. The survey revealed that one in every five businesses loses $12,000 an hour through systems downtime (How to quantify downtime).
As mentioned above, a later analysis performed in 2014 by Gartner, reported an average cost of $5,600 per minute and over $300k per hour.
Even as early as 2004, a conservative estimate from Gartner pegged the hourly cost of downtime for computer networks at $42,000. Accordingly, a company that suffers from a worse-than-average downtime of 175 hours per year can lose more than $7 million annually. However, the cost of each outage affects each company differently, so it's important to know how to calculate the precise financial impact (How to quantify downtime).
It makes sense to believe that the cost of outage only gets higher with time (since we all lean more on data systems today). You can therefore understand why past data can be multiplied by a significant number in order to reflect today’s reality…
Every minute counts
Over ten years ago, the average cost of a data center downtime across industries was valued at approximately $5,600 per minute (Unplanned IT Outages Cost More than $5,000 per Minute), a figure which, according to Gartner, remained the same until 2014. The aforementioned past study by the Ponemon Institute calculated the minimum, median, mean and maximum cost per minute of unplanned outages, based on input from 41 data centers. The greatest cost of an unplanned outage was found to exceed $11,000 per minute.
On average, the cost of an unplanned outage is likely to exceed $5,000 per minute.
It only gets more significant
A 2013 study saw an uplift of over 41% from the past averages described above, and an average of more than $7900 cost per one minute.
An ITIC survey from 2015 clearly showed that the hourly cost (compared to data from 2008) has increased by between 25% to 30%.
Downtime impact per year
A past analysis Gartner has calculated that downtime incidents can reach 87 hours per year, on average. Obviously that's the sum of many outages - anywhere from a few minutes to several hours (Average large corporation experiences 87 hours of network downtime a year).
How things have changed?
A later research from 2011 revealed that although the industry has managed to successfully fight the downtime epidemic and decrease their occurences, we are still seeing significant downtime hours and huge revenue losses (Source: led to over 3 million (apparently Whatsapp users) that migrated to Telegram)
The impact on reputation and loyalty
How much is your business reputation worth? This may be extremely difficult to assess, as well as the long-term effect of a damaged reputation and its impact on revenue and profitability.
In this case, downtime costs include lost customers (both short and long term), and other tangible elements that reflect the costs of reputation impairment like stock downturns, marketing hours (crisis and brand recovery management) and media budget required to reboot and polish up an organization's profile.
What parameters should impact your calculation?
When trying to estimate the cost of downtimes, there are the obvious direct costs (such as loss of business during downtime). However, many indirect costs such as employee overhead or reputation issues discussed above, should be calculated in as well.
Workforce overhead is derived from the cost of burning ‘war-room’ tasks that focus on getting the IT systems back up and running, the cost of being delayed with all other planned tasks, the cost of employee overtime expenses (if applicable), and more. Then there’s the value of data loss, emergency maintenance fees (particularly if the outage occurs during off hours), and additional repair costs that may continue long after service has been restored.
Needless to say, you must calculate these costs when you estimate the implication of downtime, as they are usually very significant; but even a rough guesstimate can prove to be extremely beneficial for understanding the risks and deciding on the required level of technology you should lean on, in order to fight it.
There’s also the impact of lost sales. To have an accurate assessment of the total lost sales, the impact percentage must be increased to reflect the real lifetime value of customers who permanently defect to a competitor. For instance, the Facebook (and Whatsapp) outage that I mentioned earlier Cost-Unconscious: Denying the True Cost of Network Downtime. What is the revenue loss derived by the fact that these users will present less billable ad-impressions?
Stock dropped by 25%
Although it's hard to put a number on so many parameters, they are still substantial and significant. For instance, when Amazon.com went offline for several hours during its early days, its stock dropped by 25% in a single day (Cost-Unconscious: Denying the True Cost of Network Downtime)!
In this Amazon cloud outage example, the company continued to scramble to get its cloud services back online. As a result, many customers questioned the reliability of its cloud and Amazon’s communication surrounding the outage. Other customers thought they should be compensated for the downtime as part of their SLA.
I know you are curious: As for the SLA, despite the almost-four-day outage, Amazon's EC2 SLA was not breached (Seven lessons to learn from Amazon's outage).
The cost of downtime: Calculating it yourself
How much are you bound to lose from an unexpected downtime of your servers or business applications?
According to multiple sources, the simplest way to calculate potential revenue losses during an outage is by using this equation:
LOST REVENUE | = | (GR/TH) x I x H |
GR | = | gross yearly revenue |
TH | = | total yearly business hours |
I | = | percentage impact |
H | = | number of hours of outage |
How to minimize outage and downtime risk?
Downtime and outages are catastrophic, but they don’t have to be that impactful. By utilizing solutions that focus on getting to the root of the problem, outages can be prevented before they even occur.
Evolven Change Analytics developed a unique AIOps solution that focuses on changes - the true root cause of performance incidents. Evolven helps enterprise IT and Cloud Ops teams prevent and troubleshoot incidents before the trouble starts.
Contact us to see how we help leading enterprises slash the number of incidents and MTTR.
FAQs
What is the true cost of system downtime? ›
Quick downtime calculator
To get a quick estimate of your company's probable downtime costs, use the following formula, based on the size of your business and the number of minutes your most recent incident lasted: Downtime cost = minutes of downtime x cost-per-minute. For small business, use $427 as cost-per-minute.
Downtime cost is defined as any profit that a company loses when its equipment or network stops functioning. The cost of downtime implies not only direct financial loss but can have an impact on your company in at least the other 4 ways.
What is the difference between downtime and outage? ›Downtime occurs when a system can't complete its primary function. It can be broken up into two types: IT outages and brownouts. IT brownouts occur when a system is slowed or partially available. This might mean customers can access your site, but pages load slowly or dynamic features like "add to cart" don't function.
What is downtime failure? ›In industrial environments, downtime may refer to failures in production equipment. This type of downtime is often measured as downtime per work shift or downtime per a 12- or 24-hour period. Downtime duration is the period of time when a system fails to perform its primary function.
What is true downtime cost analysis? ›TDC is a methodology of analyzing all cost factors associated with downtime, and using this information for cost justification and day to day management decisions. Most likely, this data is already being collected in your facility, and need only be consolidated and organized according to the TDC guidelines.
What are the three types of downtime? ›Common categories of downtime include excessive tool changeover, excessive job changeover, lack of operator, and unplanned machine maintenance.
How do you explain downtime? ›a time during a regular working period when an employee is not actively productive. an interval during which a machine is not productive, as during repair, malfunction, maintenance.
How do you define an outage? ›an interruption or failure in the supply of power, especially electricity. the period during which power is lost: a two-hour outage on the East Coast.
What are the two types of downtime? ›Downtime falls into two categories: planned and unplanned. Planned downtime is notable because it offers advanced warning and gives users a chance to prepare. Planned downtime is usually done for upgrades or maintenance to the network infrastructure.
What are the main causes of downtime? ›This can be due to several reasons including hardware or software failure, human error, malicious attacks or natural disasters. Since unplanned downtime is unexpected and occurs without a warning, preventing it can be a challenge.
How do you manage downtime? ›
- Know the best windows of time for planned downtime based on your company's production cycle. ...
- Prioritize all your assets and know which should be handled first. ...
- Implement clear guidelines and well-defined standard operating procedures (SOPs) for each repeated operation.
- Defects.
- Overproduction.
- Waiting.
- Not-Utilizing Talent.
- Transporting.
- Inventory.
- Motion Waste.
- Excess Processing.
How Much Does Downtime Cost a Company? The average cost of downtime is significant. Each minute costs an average of $9,000, according to the Ponemon Institute, bringing the downtime cost per hour to over $500,000.
What are the two major considerations when calculating the cost of downtime? ›Calculating Downtime Cost
The duration of the downtime and the cost incurred per minute you're offline are the two variables that most affect the financial impact of an outage.
For example, in the auto industry, downtime can cost up to $50,000 per minute. That's $3 million per hour. 400 The true downtime cost includes a variety of wasted business support costs and lost business opportunity costs because resources were needed to resolve a downtime incident that probably didn't need to happen.
What is the industry standard for downtime? ›World Class Standards For Downtime
Aim for unscheduled downtime to be 10% or less.
Database outages can have a significant impact on top line revenue. In fact, according to a survey conducted by ITIC, 98% of organizations say a single hour of downtime costs over $100,000, while 81% report that it costs over $300,000. And that's just for a single hour!
What is the average cost of downtime in a data center? ›According to Gartner, downtime costs $5,600 per minute on average. This results in average costs between $140,000 and $540,00 per hour depending on the organization. Some factors that contribute to the costs associated with downtime include: Lost sales.
Is the auto shortage getting better? ›The Auto Chip Shortage Remains, But It May Be Improving
However, if Fiorani's estimate holds true, it would mark a significant improvement for the industry. More than 10.5 million vehicles were cut from production in 2021, according to Auto News.
Labor Costs
Diagnostic Labor – This requires significantly more training than a repair laborer, as well as different tools, both of which require training and exact a significant expense. Repair Labor – This requires a significant amount of training and experience, which master technicians take many years to accrue.