Availability and Reliability of a Data Center

How many of you are really aware of the two terms’ availability and reliability? How these two terms are related to data center services and what is the importance of those? Let us discuss this topic of availability and reliability of a data center.

Availability

Availability in its simplest term is the degree to which a system or component is operational and accessible when required for use. It can be viewed as the likelihood that the system or component is in a state to perform its required function under given conditions at a given instant in time. In our rapidly changing business world, highly available systems and processes are of critical importance and are the foundation upon which successful businesses rely. So much so, that according to the National Archives and Records Administration in Washington, D.C., 93% of businesses that have lost availability in their data center for 10 days or more have filed for bankruptcy within one year !!!. The cost of one episode of downtime can cripple an organization.

Take for example an e-business. In the case of downtime, not only would they potentially lose thousands or even millions of dollars in lost revenue, but their top competitor is only a mouse-click away. Therefore loss is translated not only to lost revenue but also to a loss in customer loyalty. Because of this itself, the challenge of maintaining a highly available and reliable network is no longer just the responsibility of the IT departments, rather it extends out to management and department heads, as well as the boards which govern company policy. For this reason, having a sound understanding of the factors that lead to high availability, threats to availability, and ways to measure availability is imperative regardless of your business sector.

Impact on Business Value

In order to understand the actual impact of unavailability and how this can cripple an organization, the first thing that you need to understand is the business value of an organization. In management, the business value is an informal term that includes all forms of value that determine the health and well-being of the firm in the long run. Let us have a deep dive into the facts of availability and how this can create a business impact.

When the business value of an organization is high the impact that can make due to unavailability will also be high. How can we measure the value of a business? What are the factors that business value depends on? How this business value is closely related to availability? Let us find the answers to these questions and these answers will open your eyes towards the importance of high availability.

Business value for an organization in general terms is based on three core objectives,

  1. Increasing revenue
  2. Reducing costs
  3. Better utilizing assets

Regardless of the line of business, these three objectives ultimately lead to improved earnings and cash flow. Measuring Business Value begins first with an understanding of the Physical Infrastructure. Physical Infrastructure is the foundation upon which Information Technology (IT) and telecommunication Networks reside(consists of the Racks, Power, Cooling, Fire Prevention/Security, Management, and Services). Investments in Physical Infrastructure are made because they both, directly and indirectly, impact these three business objectives. Managers purchase items such as generators, air conditioners, physical security systems, and Uninterruptible Power Supplies to serve as “insurance policies.” For any network or data center, there are risks of downtime from power and thermal problems, and investing in Physical Infrastructure mitigates these and other risks. So how does this impact the three core business objectives above (revenue, cost, and assets)? Revenue streams are slowed or stopped, business costs/expenses are incurred, and assets are underutilized or underproductive when systems are down. Therefore, the more efficient the strategy is in reducing downtime from any cause, the more value it has to the business in meeting all three objectives.

Historically, the assessment of Physical Infrastructure business value was based on two core criteria: availability and upfront costs. Increasing the availability (uptime) of the Physical Infrastructure system and ultimately of the business processes allows a business to continue to bring in revenues and better optimize the use (or productivity) of assets. Imagine a credit card processing company whose systems are unavailable – credit card purchases cannot be processed, halting the revenue stream for the duration of the downtime. In addition, employees are not able to be productive without their systems online. And minimizing the upfront cost of the Physical Infrastructure results in a greater return on that investment. If the Physical Infrastructure cost is low and the risk/cost of downtime is high, the business case becomes easier to justify.

While these arguments still hold true, today’s rapidly changing IT environments are dictating additional criteria for assessing Physical Infrastructure business value, Agility. Business plans must be agile to deal with changing market conditions, opportunities, and environmental factors. Investments that lock resources limit the ability to respond in a flexible manner. And when this flexibility or agility is not present, a lost opportunity is the predictable result.

Reliability

Reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time. Availability is determined by a system’s reliability, as well as its recovery time when a failure does occur. When systems have long continuous operating times, failures are inevitable. Availability is often looked at because, when a failure does occur, the critical variable now becomes how quickly the system can be recovered. In the data center, having a reliable system design is the most critical variable, but when a failure occurs, the most important consideration must be getting the IT equipment and business processes up and running as fast as possible to keep downtime to a minimum.

Now we have seen what is availability and reliability. Upon considering any availability or reliability value there is a key term that you need to understand which is ‘failure’. When a particular service is not available as expected, what shall we call it as? Isn’t a failure, right? Moving forward without a clear definition of failure is like advertising the fuel efficiency of an automobile as “miles per tank” without defining the capacity of the tank in liters or gallons.

According to the IEC (International Electro-technical Commission), there are two basic definitions of a failure:

  1. The termination of the ability of the product as a whole to perform its required function.
  2. The termination of the ability of any individual component to perform its required function but not the termination of the ability of the product as a whole to perform.

Five 9’s of Availability(99.999%)

A term that is commonly used when discussing availability is the ‘5 Nine’s. Although often used, this term is often very misleading and often misunderstood. 5 9’s refers to a network that is accessible 99.999% of the time. However, it is a rather misleading term because the use of the term has become diluted.

Let’s take for example two data centers that are both considered 99.999% available. In one year, Data Center A lost power once, but it lasted for a full 5 minutes. Data Center B lost power 10 times but for only 30 seconds each time. Both Data Centers were without power for a total of 5 minutes each. The missing detail is the recovery time. Anytime systems lose power, there is a recovery time in which servers must be rebooted, data must be recovered, and corrupted systems must be repaired. The Mean Time to Recover process could take minutes, hours, days, or even weeks. Now, if you consider again the two data centers that have experienced downtime, you will see that Data Center B that has had 10 instances of power outages will actually have a much longer duration of downtime, than the data center that only had once the occurrence of downtime. Data Center B will have a significantly higher Mean Time to Recover. It is because of this dynamic that reliability is equally important to this discussion of availability. Reliability of a data center talks to the frequency of downtime in a given time frame. There is an inversely proportional relationship in that as time increases, reliability decreases. Availability, however, is only a percentage of downtime in a given duration.

Measuring Availability and Reliability

How can we measure the availability and reliability of physical infrastructure? How these are represented? The basic measure of a system’s reliability is called MTBF(Mean Time Between Failure). MTBF calculates the average period between two breakdowns. In other words, it defines how long an asset typically works until it goes caput. It is typically represented in units of hours. The higher the MTBF number is, the higher the reliability of the product. MTTR(Mean Time to Recover (or Repair)), is the expected time to recover a system from failure. This may include the time it takes to diagnose the problem, the time it takes to get a repair technician onsite, and the time it takes to physically repair the system. Similar to MTBF, MTTR is represented in units of hours. MTTR impacts availability and not reliability. The longer the MTTR, the worse of a system is. Simply put, if it takes longer to recover a system from a failure, the system is going to have lower availability. As the MTBF goes up, availability goes up. As the MTTR goes up, availability goes down.

The measurement of Availability is driven by time loss whereas the measurement of Reliability is driven by the frequency and impact of failures. Mathematically, the Availability of a system can be treated as a function of its Reliability. In other words, Reliability can be considered as a subset of Availability.

Factors that Affect Availability and Reliability

It should be obvious that there are numerous factors that affect data center availability and reliability. Some of these include AC Power conditions, lack of adequate cooling in the data center, equipment failure, natural and artificial disasters, and human errors. Let’s look into each of these factors in detail,

Power Conditions

Let’s look first at the AC power conditions. Power quality anomalies are organized into seven categories based on wave shape:

  1. Transients
  2. Interruptions
  3. Sag / Under voltage
  4. Swell / Overvoltage
  5. Waveform distortion
  6. Voltage fluctuations
  7. Frequency variations

Inadequate Cooling

Another factor that poses a significant threat to availability is a lack of cooling in the IT environment. Whenever electrical power is being consumed, heat is being generated. In the Data Center Environment, where a mass quantity of heat is being generated, the potential exists for significant downtime unless this heat is removed from the space. Often times, cooling systems may be in place in the data center, however, if the cooling is not distributed properly hotspots can occur. Hot spots within the data center further threaten availability. In addition, inadequate cooling significantly detracts from the lifespan and availability of IT equipment. It is recommended that when designing the data center layout, a hot aisle/cold aisle configuration is used. Hot spots can also be alleviated by the use of properly sized cooling systems, and supplemental spot coolers and air distribution units.

Equipment Failures

The health of IT equipment is an important factor in ensuring a highly available system, as equipment failures pose a significant threat to availability. Failures can occur for a variety of reasons, including damage caused by prolonged improper utility power. Other such causes are from prolonged exposure to elevated or decreased temperatures, humidity, component failure, and equipment age.

Natural and Artificial Disasters

Disasters also pose a significant threat to availability. Hurricanes, tornadoes, floods, and the often subsequent blackouts that occur after these disasters all create tremendous opportunity for downtime. In many of these cases, downtime is prolonged due to damage sustained by the power grid or the physical site of the data center itself.

Human Error

According to Gartner Group, the largest single cause of downtime is human error or personnel issues. One of the most common causes of intermittent downtime in the data center is poor training. Data center staff or contractors should be trained on procedures for application failures/hangs, system update/upgrades, and other tasks that can create problems if not done correctly.

Another problem is poor documentation. As staff sizes have shrunk, and with all the changes in the data center due to rapid product cycles, it’s harder and harder to keep the documentation current. Patches can go awry as incorrect software versions are updated. Hardware fixes can fail if the wrong parts are used.

Another area of potential downtime is the management of systems. System Management has fragmented from a single point of control to vendors, partners, ASPs, outsource suppliers, and even a number of internal groups. With a variety of vendors, contractors and technicians freely accessing the IT equipment, errors are inevitable.

Cost of Downtime

At the starting of this article, we have seen the impact of downtime and how this can affect the business. Now let us understand what is the cost of downtime. It is important to understand the cost of downtime to a business, and specifically, how that cost changes as a function of outage duration. Lost revenue is often the most visible and easily identified cost of downtime, but it is only the tip of the iceberg when discussing the real costs to the organization. In many cases, the cost of downtime per hour remains constant. In other words, a business that loses at a rate of 100 dollars per hour in the first minute of downtime will also lose at the same rate of 100 dollars per hour after an hour of downtime. An example of a company that might experience this type of profile is a retail store, where a constant revenue stream is present. When the systems are down, there is a relatively constant rate of loss.

Some businesses, however, may lose the most money after the first 500 milliseconds of downtime and then lose very little thereafter. For example, a semiconductor fabrication plant loses the most money in the first moments of an outage because when the process is interrupted, the Silicon wafers that were in production can no longer be used, and must be scrapped.

And others yet may lose at a lower rate for a short outage (since revenue is not lost but simply delayed), and as the duration lengthens, there is an increased likelihood that the revenue will not be recovered. Regarding customer satisfaction, a short duration may often be acceptable, but as the duration increases, more customers will become increasingly upset. An example of this might be a car dealership, where customers are willing to delay a transaction for a day. With significant outages, however, public knowledge often results in damaged brand perception and inquiries into company operations. All of these activities result in a downtime cost that begins to accelerate quickly as the duration becomes longer.

Direct and Indirect Costs

Costs associated with downtime can be classified as direct and indirect. Direct costs are easily identified and measured in terms of hard dollars. Examples include:

  1. Wages and costs of employees that are idled due to the unavailability of the network. Although some employees will be idle, their salaries and wages continue to be paid. Other employees may still do some work, but their output will likely be diminished.
  2. Lost Revenues are the most obvious cost of downtime because if you cannot process customers, you cannot conduct business. Electronic commerce magnifies the problem, as eCommerce sales are entirely dependent on system availability
  3. Wages and cost increases due to induced overtime or time spent checking and fixing systems. The same employees that were idled by the system failure are probably the same employees that will go back to work and recover the system via data entry. They not only have to do their ‘day job’ of processing current data, but they must also re-enter any data that was lost due to the system crash or enter new data that was handwritten during the system outage. This means additional hours of work, most often on an overtime basis.
  4. Depending on the nature of the affected systems, the legal costs associated with downtime can be significant. For example, if downtime problems result in a significant drop in share price, shareholders may initiate a class-action suit if they believe that management and the board were negligent in protecting vital assets. In another example, if two companies form a business partnership in which one company’s ability to conduct business is dependent on the availability of the other company’s systems, then, depending on the legal structure of the partnership, the first company may be liable to the second for profits lost during any significant downtime event.

Indirect costs are not easily measured but impact the business just the same. In 2000, Gartner Group estimated that 80% of all companies calculating downtime were including indirect costs in their calculations for the first time.

Examples include reduced customer satisfaction; the lost opportunity of customers that may have gone to direct competitors during the downtime event; damaged brand perception; and negative public relations.

Calculating Cost of Downtime

There are many ways to calculate the cost of downtime for an organization. For example, one way to estimate the revenue lost due to a downtime event is to look at normal hourly sales and then multiply that figure by the number of hours of downtime.

Revenue Lost = Normal Hourly Sales X Hours of downtime

Another example is the loss of productivity. The most common way to calculate the cost of lost productivity is to first take an average of the hourly salary, benefits and overhead costs for the affected group. Then, multiply that figure by the number of hours of downtime.

Lost Productivity = (Average Hourly Salary+ Benefits and overhead costs) X Hours of downtime

Because companies are in business to earn profits, the value employees contribute is usually greater than the cost of employing them. Therefore, this method provides only a very conservative estimate of the labor cost of downtime. Remember, however, that this is only one component of a larger equation and, by itself, seriously underestimates the true loss.

I hope you are all clear with the idea of the availability and reliability of a data center and it’s importance. Feel free to comment on any additional questions.

Knowledge Credits: Energy University by Schneider Electric

Leave a Reply