What are the risks that may affect the availability of a data center
Availability of a data center means the maximum uptime that the operation of a data center works without any failure. Availability is determined by a system’s reliability and it’s recovery time. System downtime can cause a major impact on business entities, it is necessary to know what are the risks that may affect the availability of a data center. This understanding can give you ideas for precaution to avoid the incidents in your data center environment also.
Generally, these factors can be divided into 4 and listed below,
- Nature
- Human
- Utility
- Equipment
Nature
This factor is having one of the major impacts on the availability of data centers. We can’t predict the nature of earth which may change any time and cause a complete disaster. This will include tornadoes, hurricanes, flooding, earthquakes, etc. Control against the natural calamities by humans is really less hence this can have a major impact on the availability of data center. Maintaining data access in the event of a disaster can mean the difference between a company’s success or failure. So let us have a look at some of the incidents that were occurred in various companies and their data centers.
- Lightning:They say lightning doesn’t strike the same place twice, but in 2015 one of Google’s European data centers was struck by lightning not once, but four times, causing errors in 5% of the disks responsible for Google Compute Engine (GCE) instances. Although the company restored many of the drives, an estimated 0.000001% of data stored in the data center was irrecoverably lost. While that might not sound like much, try telling that to the customers who were affected by it.
- Hurricanes: According to National Geographic, 2017 was the most expensive hurricane season in U.S. history, costing roughly $200 billion. With their combination of high winds, storm surge, and heavy rains, hurricanes are one of the most dangerous natural disasters data centers must contend with. The sudden flooding resulting from Hurricane Sandy in 2012 caused extensive data center outages in New York and New Jersey. These failures were made even worse by the fact that backup systems were located in the same geographic region and where knocked out by the same weather event.
- Tornadoes: A devastating 2011 tornado ripped through several hospital buildings in Joplin, Missouri, one of which was a data center. While none of the data lost was mission-critical, that was only because most of the information stored there had been migrated to a new offsite data center just a few weeks earlier. Hospital officials noted that if the tornado had hit a month earlier, the data loss would have been catastrophic and rendered the hospital completely inoperable.
- Flooding: Severe flooding in Leeds, UK caused a Vodafone data center to temporarily lose power during Christmas of 2015. While data loss was negligible, the power outage disrupted mobile phone service temporarily. Vodafone, of course, has a bit of history with flooding, having suffered one of the most infamous data center disasters when its Istanbul data center was devastated by flooding in 2009.
- Earthquakes: So far, data centers have been lucky. Modern architectural standards and additional precautions (such as special enclosures and rollers for server racks) have gone a long way towards protecting data centers from earthquakes, even in high-risk areas.
- The Unexpected…:Disaster planning is all about expecting the unexpected. Take, for instance, the squirrel that knocked Yahoo’s Santa Clara data center offline for several hours in 2010, or the truck that drove into a transformer feeding power into a Backspace data center in 2007.
Human
According to a survey conducted by the Aperture Research Institute, human errors are behind 57.3% of all data center outages. The second most common reason was improper failover with 43.7%.
Let me tell you another survey details as well,
According to Uptime Institute: 70% of DC Outages due to Human Error and not by a fault in the infrastructure design. Furthermore, “mistakes” that led to an outage can often be traced to a poor decision by senior management.
The results from both the organization can be different due to the reasons that it may be conducted on different entities and different environments. As a summary of both of these surveys, we can conclude that the DC outage due to human mistakes are really much higher than any other dependencies. Let’s take an example of human raised DC issues,
- Activation of the emergency power-off (EPO) switch
- Adjusting the temperature from Fahrenheit to Celsius
- Pulling power cords out of equipment
- Overloading a circuit
- Not following standard policies or procedures
To minimize the risk of the “human factor” affecting operations, it is important to have up-to-date documentation on everything connected to your data center and manuals on how different critical operations should be performed. Manuals and documentation together with scheduled tests should help you avoid many of the problems and outages described in this survey.
Utility
In the case of a data center, the major source of utility is the electric power that is drawn to data centers from local providers(can be a government entity or private entity). The secondary utility for a data center would be the Diesel generators and UPS systems. All other mechanical parts related to the data center is directly or indirectly depend on the availability of utility.
An Uptime Institute survey finds the power usage effectiveness of data centers is better than ever. However, it is also true that the survey indicates that the power outages have increased significantly. The Global Data Center Survey report from Uptime Institute gathered responses from nearly 900 data center operators and IT practitioners, both from major data center providers and from private, company-owned data centers(you can download the report from above link).
Even though we do prepare all equipment for redundancies there are chances that these machines may not work as expected at the time of any incidents. One of the incidents that I can get you is that – Diesel rotary uninterruptable power supply (DRUPS) systems were implicated in power disruptions that in 2014 affected Amazon Web Services in Sydney, a former Telecity facility called Sovereign House in London, now owned by Digital Realty Trust, and the Singapore Stock Exchange. Disruption at Amazon was caused by what the company called “an unusually long voltage sag.”. If you go through this incident you will understand the root cause of the outage is due to utility failure and subsequent machines failed to start. Some of the incidents that are reported in data center imminent failure are as below,
- The generator fails to start.
- Generator fails after X number of hours running.
- Utility power partially fails(usually one of three phases- phase loss)
- UPS fails to switch to battery
- UPS fails to switch from battery to input power
From these incidents we can all say that maintaining the periodic checks, preventive maintenance tasks are really important that would really help a lot to avoid the impact of failures.
Equipment
As you know the data center infrastructure is a large collection of multiple equipment and success is depending on the efficiency of all these together. Any equipment related to electric, mechanical, cooling, networking, servers are having chances to fail on an unexpected timeframe. Whether it’s a server reaching the end of its five-year expected lifespan or a UPS backup battery dying before it should, equipment failure is one of the most common causes of data center outages.
With today’s powerful data center infrastructure management (DCIM) tools, facilities can monitor the overall health of their own equipment as well as colocated assets. While it may not be possible to predict every failure, sophisticated algorithms can monitor equipment performance continually to anticipate when the hardware is reaching the end of its lifecycle or is prone to break down. When these problems are identified, data center personnel can plan to switch out faulty or outdated equipment without having to take critical systems offline. With the right redundancies and backups and emergency spares, in place, even an unexpected failure can be managed without compromising network performance.
Knowledge Credits: www.vxchnge.com & www.pingdom.com
Have a comment or points to be reviewed? Knowledge is power and it increases by sharing. Feel free to comment.