Preventing data center outages: common causes & solutions

High-profile outages show that no one is secure. We look at risk points, recovery procedures, and prevention techniques for greater data center security.

The TL;DR
• Data center outages are unacceptable for businesses and consumers but still possible
• The main causes are battery and UPS failures, cyber attacks, and human error
• These risks can be mitigated, but any data center will have unforeseen vulnerabilities and potential ‘cascade effects’
• Multi-vendor expertise from independent IT experts can help prevent system weaknesses and plan for efficient disaster recovery scenarios

In October 2021, Facebook, Instagram, and WhatsApp went down for over 5 hours. For companies at this level, there is an expectation that service will never be disrupted, and if it is, it will be restored in minutes, not hours.
The cause of the outage was reported to be a change in the configuration of the backbone routers that coordinate network traffic between the company’s data centers. This had a cascade effect, bringing all Facebook services to a halt.

If companies of this size have ‘cascade vulnerabilities’ that can lead to total downtime, anyone can. There are preventative measures you can take to reduce your risk and reactive strategies to reduce the impact of unexpected events.

What are the common causes of data center downtime and its costs?

The 11th Global Data Center Survey revealed that the most common causes of outages have stayed consistent for the past couple of years.

On-site power (electricity) remains the leading cause of outages, increasing from 37% to 43% of failures. The next three leading causes relate to network issues, cooling failures, or software and IT systems errors and account for 14% of outages each. Though they remain rare, outages owed to SaaS hosting, cybersecurity, and third-party cloud providers went from 7% to 11%.

The Electric Power Research Institute (EPRI) shows that 98% of outages last less than 10 seconds. But outages of this duration can still have a significant financial impact.

From 2019 to 2020, outages that had a total cost of downtime under $100,000 dropped from 60% to 39%. During the same period, outages with costs of between $100,000 and $1 million rose to 47%. Although the cost of the downtime under $100,000 have decreased from 60% to 39% , its impact can be very significant depending on the company’s size. Therefore, taking measures to protect your company against foreseeable outages is important.

How to protect your company against foreseeable outages?

Understanding these common causes of outages can help implement pre-emptive strategies to reduce risk. Let’s have a look at how:

Battery and UPS failures

First, it is crucial to track battery age. Lead-acid batteries use a chemical process that involves interaction between positive and negative lead plates and a hydrochloric acid gel electrolyte. As the battery ages, the impedance rises due to a build-up of sulfate crystals. This will cause its electrical performance to deteriorate and increase the risk of power failure.

The ambient temperature around the battery is also important, as the higher the temperature, the faster the battery ages. The ideal temperature is 20-25˚C, and the planned life halves for each degree above 30. Temperatures vary considerably within server rooms and even within racks. Batteries can be placed in a separate room with more stable temperature control to combat this.

When the data center is cooled to between 20-25˚C, it is necessary to consider humidity’s effect. Humidity won’t affect the battery performance, but high humidity can cause corrosion, meaning that parts such as cooling fans will frequently need replacement.

It’s also important to monitor charge cycles as well as temperature. Most acid batteries are designed to last 300-500 complete cycles. With a variable mains power supply, the battery may reach the end of its cycle life early.

Finally, when it comes to battery precautions, the battery load size should be sized to accommodate future growth. This is because the larger the battery Ah-rating is compared to the load size, the less likely that it will run at near capacity. So, the longer the runtime and working-life expectancy will be in practice.

Most uninterrupted power supply (UPS) batteries are sized for 80% of the capacity of the UPS system. If this load is added, the theoretical maximum runtime of the battery will reduce. UPS and batteries should be sized to account for future growth capacity. If not, this may create a scenario in which running the system harms the working-life expectancy and makes failure more likely earlier in the life cycle.

Cyberattacks

Nowadays, falling victim to a cyber-attack is highly probable, and these attacks have become a growing cause of outages across various companies and data types. As mentioned before, these types of outages increased by 4% from 2019 to 2020.
Having a comprehensive prevention strategy will deter attacks and prevent businesses from counting the cost of downtime and the fallout from a malicious attack.

The best insurance against outages from cyber-attacks is to perform regular system audits and ensure all compliance certifications are up to date. Automating security management will help detect attacks, and simplifying patch management can also help prevent unplanned outages due to cybercrime.

The difficulty in preventing cyber-attacks is that the attackers adapt their strategy to circumvent it as soon as an effective defense is found. DDoS security solutions can also help to defend against more sophisticated attacks. However, it’s recommended to have specialists in the team who will be proactive at preventing and solving these types of outages if they come up.

Human error

Finally, there is no way to avoid human error entirely, so conducting regular and comprehensive training for all data center staff is the best way to reduce instances and severity of errors.

Methods of procedure (MOPs) for performing complex actions should be documented to minimize errors. Documented step-by-step and task-oriented procedures will mitigate the probability of risk when performing maintenance. Don’t limit the procedure to one vendor, and ensure backup plans are included in case of unforeseen events.

Keeping all operational procedures up to date and following the system’s instructions is critical. Switching devices and facility one-line diagrams must be labeled correctly to ensure the correct sequence of operation.

One of the more common causes of outages is accidental equipment shut-off. Ensure that only experienced professionals monitor, maintain, and manage the power and infrastructure 24/7.
Any individual with access to the data center, including IT, emergency, security, and facility personnel, should be given basic equipment training so this doesn’t happen. Having a sign-in policy will allow data center managers to know who is entering and exiting the facility, preventing inexperienced personnel from having the opportunity to increase risk inadvertently.

How to prepare for unforeseeable outages?

With the risks from the common causes of outages mitigated as much as possible, data center managers should then have a plan for unexpected events. Below, we discuss some of the ways data center managers can ensure that the effects, fallout, and financial impact will be minimized if outages do happen.

Hybrid maintenance with multi-vendor expertise

With so many variables affecting the potential lifecycle of various equipment, it’s important to have maintenance with multi-vendor and multi-discipline expertise that can accurately assess the vulnerabilities of a system rather than individual assets.

Third-party IT maintenance can provide more holistic IT lifecycle management that maintains assets beyond their traditional EOL (End-of-Life) with a single point of contact. This is meant to increase the life of your assets and protect your infrastructure from vulnerabilities and cascade effects because the infrastructure is treated as a cohesive unit.

Colocation and disaster recovery for mission-critical applications

Yet, it is essential to remember that no infrastructure can be entirely secure against outages. Mission-critical applications and sensitive enterprise data should consider collocating with a fully redundant and compliant data center with an excellent uptime track record. Colocation facilities are designed with resilient critical systems, a redundant battery backup and cooling system, and experienced professional data center managers.

Backing up data off site is one of the foundations of a good disaster mitigation service. If data assets are stored in a single location, organizations open themselves up to a variety of risks, including increased downtime, ransomware attacks, and, in a worst-case scenario, data loss.

Should the unforeseeable occur, it’s important to have a strategy to recover a data network, integrate new equipment to replace damaged infrastructure, and put in place a ‘swing environment’ to preserve operations and data availability to meet business needs in the short term.

The rise of independent and vendor-agnostic IT services will help companies protect and optimize their hybrid IT infrastructures. Multi-discipline IT experts can spot critical weaknesses and put in place contingency and recovery plans across their systems. A centralized service that provides a unified, single point of entry for any hybrid infrastructure helps to protect against outages.

At Ynvolve, we understand the impact that an outage can have on a company. Our maintenance experts are ready to help you find ways to make your equipment more resilient to unforeseeable outages.

Common causes of outages and how to prevent them

search

categories

recent posts

Common causes of outages and how to prevent them

search

categories

recent posts

tags