Data Center Risk Management: A Comprehensive and Effective Plan

Companies with data centers need to prepare for multiple natural and unnatural risks while maintaining compliance.

By Gary Hilson

By Gary Hilson November 7, 2024

Data center risk management is more complex than ever thanks to rapid adoption of artificial intelligence (AI), soaring energy costs, geopolitical upheaval, and increasingly onerous regulatory and compliance obligations.

These trends are expanding data center risk assessments practices, even as the fundamental tenets stay the same, and it comes at a time when data center capacity is a limited resource, said Harmail Chatha, who leads ESG efforts for Nutanix. 

“It's a seller's market versus a buyer's market. There just isn't any capacity available,” he said.  “Hyperscalers are starting to take up all the data center capacity. They're buying dirt waiting for the data center.”

RELATED

AI Reorients IT Operations

That’s just one of the emerging variables that affects data center risk management, while some best practices remain unchanged. Smart CIOs continue to treat data centers as capital assets, with their own budgeting, management objectives and periodic upgrade necessities, while also bearing in mind new headwinds spurred by environmental, social and governance requirements (ESG), increased wildfire activity and new regulatory frameworks such as Digital Operational Resilience Act (DORA) that joins existing compliance obligations such as privacy legislation.

In the meantime, cloud computing, mobile applications, IOT, EUC, and remote work continue unabated, which means IT leaders still must manage external and internal risks to avoid downtime, which can result in losing millions of dollars a day.

What is data center risk management?

Since data centers in their bare form are physical facilities that house business-critical data and applications, the risks they face are immense, regardless of whether they’re built and run within the enterprise, managed by an MSP, or hosted off-site by a cloud service provider.

RELATED

Exploring Risks After Broadcom Acquired VMware

Effective data center risk assessment requires that IT leaders identify every potential threat by assessing the role people, practices and technology play in mitigating risks to a particular data center, that can include power outages, natural disasters, and region-specific rules and even political instability.

Risk assessment is followed by planning how to minimize any threat, their consequences, mitigations and solutions, and what it will cost. Any risk mitigation strategies must be implemented without disrupting data center operations and service delivery to customers.

Data center risk management isn’t possible without a thorough assessment.

Identifying and mitigating all-pervasive risks involves a process called integrated risk management (IRM). Gartner defines IRM as “a set of practices and processes supported by a risk-aware culture and enabling technologies that improve decision making and performance through an integrated view of how well an organization manages its unique set of risks.”

Organizations need the right tools and processes to monitor each moving part of the data center and deal with any risks that come up at any point in time, including malicious cyberattacks. Big data and analytics are instrumental in forming an accurate and comprehensive assessment of the risks to various operations that the data center enables, such as data access, application mobility, and DevOps. They also enable the implementation and execution of dynamic disaster recovery plans.

RELATED

Protecting Against Ransomware at the Data Level

But it’s people, processes who play the central role in creating these plans – specialists such as IT admins who are responsible for day-to-day IT operations to ensure uptime, Tuhina Goel, senior product marketing manager of business continuity and disaster recovery at Nutanix.  

“But decision makers such as the CIO, VP or Director of IT are ultimately responsible for data center risk management,” Goel said. “They own the budget and other resources to invest in right security measures, tooling and employee training.”

Any risk management plan needs to be in place before a disaster occurs. Risk assessment and auditing is the first step here. This begins with an evaluation of your existing owned and operated facilities from the point of view of facility design, IT architecture and topology, as well as operational sustainability.

It’s also important to learn from past outages by conducting a postmortem to find the root cause so that you can identify and address any inadequacies specific to the parts of the ecosystem that were affected. If the organization has a hybrid infrastructure with multiple data centers in place and there are plans for data center expansion or consolidation, each asset needs to be individually assessed for resiliency.

It helps to create a chart or sheet for handy reference that lists the major risk categories, mentions all the crucial systems each category affects, estimates the damage and recovery costs, and makes it clear what to do in case of an incident.

For organizations that need to comply with legal, contractual, or regulatory requirements, periodic data center risk assessments and disaster testing are inevitable. Not having a risk management plan in place can lead to the whole data center going down because of a single point of failure anywhere in the architecture, leading to significant disruptions to operations and consequent losses in revenue.

Assessment comes before management

Any risk management plan needs to be in place before a disaster occurs. Risk assessment and auditing is the first step here. This begins with an evaluation of your existing owned and operated facilities from the point of view of facility design, IT architecture and topology, as well as operational sustainability.

RELATED

Report Shows Top Demands Driving Hybrid Multicloud Adoption

It’s also important to learn from past outages by conducting a postmortem to find the root cause so that you can identify and address any inadequacies specific to the parts of the ecosystem that were affected. If the organization has a hybrid infrastructure with multiple data centers in place and there are plans for data center expansion or consolidation, each asset needs to be individually assessed for resiliency.

It helps to create a chart or sheet for handy reference that lists the major risk categories, mentions all the crucial systems each category affects, estimates the damage and recovery costs, and makes it clear what to do in case of an incident.

Chatha said it’s critical to check in with IT teams to understand what data they are managing so it can be protected and accessible with the right mechanisms, including multi-factor authentication (MFA) and VPNs, as well as applying ISO controls as part of a holistic security mindset. 

Risk Assessment for Natural Disasters

Source: Redlands Disaster Plan, Australia

Enduring data center risks

It isn’t easy to categorize or even list out all the kinds of risks that a data center faces. Consequently, CTOs and IT teams have many uncertainties to worry about.

Geographic threats: Topological and climate risks should be evaluated at the time of choosing a data center location and then again during the facility planning phase. If areas at higher risk of natural disasters such as earthquakes, hurricanes, floods, and bushfires can’t be avoided, consider the use of stronger construction material in the buildings to offset the risk.

Risk Map of the U.S.

Source: Alert Systems Group

Luckily, many natural disasters can be forecasted, and therefore, prepared for. Further, data centers built in cooler climates have natural, renewable options for energy savings and cooling, which is why Nordic countries are a popular destination for building data centers.

In addition to natural hazards, data center managers should also consider man-made dangers. Make sure airports, power grids, chemical plants, military bases, and water bodies are a safe distance away. On the other hand, it helps if there is a fire station, hospital, and police station nearby.

Risk Management: Factors of Data Center Location

Credit: Bob Landstrom

Power outage: Power disruption can pose an existential threat to a mission-critical data center. Organizations need to make sure there is enough resilience built in with UPS-backed power routes to each rack and cooling system. Having dual power sources with direct connection to a multi-substation power grid for the site is a minimal protection against local substation power failure. On top of that, backup generators can be on standby as a last resort.

Water seepage: Water is a double-edged sword for data centers. Even a few drops on critical hardware can cause irreparable and permanent damage. At the same time, water supply and storage for cooling and fire control systems needs to be maintained at optimal levels. 

Acoustics: Exposure to high-decibel sounds for prolonged periods of time is one of the most overlooked risks when building data centers. Hard drives and storage systems are particularly susceptible to loud sounds – high-frequency sound vibrations can significantly lower read and write performance, possibly compromising data quality and integrity.

RELATED

Cloud Vendor Shakeup Puts Focus on IT Resilience

It follows that the data center should be located far away from airports, arenas, and the like. Acoustic suppression technologies play a critical role in reducing equipment exposure to sonic shockwaves from high-decibel noise sources such as security and fire alarms or other apparatus on and around the premises.

Fire: Fires in data centers are mostly caused by power surges in the electrical equipment. One fire could destroy thousands of dollars’ worth of devices if not detected and put out immediately. In the early stages of a fire, the amount of smoke is so low that it can’t be detected by smoke detectors. Further, air conditioning and circulating systems disperse it quickly. The solution is Aspirating Smoke Detectors (ASD) that detect smoke at a very early stage and alert users as soon as minimum thresholds are crossed.

Security: Security failures in a data center could include anything from a network breach to sabotage and damage caused by individuals present at the site. One of the biggest threats is cyberattacks that result in leakage of account data or personally identifiable information (PII) belonging to customers.

Certain application or system failures may result in security personnel being unable to verify card holders’ identity or authorize them to go to certain areas. Video cameras and doors with access control might lose their connection to the central system too.

Breaches and threats caused by ransomware can only be dealt with using a multilayered approach to data protection, which has three aspects: prevention, detection, and recovery. Specific defense mechanisms include educating end users, regular vulnerability scanning, role-based access control, and regular data backups (the proverbial last line of defense).

System failure: This is where the most number of things might potentially go wrong, with the highest frequency. It is important to identify and fix all the single points of failure (that might possibly affect the data center) in the entire IT infrastructure.

This starts with a resilient network architecture and connectivity. Redundant fiber optic connectivity is the gold standard for data centers. Then come servers with multiple tenants or multiple applications running on them. Clustering, mirroring, and duplication help in ensuring continuous access and delivery and minimize the possibilities of downtime.

RELATED

Art and Science of Building a Hyper-Dense, Hybrid Cloud Data Center

Both security and system failures are influenced by the hardware running in the data center. “Hardware lifecycle management is one approach that we're taking to address security,” Chatha said, as gear that is near end of life isn’t supported by the latest and greatest operating systems. 

Modern HCI-powered data centers now pack everything together and deliver IT infrastructure as a resilient, secure, and self-healing platform.

Another risk is when software applications go rogue on the data center and take down systems and servers with them. IT needs to make sure that these applications can run seamlessly over the entire infrastructure without causing any glitches in servers located in the data center or any other environment.

Backing up data and files is a routine procedure for most organizations, but immediate recovery of real-time or transactional data in the event of downtime should be a priority for data centers. This is done in different ways in different companies according to the regulatory standards applicable to their industry. Again, by consolidating multiple backup solutions into a single turnkey platform such as Nutanix Mine, organizations can simplify data lifecycle management and get complete visibility and control over their data.

Having a disaster recovery plan in place is essential for data center risk management, said Chatha. “If a primary location goes down, is it going to impact the entire company or subset of the company? How critical is that?”

Poor Disaster Recovery planning: Identifying and minimizing any and all risks isn’t the end of the story. Any risk management plan worth its salt should know exactly what to do when (not if) disaster strikes and include a step-by-step recovery plan for every imaginable undesirable event. This starts with having systems in place that monitor key environmental factors and alert the concerned people when certain thresholds are crossed.

Failing this, the situation might quickly get out of hand and losses will escalate in the event of a sudden disaster.

Having a disaster recovery plan in place is essential for data center risk management, said Chatha. “If a primary location goes down, is it going to impact the entire company or subset of the company? How critical is that?”

Platforms that are flexible and automated are critical for non-disruptive recovery in the event of a disaster. Nutanix Xi Leap is a DR orchestration solution that is simple to deploy and manage, as well as adaptable to on-premises or cloud sites. It eliminates data silos and facilitates replication and recovery from a single user interface.

Managing New Complexities and Threats

The data center business is more dynamic thanks to emerging technologies and workloads, and data center risk management is now facing new challenges.

Energy demands: Costing data power consumption has never been more important, especially with rising energy costs and power hungry AI workloads. As Vince Kellen, CIO of the University of California, San Diego told EdTech’s Tom Mangan, “We’re seeing that with every wave of hardware expansion in the supercomputer center, the type of computing is much more intensive, both from an energy and a heat standpoint.”

The article notes that it’s not just AI that’s putting pressure on power – with every wave of hardware expansion, supercomputing workloads are much more intensive, both from an energy and a heat standpoint, and data-driven processes are on the rise. 

Kellen told EdTech that if energy consumption continues to climb at current rates, some areas of the United States may not have enough energy to keep the computers humming, with a state being advantaged or disadvantaged based upon how it regulates its cost of energy.

If you're doing business in Europe, your data center risk management must factor in rising energy costs across the continent.

Sustainability and ESG Demands: Despite increased power demands on data centers, owners and operators are expected to continue to focus on sustainability, which adds to risk management complexity. 

RELATED

Building Scalable, Sustainable Data Centers

In 2024, carbon offsets have fallen out of favor, noted Fixate.IO technology analyst Christopher Tozzi in DataCenter Knowledge, with more investment required in "green AI infrastructure" within data centers, such as processors that are designed to reduce the energy consumed by AI workloads, as well as an increase in water efficiency. 

Reporting and compliance: Sustainability and ESG demands will increasingly require more metrics reporting and disclosures, especially around water use efficiency, wrote Tozzi, which means data center risk management must include methods and processes to release hard data about their sustainability outcomes. In 2023, Amazon Web Services became the first cloud provider to release metrics related to water use, setting a precedent for other data center operators, while California has also set another precedent by implementing new regulations that lay out climate-related disclosure requirements on data centers.

Not all regulatory compliance is related to environmental concerns. Privacy legislation across different jurisdictions, whether it’s the General Data Protection Act (GDPR) in Europe or the California Consumer Privacy Act (CCPA), place demands on security if companies are to avoid the penalties for non-compliance. In 2025, DORA, which includes cloud providers because they are considered critical third-party platforms, will also need to be factored into data center risk management. 

Location, Location, Location

If there’s something all data center risk management factors have in common, it’s location – proximity to areas prone to more wildfire activity, higher energy costs and expensive real estate.

Data centers have traditionally been located close to the company’s headquarters, but it also makes sense to have them close to your company's IT staff because they will need to monitor as part of a data center risk management strategy, or meet with a third party managed services provider.

RELATED

Validated Way for Moving Between Private Data Centers and Public Cloud

Safety from natural disasters isn’t a new requirement, but increasing wildfire activity has the potential to encroach on areas that have previously been a safe location for a data center.

Chatha said once natural disasters are taken into account, whether it be wildfires or flood zones, data center risk management is heavily influenced by municipalities, including zoning and the local power grid – is the power available? 

Other considerations where data center risk management must consider location is proximity to connectivity and the quality of network providers in the area – you need great reliability and speed that ensure you avoid latency for your end users. 

“Connectivity used to be the biggest barrier in site selection, but that's much easier to manage and deploy these days versus the power constraint,” Chatha said.

Real estate costs are also a consideration when locating your data center, not just for a new build but when considering expansions. If you use a third party managed service provider, their real estate pressures could affect the cost of your services as well as available capacity. 

Balancing the Ecosystem with Data Center Risk Management

A data center has a thousand moving parts. It itself is a cog in the organizational wheel, so to speak. One small misalignment upsets the whole equilibrium of the organization, across departments.

Risk mitigation, therefore, is a shared responsibility. Each employee or stakeholder can help keep the facility operating at its optimal level either by following or by enforcing the rules and learning how to do both better. IT leaders should know exactly where and how much it costs to keep everyone trained and have access to resources they need to carry out any tasks where the data center is involved. The responsibility falls on the CTO or CIO to set expectations and give clarity on these operations.

Of course, data centers or the IT infrastructure itself doesn’t function in isolation. Spending money on data center risk management may not necessarily be a top priority for all managers – most departmental objectives pale in comparison to meeting revenue targets.

“Conflicting goals can be hard to address, but one of the most effective methods of doing so is to have a highly efficient process for continuously identifying where a risk resides. You also need a predictable, reliable method of updating systems without impacting the overarching business goals of the organization,” said Gavin Millard, VP of Product Marketing at Tenable.

And in a competitive seller’s market, data center risk management has become increasingly dynamic, with power now the key consideration, Chathra said. “Connectivity used to be the biggest barrier, but it’s much easier to manage and deploy these days versus the power constraint.”

Chathra has been in the data center industry for decades, and he said even the sellers are constrained by power companies. “Small guys like us are just kind of going wherever we can find a little bit of power here and there and deploy.”

This is an updated version of the article originally published on April 15, 2021. 

Gary Hilson has more than 20 years of experience writing about B2B enterprise technology and the issues affecting IT decisions makers. His work has appeared in many industry publications, including EE Times, Embedded.com, Network Computing, EBN Online, Computing Canada, Channel Daily News, and Course Compare. Find him on Twitter.

Dipti Parmar wrote the original article. Find her on X @dipTparmar and LinkedIn.

© 2024 Nutanix, Inc. All rights reserved. For additional information and important legal disclaimers, please go here.

Related Articles