Using Machine Learning to Protect Your Data Center from Future Thermal Runaway Situations

Blog

Oct 13

Welcome to Keep Your Cool—a series tackling simple cooling optimization strategies for busy data center operators.

Data centers today are dynamic environments that continuously evolve. Gone are the days when designing a data center, conducting a CFD (Computational Fluid Dynamics) analysis, and commissioning with load banks would suffice for the life of the facility. The rise of AI workloads, GPU-powered servers, and fluctuating computing demands have reshaped the landscape, particularly for older, operational data centers. As servers are replaced and cooling demands grow, the original airflow analysis may no longer be valid, leading to hidden hotspots and ongoing concerns about the facility’s resilience during cooling unit failures.

This evolving environment has been further complicated by supply chain disruptions. For instance, should a rooftop condenser fail and require repair or replacement, parts may not be readily available, with lead times extending up to 12-18 months. This presents two significant challenges for data center operations:

Electrical loads have expanded to meet growing IT demand, stretching cooling capacity.
Backup cooling units may be compromised or unavailable, reducing redundancy and risking thermal runaway in case of cooling system failures.

In these scenarios, the operations team often scrambles to implement emergency measures, hoping to prevent overheating and avoid server failures. Unfortunately, these reactive solutions are often untested, ad-hoc, and result in high stress for personnel working in a firefighting mode. To avoid this chaos, we propose a method that allows the operations team to be fully prepared with contingency plans, integrating them seamlessly into the facility’s Method of Procedure (MOP) and Standard Operating Procedures (SOP).

This technique ensures a proactive approach to maintaining data center resiliency, reducing the risk of downtime during cooling system failures, even in operational facilities where modifications to airflow and cooling performance are complex. It also lets user plan and execute their preventive maintenance on critical cooling units, knowing what the time window for conducting window is without putting the facility at risk.

The Importance of Machine Learning in Data Center Resiliency

For greenfield data centers, CFD modeling can be applied before commissioning to simulate airflow and cooling capacity under various conditions. However, for operational data centers—where servers are energized and airflow patterns are already established—CFD models may no longer reflect current conditions. This is where machine learning becomes a powerful tool. By combining real-time environmental monitoring with predictive modeling, data centers can develop contingency plans without compromising service availability.

Let’s consider an example: A 20,000 sq. ft. data hall with 250 racks, commissioned 10 years ago. Over the years, the facility has undergone several rounds of server retrofits, increasing power consumption and cooling demand. Originally commissioned with an N+1 cooling redundancy (one backup air handler unit), one of the rooftop chillers has failed and, although repaired, its reliability remains uncertain. The data center is operational, but the lack of resiliency leaves it vulnerable—one cooling unit failure could put the entire facility at risk.

Developing a Machine Learning-Driven Resiliency Plan

The following method outlines how an operational data center, like the one in our example, can develop a robust resiliency plan using real-time monitoring and machine learning:

Implement Real-Time Temperature Monitoring
A critical first step is the ability to monitor inlet air temperatures at multiple heights in real time. While some data centers are equipped with Building Management Systems (BMS) or Data Center Infrastructure Management (DCIM) systems capable of this, many are not. In such cases, Purkay Labs offers portable, Wi-Fi-enabled stands that provide real-time monitoring at strategic points across the facility. This enables comprehensive data collection without disrupting operations.
Establish a Temperature Threshold
Set a maximum allowable inlet air temperature for the data center. This could follow the ASHRAE recommended limit of 80.6°F (27°C) or a custom threshold suited to the facility's specific needs.
Baseline Data Collection
Using either the BMS/DCIM system or Purkay Labs’ Wi-Fi-enabled stands, collect baseline temperature data from key locations across the data center. This provides a snapshot of how the facility is performing under normal operating conditions.
Simulate Air Handler Failures with Machine Learning
The next step involves simulating an air handler failure. Using Purkay Labs’ machine learning software, a controlled shutdown of one air handler is initiated. The software then calculates, in real time, how long it will take for each sensor to reach the critical temperature threshold. Projections are updated every minute based on additional temperature measurements, allowing the system to accurately forecast potential overheating events.
Monitor Time Windows
Once the software predicts that a sensor is less than two hours from reaching the critical temperature, the failed air handler is turned back on. This process is repeated for each air handler unit, with time windows for all sensors recorded. The key point is that the inlet air temperature never reaches the critical limit, as the machine learning algorithm predicts the time remaining before thermal runaway occurs, allowing proactive intervention.
Identify Vulnerable Areas
By conducting this exercise, the operations team can identify the most vulnerable locations within the data center. For example, certain areas may have a much shorter time window before overheating, while others may offer several hours or even days of buffer time. With this information, the team can plan for emergency air handling rentals or pre-order replacement units if necessary.

Benefits of a Machine Learning-Based Approach

This technique offers several key benefits:

Proactive Identification of Hot Spots: The system pinpoints areas of the data center most susceptible to overheating, allowing the operations team to take targeted actions in the event of a cooling failure.
Measured Response Times: With detailed time windows for each sensor, the team knows exactly how long they have before temperatures reach critical levels. This enables better decision-making and reduces panic during emergencies.
Emergency Planning: Pre-planning for specific failure scenarios becomes easier. For instance, if a particular air handler is predicted to cause overheating within hours, the team can rent a backup air handler in advance, ensuring continuity of operations. In some cases, stand-by equipment orders may be necessary, depending on the vulnerability levels identified.
Organized Preventive Maintenance Windows - The data will enable the user to know exactly how long the time window should be to perform maintenance on critical cooling units without putting the facility at risk.

Conclusion

Machine learning offers a powerful solution for preparing data centers to handle cooling system failures. By simulating failure scenarios and projecting thermal responses in real time, data center operators can replace ad-hoc emergency measures with a robust, data-driven contingency plan.

At Purkay Labs, we provide the tools and expertise to help data centers maintain optimal environmental conditions. Whether through our Wi-Fi-enabled monitoring systems or customized machine learning software, we empower data center teams to make informed decisions, protect their assets, and ensure uptime—even under the most challenging conditions.

About Purkay Labs

Purkay Labs specializes in environmental monitoring solutions that help data centers optimize airflow, temperature, and cooling performance. Our advanced tools, including real-time monitoring systems and machine learning algorithms, provide data center operators with the insights they need to prevent thermal runaway and ensure the continuous operation of critical infrastructure.

Data Center CoolingThermal ManagementCooling System FailureMachine Learning in Data Centers

Purkay Labs https://www.purkaylabs.com