Purkay Labs

View Original

Predictive Cooling Failure Tests

Welcome to Keep Your Cool - a series tackling simple cooling optimization strategies for the busy data center operators.

Running a live data center comes with its unique set of challenges, particularly when it comes to maintaining optimal cooling without risking operational downtime. At Purkay Labs, we understand the importance of assessing your data center's cooling resilience while keeping everything fully operational. That's why we've developed a non-intrusive testing methodology that provides data center operators with the insights needed to manage cooling systems effectively, even during potential failure scenarios.

Project Overview

Recently, Purkay Labs was engaged by a client to conduct a series of resiliency tests at their data center. The goal was to determine how well the data center could sustain operations and maintain specified environmental conditions in the event of cooling system failures. This is crucial for operators who need to guarantee uptime without the luxury of shutting down for assessments.

What We Did

Using our AUDIT-BUDDY systems equipped with 42 TH1 sensors placed at strategic locations across the data center, we initiated a series of controlled tests. These tests were designed to simulate failures of individual cooling units as well as various combinations of these units under actual operational conditions.

  1. Baseline Establishment: Before beginning the tests, we established a baseline temperature range between 62-64 °F, which reflects typical operating conditions. This baseline helps in assessing the impact of cooling failures more accurately.

  2. Linear Regression Analysis: During each test, we continuously collected temperature data, which was then analyzed using linear regression techniques. This analysis helped predict how long it would take for the temperature at critical server inlets to reach the ASHRAE-specified critical threshold of 80.6°F. This predictive approach is vital as it provides real-time insights into when corrective actions would be necessary, without ever reaching a point that risks data integrity or operational continuity.

  3. Real-Time Monitoring and Predictive Alerts: Our custom-built dashboard updated these predictions every minute, providing a dynamic tool for monitoring and managing the cooling environment efficiently. If the regression models indicated that the critical temperature would be reached within one hour, we would conclude the test early to prevent any risk to data center operations.

Outcome and Insights

The tests revealed that the data center could sustain operations for 4-6 hours with one cooling unit down, without exceeding critical temperature thresholds. However, in scenarios where two units failed simultaneously, the operational window was reduced to 1-2 hours. This crucial data allows operators to plan and implement strategic corrective measures like deploying additional cooling resources or adjusting operational loads.

Operational Recommendations

Based on our findings, we recommended several enhancements to improve resilience and operational stability:

  • Upgrade Cooling Units: Replacing aging equipment with newer, more efficient units to ensure reliability and extend operational resilience.

  • Enhance Movability of Cooling Units: Adding mobility to cooling units like the MovinCool to allow targeted airflow redirection to critical areas during cooling failures.

  • Continued Monitoring: Implementing ongoing monitoring and periodic testing to adapt to changes in load and environmental conditions, ensuring that the cooling infrastructure remains robust against potential failures.

Conclusion

For data center operators grappling with the dual challenges of ensuring cooling efficiency and operational uptime, Purkay Labs offers a proven methodology that minimizes risk while maximizing data-driven decision-making. Our approach not only safeguards your operational integrity during tests but also enhances your long-term cooling strategy. If you're experiencing cooling issues or want to test your data center's resilience against cooling failures, reach out to see how we can help you maintain a cool head even under potential system failures.

Visit us at Purkay Labs to learn more about our innovative solutions for data center management and resilience testing.