Testing Cooling System Resiliency in a Live Data Center
Project Overview
In August 2024, Purkay Labs conducted a resiliency test at a major retailer’s data center. The data center had previously experienced issues with its cooling units and had purchased temporary units. The client wanted to address the following questions:
Do we have a back-up plan if one of our cooling units fails?
How much redundancy is built into the system in the event of a failure?
The Resiliency Test allowed the Client to assess how effectively the facility could maintain operational conditions in the event of cooling system failures.
What We Did
Purkay Labs first met with the data center team to establish a testing script and parameters. Since the goal was to simulate cooling unit failures, we needed to agree on the critical threshold point at which failure would be deemed to occur. The client adopted the ASHRAE guidelines, setting the threshold at 80.6°F. The test script outlined all procedures, ensuring both the data center team and Purkay Labs understood and agreed on the steps to be taken.
On-site, Purkay Labs conducted a baseline test to measure the current temperature with all three cooling units operational. Then they performed a series of three tests, turning off one cooling unit at a time to observe the impact. During each test, we used our predictive analysis feature, applying linear regression to estimate how long it would take to reach the critical threshold. The results were categorized into three scenarios:
The test would run indefinitely without reaching the critical threshold.
The temperature would approach the critical threshold within 5-7 hours.
The temperature would reach the critical threshold within 2 hours.
If the temperature approached the threshold (indicated by color changes to orange or red), the cooling unit was turned back on, and the event was recorded.
Outcome and Insights
The tests showed that the data center could operate for up to 4-6 hours with one cooling unit down without exceeding the critical temperature threshold. However, when two units failed simultaneously, the operational window was reduced to just 1-2 hours. These insights were invaluable for planning corrective measures, such as deploying additional cooling resources or adjusting operational loads to maintain optimal conditions.
Operational Recommendations
Based on the test results, we recommended the following improvements to enhance the cooling system’s resilience:
Upgrade Cooling Units: Replace older units with more efficient models to improve system reliability and extend operational stability.
Enhance Cooling Unit Mobility: Introduce portable cooling units, like MovinCool, to provide flexible airflow redirection to critical areas during cooling failures.
Continued Monitoring: Implement ongoing monitoring and periodic testing to ensure the cooling infrastructure remains robust in response to changing loads and environmental conditions.
Conclusion
For data center operators focused on ensuring both cooling efficiency and operational uptime, Purkay Labs offers a proven approach to resiliency testing that minimizes risk and supports data-driven decision-making. Our methodology safeguards operational integrity during tests and strengthens long-term cooling strategies. To learn more about how we can help you maintain stability during potential cooling system failures, visit AUDIT-BUDDY and Resiliency Testing.
About Purkay Labs:
Purkay Labs specializes in thermal surveys for new and operational Data Centers. Our advanced tools, like AUDIT-BUDDY, enable data center operators to assess system resiliency through predictive analysis and real-time monitoring. We help identify vulnerabilities, minimize risks, and enhance long-term operational stability.
Ready to safeguard your data center’s performance? Contact Purkay Labs today to learn more about how our resiliency testing can optimize your cooling infrastructure.