The Disconnect Between Design and Operations
While performing scheduled proactive maintenance in a large (1,128 active cabinets) data center room, I discovered an issue. As a part of our standard operating procedure, we capture thermal images of each circuit and the associated amperage feeding each cabinet. At this point, we were working the aggregation row when I noticed that the normal pattern for deploying the bus plugs had not been followed. In addition, it appeared that the “A” and “C” phases were powering a majority of the cabinets and consuming the most power on both busways feeding this row. This is a concern because it affects the balance of power between phases, (which costs the data center operator money) and more importantly, it can lead to unexpected loss of redundancy.
When we opened the Infeed unit, where the power is connected to the busway, to gather data and capture thermal images, it became quite clear that the A and C phases in both of the redundant feeds where as much as 93 amps out of balance with B phase. The thermal images show this in dramatic fashion.


Because we collect amperage readings on every wire we image, another problem was evident. Though the total KW on the busways was close to ideal for using all available power, the reality was that overloaded A and C phases meant that redundancy had been lost. These wires are fed by a 225 amp breaker. If either the A Feed or B Feed breaker were to trip, the total load on the busway would exceed the remaining breaker capacity and the entire row would lose power. Since this was the aggregation row for the room, the entire room would lose connectivity to the outside world.
We called an all stop to work in the room, and contacted the client’s facility manager to present our findings. While going over the issue with the facility manager, we also realized that several new Cisco switches were installed in the racks with a work order to connect them to the busway the next day! Based on the power requirements of the new equipment, it was evident that had they been powered up, to the phases they were connected to, the breakers would have opened and the communication to all of the room’s 1,128 cabinets would have been lost.
What followed led up the chain of command to the highest levels in the data center operator’s company. The work orders for the next day were cancelled and no new equipment was allowed to be installed in the row until the issue had been corrected.
We were asked to develop a plan to correct the power issue while ensuring that no downtime incident would occur. Because we had the actual amperage readings for every cabinet feed, over the next six hours we were able to develop three viable options. Each of the options would effectively minimize the imbalance restoring redundancy, and one of the options would bring the total phase imbalance to less than three amps.
Because we go beyond thermal image and capture all the relevant metadata, we are able to provide our clients valuable, clear, actionable reports on their facilities. Our clients are able to decrease costs, avoid downtime incidents, and more effectively manage their critical facilities.
David Crites Chief Operations Officer