Oct 10 – 14, 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Vibration monitoring system for the RACF data center at BNL

Oct 13, 2016, 3:30 PM
1h 15m
San Francisco Marriott Marquis

San Francisco Marriott Marquis

Poster Track 7: Middleware, Monitoring and Accounting Posters B / Break

Speaker

Alexandr Zaytsev (Brookhaven National Laboratory (US))

Description

RHIC & ATLAS Computing Facility (RACF) at BNL is a 15000 sq. ft. facility hosting the IT equipment of the BNL ATLAS WLCG Tier-1 site, offline farms for the STAR and PHENIX experiments operating at the Relativistic Heavy Ion Collider (RHIC), BNL Cloud installations, various Open Science Grid (OSG) resources, and many other physics research oriented IT installations of a smaller scale. The facility originated in 1990 and grew steadily up to the present configuration with 4 physically isolated IT areas with a maximum rack capacity of about 1000 racks and a total peak power consumption of 1.5 MW, of which about 400 racks plus 9 large robotic tape frames are currently deployed.

These IT areas are provided with a raised floor and a distributed group of chilled-water cooled CRAC units deployed both on the false floor (20 Liebert CRAC units are distributed across the area) and in the basement of the data center building (two large units constructed as a part of the original data center building back to late 1960s). Currently the RACF data center has about 50 PB of storage deployed on top approximately 20k spinning HDDs and 70 PB of data stored on 60k tapes loaded into the robotic silos provided with 180 tape drives, that are potentially sensitive to external sources of vibration. An excessive vibration level could potentially endanger the normal operation of IT equipment, cause the equipment shutdown and even reduce the expected lifetime of the HDDs, unless the source of vibration is detected and eliminated quickly. In our environment the CRAC units deployed on the false floor are the cause of such problems in most of the cases, but sometimes similar issues can be a result of mechanical interference between the equipment deployed in the adjacent racks. Normally the mechanical problems related to the CRAC units are caught within 12-24 hours by performing regular inspections of the area by the RACF data center personnel, yet the need was realized in 2015-2016 for a dedicated and fully automated system that would provide the means of early detection of unwanted vibration sources and gatherer of historical data of the energy spectrum evolution for the known (constantly) present sources, such as nominally operating CRAC units and data storage equipment.

This contribution gives a summary of the initial design of the vibration monitoring system for the RACF data center and the related equipment evaluations performed in 2016Q1-2, as well as the results of the first equipment deployment of this monitoring system (based on high sensitivity MEMS technology triaxial accelerometers with DC response measurement capability) in one of the IT areas of the RACF data center.

Primary Keyword (Mandatory) Computing facilities
Secondary Keyword (Optional) Monitoring

Primary author

Alexandr Zaytsev (Brookhaven National Laboratory (US))

Co-authors

Christopher Hollowell (Brookhaven National Laboratory) Costin Caramarcu (Brookhaven National Laboratory) Tony Wong (Brookhaven National Laboratory) William Strecker-Kellogg (Brookhaven National Lab)

Presentation materials