Speaker
Description
The Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory provides data services (storage, transfer and management), computing resources and collaborative tools to our worldwide scientific communities. Growing needs from our various programs coupled with data centre floor space and power constraints led us to the construction and occupancy of a new power-efficient data center and on-going studies of new hardware architectures that yield higher performance per unit of power (kWh). This presentation describes BNL's sustainability activities in support of current and future needs of core programs such as RHIC, EIC, HL-LHC, Belle-II, LQCD and NSLS-II.
Data center sustainability is a phenomenon that has grown in focus due to the continuing evolution of AI, HPC and HTC systems, leading to the rampant increase in carbon emissions due to the unprecedented rise in TDP of current computer chips. With an exponential increase of demand towards the usage of such systems, major challenges have surfaced in terms of productivity, PUE and thermal/scheduling management. Deploying AI/HPC infrastructure in data centers will require substantial capital investment.
This study at the SDCC quantifies the energy footprint of this infrastructure by developing models based on the power demands of AI hardware during training. We measured the instantaneous power draw of an 8-GPU NVIDIA H100 HGX node while training open-source models, including the ResNet image classifier and the Llama2-13b large language model. The peak power draw observed was about 18% below the manufacturer’s rated TDP, even with GPUs near full utilization. For the image classifier, increasing the batch size from 512 to 4096 images reduced total training energy consumption by a factor of four when model architecture remained constant. These insights can aid scientific data center facilities to identify the ‘stranded power’ within existing facilities, assist in capacity planning and provide staff with energy use estimates. Future studies will explore the effects of liquid cooling technologies and carbon-aware scheduling on AI workload energy consumption. These results can help in the development of software or operational models, which may significantly reduce the carbon footprint at the data centers and identify opportunities for heat reuse.