Mr Levente Hajdu (Brookhaven National Laboratory)
In statistically hungry science domains, data taking data deluges can be both a blessing and a curse. They allow the winnowing out of statistical errors from known measurements, open the door to new scientific opportunities as the physics program matures but are also a testament to the efficiency of the experiment and accelerator and skill of its operators. However, the data samples need to be dealt with and in experiments like those at RHIC, the planning for computer resources do not allow huge increases in computing capacity. A standard strategy has then been to share resources across multiple experiments at a given facility. Another has been to use middleware that “glues” resources across the world so they are able to locally run the experimental software stack (either natively or virtually). In this presentation, we will describe a framework STAR has successfully used to reconstruct a ~400TB of data consisting of over 100,000 jobs submitted to a remote site in Korea from its Tier0 facility at the Brookhaven National Laboratory. The framework automates the full path taking raw data files from tape and writing Physics ready output back to tape without operator or remote site intervention. Through hardening we have demonstrated an efficiency of 99%, over a period of 7 months of operation. The high performance is attributed to finite state checking with retries to encourage resilience in the system over capricious and fallible infrastructure.