21–25 May 2012
New York City, NY, USA
US/Eastern timezone

No file left behind - monitoring transfer latencies in PhEDEx

22 May 2012, 13:30
4h 45m
Rosenthal Pavilion (10th floor) (Kimmel Center)

Rosenthal Pavilion (10th floor)

Kimmel Center

Poster Distributed Processing and Analysis on Grids and Clouds (track 3) Poster Session

Speaker

Natalia Ratnikova (KIT - Karlsruhe Institute of Technology (DE))

Description

The CMS experiment has to move Petabytes of data among dozens of computing centres with low latency in order to make efficient use of its resources. Transfer operations are well established to achieve the desired level of throughput, but operators lack a system to identify early on transfers that will need manual intervention to reach completion. File transfer latencies are sensitive to the underlying problems in the transfer infrastructure, and their measurement can be used as prompt trigger for preventive actions. For this reason, PhEDEx, the CMS transfer management system, has recently implemented a monitoring system to measure the transfer latencies at the level of individual files. For the first time now, the system can predict the completion time for the transfer of a data set. The operators can detect abnormal patterns in transfer latencies early, and correct the issues while the transfer is still in progress. Statistics are aggregated for blocks of files, recording a historical log to monitor the long-term evolution of transfer latencies, which are used as cumulative metrics to evaluate the performance of the transfer infrastructure, and to plan the global data placement strategy. In this contribution, we present the typical patterns of transfer latencies that have been identified in the operational experience acquired with the latency monitor. We show how we are able to detect the sources of latency arising from the underlying infrastructure (such as stuck files) which need operator intervention, and we identify the areas in PhEDEx where a development effort can reduce the latency. The improvement in transfer completion times achieved since the implementation of the latency monitoring in 2011 is demonstrated.

Authors

Natalia Ratnikova (KIT - Karlsruhe Institute of Technology (DE)) Dr Nicolo Magini (CERN)

Co-authors

Dr Alberto Sanchez Hernandez (Centro Invest. Estudios Avanz. IPN (MX)) Andrea Sartirana (Ecole Polytechnique (FR)) Dr Chih-Hao Huang (Fermi National Accelerator Laboratory) Federica Moscato (Fermi National Accelerator Lab. (US)) Markus Klute (Massachusettes Institute of Technology) Mingming Yang (Massachusetts Inst. of Technology (US)) Oliver Gutsche (Fermi National Accelerator Lab. (US)) Paul Rossman (Fermi National Accelerator Laboratory (FNAL)) Rapolas Kaselis (Vilnius University (LT)) Si Xie (Massachusetts Inst. of Technology (US)) Stefan Piperov (Bulgarian Academy of Sciences (BG)) Dr Thorsten Chwalek (KIT - Karlsruhe Institute of Technology) Dr Tony Wildish (Princeton University (US))

Presentation materials