20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013)

Name: 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013)
Start: 2013-10-14T09:00:00+02:00
End: 2013-10-18T13:00:00+02:00
Location: Amsterdam, Beurs van Berlage

14–18 Oct 2013

Amsterdam, Beurs van Berlage

Europe/Amsterdam timezone

CHEP2013 Logistics Management

info@chep2013.org

Reliability Engineering analysis of ATLAS data reprocessing campaigns

14 Oct 2013, 13:52

22m

Graanbeurszaal (Amsterdam, Beurs van Berlage)

Graanbeurszaal

Amsterdam, Beurs van Berlage

Oral presentation to parallel session Distributed Processing and Data Handling A: Infrastructure, Sites, and Virtualization Distributed Processing and Data Handling A: Infrastructure, Sites, and Virtualization

Dmytro Karpenko (University of Oslo (NO))

During three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of data being reprocessed every year. In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures hard to diagnose and repair quickly. As a result, Big Data processing on the Grid must tolerate a continuous stream of failures, errors and faults. While ATLAS fault-tolerance mechanisms improve the reliability of Big Data processing in the Grid, their benefits come at costs and result in delays making the performance prediction difficult. Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement. In ATLAS, cost monitoring and performance prediction became critical for the success of the reprocessing campaigns conducted in preparation for the major physics conferences. In addition, our Reliability Engineering approach supported continuous improvements in data reprocessing throughput during LHC data taking. The throughput doubled in 2011 vs. 2010 reprocessing, then quadrupled in 2012 vs. 2011 reprocessing. We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the foundation needed to scale up the Big Data processing technologies beyond the petascale.

Dr Alexandre Vaniachine (ANL)

Dmitri Golubkov (Institute for High Energy Physics (IHEP)-Unknown-Unknown) Dmytro Karpenko (University of Oslo (NO))

Slides

ATL-SOFT-SLIDE-2013-798.pdf

20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013)

CHEP2013 Logistics Management

Reliability Engineering analysis of ATLAS data reprocessing campaigns

Graanbeurszaal

Amsterdam, Beurs van Berlage

Speaker

Description

Author

Co-authors

Presentation materials

Choose timezone

20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013)

CHEP2013 Logistics Management

Speaker

Description

Author

Co-authors

Presentation materials