21-27 March 2009
Prague
Europe/Prague timezone

Building a Reliable High Performance PanDA Facility

24 Mar 2009, 17:30
20m
Club C (Prague)

Club C

Prague

Prague Congress Centre 5. května 65, 140 00 Prague 4, Czech Republic
oral Distributed Processing and Analysis Distributed Processing and Analysis

Speaker

Dr Dantong Yu (BROOKHAVEN NATIONAL LABORATORY)

Description

PanDA, ATLAS Production and Distributed Analysis framework, has been identified as one of the most important services provided by the ATLAS Tier 1 facility at Brookhaven National Laboratory (BNL), and enhanced to what is now a 24x7x365 production system. During this period, PanDA has remained under active development for additional functionalities and bug fix, and processing requirements have increased geometrically, leading to challenges in service provision. We used a RedHat Satellite system, cfEngine, and custom scripting to streamline the deployment, provisioning, and maintenance of the OS, Grid Middleware, and PanDA. We deployed redundant hardware and multiple service instances for each critical Panda component, and added a high performance/high availability capability by introducing a Layer4/7 smart switch from F5 in front of some components. This cost-effective approach greatly improves throughput and reliability, and prevents any single point of failure caused by hardware, network, grid middleware, operating system, or local PanDA application issues. Its transparency allows flexible management of the heterogeneous service, with only minimal application-level configuration and coding necessary to support integration with the smart switch. We have also implemented an extensive monitoring and alert system using Ganglia, Nagios (with extensive custom probes), RT (Request Tracker), and a custom-written ticket opening/escalation system. These tools work together to alert us to problems as they occur, and greatly assist in quickly troubleshooting any failures. In Summary, our contributions in innovation hardware resilience, extensive monitoring and automatic problem report and tracking significantly enhance the reliability of the evolving Panda system while allowing Panda developers ready access to the system for software improvement. Our experiment shows that the Panda performance was more than triple that of the legacy Panda instance, and any single failure was transparent to ATLAS users.

Primary authors

Aaron Thor (BROOKHAVEN NATIONAL LABORATORY) Dr Dantong Yu (BROOKHAVEN NATIONAL LABORATORY) Dr Jason Smith (BROOKHAVEN NATIONAL LABORATORY) John DeStefano (BROOKHAVEN NATIONAL LABORATORY) John Hover (BROOKHAVEN NATIONAL LABORATORY) Jose Caballero (BROOKHAVEN NATIONAL LABORATORY) Prof. Kaushik De (University of Texas at Arlington) Mark Sosebee (University of Texas at Arlington) Dr Maxim Potekhin (BROOKHAVEN NATIONAL LABORATORY) Mizuki Karasawa (BROOKHAVEN NATIONAL LABORATORY) Dr Paul Nilsson (CERN) Dr TORRE WENAUS (BROOKHAVEN NATIONAL LABORATORY) Tadashi Maeno (BROOKHAVEN NATIONAL LABORATORY) Dr Tomasz Wlodek (BROOKHAVEN NATIONAL LABORATORY) Dr Xin Zhao (BROOKHAVEN NATIONAL LABORATORY) Dr Yuri smirnov (BROOKHAVEN NATIONAL LABORATORY)

Presentation Materials