Feb 13 – 17, 2006
Tata Institute of Fundamental Research
Europe/Zurich timezone

Advances in Fabric Management by the CERN-BARC collaboration

Feb 13, 2006, 11:00 AM
7h 10m
Tata Institute of Fundamental Research

Tata Institute of Fundamental Research

Homi Bhabha Road Mumbai 400005 India
poster Computing Facilities and Networking Poster

Speaker

Mr William Tomlin (CERN)

Description

The collaboration between BARC and CERN is driving a series of enhancements to ELFms [1], the fabric management tool-suite developed with support from the HEP community under CERN's coordination. ELFms components are used in production at CERN and a large number of other HEP sites for automatically installing, configuring and monitoring hundreds of clusters comprising of thousands of nodes. Developers at BARC and CERN are working together to improve security, functionality and scalability in the light of feedback from site administrators. In a distributed Grid computing environment with thousands of users accessing thousands of nodes, reliable status and exception information is critical at each site and across the grid. It is therefore important to ensure the integrity, authenticity and privacy of information collected by the fabric monitoring system. A new layer has been added to Lemon, the ELFms monitoring system, to enable the secure transport of monitoring data between monitoring agents and servers by using a modular plug-in architecture that supports RSA/DSA keys and X509 certificates. In addition, the flexibility and robustness of Lemon has been further enhanced by the introduction of a modular configuration structure, the integration of exceptions with the alarm system and the development of fault tolerant components that enable automatic recovery from exceptions. To address operational scalability issues, CCTracker, a web-based visualization tool, is being developed. It provides both physical and logical views of a large Computer Centre and enables authorized users to locate objects and perform high-level operations across sets of objects. Operations staff will be able to view and plan elements of the physical infrastructure and initiate hardware management workflows such as mass machine migrations or installations. Service Managers will be able to easily manipulate clusters or sets of nodes, modifying settings, rolling out software-updates and initiating high-level state changes. [1] http://cern.ch/elfms

Primary authors

Co-authors

Presentation materials