Mr
Wojciech Lapka
(Unknown)
28/10/2009, 10:30
Monitoring Infrastructure and Tools
Since 2005 Worldwide LHC Computing Grid (WLCG) services have been monitored by the Service Availability Monitoring (SAM) system which has been the main source of information for the monthly WLCG availability and reliability calculations.
During this time SAM framework gained popularity amongst site and service managers and was very useful in building robust grid infrastructure.
Experience...
Mr
Thomas Davis
(NERSC/LBNL)
28/10/2009, 11:00
Monitoring Infrastructure and Tools
We present a method of monitoring the environment and
performance using open source tools such as Nagios, Ganglia and Cacti to collect and display performance data as well as availability information for various components of large computing systems in an integrated fashion. We will present information on how the data is collected, viewed and analyzed, with specific examples from NERSC's Cray system.
Frédéric AZEVEDO
(CC-IN2P3)
28/10/2009, 11:30
Due to the continuous load and intensive usage on our robotics, we regularly face some hardware
issues with tapes and tape drives. A recurrent issue concerns possible data loss which leads to go
through a long recovery process.
In order to improve our reliability, we have studied commercial solutions to avoid permanent
write/read errors, or at least foresee occurring errors. We've tested...