Speaker
Ulrich Schwickerath
(CERN)
Description
During summer 2010, a large LSF test cluster infrastructure was put in place to allow scalability tests of the batch software (LSF) at a scale which exceeds the production instance by up to a factor 5.
The response time of several central commands was measured as a function of the number of worker nodes and the number of batch nodes in the farm.
Several issues which were found during the tests were fixed on the fly by the vendor. This way, it was possible to go up to 15,000 virtual worker nodes, and more than 400,000 jobs in the system. Some results from these scalability tests will be presented, lessons learned during the tests, and possible consequences for planning will be discussed.
Author
Ulrich Schwickerath
(CERN)
Co-authors
Gavin McCance
(CERN)
Ricardo Silva
(CERN)