Mr Rune Sjoen (Bergen University College)
The ATLAS data network interconnects up to 2000 processors using up to 200 edge switches and five multi-blade chassis devices. Classical, SNMP-based, network monitoring provides statistics on aggregate traffic, but something else is needed to be able to quantify single traffic flows. sFlow is an industry standard which enables an Ethernet switch to take a sample of the packets traversing it and send them to a collector for permanent storage. The packet samples are analyzed in software and conversations at different network layers can be individually traced. Implementing statistical packet sampling into the ATLAS network gives us the ability to identify and examine the causes of unknown traffic patterns. As every switch in ATLAS supports sFlow, there is the potential to concurrently monitor over 4000 ports. Since brief transactions can be important, we operate sFlow at high sampling rates, up to one sample per 512 packets, which together with the large number of ports in the system generates a data handling problem of its own. This paper describes how this problem is addressed by making it possible to collect and store data either centrally or distributed according to need. The developed system consists of a collector, a service exposing the data, and a web interface. The methods used to present the results in a meaningful fashion for system analysts are discussed and we explore the possibilities and limitations of this diagnostic tool, giving examples of its use in solving system problems that arise during the ATLAS data taking.
Mr Ali Al-Shabibi (Ecole Polytechnique Federale de Lausanne (EPFL)) Mr Brian Martin (CERN) Mr Lucian Leahu (Polytechnic Institute of Bucharest) Mr Matei Ciobotaru (University of California - Department of Physics) Mr Rune Sjoen (Bergen University College) Ms Silvia Maria Batraneanu (Polytechnic Institute of Bucharest) Mr Stefan Stancu (University of California - Department of Physics)