Speaker
Mr
Rune Sjoen
(Bergen University College)
Description
The ATLAS data network interconnects up to 2000 processors using up to
200 edge switches and five multi-blade chassis devices. Classical,
SNMP-based, network monitoring provides statistics on aggregate traffic,
but something else is needed to be able to quantify single traffic
flows.
sFlow is an industry standard which enables an Ethernet switch to take a
sample of the packets traversing it and send them to a collector for
permanent storage. The packet samples are analyzed in software and
conversations at different network layers can be individually traced.
Implementing statistical packet sampling into the ATLAS network gives us
the ability to identify and examine the causes of unknown traffic
patterns.
As every switch in ATLAS supports sFlow, there is the potential to
concurrently monitor over 4000 ports. Since brief transactions can be
important, we operate sFlow at high sampling rates, up to one sample per
512 packets, which together with the large number of ports in the system
generates a data handling problem of its own.
This paper describes how this problem is addressed by making it possible
to collect and store data either centrally or distributed according to
need. The developed system consists of a collector, a service exposing
the data, and a web interface.
The methods used to present the results in a meaningful fashion for system
analysts are discussed and we explore the possibilities and limitations
of this diagnostic tool, giving examples of its use in solving system
problems that arise during the ATLAS data taking.
Authors
Mr
Ali Al-Shabibi
(Ecole Polytechnique Federale de Lausanne (EPFL))
Mr
Brian Martin
(CERN)
Mr
Lucian Leahu
(Polytechnic Institute of Bucharest)
Mr
Matei Ciobotaru
(University of California - Department of Physics)
Mr
Rune Sjoen
(Bergen University College)
Ms
Silvia Maria Batraneanu
(Polytechnic Institute of Bucharest)
Mr
Stefan Stancu
(University of California - Department of Physics)