Speaker
Sophie Lemaitre
(CERN)
Description
One of the current problem areas for sustainable WLCG operations is in the
area of data management and data transfer. The systems involved (e.g.
Castor, dCache, DPM, FTS, gridFTP, OPN network) are rather complex and have
multiple layers - failures can and do occur in any layer and due to the
diversity of systems involved, the differences in the information they have
available and their log formats it is currently extremely manpower-intensive
to debug problems across these systems. That the information is often
located on more than one WLCG site also complicates the problem and
increases the latency in problem resolution. Additionally, we lack a good
set of monitoring tools to provide a high-level operations-focused overview
of what is happening upon the transfer services, and where the current top
problems are. The services involved have most of the necessary information
- we just don't collect all of it, join it and provide a useful view.
The paper will describe the current status of a set of operations tools
that allow a service manager to debug acute problem through the multiple
layers (allowing them to see how a request is handled across all components
involved). It will also report on work towards an "operations dashboard" for service managers to show what (and where) the current top problems in the system are.
Authors
Gavin McCance
(CERN)
Sophie Lemaitre
(CERN)