Sophie Lemaitre (CERN)
One of the current problem areas for sustainable WLCG operations is in the area of data management and data transfer. The systems involved (e.g. Castor, dCache, DPM, FTS, gridFTP, OPN network) are rather complex and have multiple layers - failures can and do occur in any layer and due to the diversity of systems involved, the differences in the information they have available and their log formats it is currently extremely manpower-intensive to debug problems across these systems. That the information is often located on more than one WLCG site also complicates the problem and increases the latency in problem resolution. Additionally, we lack a good set of monitoring tools to provide a high-level operations-focused overview of what is happening upon the transfer services, and where the current top problems are. The services involved have most of the necessary information - we just don't collect all of it, join it and provide a useful view. The paper will describe the current status of a set of operations tools that allow a service manager to debug acute problem through the multiple layers (allowing them to see how a request is handled across all components involved). It will also report on work towards an "operations dashboard" for service managers to show what (and where) the current top problems in the system are.