xrootd monitoring discussion


Attended:  Ilija Vukotic, Diego Davila, Marian Zvada, Borja Garrido Bear, Maarten Litmaath, Albert Rossi, Robert Currie, Costin Grigoras, David Smith, Andrew Hanushevsky, Paul Millar, Brian Bockelman, Brian Lin, Liz Sexton-Kennedy, Julia Andreeva

Apologies: Alessandra Forti

Discussion after Ilija's presentation

Brian: How you deal with the security issues? For core files which contain private keys...

Ilija:  Core dumps are not handled yet. Good to know such issues have to be taken into consideration.


Paul and others: So far the monitoring flow worked with the dedicated xCache instances used only by ATLAS. What happens if several VOs use the same instance and we would like to monitor/compare behavior of different VOs?

Probably the best solution would be to use virtual xCache instances  sharing the same hardware, each of them dedicated to a given VO. For xrootd it is better to use physically separated xrootd servers.


Julia: Would CMS be interested to try/re-use xcache monitoring developed by Ilija?

Brian and Liz for CMS: CMS would be interested to use common solution/infrastructure, like MONIT at CERN , where all other CMS monitoring data go.

Xrootd monitoring discussion based on the presentation of Derek and Diego at the previous meeting

- Does anyone use intermediate "Read" events or we can filter them out?

- UCSD was looking in this info at some point, but not any more. Apparently, nobody is currently using them. No need to filter them out, it is possible to switch off reporting of such events by the server in the configuration. This can substantially decrease amount of data to be sent and processed.

- Brian explained, that the component which aggregates events belonging to the same transfer/access is not a 'Shoveler' but 'Collector'.

-Why RabbitMQ was selected as implementation of the message queue?

Brain: The reason is that OSG relies on RabbitMQ infrastructure run by external provider

Julia: Looks to be a good solution for OSG, but what about European WLCG sites? Most of WLCG monitoring applications do rely on ActiveMQ infrastructure provided by MONIT at CERN.

Brian: CMS and ATLAS xrootd monitoring is only a fraction of of the OSG xrootd monitoring. Directing all OSG xrootd monitoring flow to CERN is not a good idea.

Looks like we need two slightly different implementations for OSG and European sites in respect of message queue. For the European sites ActiveMQ infrastructure at CERN might be a proper solution, rather than running yet another message queue. We might need to think about plugin implementation of the collector (one for OSG sites, another one for Europe) and similarly for the shoveler which would covert UDPs to a message queue format.

Borja: Both RabbitMQ and ActiveMQ use STOMP protocol. So reporting should not be so much different.

Maarten: Are there known issues with STOMP protocol?

Julia: At least FTS uses stomp protocol without problems.

Borja confirmed that it is successfully used by MONIT.

Julia: Can we have site shoveler as a part of xrootd distribution?

Andrew: Yes, it is possible. As it is most easily implemented in Python, we would need to beware of incompatibilities between Python 2 and Python 3, because not every deployment may have switched to the latter yet.

Borja: Deployment scheme for the collectors?

Julia: They might sit next to the message queue which is getting data from the sites. In case of European sites it might be CERN.

Conclusions and actions

- We agreed to drop reporting of 'read' events in the xrootd monitoring flow

- We need to foresee the possibility to use ActiveMQ for the European sites instead of RabbitMQ component in the schema of Derek and Diego. This would imply, that shoveler should be configurable to send data either to RabbitMQ or to ActiveMQ and the collector should have plugins for two different implementations. We should discuss with the OSG developers whether they need help for implementation of the flow for the European sites.

- We plan to meet again beginning of September, but please feel free to reach out with any questions, comments or suggestions  in the meantime.



There are minutes attached to this event. Show them.