ATLAS-MonIT meeting

Europe/Zurich
513/1-024 (CERN)

513/1-024

CERN

50
Show room on map

A lord of improvement since last year is acknowledged. The meeting discussed how to progress further from there.

* The main issue with the current implementation of MonIT dashboard seems to be the performance of the accounting plots. Particularly the Job  Accounting dashboard has many panels and each of them in Grafana triggers a query to InfluxDB. Sometimes the time to return a plot is very long. 

Short term actions: ATLAS should consider the possibility to split a single dashboard in different ones, based on the use case. Each use case can be tuned depending on the kind of aggregation, filtering and granularity needed for the plots. Having beer plots in the same dashboard or tab also helps the performance, particularly if it is only a few plots the the user is interested in. Having different dashboards gives the MonIT team the possibility also to implement dedicated InfluxDB databases optimised for the given use case. We discussed how granular this process can be as there are many use cases and it is not practical/usable to have a dedicated dashboard for each, so a compromise should be found. 

Medium term action: the performance of the system (InfluxDB+grafana) should be reviewed and possible bottlenecks should be identified. This process will require frequent exchanges between ATLAS and MonIT. It was mentioned that for DDM the work of Thomas was very important complementing the MonIT team effort, and the same level of experience would be very valuable for the accounting dashboards. The choice of the technologies (InfluxDB and grafana) for the accounting use case should also be reconsidered as some limitations seem intrinsic into the tools and not easy to overcome. Better performance and functionality could come from new releases of InfluxDB for example and the MonIT team will test it as early as possible.    

* The display aspects of the monitoring plots needs to be improved. For some there is no obvious solution at the moment for some fo them, such as the absence of a "calendared" notion. It was also not obvious how much leverage and available effort we have implementing and contributing upstream new/different features that we would need, rather than waiting for a new release addressing the problems (which are well known and recognised in the communities using those tools). For some cosmetic aspects there are grafana plug-ins that could give the desired functionality. It was suggested to try them and the MonIT team could start with the WLCG dashboards they maintain, which suffer from some of the same cosmetic issues. The experience could then be feedback to experiments through examples, docs, best practices .. 

* It was discussed the decommissioning of the old SSB and migration to MonIT. Little effort from ATLAS being put into that. The MonIT team is working on the migration of WLCG SAM3 from the Oracle-based solution to MonIT (so that the migration to the new version of Oracle is not needed). Likely, this will be almost all ATLAS needs and a few additional metrics could be added simply to that, rather than migration the full SSB, which at that point could be simply decommissioned.     

  

There are minutes attached to this event. Show them.