MONIT/CMS Meeting

Europe/Zurich
    • 14:00 14:30
      MONIT Status and Plans 30m
      Speaker: Pedro Andrade (CERN)
    • 14:30 15:00
      CMS Status and Plans 30m
      Speakers: Danilo Piparo (CERN), Federica Legger (Universita e INFN Torino (IT)), Valentin Y Kuznetsov (Cornell University (US))

      CMS feedback

      • stability of the infrastructure, especially during downtimes of CERN services since monitoring information is very valuable during these times
        • how to separate it and put on the high-availability mode
      • since we rely on ES/InfluxDB we need tutorials about their QL
      • with growths of the infrastructure, dashboards we need an easy tool to find appropriate information, similar to google search
        • it may require data annotation, indexing, etc.
      • we need to be periodically informed about R&Ds and directions MONIT is planning such that we can influence in a discussion on these subjects, e.g. if there is an internal Jira (ticketing system) which we can look and see
      • ability to specify the severity level of tasks/tickets

      CMS adaptation to MONIT

      • overall we start moving more aggressively to MONIT infrastructure
      • usage of ES/Kibana/Grafana is growing among different CMS groups
      • usage of HDFS is mostly up to experts
      • HDFS workflows is hard to use/write/execute for an average user, therefore an additional layer may be more desired, e.g. Job Monitoring ES+Spark is a good example
      • we start seeing growth in usage of Monit CLI

      CMS Plans

      • expand usage of ES as a primary data-storage

      • start validation of our data schemas during injection

      • migrate all CLI tools to Go to avoid dependencies, env setup, etc.
      • migrate highly populated dashboards (cardinality wise) to ES+Spark aggregated index
      • intelligent alert notification system
      • explore Rumble [1] as a layer on top of HDFS to query the data

      [1] https://indico.cern.ch/event/908539/contributions/3822566/attachments/2030916/3398965/2020_05_Rumble.pdf