HEPiX Batch Monitoring Working Group

America/New_York

Fifemon used at

  • BNL: inject custom classads in hierarchy

  • CERN: working on packaging in RPM (will share); added metrics to monitor cluster utilization per experiment (and react on under-utilization w.r.t. quota)

    • Mix: fifemon + collectd based layer

  • FNAL: Added multiple schedds support; quota collection (from accounting classads in case one cannot contact negotiator); config options to process properly; need to push upstream

    • Classads with parameters with number in the name, causing issues (WillSK)

  • Deployment with Docker? Dependency hell either way.

  • Python 3 migration? Noone yet.

  • Elasticsearch / Filebeat configuration for fifemon repo as well

  • Nicolas (CC-IN2P3): mod to mon jobs submitted to local schedd (as opposed to CE) in jobs.py

    • (BNL) We also had to remove some FNAL-specific assumptions in the code

  • Include cgroups mon into fifemon?

    • (BNL) Separate tool for us, direct monitoring on node (https://github.com/HEPiX-batchmonitoring/condor_graphite)

  • Mixed solutions for monitoring with some overlap between what each tool can do

  • https://indico.cern.ch/event/778660/contributions/3245464/attachments/1770416/2876523/CERNHTCondorMonitoring.pdf

  • Common repository?

There are minutes attached to this event. Show them.
    • 1
      HTCondor Monitoring (Fifemon) Overview

      I'll start discussion with an overview of how we at BNL use the Fifemon (https://fifemon.github.io/) monitoring tool including changes we made to it

      Speaker: William Edward Strecker-Kellogg (Brookhaven National Laboratory (US))
    • 2
      Discussion on Fifemon

      Other case studies are welcome, as well as feedback and input from the author (Kevin, who should be in attendance).