HEPiX Batch Monitoring Working Group
HEPiX Batchsys monitoring meeting
2018-09-20
https://indico.cern.ch/event/758427/
Not using fifemon:
KIT (tried but not using it), UWisc, Oxford, IN2P3CC
Using fifemon: http://fifemon.github.io/
BNL (heavily modified), FNAL, CERN (modified?)
Oxford: HTCondor python bindings, telegraph, influxdb and grafana to monitor, inhouse python script
UWisc: ganglia.d, pool metrics and user metrics go to ganglia
Ganglia srv echo them to graphite
CCIN2P3: collectd to collect metrics and syslog-ng for logs, normalize & enrich w/ riemann, send all to ES. Grafana & Kibana as UI (using PBS/Torque?)
Smry: everyone uses cli to call condor periodically and collect stats
Two levels of monitoring:
1. Accounting
- Auditing (security?, CE and VO vs. machine in time).
2. Metric gathering
- Alerting nagios / grafana,
Reporting: Grid monitoring requirements (WLCG / OSG / EGI), how can they be met in a more standard way?
What do we expect from this WG?
KIT: common core, understand who’s doing what, mon & a bit of accounting, nice to find common base for condor mon. contribute to fifemon, make it better.
CERN: share tools & share experience & developments
Someone: share experience, how, what means, tools
INFN-T1: we produce APEL Records
https://wiki.egi.eu/wiki/APEL/Documentation
INFN: test HTC + HTC-CE instance; monitoring system based on influxdb+sensu+grafana. Nothing in place for condor, running LSF for now. Plan moving from LSF+cream-ce to HTC+HTC-ce.
Will set up Wiki page where we can have:
1. Case studies
2. Wishlist / Tech Discussion area
3. Summary matrix of people / technologies / sites
Next meeting in approx a month, thereafter every month or two upon agreement.