HEPiX Batch Monitoring Working Group

America/New_York

HEPiX Batchsys monitoring meeting
2018-09-20
https://indico.cern.ch/event/758427/

Not using fifemon:
KIT (tried but not using it), UWisc, Oxford, IN2P3CC

Using fifemon: http://fifemon.github.io/
BNL (heavily modified), FNAL, CERN (modified?)

Oxford: HTCondor python bindings, telegraph, influxdb and grafana to monitor, inhouse python script

UWisc: ganglia.d, pool metrics and user metrics go to ganglia
Ganglia srv echo them to graphite

CCIN2P3: collectd to collect metrics and syslog-ng for logs, normalize & enrich w/ riemann, send all to ES. Grafana & Kibana as UI (using PBS/Torque?)

Smry: everyone uses cli to call condor periodically and collect stats

Two levels of monitoring:
1. Accounting
    - Auditing (security?, CE and VO vs. machine in time).
2. Metric gathering
    - Alerting nagios / grafana,

Reporting: Grid monitoring requirements (WLCG / OSG / EGI), how can they be met in a more standard way?

What do we expect from this WG?

KIT: common core, understand who’s doing what, mon & a bit of accounting, nice to find common base for condor mon. contribute to fifemon, make it better.

CERN: share tools & share experience & developments

Someone: share experience, how, what means, tools

INFN-T1: we produce APEL Records
https://wiki.egi.eu/wiki/APEL/Documentation

INFN: test HTC + HTC-CE instance; monitoring system based on influxdb+sensu+grafana. Nothing in place for condor, running LSF for now. Plan moving from LSF+cream-ce to HTC+HTC-ce.

Will set up Wiki page where we can have:
1. Case studies
2. Wishlist / Tech Discussion area
3. Summary matrix of people / technologies / sites

Next meeting in approx a month, thereafter every month or two upon agreement.

There are minutes attached to this event. Show them.
    • 09:00 10:00
      Initial Meeting 1h

      Define scope of work and gather information about current status of people's monitoring solutions, with a view towards how to collaborate and improve.