HEPiX Batch Monitoring Working Group

Name: HEPiX Batch Monitoring Working Group
Start: 2018-09-20T09:00:00-04:00
End: 2018-09-20T10:00:00-04:00
Location: No location set

Thursday 20 Sept 2018, 09:00 → 10:00 America/New_York

Hide

HEPiX Batchsys monitoring meeting
2018-09-20
https://indico.cern.ch/event/758427/

Not using fifemon:
KIT (tried but not using it), UWisc, Oxford, IN2P3CC

Using fifemon: http://fifemon.github.io/
BNL (heavily modified), FNAL, CERN (modified?)

Oxford: HTCondor python bindings, telegraph, influxdb and grafana to monitor, inhouse python script

UWisc: ganglia.d, pool metrics and user metrics go to ganglia
Ganglia srv echo them to graphite

CCIN2P3: collectd to collect metrics and syslog-ng for logs, normalize & enrich w/ riemann, send all to ES. Grafana & Kibana as UI (using PBS/Torque?)

Smry: everyone uses cli to call condor periodically and collect stats

Two levels of monitoring:
1. Accounting
- Auditing (security?, CE and VO vs. machine in time).
2. Metric gathering
- Alerting nagios / grafana,

Reporting: Grid monitoring requirements (WLCG / OSG / EGI), how can they be met in a more standard way?

What do we expect from this WG?

KIT: common core, understand who’s doing what, mon & a bit of accounting, nice to find common base for condor mon. contribute to fifemon, make it better.

CERN: share tools & share experience & developments

Someone: share experience, how, what means, tools

INFN-T1: we produce APEL Records
https://wiki.egi.eu/wiki/APEL/Documentation

INFN: test HTC + HTC-CE instance; monitoring system based on influxdb+sensu+grafana. Nothing in place for condor, running LSF for now. Plan moving from LSF+cream-ce to HTC+HTC-ce.

Will set up Wiki page where we can have:
1. Case studies
2. Wishlist / Tech Discussion area
3. Summary matrix of people / technologies / sites

Next meeting in approx a month, thereafter every month or two upon agreement.

There are minutes attached to this event. Show them.

- 09:00 → 10:00
  
  Initial Meeting 1h
  
  Define scope of work and gather information about current status of people's monitoring solutions, with a view towards how to collaborate and improve.

Choose timezone

HEPiX Batch Monitoring Working Group