Two topics to discuss: *Project plan *HammerCloud functional tests
Kick-off meeting for the WLCG monitoring consolidation project
Pablo Saiz
local: Eddy, Alessandra, Marian, Maarten, Pablo, Nicolo, Lionel, Julia, Valentina, Luca, Andrea, Simone, Alessandro, Alberto
remote: David

Minutes taken by Nicolo

JIRA actions for February

Almost all tasks due for February closed 
ticket 28 on drupal - Julia will close it
ticket 52 on downtime information - still open, because ACE/MyWLCG/SUM all use different algos - discussion needed to agree which one to use
ticket 42 - simplification - Luca will update with a draft for the proposal next week
ticket 16 - HC schema - moved to March

Pablo reminds to check the many tasks for March

Valentina presents HammerCloud

Slide 8 on Job Templates:

Pablo: how to send tests to many sites?
Valentina: from single template, define tests which submit tasks ("Ganga jobs", each contains many jobs) to many sites, replacing variables in template.

Example template on slide 8: 'inputdata' is CRAB-specific; other variables are for Ganga to choose plugin

Demonstration of "Add template" GUI:

- parameters for test frequency/duration
- parameters for the location of the template/tarball files on the HC host. Need to be copied to HC host by operator/cron job.

Marian: how do we set command line args in test?
- extraargs parameter

Demonstration of existing template in "Template" GUI:

- configuration of sites, job pressure (max jobs in queue and running)

Pablo: when are CE/SE white lists set?
Andrea: for CRAB, taken from 'Site' table in HammerCloud. Requires manual update, tests stop if SE name changes. CEs taken from BDII.

Slide 11:

Valentina: "Test create" creates the test entry in the DB; "Test generate" writes the Ganga job (only once per test).
Valentina: HC polls experiment WMSes for job monitoring, frequency configurable (usually 30 seconds)

Julia: how do you keep a constant load of jobs on a site?
Andrea: HC monitors jobs; when number of jobs falls below configurable minimum, HC submits a new task
Alessandro: functional tests have 1 job per task, so you stay constant at minimum. Stress test have many jobs per task, so you will go above minimum.

Julia: how do we avoid too many queries to experiment WMSes if there are too many HC jobs?
Simone, Alessandro, Andrea, Nicolo': bulk queries to experiment WMSes

Slide 15:
Valentina: to calculate some metrics (e.g. CPU/Walltime), HC needs to get information at the end of the job.
Implementation depends on the plugin e.g. download log and parse it.
To add a new metric, need to update plugin if info is not already available.

Slide 16:
Alessandro: "Athena nightly build system" tests are to validate nightly releases

Slide 17:  y axis is number of tests/month --> on avg ~1000 jobs/test

Discussion after presentation:

Alessandra: Role of Ganga?
Valentina: common interface to CRAB/PanDA/DIRAC. Also has local sqlite for job tracking

Pablo: could we use Ganga plugins to test different things, in addition to jobs? E.g. run local script to test SRM.
Valentina, Maarten: yes, but ganga is designed around jobs.

Simone: do you propose to replace Nagios with HC functional tests?
Pablo: scheduling functionality is similar, and HC has more reports
Andrea: but Nagios is much more powerful for configuring scheduling

Valentina: HC copies job status from Ganga DB into HC DB
Julia: so jobs status is tracked three times: in experiment WMS, Ganga DB and HC DB. It seems that Ganga is not needed for this, only as common interface.
Andrea: Ganga is there for historical reasons, but removing it needs heavy rewriting

Luca: to evaluate HC for SAM, compare with what we do with Nagios:
- WN tests: HC could do it
- CE job submission tests: impossible with HC, cannot submit to specific CEs, HC not designed to do it
- other tests (e.g. SRM)

Pablo: two topics for next meeting:
1) definition of availability/reliability (currently three systems do it in different way)
2) continue discussion on Nagios Ops


David - I can give summary slides of UK discussion on monitoring
Alberto volunteers to take minutes next time
