In the room: Stefan, Alessandro, Lionel, Luca, Andrea, Julia, Pablo, Mike, Ivan, Eddie, Costin, Alberto
On the phone: Alessandra
Apologies from: Maarten, Marian, Pedro
Minutes taken by Costin
Pablo: Clean up procedure. We have agreed that the tests results older than 3 months are not important. We have cleaned them up already from preproduction. If there are no objections, in two weeks we will clean them also from production.
1. ATLAS probes
Alessandro: same code is called with different attributes, several probes produce many different attributes
Julia: in the future version there will be the option to have custom services or site-wide values. If that is good enough than no effort will be done in order to backport this feature
Alessandro, Alessandra: ok
Pablo: the same storage probe would be usable by ATLAS, CMS and LHCb, right?
Alessandro: it should indeed be very easy to implement the experiment-specific function to get the list of services
Julia: do you want to get rid of the grid monitoring framework in nagios probes with this occasion?
Julia: do you reuse any components from this, like communication libraries?
Alessandro: nagios should report directly to the message queue
Alessandro, Julia, Luca: discussion around the active/passive probes, since different files are used, and get and put are not synchronous (by using pre-placed files), and it might be complicated to publish the tests in a consistent way
Stefan: To clarify: LHCb sees the system as a black box, API is essential to publish/retrieve results, so the internals are not important for LHCb. Also the actual queue (Nagios or SAM) is not important since the message is the same.
Alessandro: is it important to see the results in the Nagios box? We think not. So far they are visible.Alessandra, Julia: if the tests are passively important it might be difficult to rerun the tests locally. Looking for alternatives. If the experiments inject directly messages, in any Nagios box, then the site ones or even the CERN one is not important.
Alessandro: re-run metrics: not if they are not active and in the local box. Notifications: in theory is possible even if the tests are not local. Import by sites: useful, but could be replaces by an API that sounds like a good option. Still, re-running the metrics is the important issue.
Services should be instrumented to publish error codes in Nagios. A Nagios box reimplementing APF doesn't make sense.
Costin: ALICE has the same approach, to publish status from the VoBox
Alberto: can we understand if the error is site or experiment-specific from a single instance?
Stefan: important point for LHCb, since the experiment is bringing the entire middleware with it, it might complicate the picture
Alessandro: submission is independent of the payload and there will be two different error states, one for the site and another for the experiment software
Andrea: this development would mean deciding from the beginning that Nagios boxes have to be decomissioned. And since the current system is working well and CMS is happy with it, we need more than an idea of change before committing to this.
Alessandro: ATLAS wants to automatize and simplify as much as possible the environment.
Julia: functional tests are run 1/hour, would not interfere at all with the automatization
Alessandro: to be investigated
2. From Quattor to Agile Infrastructure Deployment
- postponed for the next meeting, which should be next week in order to catch up, on Friday morning