Minutes of the Monitoring Consolidation meeting of April 4, 2014




Alessandra (remote)
David C. (remote)
Maarten (minutes)

Apologies: Julia, Costin

Last person to arrive: Alessandro

Status of tasks (Pablo)

- Downtime information propagation needs more time.

- A test broker has been made available for ALICE.  LHCb can use it too.

- AI and a few other tasks are slightly delayed, but no reasons for worry.

Feedback from UK sites (David)

- Pablo: separating tests of different importance and priorities can be
  done via different profiles.

- David: the questions stem from experience with the current system,
  responses from the team will be taken back to the sites.

- Maarten: a lot of info in REBUS taken from the BDII is unreliable,
  e.g. capacities reported in the wrong units.  How will we make the new
  REBUS better in that respect?

- Pablo: pledges will remain manual and capacities will still be taken
  from the BDII.

- Eddie: the values will be checked for sanity.

- Alessandro: VO shares?  Incorrect values have been observed.

- Pablo: currently once per year.

- Alessandro: compare capacities with pledges?

- Pablo: let's put that on the wish list.

- Pablo: what are the SUM "variations"?

- David: Pages within the SUM portal.

Simplification of test framework (Luca)

- Because the 3 use case classes for testing remain unchanged (see Fig. 2),
  the SAM architecture must remain fairly similar!

- Moving some of the current SAM aspects to SAM-3 can be considered normal

- There are 4 main areas for evolution:

1. Configuration could be made independent of Nagios.

2. The reliability of running WN tests could be improved by using a custom
   script instead of Nagios on the WN, but then we would need to reinvent
   the features we now get for free from Nagios.

3. On the test submission hosts we could use some other framework instead,
   but Nagios still looks quite well suited for the foreseeable future.

- Pablo: in his presentation David mentioned that site admins would like to
  to be able to trigger tests to be rerun.  Active vs. passive tests?

- Marian: that is easy for active tests and indirectly for passive tests,
  but could be difficult for external tests by the experiments.

- Alessandro: we can mitigate that by increasing the testing frequency.

- Marian: Nagios can still handle a lot more, while the limitations would
  come from the middleware to be tested, e.g. concerning timeouts.

- David: the timescale should be less than the few hours we may see today.

4. Metrics inference and on-demand testing.  Metrics might e.g. be based on
   results from accounting.

- Andrea: the experiments would need to implement that.

- Luca: a common framework could still be provided.

- Proposed strategy is to start with configuration simplification.

- Maarten: does the future of Nagios look good?

- Marian: yes, the Nagios core is open and developed by the community,
  while there also is the enterprise stack with commercial support.

- Andrea: though the Nagios configuration may have a steep learning curve,
  we now are familiar with it since a few years.

- Luca: the configuration still has limitations, can be inflexible.

- Alberto: (web) templates?  Are changes really needed or just nice to have?

- Pablo: there are experiment requirements on the configuration tool and
  the current expertise is not at CERN but in EGI (Emir), which is a risk.

- Marian: if WLCG diverges from EGI, that also has a risk; we currently have
  a well-tested common framework in which new probes can be added easily.
  Instead, the profile management may be able to help with the configuration.
  In any case expertise will be needed to understand how the system works.

- Andrea: it may be sufficient to improve the documentation a bit,
  as most of the changes are in the probe code, not in the configuration.

- Andrea: Nagios on the WN may need a lot of memory, which can cause the
  job to get killed by the batch system.

- Marian: that does not seem to be a big worry.  It is doubtful that we can
  easily move to an alternative that uses little memory.  We currently use
  a single statically compiled binary for all platforms, but we could have
  a different binary per platform if needed.

Next meeting on Fri Apr 11

- Report from PIC on their Nagios plugin

- Availability comparisons between SAM and SAM3

'Volunteers' for the minutes: Costin, Julia, Alessandro