Downtime

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Description
Two topics to discuss: *Project plan *Downtime info
Videoconference Rooms
WLCG_monitoring_consolidation
Name
WLCG_monitoring_consolidation
Description
Kick-off meeting for the WLCG monitoring consolidation project
Extension
109258925
Owner
Pablo Saiz
Auto-join URL
Useful links
Phone numbers
Monitoring Consolidation - 14 March
https://indico.cern.ch/event/305345/
Room: Ivan, Pablo, Maarten, Lionel, Alberto, Stefan, Edi, Julia
Remote: Alessandro, Alessandra

Apologies: David C., Andrea, Nicolo, Costin, Luca, Marian

Minutes taken by Alberto
—————
Review of JIRA tasks
Most of the actions for february already closed. The only three marked as open are
- Julia: Documentation will be with the twiki instead of drupal
- Luca: Nagios framework proposal was done.
- Ivan: Availability calculation, which should be clearer after today.

There are still around 15 tasks to be completed before the end of March. This is a call for the responsible of the task to verify that they are still on track.
————
Availability and Reliability in SAM
Report on improvements also (see slides).

Slide 3: Fixed bug on aggregation of “unknown” (since 11 March)

Slide 4: improved load time by factor of 10

Slide 5: Storing test results not just the status change
Maarten: Is it much slower?
Answer: Not really. We keep everything but can be compressed. Removing duplicates for the
same value.
Julia: The uncompressed version could be kept if experiments ask for it.
Stefan: We could compress only the green values
Maarten: Could be useful to check log in moments of up time and successful tests
Julia: Move uncompressed data into Archive after 3 month.
Stefan: Before discussing we should understand what s the size.
Pablo: Will check the size but have a smaller DB speeds up the UI also
Julia: For instance Job Monitoring raw data is kept forever and users don’t look at it but may
ask for historical plot.

Slide 6: Availability and Reliability algorithms
SD has impact on the Availability not on the Reliability

Slide 7: Service downtime
Only the schedules outages are reported.

Slide 8: Example
4 hours downtime in the fist plot 20/24 86%
1 h critical status 23/24
second plot is a daily bin
the export should have the numbers instead of the colours
third plot is reliability
in the forth is 100% reliable

Slide 9
Site downtime metrics
Combines the services running on the site.
Alessandro: The combination of the services is also done differently depending on the service type. At the moment, the CEs are in OR and the SRMs are in AND.

For the WLCG reports they will look exactly the same, some months there will be both
generate and compared.
---
Pablo: The experiments could also check if the metrics are correct.

Alberto: Would be useful to keep the history of the formula so that one can apply it to old data
or also go back to previous versions
———
Next meeting (in 3 weeks)
- Report on UK Sites (David)
. Nagios for OPS (Maite)
- Simplification of Nagios (Luca)
Julia agrees to take the minutes in that meeting
There are minutes attached to this event. Show them.
    • 14:00 14:10
      JIRA actions for February 10m
      Review of the JIRA actions scheduled for this and next month https://its.cern.ch/jira/issues/?filter=13902
      Speaker: Pablo Saiz (CERN)
    • 14:10 14:30
      Downtime and availability/reliability algorithm 20m
      Speaker: Ivan Antoniev Dzhunov (CERN)
      Slides
    • 14:50 14:55
      Discussion 5m
    • 14:55 15:00
      Next meeting 5m