Probe information

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Description
More in depth look at the information gathering layer
Videoconference Rooms
WLCG_monitoring_consolidation
Name
WLCG_monitoring_consolidation
Description
Kick-off meeting for the WLCG monitoring consolidation project
Extension
109258925
Owner
Pablo Saiz
Auto-join URL
Useful links
Phone numbers
Date: 12-09-2013
Present:
Eddie, Dave, Alex, Marian, Costin, Markus, Pablo, Lionel, Andrea, Nicolo, Valentin, Maarten, Ivan

Apologized:
Pedro, Stefan

Minutes taken by Alex

#######################
#                     #
#  LHCB presentation  #
#                     #
#######################
Pablo: In the message, some field are not important (vo, site name) ?
Marian: Sitename is not important because that message is sent to the experiment nagios box, and that one fills up the rest.

Marian: Why sending job through nagios / dirac.
Valentin: suggested by dirac people.
Marian: How often this job are running ?
Valentin: not so often, not more than 1 every 30 min.

Pablo: What is the workflow of test submission.
Marian: Dirac - > Worker node -> nagios -> SAM.
Pablo: Why result are not directly send to from the worker node to SAM?
Andrea: Because you may want to see the result from the nagios web UI.

########################
#                      #
#  ALICE Presentation  #
#                      #
########################

Pablo: Is the 24 hours timeout standard for all experiments?
Andrea: This is imposed by SAM ( expiration date of a test).
Marian: If a job is sent and not running after 5 hours then it is cancelled.

Marian: It should be easy to add the XRootD type in the A/R formula, since that service is already in GOCDB. In the current system, other services will also have to be registered in the GOCDB

Andrea: Which service flavor do you have for the CE test?
Marteen: At the moment, only CREAM-CE are tested. In ALICE, some site are not monitored because they don't use CREAM-CE. When we published from MonALISA, we should use a 'meta-service', which will be any batch of the site.
Pablo: VOBox could be that "meta-service"
Maarten: Yes

Marteen: This plan is plan to get rid of nagios, ML could publish directly to the message bus.
Markus: Do you think this idea is shared by the site? Because they already use nagios for their facilities.
Maarten: This is part of a bigger discussion. Orthogonal to all experiments.
Marian: Could it be simply done by a plugin, active polling when priority is required or use extracted informations from existing system.
Andrea: There are currently only 3 sites which want to see the data in their internal nagios (Lyon, Nikef, PIC)
Marian: Other site only rely on the API .


Pablo: There are two main differences between ALICE and LHCb: messages are sent centrally by ALICE while in LHCb every WN is sending its results, and ALICE sends directly to SAM, while LHCb goes through nagios.


Marian: Can we monitor the status of the WN based on pilots? (and submit monitoring data when no prod activity).
Andrea: Pilot submission dependent on load of the central queue (which is not constant). Pilot can't be the primary source of monitoring information.
Marian: From the report perspective, no pilot job could mean everything is ok.
Maarten: The site should be notify as soon as a problem occurs.
Costin: This is just a theoretical problem, sites have jobs to execute almost all the time
Andrea: WN tests are far from trivial and describe really complex use cases. They couldn't be included in the pilot because it's too complicated to instrument the generic one.
Maarten: We have to be sure that there are always jobs to be run on a site (or send monitoring jobs).
Marian, Andrea: That discussion make sense only if we completley move away from nagios

Marian : Not only the data should be include in the monitoring bus but also some indication on the ML state.
Maarten: Fair enough.
Costin: It's trivial to detect if ML is down as nothing will be received.

Pablo: The current expiration policy of the data is 3 months for metric output and 12 month for availability report.
Andrea: long expiration policy is important to present data on longer time.
Markus : You should keep data long enough to recompute for the reporting purpose.
Pablo: Recomputation is done in the ten days after the end of the month.
Maarten: details are not so important, we just want the status.
Markus: Just keep the data even after 12 month.

Pablo: We currently have many infrastructures: production, pre-production and validation. Do we really need all of them?
Maartin: OK to merge validation and pre-production.
Pablo: The tests are ran several times and availability might be slightly different on different infrastructure.
Andrea: The main requirements is to have a machine where to put development probes and test them manually inside the full machinery.
Marian: A probe is just a script which could be run everywhere.
Andrea: It is better to have the nagios installed but the full machinary is less important.
Marian: Atlas is making a lot of development on storage probe and need a lot of testing.


#########
#       #
#  AOB  #
#       #
#########

Next meeting in two week from now (friday 27 September @ 11.00 am), ATLAS will give a talk on its probes and another presentation will be focused on the deployment of that work in production (puppet template, ... )

The web interface discussion will occur in the meeting after that.
There are minutes attached to this event. Show them.
    • 14:00 14:20
      Bridging LHCbDIRAC and SAM-Nagios 20m
      Speaker: Valentin Volkl (CERN)
      Slides
    • 14:20 14:40
      Publishing ALICE data 20m
      Speakers: Costin Grigoras (CERN), Maarten Litmaath (CERN)
      Slides
    • 14:40 14:55
      Probe discussion 15m
    • 14:55 15:00
      Next meeting 5m