Probe information

Name: Probe information
Start: 2013-09-12T13:15:00+02:00
End: 2013-09-12T15:20:00+02:00
Location: CERN

Thursday 12 Sept 2013, 13:15 → 15:20 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Description

More in depth look at the information gathering layer

Hide

Date: 12-09-2013
Present:
Eddie, Dave, Alex, Marian, Costin, Markus, Pablo, Lionel, Andrea, Nicolo, Valentin, Maarten, Ivan

Apologized:
Pedro, Stefan

Minutes taken by Alex

#######################
#                     #
# LHCB presentation #
#                     #
#######################
Pablo: In the message, some field are not important (vo, site name) ?
Marian: Sitename is not important because that message is sent to the experiment nagios box, and that one fills up the rest.

Marian: Why sending job through nagios / dirac.
Valentin: suggested by dirac people.
Marian: How often this job are running ?
Valentin: not so often, not more than 1 every 30 min.

Pablo: What is the workflow of test submission.
Marian: Dirac - > Worker node -> nagios -> SAM.
Pablo: Why result are not directly send to from the worker node to SAM?
Andrea: Because you may want to see the result from the nagios web UI.

########################
#                      #
# ALICE Presentation #
#                      #
########################

Pablo: Is the 24 hours timeout standard for all experiments?
Andrea: This is imposed by SAM ( expiration date of a test).
Marian: If a job is sent and not running after 5 hours then it is cancelled.

Marian: It should be easy to add the XRootD type in the A/R formula, since that service is already in GOCDB. In the current system, other services will also have to be registered in the GOCDB

Andrea: Which service flavor do you have for the CE test?
Marteen: At the moment, only CREAM-CE are tested. In ALICE, some site are not monitored because they don't use CREAM-CE. When we published from MonALISA, we should use a 'meta-service', which will be any batch of the site.
Pablo: VOBox could be that "meta-service"
Maarten: Yes

Marteen: This plan is plan to get rid of nagios, ML could publish directly to the message bus.
Markus: Do you think this idea is shared by the site? Because they already use nagios for their facilities.
Maarten: This is part of a bigger discussion. Orthogonal to all experiments.
Marian: Could it be simply done by a plugin, active polling when priority is required or use extracted informations from existing system.
Andrea: There are currently only 3 sites which want to see the data in their internal nagios (Lyon, Nikef, PIC)
Marian: Other site only rely on the API .

Pablo: There are two main differences between ALICE and LHCb: messages are sent centrally by ALICE while in LHCb every WN is sending its results, and ALICE sends directly to SAM, while LHCb goes through nagios.

Marian: Can we monitor the status of the WN based on pilots? (and submit monitoring data when no prod activity).
Andrea: Pilot submission dependent on load of the central queue (which is not constant). Pilot can't be the primary source of monitoring information.
Marian: From the report perspective, no pilot job could mean everything is ok.
Maarten: The site should be notify as soon as a problem occurs.
Costin: This is just a theoretical problem, sites have jobs to execute almost all the time
Andrea: WN tests are far from trivial and describe really complex use cases. They couldn't be included in the pilot because it's too complicated to instrument the generic one.
Maarten: We have to be sure that there are always jobs to be run on a site (or send monitoring jobs).
Marian, Andrea: That discussion make sense only if we completley move away from nagios

Marian : Not only the data should be include in the monitoring bus but also some indication on the ML state.
Maarten: Fair enough.
Costin: It's trivial to detect if ML is down as nothing will be received.

Pablo: The current expiration policy of the data is 3 months for metric output and 12 month for availability report.
Andrea: long expiration policy is important to present data on longer time.
Markus : You should keep data long enough to recompute for the reporting purpose.
Pablo: Recomputation is done in the ten days after the end of the month.
Maarten: details are not so important, we just want the status.
Markus: Just keep the data even after 12 month.

Pablo: We currently have many infrastructures: production, pre-production and validation. Do we really need all of them?
Maartin: OK to merge validation and pre-production.
Pablo: The tests are ran several times and availability might be slightly different on different infrastructure.
Andrea: The main requirements is to have a machine where to put development probes and test them manually inside the full machinery.
Marian: A probe is just a script which could be run everywhere.
Andrea: It is better to have the nagios installed but the full machinary is less important.
Marian: Atlas is making a lot of development on storage probe and need a lot of testing.

#########
#       #
# AOB #
#       #
#########

Next meeting in two week from now (friday 27 September @ 11.00 am), ATLAS will give a talk on its probes and another presentation will be focused on the deployment of that work in production (puppet template, ... )

The web interface discussion will occur in the meeting after that.

There are minutes attached to this event. Show them.

- 14:00 → 14:20
  
  Bridging LHCbDIRAC and SAM-Nagios 20m
  
  Speaker: Valentin Volkl (CERN)
  
  Slides
- 14:20 → 14:40
  
  Publishing ALICE data 20m
  
  Speakers: Costin Grigoras (CERN), Maarten Litmaath (CERN)
  
  Slides
- 14:40 → 14:55
  
  Probe discussion 15m
- 14:55 → 15:00
  
  Next meeting 5m

Choose timezone

Probe information

513/R-068

CERN

Share this page

Direct link

Social networks

Calendaring