CondorG probes

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Description
Three topics to discuss: *Project plan *Update on the CondorG probes) *REBUS functionality (pledges)
document
Videoconference Rooms
WLCG_monitoring_consolidation
Name
WLCG_monitoring_consolidation
Description
Kick-off meeting for the WLCG monitoring consolidation project
Extension
109258925
Owner
Pablo Saiz
Auto-join URL
Useful links
Phone numbers

Minutes from WLCG Mon consolidation meeting on 17th of Jan 2014


people in the room :  pablo, julia, luca, david t., eddie, marian, maarten, ale, simone, lionel, nicolo

remote:salvatore and david c.
Apogies: Stefan, Pedro

 



Pablo: From now on, the last person arriving to a meeting will be the one who takes the minutes in the next meeting. If anybody disagrees, please speak up
(there was consensus from the room)
Pablo: So, during the next meeting, Ale will take the minutes

Pablo: Version 1.0 of the document is out. We have added all the tasks of the project on JIRA. The timeline for the tasks go all the way until the end of the project.


Maarten: Did you identify any worrisome points? Would you be able to deliver by the end of the project?


Pablo: Time estimation was difficult, we have a rough estimate at the moment. We will be flexible as the tasks move forward we might need to assign more people on a task or change the duration of a task.


Questions on Luca’s presentation:


Ale on slide 6: I think it could be useful, while you develop, to contact people that are experts from ATLAS and CMS regarding the error codes.


Maarten: The machine that runs all those submissions has to be very stable and has to be configured properly. That’s where the expertise of atlas and cms would be needed. The probe can have its logic based on the official documentation.


Pablo: Regarding the timeout, what we have seen at the moment is that a test validity in SAM is 24 hours. In the beginning we said let’s change it to 2 hours and we saw many many gaps in the tests. For the record the current validity is set to 6 hours for a test and we will most probably reduce this number sequentially in the future.


Ale: If we want to test the status of a site, 6 hours is enough. We all agree on that.


Slide 7

Ale: We would like to add the queue name to the vo feed, we don’t want to rely on the BDII.


Julia: The problem is that NCG cannot handle queue names, it does not support it. We are working on providing new probes and a new SAM system (SAM 3). For some time we could live with this limitation to get the queues from BDII. The new configuration system will take into account the vo feed but this is a work for the future SAM 3. We should also take into account that even the availability calculation currently cannot handle the queues. It is not possible to take into account two queues from the same CE. This has to be considered. Also, the timeline for Nagios is not clear at the moment.


Luca: The target is to have a proposal for Nagios on February.


Ale: Are you also testing the OSG-CE? You should check it to a OSG-CE that doesn’t have a default queue.

Maarten: We do not lose anything with the new probe, we are doing what wms probes are doing currently. We still have the same constraints, we have changed the machinery of the probe but we are relying on the information system that the wms probes are relying.


Pablo: We won’t gain much in patching and hacking the old system as it is going away. We should think how to do it properly for the new system.


Maarten: Don’t retire the wms probes if it cause problems, don’t rush it. Sure, they are not used in real-case environments by the experiments but we were living with it for quite some time. We could retire them when the current SAM system retires.


Simone: If it is possible please retire the wms probes.


Slide 12:


Julia: Will these changes (introducing new probes) have an impact on POEM, on profiles?


Marian: They need to be added.


Julia: Will it impact the availability calculation?


Luca: It is explained on the next slides.


Slide 15:


Pablo: Would it be possible to reuse this probe by other experiments as well?


Luca: Yes, it is not specific to CMS, it will also be identical to ATLAS as well.


Maarten: The WN metrics stay the same but the CE metrics will change their name and this will affect the availability calculation.


Slide 22:


Julia: This means that we register in preproduction a new profile and we compare them in preproduction?
Luca: yes.


Simone: I think you missed a step between step 2 and 3. You have no full wn test step. Step 2 is on pre-production. There is a big jump between step 2 and 3.


Nicolo: It is a list of changes.


Maarten: Exactly, we will see if we could avoid it. Step 1 could also be done in production.

Ale: As of today this seems to be the plan, we will see how it goes. Let’s start and see how it goes.


Maarten: I got a question about the server side of condorg, it has an instance at the moment but also this needs to be of a production quality.


Julia: It is sitting on the same box. It shouldn’t be a problem.


Pablo: Let’s postpone the REBUS talk and discuss it next week with the requirements from the experiments and WLCG office. Next meeting with be on Friday 31st of Jan. I would like to discuss a bit more the topic of vo feed to make sure that we have what we need and only that.

There are minutes attached to this event. Show them.
    • 14:00 14:10
      Review of version 1.0 of the report 10m
      All the action items defined in the document have been defined in jira: https://its.cern.ch/jira/browse/WLCGMONCON The current version of the document is also attached to this agenda
      Speaker: Pablo Saiz (CERN)
    • 14:10 14:35
      CondorG probes 25m
      Speaker: Luca Magnoni (CERN)
      Slides
    • 14:35 14:40
      REBUS functionality 5m
    • 14:40 14:55
      Discussion 15m
    • 14:55 15:00
      Next meeting 5m