ADC Weekly
→
Europe/Zurich
3162/2-E01 (CERN)
Alessandro Di Girolamo
(CERN),
I Ueda
(Department of Particle Physics-University of Tokyo)
-
-
15:40
→
15:45
possible delay 5m
-
15:45
→
16:00
Hot topics
- 15:45
- 15:50
-
16:00
→
16:15
AMOD/ADCoS report 15mSpeakers: Alexey Sedov (Universitat Autònoma de Barcelona (ES)), Helmut Wolters (LIP Coimbra, Portugal)
- The issue reported as BNL proxy expiration was caused by the CERN VOMS server issues. The slide will be corrected.
-
Renewal of VOMS proxy may need to include a step for checking the validity of new proxy before replacing the old one?
- D.Cameron: rather than putting in such a check, keeping a backup and rolling-back manually would be better.
- It is rather a problem in voms-proxy-init. It should not return an invalid proxy. A fix to be requested to the developers.
-
16:15
→
16:30
Monitoring jobs failing-over to FAX 15mSpeaker: Ilija Vukotic (University of Chicago (US))
- At the last s&c week, it was announced that the FAX team has activated the input file fail-over to FAX for some selected sites. It triggered a discussion and it was agreed that FAX team will prepare instructions/procedures. The presentation is supposed to address this.
-
Slide 2: "we would suggest all the sites and all the queues to have it on"
- "we" means FAX team.
-
ADC-ops does not recommend/suggest to activate the failing-over to FAX before seeing results from stress tests. It is not up to the sites/clouds/FAX team to decide to switch it on
- Joining to FAX is on a voluntary basis and up to the sites/clouds, but activating the fail-over could affect other sites, especially T1s, and should be a decision by ADC-ops.
- The level of stress needs to be agreed offline in a dedicated discussion by mail.
-
When a SE is down, what happens to the output?
- Ilija: need to be written to another site
- This means a need for a dev in panda/pilot
-
HC (AFT/PFT) should exclude queues when SE is down, so the FAX fail-over cannot be used for this case.
- AFT/PFT should ignore this flag.
-
we need more numbers to monitor
- total number of successful jobs, total number of failed jobs, number of successful jobs because of the fail-over, number of failed jobs despite the fail-over
- we should also monitor the negative impacts; eg. if fax takes long time and if jobs fail after unsuccessful fail-over, it is a loss of computing resources.
- 16:30 → 16:35
-
15:40
→
15:45