RAL Tier1 Experiments Liaison Meeting
Access Grid
RAL R89
-
-
13:30
→
13:31
Experiment Operational Issues 1m
-
13:35
→
13:45
VO-Liaison ATLAS 10mSpeakers: Brij Kishor Jashal (Rutherford appelton laboratory), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
-
13:45
→
13:55
VO Liaison CMS 10mSpeaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))
Katy is at CERN for O&C week and will not attend the liaison meeting today.
SAM test errors on Friday from disk gateway problems stemming from VMware machine back-ups and high read load from LHCb. Various improvements put in place to make system more robust.
AAA seeing occasional periods of SAM test errors although traffic is not high. ceph-svc20 has been worst this week; Jyothish is updating XRootd and priority scheduling.
UK mini-DC: I talked to Alessandra and she doesn't want to hurry the tape testing given the EOS nodes have only just arrived. She thinks we have enough tests for Echo to perform in the week of the 3rd March, and then we should be able to schedule tape tests for another week.
-
13:55
→
14:05
VO Liaison LHCb 10mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
Operational issues:
- ECHO redirectors overload last Friday (GGUS 681943)
- Redirectors were overloaded, probably because of the high number of LHCb Sprucing jobs
- Fixed by a few server tweaks
- Writeable WN gateways could be helpful to remove some load from the redirectors
News:- Preprod farm now has writeable WN gateways
- LHCb jobs are already using them!
- ECHO redirectors overload last Friday (GGUS 681943)
-
14:10
→
14:20
VO Liaison ALICE 10mSpeaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
-
14:20
→
14:30
VO Liaison LSST 10mSpeaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))
- Components all deployed at RAL for MultiSIte testing that is to happen in the next few weeks
- Butler now contactable from BatchFarm
- IngestD configured and ready for MultiSite test
- Kafka picking up messages from US
- Current tests passing green
- Potential issues that RAL could face
- Lancaster has run some tests with actual data and they have seen a large amount of I/O on their CephFS mount on the worker nodes
- Due to many small files
- Swapping to DAVS has changed the job I/O behaviour to reduce this but still high amounts
- Lancs considering using Ceph over CephFS for this workflow
- Lancaster has run some tests with actual data and they have seen a large amount of I/O on their CephFS mount on the worker nodes
- Job Slot limit increased to 1000 for when analysis work comes in
Jobs running well, (SLAC / S3DF in down time hence no jobs today)
transfers were failing, but now seem to be working after network changes last night for FTS
- Components all deployed at RAL for MultiSIte testing that is to happen in the next few weeks
-
14:30
→
14:40
VO Liaison APEL 10mSpeaker: Thomas Dack
-
14:45
→
14:55
WP-D - GPU, Data Management, Other 10mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
15:00
→
15:01
Major Incidents Changes 1m
-
15:05
→
15:15
Summary of Operational Status and Issues 10mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
15:20
→
15:21
AOB 1m
-
15:22
→
15:32
Any other Business 10mSpeakers: Brian Davies (Lancaster University (GB)), Darren Moore
-
13:30
→
13:31