US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
We are working on creating the updated Milestones, using input from our WBS 2.3 copy and the scrubbing presentations with a goal of finishing by the end of this week.
Scrubbing outcomes are still being discussed and final descopes and changes should be known in a few weeks.
Even though we have procurement plans for FY25, all Tier-2s should hold off any purchases till we can discuss them with 2.3 and get approval from Paolo and Verena. This is because we need to understand the impact from scrubbing and the plans to NOT have equipment funds for the Tier-2s in FY26 and FY27.
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Very good running at all Tier 2 sites for the last two weeks.
- There were two production reductions:
- The MWT2 Illinois site was down for a on July 16 for preventive maintenance.
- On July 22 most sites suffered a high failure rate when bad derivation requests were running.
- The failures were caused by memory leaks and in some cases servers ran out of memory and rebooted.
- Rod paused the requests and pushed on a Panda ticket about developing an automated procedure to catch failing tasks.
- There were two production reductions:
- EL9/FY24 Equipment.
- AGLT2 finished installing their new equipment and updating to RHEL9 at the MSU site.
- This completed AGLT2 milestones #307 (EL9) and #313 (FY24 equipment) and also allowed marking milestone #311 (all sites deploy their FY24 compute) as complete.
- SWT2 CPB has setup their new storage running Alma Linux 9. The CPB team is now in the process of draining sets of older storage servers and updating them from CentOS 7 to Alma Linux 9. Once a group of servers is drained, it is updated to Alma Linux 9. The process will be repeated as many times to update all older servers except the very oldest storage servers which are out of warranty and will be retired.
- All other sites have completed both their EL9 updates and have deployed their FY24 purchases.
- AGLT2 finished installing their new equipment and updating to RHEL9 at the MSU site.
- Scrubbing results:
- We did OK and the Tier 2 sites received full funding for FY25.
- The paperwork for the last increment of FY25 funding is in process right now.
- The outlook for FY26 is better than what people had worried about.
- It looks like the sites will get full funding for personnel plus $50k for equipment, supplies, and travel.
- Any spending beyond this (e.g. large equipment purchases) will be approved on a case by case basis.
- The current outlook for FY27 is that the situation will be the same: personnel + $50k.
- Due to delays in the INI spending and the need to spend the grant down to zero by January 31, 2027 when the current cooperative agreement (CA) ends, it is still looking reasonable to expect substantial end of grant infrastructure funding. The current estimate is that this funding will be about $3 million split between the 4 Tier 2 federations.
- We did OK and the Tier 2 sites received full funding for FY25.
- We need to revisit the procurement plans to see if modifications are required.
- The likely answer is yes.
- Shawn is checking on when the FY25 equipment must be spent.
- It might be sensible to hold off purchasing for now and/or to buy longer warranties.
- In any case there is a hard, unchangeable deadline to make equipment purchases by 90 days before the end of the current CA.
- This date is approximately November 1, 2026.
- Shawn and Alexei have entered a proposed list of FY26 milestones but these need to be discussed in detail.
- These milestones are one of the main outputs of the scrubbing.
- Very good running at all Tier 2 sites for the last two weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
- Coffea-casa configuration updates
- Update link on the portal to use the instance that run coffee-casa as AF users
- Updated auth configuration to use the prod key-cloak instance with idp hint(globus) to bypass the Keycloak login page's IdP selection screen.
- Updated user mapping to use the posix claim rather than a callout to the connect api server.
- ServiceX
- transformer oom only on the ADS nodes, memory peaked at 2.4G while on other nodes it's less than 500MB.
- increased men limit as a workaround. troubleshooting on going.
- Coffea-casa configuration updates
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- Ongoing migration to new Frontier service for Varnish (MWT2 yesterday, AGLT2 today, others to come)
- Greedy deletion deactivated on site storage after completion of SCRATCHDISK to DATADISK migration
- OOM node crashes at OU due to job memory overruns sparked discussion of how best to prevent such incidents
- Discussion of plans for ESNET xcache service and how best to integrate it into shifter monitoring
- LHC: Flawless p-p operation. We will probably move 2 PB from CERN datadisk to tzdisk for contingency.
- Low level of ADC Ops support this week due to several key people on leave (Andreu, Rod, Timo, Fabio, Alex (HC), Dario (ES)). For list of ADC people on leave, refer to this gdoc
- Varnish deployment is proceeding. We also have a new, upgraded (latest Tomcat), k8s Frontier instance which is being tested.
- An evgen campaign overloaded the Rucio and IAM DB with many small input files. Throttled.
- FTS:
- All transfers to US storages use BNL FTS with tokens
- Started thinking of storing/integrating full FTS configuration in CRIC
- CRIC: We are loosing our main developer in the beginning of September. WLCG is going to take over the CRIC efford needed by ATLAS.
- Tokens: Added ADC recipes on how to implement tokens for CEs/SEs
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCache
- added Oxford back to testing
- new images with supervisord for ESnet node
- VP
- NTR
- Varnish
- added UAM local varnish
- moved MWT2 and AGLT2 varnishes to use new Frontier (CERN Openstack k8s based)
- NET, ucsc, SWT2, BNL are already set to use it
- Removed SWT2_CPB from varnish testing
- Still using squids BNL, SWT2_CPB, FZK_LCG2, Beijing
- XCache
-
14:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Lincoln back from vacation this week, NTR
- Aidan has done some very nice work demonstrating Kueue and Multi-Kueue between RP1 and UC AF.
- See Facility R&D notes
- He will start looking into sending HTCondor pods between clusters.
- David solved HTCondor auth issues on RP1
- Single shared HTCondor queue, maintained between Jupyter sessions.
- Now to solve the more challenging problem of syncing users into all HTCondor pods, so shared filesystems can be used.
- Looking into Keycloak --> LDAP sync tools developed by IceCube folks for longer term solution
- Meanwhile, Lincoln will put together a shim to allow existing user sync scripts work a la UC AF
- NTR on EOS work but still on our plate
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05