US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
Today we have quite a bit to cover!
- We note that CHEP papers are due by the end of February with NO EXTENSIONs to be provided! Please get working on your papers.
- Quarterly reports and milestone updates are submitted.
- We have a few milestones that have remained delayed for a while and we hope they will be resolved soon.
- Next week is ATLAS S&C week at CERN.
- Those reporting for the USATLAS cloud, please see some areas to cover (from Ivan's email earliers): https://docs.google.com/document/d/1tlV6Vl1ZmOHhK1VT0ZGdNZfHWwv83fQqUb_ZJ080bc4/edit?tab=t.0#heading=h.jquwsstxiga5
- This S&C is also a Sites Jamboree
- We still need a speaker for the US Analysis Facilities talk ... volunteers or suggestions?
- The OSG All-hands + HTCondor week has been set June 2 – 6, 2025 in Madison, Wisconsin. USATLAS and USCMS have been asked to confirm their attendence.
- Any conflicts or issues?
- Can we let the organizers know we (USATLAS) will be participating?
- Paolo has requested a "Tier-2 Shopping List" of what could be purchased by December 2025 (when all funds must be spent out).
- This is complicated because this will be the last "extra" funds we can expect before HL-LHC starts in 2030
- Previously we have gotten such funds and targeted ensuring that our Tier-2 networks were appropriately sized and not causing any ongoing expenses
- The complication is that our Tier-2s are expected to have 400 Gbps connections (resilient) by HL-LHC but it is likely too early to invest in 400G by the end of this calendar year (prices for 400G should drop significantly by 2029 for example).
- Each Tier-2 will need to think carefully about their needs and the plans already in place for their institutional network upgrades
- We will need a separate meeting to discuss planning and next steps
- We need to have a separate discussion about Tier-1 procurements and plans
- February 2025 is designated as "US Capability Mini-Challenge" month, where USATLAS, USCMS and possibly others test capabilities.
- We have a Google folder to help organize what should be a bottom-up effort: https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=sharing
- We (WBS 2.3) need to start filling out relevant documents to cover the who, what, when for each topic we want to test in February
- Some capabilities may note be tested in this round, but we can still begin organizing them
- USCMS has been invited to contribute and other experiments are welcome
- We plan to report on the tests that happen at a future WLCG DOMA meeting
- This will be briefly presented/discussed at ATLAS S&C next week.
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
OSG Software
- Released last week: vo-client-137-4. Includes new Kubernetes-based IAM server in /etc/vomses
- Upcoming release (next week?):
- XRootD 5.7.3 upstreams a caching-related patch that OSG was carrying in XRootD 5.7.2
- vo-client cleanup, removing old CERN VOMS servers
- Built ARM VMs and are working on getting them into our integration testing pipeline
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
Compute for (Tom)
-
Ongoing upgrade to HTCOndor24 and Alma 9.5. Done for 1 CE and ~12k slots.
-
Atlas T1 farm rolling condor upgrade in progress. Condor v23 LTS to condor v24 LTS
-
gridgk03 upgraded (new router syntax)
-
some workers upgraded already, proceeding in batches to ensure uptime
Storage
Rolling pool restarts were performed (10 servers) to update dCache pool memory on 01/29/25 and ZFS parameters on 01/23/25, respectively
-
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- Availability and Reliability: 99.7%
- Occupancy: 30.9k slots, 97%
- One drop ~12:00 on Monday 01/20 due to a central submission infrastructure certificate
- USFTS was not serving the US sites for 10 days (1/14 - 1/24) due to update of the lsc files (RQF3013164)
- Filling of SCRATCHDISK on Thursday, 01/17 due to sudden wave of transfers after switching to CERN FTS
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Reasonable running with some disruptions.
- All sites (worldwide!) drained because of central authorization issues.
- NET2 had some disruption due to network upgrade work.
- CPB having trouble scaling up to all servers running Alma Linux 9.
- EL9
- MSU has made some progress getting the install system for RHEL in service
- Illinois converted but still working on getting the new infrastructure.
- CPB is very close on the compute. Storage will be next.
- Reasonable running with some disruptions.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))-
13:40
HPC Operations 5mSpeaker: Rui Wang (Argonne National Laboratory (US))
TACC (Rui): Running smoothly (40% allocation used), maintenance yesterday
Perlmutter (Xin): following up with the empty pilots (Jira)
- One node can start up to 300+ pilots in parallel, some pilots fail to get real jobs from panda server when a lot of them asking for jobs around the same time
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:40
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Dr Quilan Huang (BNL)
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
- servicex updated to 1.5.5
- Claimed to be 100% reliable. But probalby still conditionally reliable. Appeared to would put stress on s3 and then crashes after that.
- quaterly maintenence coming up in Feb
- Kubernetes/Rook updates, OS updates
- IPv6 issue - now we know what triggers it so we can avoid. it's related to nftables/iptables compability. Kubernetes support for nft is beta in v1.31.
- 200g challenge rerun on the ADS nodes
- Reaching 335G throughput with the new networking/storage gear.
- Next week - Wednesday 15:00 at CERN we will meet with CERN AF people, discuss what different AFs offer, monitoring, etc.. Will have a zoom call.
- servicex updated to 1.5.5
-
13:50
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- USATLAS Occupancy drops:
- 1/29 (today): A central Harvester/Panda problem due to the switch to the new token issuer (ATLASPANDA-1291). Solved by patching Panda and Harvester
- 1/15: Duu to a BNL-Rucio issue mentioned in the WBS 2.3.1.4 section (RQF3013164)
- DDM Ops: Fabio is back. To define/summarize his US Ops goals on a USOps meeting at S&C.
- Switched from GGUS to GGUS helpdesk today.
- Ticket submission to all US sites was tested.
- To remove USATLAS sites from the system that are obsolete
- To start discovering new "features" when the system comes back up today.
- Move to new token issuer - postponed.
- USATLAS:
- AGL2: Corrected the A/R for December (from 78% to 90.2%)
- SWT2: Upgrade OS, Slurm, CE. Ongoing
- NET2: Network is almost there
- USATLAS Occupancy drops:
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCaches
- number of restarts of UK and German xcaches.
- gStream monitoring stopped working for xcaches on 5.7.2. Will try to debug it next week.
- VP
- will be adding a VP configuration for NET2 VP queue.
- Analytics
- Infrastructure works fine
- Again looking into getting branch reading analytics from EventLoop
- Started work on WFMS AI assistant
- Varnishes
- All working fine
- CREST
- Tomorrow morning another round of HTL testing.
- Prepared three different varnish configurations (node, top of the rack, node+top of the rack)
- ServiceX/Y
- stress testing
- development of the new transformer code.
- XCaches
-
14:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- 200G challenge redux - 335Gbps achieved so far on new gear (~65Gbps per node - 2x100G capable)
- Flocking from AF to MWT2 via HTCondor Docker universe (sort of a Docker-based glidein), with AutoFS mounting AF filesystems and propagated into container. Controlled scaling tests ongoing. Hopefully we will have a large scale test during the next MWT2 downtime.
- Work continues on CephFS/RBD on stetched k8s, optimizing and profiling pool. Should be done in the next week or so.
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05