US ATLAS Computing Facility (Possible Topical)
→
US/Eastern
Description
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
We have a lot going on in our Facilities!
- Procurement and ops documents for Tier-2
- 5-year planning development
- Detailed plans for use of $3-4M for Tier-2 facility
- Milestones and quarterly reporting updates due soon
- Ongoing mini-challenges (jumbo frames, cloud storage, Scitags, IPv6-only, capacity testing, etc)
We continue the engagement with Trusted CI and have some homework from the #3 set to complete.
During the last 3 weeks were the LHCOPN/LHCONE meeting and HEPiX meeting, both very interesting and relevant for our facilities
- LHCOPN/LHCONE: https://indico.cern.ch/event/1479019/
- HEPiX: https://indico.cern.ch/event/1477299/timetable/#20250331.detailed
Upcoming meetings include
- WLCG/HSF meeting in Lyon in early May
- HTC25 with joint USATLAS-USCMS meeting in Madison (June 2-6)
- ATLAS S&C in early July
- USATLAS Scrubbing in mid July
- USATLAS workshop at Michigan in late July
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- Release
- This week: Frontier Squid for ARM, vo-client removing old OpenShift IAM instances
- Holding off on XRootD 5.8.0 in favor of XRootD 5.8.1 due to reported stacktrace
- There was a request for cvmfs-2.12.7?
- ARM integration tests are complete now!
- Waiting on NET2 to configure a test Prometheus for the test cluster / namespace
- Release
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Good running over the last couple of weeks.
- Some cvmfs issues.
- Short outage at NET2.
- A certificate expired over a weekend. The certificate belonged to the Harvester and was for the VM servicing Kubernetes.
- CPB has been having trouble keeping full.
- Apparently the current job mixture is causing trouble.
- EL9/FY24 purchases:
- MSU still working on installation of EL9
- They will install their FY24 equipment after their EL9 installation is working.
- CPB is still working on converting their storage to Alma Linux 9.
- So their FY24 storage is caught up in this.
- MSU still working on installation of EL9
- Rafael and I are working on an email about the following documents:
- The Jan-Mar quarterly reporting
- The site procurement plan
- The site operations plan
- The 5 year planning document each site
- The proposed milestones for each site
- Once we get this information we will. have a dedicated meeting to make sure that the regular and infrastructure planning is sensible and consistent.
- This will serve as the kickoff for writing the infrastructure proposal.
- Good running over the last couple of weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 13:40
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
- Interactive nodes for AF changes from spar nodes to attsub[01-08]
- The documentation for BNL part has been refreshed by Shuwei
- Create dCache user space for one AF user
- Interactive nodes for AF changes from spar nodes to attsub[01-08]
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:00
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
- notebook service reorg
- putting up a binderbub service that can launch all the existing notebooks offered via homegrown jupyterlab service
- intuitive user interface that are easy to navigate
- keycloak auth with multiple upstream id provider, run as local user if an AF account can be matched.
- dask-gateway integration with Analysis base images.
- will run in parallel with existing svc and retire old svc if it's well received.
- putting up a binderbub service that can launch all the existing notebooks offered via homegrown jupyterlab service
- notebook service reorg
-
13:50
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- XCaches
- UK nodes fixed again
- everything else works fine
- VP
- working fine
- Analytics
- more fixes for wlcg_wpad
- new dashboards for AnalysisFacilities (benchmarks, af-condor, condor-insights)
- Varnish
- Varnish for Frontier deployed at Rome1. Serving Roma and Milano, second choice for all of the Italian sites
- Deployed at pic. Serving pic, second choice for all of the Iberian peninsula sites.
- Waiting on IN2P3-CC to setup one for France.
- LRZ installed a "private" varnish instance.
- Discussed with ESnet a possibility to have an instance in Boston.
- Waiting on BNL to get one there.
- We should decide on US approach.
- Uni Victoria is now using CF Varnish.
- AI
- testing how we could use MCP (model context protocol) to expose our analytics/accounting data to AI models/clients.
- ServiceX/Y
- rewriting part of the site to use HTTP over SSE.
- will be changing client in the same way so it does not poll S3 all the time.
- XCaches
-
14:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Debugging slow network setup time on stretched cluster. Symptoms are network timeouts for the first minute or so of the container being up.
- Seems to be related to Calico's utilization of XDP (eXpress Data Path) ... Calico keeps trying and failing to clean up XDP programs on the loopback device??
- Disabled it and things seem OK now, but not clear if disabling will degrade overall performance
- Work with Armada continues
- Work with Jupyter (via Coffea Casa? TBD) continues
- Debugging slow network setup time on stretched cluster. Symptoms are network timeouts for the first minute or so of the container being up.
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05