US ATLAS Computing Facility (Possible Topical)
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
We need to complete our Tier-2 planning and use of possible end-of-CA funds before the end of this month
As a facility, we should be thinking about how some of the work we do might be enhanced/improved by the use of AI/ML, since there may be funding options in the future
Today is the ESnet blueprint meeting from 2:30-3:30 PM Eastern, with topics:
- - Tier-2 updates
- - IPv6-only LHCOPN ?
- - System tuning work (capabilty challenge?)
- - DC27 plans
Trusted CI engagement continuing (tomorrow 5th meeting, next Wednesday USATLAS one-on-one meeting)
- 13:05 → 13:10
-
13:10
→
13:30
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
13:15
Compute Farm 5mSpeakers: Thomas Smith, Tom Smith (BNL (gmail))
-
Weather related power dip on May 2nd, approx ⅓ of tier 1 was affected (by core count)
-
Lost power to 1 row of compute for a few minutes at ~02:30 (eastern time)
-
Received notification, onsite work was done to recover the lost portion of the condor pool
-
~99% recovery completed by 05:00, 100% recovery by 10:00
-
-
Initial testing work has begun on revised condor memory (cgroups) config which should better protect worker nodes (EPs) from becoming completely exhausted of memory
-
These changes don’t affect the Tier 1 (yet) but are on the horizon. Currently being rolled out on one of our other pools
Also Storage: (I dont have permissions to add there)
-
Infrastructure
-
The local OSG CAs Puppet class has been improved, enhancing CRL updates and repository management.
-
Monitoring
-
Integration of various dCache components into the ELK infrastructure is underway.
-
Pools are currently being integrated to complete the deployment of Filebeat.
-
-
13:20
Storage 5mSpeakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
-
13:25
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
13:10
-
13:30
→
13:40
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Good running in the past two weeks.
- There were small draining incidents at most sites.
- Sites have mostly submitted Operations and Procurement plans.
- One Procurement plan is still outstanding.
- A few of us will discuss how to proceed later today and there will be a meeting on Friday with the sites.
- EL9 upgrade / FY24 equipment install continues at MSU and UTA.
- Discussed Varnish/NRP at today's daily meeting.
- I will check with Valentin Voikl if the recommended cvmfs version is 2.12.7.
- The OSG repository only has cvmfs version 2.12.6 which from the client should be about the same.
- The NET2 tape system has been having difficulty keeping up with massive requests submitted all at one time.
- Otherwise the tape system is working well.
- Good running in the past two weeks.
-
13:40
→
13:50
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 13:40
-
13:45
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13:50
→
14:10
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
Work on the GPFS and dCache storage mount in a pod with proper permission on Openshift
-
Successfully mounted both GPFS and dCache storage within a pod.
-
The pod is configured to run as a non-root user using
securityContextwith the assigned UID and GID, ensuring correct access control. -
Read/write operations on both storage systems work as expected.
-
-
-
13:55
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:00
-
13:50
-
14:10
→
14:25
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:10
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
14:15
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
-
14:20
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Diagnosing Coffea Casa deployment issues ongoing
- Duplicate resource errors when spawning notebooks
- EOS deployment ongoing
- Cluster up with 2PB of storage across several 90TB arrays (retired MWT2 storage)
- Need to understand options available for authentication. Don't really want to run Kerberos
- Gathering a list of issues/tweaks/workarounds with the Helm charts, would like to meet with the developers at some point to discuss further
- Experimenting with WireGuard 'routing node' features
- Don't have to install WireGuard on all nodes, but a node can be a NAT between the WG network and a private LAN
- Demonstrated connectivity from, e.g. umich001 to UChicago AF NFS server via the WG network[1]
- Was also able to mount /home and it seems to work. 200MB/s read/write - not great but probably due to MTU ~1500 as Aidan/Judith observed
- Demonstrated connectivity from, e.g. umich001 to UChicago AF NFS server via the WG network[1]
- Don't have to install WireGuard on all nodes, but a node can be a NAT between the WG network and a private LAN
- Sent Wei a demonstration `podman` command to create a pod to join the WG network
- Tested HTCondor glidein on NET2 (ostensibly to connect back to UChicago AF), caused HTCondor to segfault :)
- Kuantifier discussion tomorrow, use Facility R&D link:
[1]
[root@umich001 ~]# tracepath 192.168.240.133
1?: [LOCALHOST] pmtu 1280
1: 100.81.190.82 6.515ms
1: 100.81.190.82 6.275ms
2: 192.168.240.133 6.750ms reached
Resume: pmtu 1280 hops 2 back 2 - Diagnosing Coffea Casa deployment issues ongoing
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:05