US ATLAS Computing Facility (Possible Topical)

Name: US ATLAS Computing Facility (Possible Topical)
Start: 2025-05-07T13:00:00-04:00
End: 2025-05-07T15:25:00-04:00
Location: No location set

Wednesday 7 May 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  We need to complete our Tier-2 planning and use of possible end-of-CA funds before the end of this month
  
  As a facility, we should be thinking about how some of the work we do might be enhanced/improved by the use of AI/ML, since there may be funding options in the future
  
  Today is the ESnet blueprint meeting from 2:30-3:30 PM Eastern, with topics:
  
  - Tier-2 updates
  
  - IPv6-only LHCOPN ?
  
  - System tuning work (capabilty challenge?)
  
  - DC27 plans
  
  Trusted CI engagement continuing (tomorrow 5th meeting, next Wednesday USATLAS one-on-one meeting)
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  XRootD 5.8.1
  
  IGTF 1.135
  
  Updated SlovakGrid trust anchor with extended validity (SK)
  
  Withdrawn discontinued HPCI CA (JP)
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    Compute Farm 5m
    
    Speakers: Thomas Smith, Tom Smith (BNL (gmail))
    
    Weather related power dip on May 2nd, approx ⅓ of tier 1 was affected (by core count)
    
    Lost power to 1 row of compute for a few minutes at ~02:30 (eastern time)
    
    Received notification, onsite work was done to recover the lost portion of the condor pool
    
    ~99% recovery completed by 05:00, 100% recovery by 10:00
    
    Initial testing work has begun on revised condor memory (cgroups) config which should better protect worker nodes (EPs) from becoming completely exhausted of memory
    
    These changes don’t affect the Tier 1 (yet) but are on the horizon. Currently being rolled out on one of our other pools
    
    Also Storage: (I dont have permissions to add there)
    
    Infrastructure
    
    The local OSG CAs Puppet class has been improved, enhancing CRL updates and repository management.
    
    Monitoring
    
    Integration of various dCache components into the ELK infrastructure is underway.
    
    Pools are currently being integrated to complete the deployment of Filebeat.
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
  - 13:25
    
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Good running in the past two weeks.
  
  There were small draining incidents at most sites.
  
  Sites have mostly submitted Operations and Procurement plans.
  
  One Procurement plan is still outstanding.
  
  A few of us will discuss how to proceed later today and there will be a meeting on Friday with the sites.
  
  EL9 upgrade / FY24 equipment install continues at MSU and UTA.
  
  Discussed Varnish/NRP at today's daily meeting.
  
  I will check with Valentin Voikl if the recommended cvmfs version is 2.12.7.
  
  The OSG repository only has cvmfs version 2.12.6 which from the client should be about the same.
  
  The NET2 tape system has been having difficulty keeping up with massive requests submitted all at one time.
  
  Otherwise the tape system is working well.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: stable week
    
    TACC: LSCP people suggested we use 'flex' queue, which charges 0.8 (instead of 1)
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Work on the GPFS and dCache storage mount in a pod with proper permission on Openshift
    
    Successfully mounted both GPFS and dCache storage within a pod.
    
    The pod is configured to run as a non-root user using securityContext with the assigned UID and GID, ensuring correct access control.
    
    Read/write operations on both storage systems work as expected.
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    Thanos deployed for long term metrics persistence
    
    Finalizing binderhub configurations
    
    Habor proxy cache deployed
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    All BNL transfers migrated to new FTS; old service to be decommissioned in a few weeks
    
    Some squid and varnish issues
    
    BGP tagging confirmed at NET2
    
    HTCondor update ongoing at AGLT2
  - 14:15
    
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Diagnosing Coffea Casa deployment issues ongoing
    
    Duplicate resource errors when spawning notebooks
    
    EOS deployment ongoing
    
    Cluster up with 2PB of storage across several 90TB arrays (retired MWT2 storage)
    
    Need to understand options available for authentication. Don't really want to run Kerberos
    
    Gathering a list of issues/tweaks/workarounds with the Helm charts, would like to meet with the developers at some point to discuss further
    
    Experimenting with WireGuard 'routing node' features
    
    Don't have to install WireGuard on all nodes, but a node can be a NAT between the WG network and a private LAN
    
    Demonstrated connectivity from, e.g. umich001 to UChicago AF NFS server via the WG network[1]
    
    Was also able to mount /home and it seems to work. 200MB/s read/write - not great but probably due to MTU ~1500 as Aidan/Judith observed
    
    Sent Wei a demonstration `podman` command to create a pod to join the WG network
    
    Tested HTCondor glidein on NET2 (ostensibly to connect back to UChicago AF), caused HTCondor to segfault :)
    
    Kuantifier discussion tomorrow, use Facility R&D link:
    
    https://uchicago.zoom.us/j/91426535627?pwd=UUExNlkvMGRpbEJudU5OZ05mV080dz09
    
    [1]
    
    [root@umich001 ~]# tracepath 192.168.240.133
    1?: [LOCALHOST] pmtu 1280
    1: 100.81.190.82 6.515ms
    1: 100.81.190.82 6.275ms
    2: 192.168.240.133 6.750ms reached
    Resume: pmtu 1280 hops 2 back 2
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility (Possible Topical)

Facilities Team Google Drive Folder