US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2025-03-12T13:00:00-04:00
End: 2025-03-12T15:25:00-04:00
Location: No location set

Wednesday 12 Mar 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Tier-2 facility will have separate mgmt meeting on Fridays (PIs/Co-PIs, mgmt)
  
  LHCONE/LHCOPN next week in Manchester UK
  
  Still looking for future topical presentations
  
  Lots of discuss about Skype demise and replacement...
  
  OTF presentation on facilities issues & evolution
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release
  
  This week
  
  IGTF 1.134
  
  Next week
  
  XRootD 5.7.3 with gstream fix patch
  
  Other
  
  Successfully ran manual ARM integration tests: need to build Frontier Squid on ARM, install Ceph client now that it's available for ARM
  
  Eduardo has passed along that the dev OKD cluster is ready
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
    
    Tier 1 consolidated report 031225.pdf
  - 13:25
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    WBS 2.3.1.2 Tier-1 Infrastructure - Jason
    
    Unplanned power interruption (morning of 3/7)
    
    Most things recovered in a few hours, a few VMs took several hours to recover
    
    Casualties: NIC on one OpenShift worker (few days to replace), a few VM disk images were corrupted (one needs to be rebuilt, another copied from RHEV again since it was recently migrated the old image was still present)
    
    OpenShift: more than half migrations from RHEV complete, close to supporting containers (ready for testing very soon)
    
    WBS 2.3.1.3 Tier-1 Compute - Tom
    
    BNL_ARM - it was not getting new jobs due to missing SW tags in CRIC. Solved.
    
    Unplanned power outage on friday 7 Mar.
    
    This lead to a large number of job failures as workers lost power
    
    HTCondor recovery by ~15:45 (eastern time)
    
    Job ramp up was gradual, but successful
    
    Some worker nodes came up in a bad state and were rebuilt. Full capacity restored
    
    There was an effort to recover additional previously downed worker nodes, capacity is slightly higher post- power outage as a result of This effort (34.2k core -> 35.4k core)
    
    WBS 2.3.1.4 Tier-1 Storage - Carlos
    
    Power glitch outage on 03/07/25.
    
    The ATLAS production storage service was degraded
    
    The Chimera server was down for 7 minutes but restarted without any issues or corruption.
    
    Other dCache core services failed over to redundant components.
    
    A mix of pool hosts restarted automatically, while a few others required manual hardware intervention. No data loss was observed.
    
    A subset of doors were also affected and recovered without issue
    
    The impact was limited to some READ operations and READ/WRITE transfers that were in progress during the power glitch.
    
    The system was fully functional by 11 AM (EST).
    
    Test/Integration instance affected due to the OpenShift issue
    
    Work on DMZ Pools: The underlying filesystem block size of DMZ pools has been aligned with the NVMe-based block size, resulting in an improvement in READ IOPS.
    
    WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
    
    All operations’ related news were already reported above.
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Reasonable running over the last couple of weeks.
  
  OU had a scheduled a downtime.
  
  Found that there was a problem with the reliabiility reporting not playing well with sites putting only some services offline.
  
  I need to reply to an email from Borja from the CERN monit team.
  
  I am (slowly) working on templates for the procument and operation plans.
  
  I have modified the v71 tab of the capacity sheet to calculate the meanRSS for each site.
  
  I will shortly add a power consuption calculation so that we can answer a question from the operations review.
  
  The BNL data on the capacity sheet seems out of date.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    TACC&Perlmutter: Issus was found by Doug related to the Panda token. Update the token issuer to https://atlas-auth.cern.ch/
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    fixed issue with PanDA tokens - issue change from atlas-auth.web.cern.ch to atlas-auth.cern.ch
    
    Restarted BNLHPC_DATADISK after BNL power cut on Friday.
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    Discuss with openshift expert regarding to accesing GPFS storage within a container
    
    Investigate how to set proper storage permission automatically after user login via Jupyterhub portal
    
    Working on the document about the workflow of jupyter+openshift
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    ADC Operations
    
    Sectigo CA replacement started. First case - Milano has chosen HARICA. This required update of the CA definition of storage nodes
    
    HammerCloud - started working on update of the tests from rel.22 to rel.23. Score simulation will be replaced by evgen
    
    Enabled Data Carousel for analysis
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Analytics
    
    we used to have all the data visible to ATLAS_USER and anonymous user. Now that's not the case any more and we have to explicitly allow data into dashboards for these users. That "broke" a lot of dashboards and visualizations embeded or shared to a lot of places and people. I have been fixing them for the last few days. Please complain if you see a dashboard not showing correctly.
    
    XCaches
    
    required update to the x509 proxy renewal container
    
    updated all UC AF xcaches
    
    had to fix dashboards
    
    building new image for gStream monitoring fix
    
    Varnishes
    
    All working fine
    
    ATLAS made a decission to move to Varnish for conditions.
    
    Ilija and Nurcan preparing a grand plan document.
    
    Asked John to try installing one at BNL.
    
    VP
    
    working fine
    
    ServiceX and ServiceY
    
    x509 proxy renewal container update
    
    also for cms
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    Have access to NET2 K8S, doing some tests at a small scale. Coordinating with Eduardo on figuring out minimal privileges for e.g. WireGuard in OpenShift
    
    Aidan will try Armada for Kubernetes-level federation against this cluster as well
    
    stretched k8s upgraded to Kubernetes 1.31
    
    having a working unprivileged wireguard container with manual configuration. capabilies added _in the namespace_ only
    
    [12:03]:~/wg-test/config $ podman run --cap-add=NET_RAW --cap-add=NET_ADMIN --cap-add=SYS_MODULE --sysctl="net.ipv4.conf.all.src_valid_mark=1" -p 51820:51820/udp -v /lib/modules:/lib/modules -v /home/lincolnb/wg-test/config/:/etc/wiregua
    rd wgtest3 /bin/bash -c "wg-quick up wg0; ping 10.20.10.1"
    PING 10.20.10.1 (10.20.10.1) 56(84) bytes of data.
    64 bytes from 10.20.10.1: icmp_seq=1 ttl=64 time=3.74 ms
    64 bytes from 10.20.10.1: icmp_seq=2 ttl=64 time=1.85 ms
    ^C
    --- 10.20.10.1 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1002ms
    rtt min/avg/max/mdev = 1.851/2.794/3.737/0.943 ms
- 14:25 → 14:35
  
  AOB 10m

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Release

This week

Next week

Other

WBS 2.3.1.2 Tier-1 Infrastructure - Jason

WBS 2.3.1.3 Tier-1 Compute - Tom

WBS 2.3.1.4 Tier-1 Storage - Carlos

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan