US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 1:00 PM 1:05 PM
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Tier-2 facility will have separate mgmt meeting on Fridays (PIs/Co-PIs, mgmt)

      LHCONE/LHCOPN next week in Manchester UK

      Still looking for future topical presentations

      Lots of discuss about Skype demise and replacement...

      OTF presentation on facilities issues & evolution

       

    • 1:05 PM 1:10 PM
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release

      This week

      • IGTF 1.134

      Next week

      • XRootD 5.7.3 with gstream fix patch

      Other

      • Successfully ran manual ARM integration tests: need to build Frontier Squid on ARM, install Ceph client now that it's available for ARM
      • Eduardo has passed along that the dev OKD cluster is ready
    • 1:10 PM 1:30 PM
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 1:10 PM
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 1:15 PM
        Compute Farm 5m
        Speaker: Thomas Smith
      • 1:20 PM
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
      • 1:25 PM
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        WBS 2.3.1.2 Tier-1 Infrastructure - Jason

        • Unplanned power interruption (morning of 3/7)

          • Most things recovered in a few hours, a few VMs took several hours to recover

          • Casualties: NIC on one OpenShift worker (few days to replace), a few VM disk images were corrupted (one needs to be rebuilt, another copied from RHEV again since it was recently migrated the old image was still present)

        • OpenShift: more than half migrations from RHEV complete, close to supporting containers (ready for testing very soon)

        WBS 2.3.1.3 Tier-1 Compute - Tom

        • BNL_ARM - it was not getting new jobs due to missing SW tags in CRIC. Solved.

        • Unplanned power outage on friday 7 Mar. 

          • This lead to a large number of job failures as workers lost power

          • HTCondor recovery by ~15:45 (eastern time)

          • Job ramp up was gradual, but successful

          • Some worker nodes came up in a bad state and were rebuilt. Full capacity restored

        There was an effort to recover additional previously downed worker nodes, capacity is slightly higher post- power outage as a result of This effort (34.2k core -> 35.4k core)

        WBS 2.3.1.4 Tier-1 Storage - Carlos

        • Power glitch outage on 03/07/25. 

          • The ATLAS production storage service was degraded

          • The Chimera server was down for 7 minutes but restarted without any issues or corruption.

          • Other dCache core services failed over to redundant components. 

          • A mix of pool hosts restarted automatically, while a few others required manual hardware intervention. No data loss was observed.

          • A subset of doors were also affected and recovered without issue

          • The impact was limited to some READ operations and READ/WRITE transfers that were in progress during the power glitch. 

          • The system was fully functional by 11 AM (EST).

          • Test/Integration instance affected due to the OpenShift issue

         

        • Work on DMZ Pools: The underlying filesystem block size of DMZ pools has been aligned with the NVMe-based block size, resulting in an improvement in READ IOPS. 

        WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

        • All operations’ related news were already reported above.

         

    • 1:30 PM 1:40 PM
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Reasonable running over the last couple of weeks.
        • OU had a scheduled a downtime.
      • Found that there was a problem with the reliabiility reporting not playing well with sites putting only some services offline.
        • I need to reply to an email from Borja from the CERN monit team.
      • I am (slowly) working on templates for the procument and operation plans.
      • I have modified the v71 tab of the capacity sheet to calculate the meanRSS for each site.
        •  I will shortly add a power consuption calculation so that we can answer a question from the operations review.
        • The BNL data on the capacity sheet seems out of date.
    • 1:40 PM 1:50 PM
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 1:40 PM
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
        • TACC&Perlmutter: Issus was found by Doug related to the Panda token. Update the token issuer to https://atlas-auth.cern.ch/
      • 1:45 PM
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))

        fixed issue with PanDA tokens - issue change from atlas-auth.web.cern.ch  to atlas-auth.cern.ch

        Restarted BNLHPC_DATADISK after BNL power cut on Friday.

    • 1:50 PM 2:10 PM
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 1:50 PM
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • Discuss with openshift expert regarding to accesing GPFS storage within a container
        • Investigate how to set proper storage permission automatically after user login via Jupyterhub portal

        • Working on the document about the workflow of jupyter+openshift

         

      • 1:55 PM
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 2:00 PM
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
    • 2:10 PM 2:25 PM
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 2:10 PM
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        ADC Operations

        • Sectigo CA replacement started. First case - Milano has chosen HARICA. This required update of the CA definition of storage nodes
        • HammerCloud - started working on update of the tests from rel.22 to rel.23. Score simulation will be replaced by evgen
        • Enabled Data Carousel for analysis
      • 2:15 PM
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • Analytics
          • we used to have all the data visible to ATLAS_USER and anonymous user. Now that's not the case any more and we have to explicitly allow data into dashboards for these users. That "broke" a lot of dashboards and visualizations embeded or shared to a lot of places and people. I have been fixing them for the last few days. Please complain if you see a dashboard not showing correctly. 
        •  XCaches
          • required update to the x509 proxy renewal container
          • updated all UC AF xcaches
          • had to fix dashboards
          • building new image for gStream monitoring fix
        • Varnishes
          • All working fine
          • ATLAS made a decission to move to Varnish for conditions.
          • Ilija and Nurcan preparing a grand plan document.
          • Asked John to try installing one at BNL.
        • VP
          • working fine
        • ServiceX and ServiceY
          • x509 proxy renewal container update
          • also for cms
      • 2:20 PM
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • Have access to NET2 K8S, doing some tests at a small scale. Coordinating with Eduardo on figuring out minimal privileges for e.g. WireGuard in OpenShift
          • Aidan will try Armada for Kubernetes-level federation against this cluster as well
        • stretched k8s upgraded to Kubernetes 1.31
        • having a working unprivileged wireguard container with manual configuration. capabilies added _in the namespace_ only 
          •  [12:03]:~/wg-test/config $ podman run --cap-add=NET_RAW --cap-add=NET_ADMIN --cap-add=SYS_MODULE --sysctl="net.ipv4.conf.all.src_valid_mark=1" -p 51820:51820/udp -v /lib/modules:/lib/modules -v /home/lincolnb/wg-test/config/:/etc/wiregua
            rd wgtest3 /bin/bash -c "wg-quick up wg0; ping 10.20.10.1"
            PING 10.20.10.1 (10.20.10.1) 56(84) bytes of data.
            64 bytes from 10.20.10.1: icmp_seq=1 ttl=64 time=3.74 ms
            64 bytes from 10.20.10.1: icmp_seq=2 ttl=64 time=1.85 ms
            ^C
            --- 10.20.10.1 ping statistics ---
            2 packets transmitted, 2 received, 0% packet loss, time 1002ms
            rtt min/avg/max/mdev = 1.851/2.794/3.737/0.943 ms
    • 2:25 PM 2:35 PM
      AOB 10m