US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Today we have quite a bit to cover!

       

      • We note that CHEP papers are due by the end of February with NO EXTENSIONs to be provided!  Please get working on your papers.
      • Quarterly reports and milestone updates are submitted.
        • We have a few milestones that have remained delayed for a while and we hope they will be resolved soon.
      • Next week is ATLAS S&C week at CERN. 
      • The OSG All-hands + HTCondor week has been set  June 2 – 6, 2025 in Madison, Wisconsin.   USATLAS and USCMS have been asked to confirm their attendence. 
        • Any conflicts or issues?
        • Can we let the organizers know we (USATLAS) will be participating?
      • Paolo has requested a "Tier-2 Shopping List" of what could be purchased by December 2025 (when all funds must be spent out).
        • This is complicated because this will be the last "extra" funds we can expect before HL-LHC starts in 2030
        • Previously we have gotten such funds and targeted ensuring that our Tier-2 networks were appropriately sized and not causing any ongoing expenses
        • The complication is that our Tier-2s are expected to have 400 Gbps connections (resilient) by HL-LHC but it is likely too early to invest in 400G by the end of this calendar year (prices for 400G should drop significantly by 2029 for example).
        • Each Tier-2 will need to think carefully about their needs and the plans already in place for their institutional network upgrades
        • We will need a separate meeting to discuss planning and next steps
      • We need to have a separate discussion about Tier-1 procurements and plans
      • February 2025 is designated as "US Capability Mini-Challenge" month, where USATLAS, USCMS and possibly others test capabilities.
        • We have a Google folder to help organize what should be a bottom-up effort:  https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=sharing
        • We (WBS 2.3) need to start filling out relevant documents to cover the who, what, when for each topic we want to test in February
        • Some capabilities may note be tested in this round, but we can still begin organizing them
        • USCMS has been invited to contribute and other experiments are welcome
        • We plan to report on the tests that happen at a future WLCG DOMA meeting
        • This will be briefly presented/discussed at ATLAS S&C next week.
    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      OSG Software

      • Released last week: vo-client-137-4. Includes new Kubernetes-based IAM server in /etc/vomses
      • Upcoming release (next week?):
        • XRootD 5.7.3 upstreams a caching-related patch that OSG was carrying in XRootD 5.7.2
        • vo-client cleanup, removing old CERN VOMS servers
      • Built ARM VMs and are working on getting them into our integration testing pipeline

       

    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)

        Compute for (Tom)

        • Ongoing upgrade to HTCOndor24 and Alma 9.5. Done for 1 CE and ~12k slots.

        • Atlas T1 farm rolling condor upgrade in progress. Condor v23 LTS to condor v24 LTS

        • gridgk03 upgraded (new router syntax)

        • some workers upgraded already, proceeding in batches to ensure uptime

         

         

        Storage

        Rolling pool restarts were performed (10 servers) to update dCache pool memory on 01/29/25 and ZFS parameters on 01/23/25, respectively

      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • Availability and Reliability: 99.7%
        • Occupancy: 30.9k slots, 97%
          • One drop ~12:00 on Monday 01/20 due to a central submission infrastructure certificate
        • USFTS was not serving the US sites for 10 days (1/14 - 1/24) due to update of the lsc files (RQF3013164)
        • Filling of SCRATCHDISK on Thursday, 01/17 due to sudden wave of transfers after switching to CERN FTS
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Reasonable running with some disruptions.
        • All sites (worldwide!) drained because of central authorization issues.
        • NET2 had some disruption due to network upgrade work.
        • CPB having trouble scaling up to all servers running Alma Linux 9.
      • EL9 
        • MSU has made some progress getting the install system for RHEL in service
        • Illinois converted but still working on getting the new infrastructure.
        • CPB is very close on the compute. Storage will be next.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        TACC (Rui): Running smoothly (40% allocation used), maintenance yesterday

        Perlmutter (Xin): following up with the empty pilots (Jira)

        •  One node can start up to 300+ pilots in parallel, some pilots fail to get real jobs from panda server when a lot of them asking for jobs around the same time
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Dr Quilan Huang (BNL)
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • servicex updated to 1.5.5
          • Claimed to be 100% reliable. But probalby still conditionally reliable.  Appeared to would put stress on s3 and then crashes after that. 
        • quaterly maintenence coming up in Feb
          • Kubernetes/Rook updates, OS updates
          • IPv6 issue - now we know what triggers it so we can avoid. it's related to nftables/iptables compability. Kubernetes support for nft is beta in v1.31.
        • 200g challenge rerun on the ADS nodes
          • Reaching 335G throughput with the new networking/storage gear.
        • Next week - Wednesday 15:00 at CERN we will meet with CERN AF people, discuss what different AFs offer, monitoring, etc.. Will have a zoom call. 
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • USATLAS Occupancy drops:
          • 1/29 (today): A central Harvester/Panda problem due to the switch to the new token issuer (ATLASPANDA-1291). Solved by patching Panda and Harvester
          • 1/15: Duu to a BNL-Rucio issue mentioned in the WBS 2.3.1.4 section (RQF3013164) 
        • DDM Ops: Fabio is back. To define/summarize his US Ops goals on a USOps meeting at S&C.
        • Switched from GGUS to GGUS helpdesk today.
          • Ticket submission to all US sites was tested.
          • To remove USATLAS sites from the system that are obsolete
          • To start discovering new "features" when the system comes back up today.
        • Move to new token issuer - postponed.
        • USATLAS:
          • AGL2: Corrected the A/R for December (from 78% to 90.2%)
          • SWT2: Upgrade OS, Slurm, CE. Ongoing
          • NET2: Network is almost there
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • XCaches
          • number of restarts of UK and German xcaches.
          • gStream monitoring stopped working for xcaches on 5.7.2. Will try to debug it next week.
        • VP
          • will be adding a VP configuration for NET2 VP queue.
        • Analytics
          • Infrastructure works fine
          • Again looking into getting branch reading analytics from EventLoop
          • Started work on WFMS AI assistant
        • Varnishes
          • All working fine
        • CREST
          • Tomorrow morning another round of HTL testing.
          • Prepared three different varnish configurations (node, top of the rack, node+top of the rack)
        • ServiceX/Y
          • stress testing
          • development of the new transformer code. 
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • 200G challenge redux - 335Gbps achieved so far on new gear (~65Gbps per node - 2x100G capable) 
        • Flocking from AF to MWT2 via HTCondor Docker universe (sort of a Docker-based glidein), with AutoFS mounting AF filesystems and propagated into container. Controlled scaling tests ongoing. Hopefully we will have a large scale test during the next MWT2 downtime. 
        • Work continues on CephFS/RBD on stetched k8s, optimizing and profiling pool. Should be done in the next week or so.
    • 14:25 14:35
      AOB 10m