US ATLAS Computing Facility (Replaced Tech Presentation)

Name: US ATLAS Computing Facility (Replaced Tech Presentation)
Start: 2025-02-12T13:00:00-05:00
End: 2025-02-12T15:25:00-05:00
Location: No location set

Wednesday 12 Feb 2025, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 993 2967 7148

Meeting password: 452400

Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

- 13:00 → 13:05
  WBS 2.3 Facility Management News 5m
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Today is a regular facility meeting (we had no Topical Presentation lined up). Please let us know if you have a topic you would like to present at a future meeting.
  
  There are a lot of things going on.
  
  February 2025 is a "Capabilities" Testing and Demonstration month. See current list of topics at https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link
  
  Please consider participating in one or more and feel free to edit existing documents or add new ones
  
  The Tier-2s need to come up with a plan for how to use extra funds during this calendar year.
  
  Highest priority is ensuring each of our Tier-2s will have 400 Gbps links by the end of 2029 (but it may be too early to spend directly on that now)
  
  Each Tier-2 should be engaging the the relevant campus and regional networks to discuss their upgrade plans and timelines
  
  Also consider needs for the funds to fix infrastructure issues (power, cooling)
  
  First version of a WBS 2.3.2 document is due by the end of this month, with details needed by July scrubbing
  
  Ongoing Jumbo frames testing is proceeding smoothly.
  
  Today is the last "regular" frames transfer testing from CERN-PROD_PILOT to both NET2 and BNL, tomorrow and Friday will be Jumbo frame testing
  
  Upcoming Meetings
  
  LHCONE/LHCOPN meeting https://indico.cern.ch/event/1479019/
  
  WLCG DOMA https://indico.cern.ch/event/1511535/
  
  HEPiX https://indico.cern.ch/event/1477299/
  
  Also for your calendar, we plan to have a USATLAS facilities meeting as part of HTC25 in Madison Wisconsin June 2-6, 2025.
  
  Meeting site is https://agenda.hep.wisc.edu/event/2297/overview
  
  USATLAS Scrubbing dates are decided July 14/15 at Stonybrook (possibly will be moved to 15/16 for European travel needs)
  
  While many of you won't need to attend, you may be asked for input or slides for the scrubbing
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  vo-client
  
  XRootD shoveler
  
  xrdcl-pelican
  
  Release (aiming for next week)
  
  XRootD 5.7.3
  
  CVMFS 2.12.6: new release (currently released version is 2.11.5) with various client features and bug fixes. See details here https://cvmfs.readthedocs.io/en/stable/cpt-releasenotes.html
  
  Other projects
  
  ARM package integration testing: made some progress in getting ARM VMs started by HTCondor and are working through some minor invocation issues
  
  Kuantifier: waiting on NET2 authenticated Prometheus dev instance
  
  Eduardo has nodes for this and is working on setting up the cluster
- 13:10 → 13:30
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Tier-1 Infrastructure 5m
    
    Speaker: Jason Smith
  - 13:15
    
    Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    
    Storage 5m
    
    Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
  - 13:25
    Tier1 Operations and Monitoring 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    WBS 2.3.1.2 Tier-1 Infrastructure - Jason
    
    NTR
    
    WBS 2.3.1.3 Tier-1 Compute - Tom
    
    Testing Condor v24 LTS configuration on gridgk03
    
    Some issues with jobs being evicted after 2 hours. Condor developers have been contacted and are providing support
    
    All WNs upgraded condor 24.0 LTS and Alma Linux 9.5, operation of workers has been smooth
    
    WBS 2.3.1.4 Tier-1 Storage - Carlos
    
    Database hardware issue affecting Pinmanager, Bulk, TransferManager and SpaceManager services
    
    Degradation of service mainly affecting WRITEs (02/01/25 5PM EST)
    
    Service recovered 02/02/25
    
    Activity on synchronizing internal accounting (spacemanager) tables after restoring the service
    
    Enabling JumboFrames on all doors and storage servers for ongoing Capabilities testing
    
    Bulk service restarted on 02/09/25
    
    130k staging requests stuck in QUEUE state
    
    After restarting the service the requests were submitted to HPSS. The entire workflow is working as expected. A follow up ticket created to dCache devs https://github.com/dCache/dcache/issues/7746
    
    WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
    
    Occupancy: 92%, A/R: 100%
    
    Occupancy is lower than expected due to:
    
    2/5/25: Site was emptied for several hours due to Harvester DB lock timeouts.
    
    2/1/25: The problem mention in the storage section above
- 13:30 → 13:40
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  Some reduction in production in the last 30 days.
  
  Two central outages:
  
  1/14/24-1/16/24 Change at CERN causes BNL to fail and sites drain until they are moved to CERN FTS instance.
  
  2/6/24 One of two harvester instances at CERN has a database issue. US sites using HTCondor-CE drain.
  
  Does not affect NET2 and Kubernetes part of CPB.
  
  For the month of January the Illinois site at MTW2 is offline reducing MWT2 production by about 1/3.
  
  Jan 2-15 the site was down to move to a new building,
  
  From Jan 16-22 (approximately) authentication was not working,
  
  From Jan 23-31 (approximately) Systems were rebuilt as RHEL9 using new puppet setup.
  
  There were also various hardware and power balance issues.
  
  NET2 had a couple of interruptions to get their 400G uplink working.
  
  The good news is the 400G is in service and working well!
  
  OU_OSCER_ATLAS generally stable and lots opportunistic jobs.
  
  Some draining 2/11/25
  
  SWT2_CPB worked most of January to get their site up running Alma Linux 9.
  
  Things stablelized on 2/3/24.
  
  CPB did not refill last week for one whole day after the harvester issue was fixed.
  
  Cause of the slow refilling is under investigation,
  
  Procurement Planning
  
  We need to come up with a list of extra network gear we need to spend $2-$4 million split between the Tier sites by the end of February.
  
  Procurement plans will likely be due by the end of March now that the equipment funding levels are known.
  
  Operations Planning
  
  Now that we are past the EL9 updates (except MSU), we need to plan for what we do going forward.
  
  Clearly storage tokens will need to be supported at all sites,
  
  Some sites need to update to OSG24/Condor24.
  
  All sites have all public facing servers dual stacked and supporting IPv6 except the CE at OU.
  
  AGLT2 and CPB still need to go to jumbo frames.
- 13:40 → 13:50
  WBS 2.3.3 Heterogenous Integration and Operations
  
  HIOPS
  
  Convener: Rui Wang (Argonne National Laboratory (US))
  - 13:40
    HPC Operations 5m
    
    Speaker: Rui Wang (Argonne National Laboratory (US))
    
    Perlmutter: jobs are running fine.
    
    Empty pilot: Xin added an interval in between to reduce the amount of requests of job sent at the same time
    
    TACC: shared file system failure
    
    The scratch disk failed since Saturday; work disk failed on Monday
    
    No information on the detailed status yet
  - 13:45
    
    Integration of Complex Workflows on Heterogeneous Resources 5m
    
    Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
- 13:50 → 14:10
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    
    Analysis Facilities - BNL 5m
    
    Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
    
    1. Investigated storage technologies for user home areas to ensure correct storage ACLs for NFS and GPFS within a container, including solutions like GPFS CSI and NAPP CSI.
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speaker: Fengping Hu (University of Chicago (US))
    
    ServiceX updated to 1.5.6. It’s expected to be reliable, and Ben is confident that it’s ready for broader use.
    
    Added Dask-Gateway support to the AB image (currently in a branch). Since it requires JupyterHub for launching, we are preping up BinderHub as the launching platform.
    
    coffea-casa cull timeout adjusted from 1 hour to 1 day - this is to support users to launch computations from the terminal.
    
    Maintenance is scheduled for late February or early March.
- 14:10 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    
    ADC Operations:
    
    05.02.2025: One Harvester (out of two) DB lock timeouts.
    
    29.01.2025: Panda issue due to token issuer change (ATLASPANDA-1291)
    
    DDM Ops/US Ops: Fabio is back. His priorities were defined.
    
    GPUs: Need Cuda > 12.8 on all PQs. Expect Helpdesk tickets.
    
    SAM tests moved from python2@SL7 to python3@EL9.
    
    US Cloud Operations
    
    SWT2: Failed transfers due to ACT access problem. Ongoing.
    
    Ongoing JumboFrames tests.
    
    USATLAS Helpdesk Tickets (Link)
  - 14:15
    Services DevOps 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    XCaches
    
    several issues I should look at.
    
    still did not debug gStream issue.
    
    VP
    
    working fine
    
    need to follow up on NET2 VP queue mails.
    
    Varnishes
    
    all working fine
    
    there was a discussion on wholesale move from squid to varnishes.
    
    now adding instances at NRP in NL and CZ to serve frontier data.
    
    ServiceY
    
    retesting FAB server-side delivery.
    
    new datasets, new cluster
    
    ServiceX
    
    upgraded to 1.5.6
    
    new code gen images.
    
    AI
    
    now WFMS assistant 'knows' most of the panda task table columns. wfms-assistant.af.atlas-ml.org
  - 14:20
    Facility R&D 5m
    
    Speaker: Lincoln Bryant (University of Chicago (US))
    
    rp1 ceph storage bottlenecked on wireguard interface at IU. Much older equipment (R720?), CPU might not be fast enough to handle the encryption overhead. 2 solutions implemented:
    
    increasing k8s MTU from 1280 to 8780 increased iperf throughput from 1Gbps to 4Gbps.
    
    adding non-wireguard backhaul network for Ceph increased performance to 10Gbps (line rate)
    
    testing feasibility of unprivileged wireguard on VM at UChicago: podman seems to let us create tunnel interfaces in containers without rootly privileges in current (EL9+) kernels. might have interesting implications for jobs.
    
    ongoing re-testing of ServiceY on FAB. Fengping will present at KNIT10 conference in March.
    
    Flocking tests from UChicago AF -> MWT2 ongoing, to be tested at large scale with upcoming MWT2 storage downtime.
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility (Replaced Tech Presentation)

Facilities Team Google Drive Folder

Release (this week)

Release (aiming for next week)

Other projects

WBS 2.3.1.2 Tier-1 Infrastructure - Jason

WBS 2.3.1.3 Tier-1 Compute - Tom

WBS 2.3.1.4 Tier-1 Storage - Carlos

WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan