US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2024-01-31T13:00:00-05:00
End: 2024-01-31T15:25:00-05:00
Location: No location set

Wednesday 31 Jan 2024, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:05
  WBS 2.3 Facility Management News 5m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Forthcoming meetings
  
  ATLAS S&C week, Feb 5-9, https://indico.cern.ch/event/1340782/
  
  ADC @ S&C week (Sites Jamboree), Feb 6-8, https://indico.cern.ch/event/1355529/
  
  Tier2 Technical, Facility R&D, Joint WBS 2.3/5, and Facility Coordination next week all cancelled due to S&C.
  
  Facility Topical in two weeks (Feb 14). Start a default rotation through L3 areas. Last talk was from 2.3.1. Next: topical presentation from 2.3.2. Can be superceded as additional topics emerge.
  
  New 2.3/5 (and RAC) schedule:
  
  Wednesday, February 7th: canceled for S&C week
  Wednesday, February 21st: 2.3/5 and RAC
  Wednesday, March 6th: 2.3/5
  Wednesday, March 20th: 2.3/5 and RAC
  Wednesday, April 3rd: 2.3/5
  Wednesday, April 17th: 2.3/5 and RAC
  etc
  
  Mar 18-22: US ATLAS T1 jamboree @ BNL
  
  May 13-17: HSF/WLCG workshop @ DESY
  
  Jun 3-7: 78th ATLAS S&C week @ Oslo
  
  Jul 8-12: HTC 24
  
  July 15-17 US ATLAS Summer Workshop in U Washington (including WBS5 training, and a co-located IRIS-HEP training event)
  
  July 19-20 Scrubbing
- 13:05 → 13:10
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (next week?)
  
  HTCondor 23.4.0 and 23.0.4
  
  We are planning on making changes to how we use the micro version number.
  
  Current proposal is to bump the micro version for every release candidate so you may see larger jumps in the final number for what's available in the release repositories
  
  XRootD 5.6.6
  
  This may get punted, depending on whether or not we can get our integration tests to cooperate
  
  voms-2.10.0-0.31.rc3.1
  
  We do not currently ship VOMS for EL9 and had to bump the version to get builds working
  
  We updated the version across all operating systems
  
  We need testers!
  
  IRIS-HEP Y6Q4 (Aug 2024) milestones
  
  ARM work (host deployment, forays into OSG Software builds/tests)
  
  Retire OSG 3.6 and EL7
  
  Kuberentes Gratia Probe
- 13:10 → 13:25
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Infrastructure and Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:15
    Storage 5m
    
    Speaker: Vincent Garonne (Brookhaven National Laboratory (US))
    
    Operational interventions including HW and pool failures
    
    ATLAS utilizes tokens for transfers, and DATADISK and SCRATCHDISK have been migrated to the CERN FTS Pilot instance
    
    USATLAS storage upgraded to dCache 9.2.6 + java-17 (1/22)
    
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=165029
    
    While initially a temporary workaround is in place to overcome the issue. Most recently refreshing the state of components appeared to help as the occurrence of the issue appeared to be reduced.
    
    Issues found and reported www.dcache.org #10575 configuration instabilities observed wrt. WAN TPC writes.
    
    Configuration for RHEL8 (HPSS, dCache pools) is ready. New data servers just delivered to start their commissioning, etc
    
    1000 LTO8 tapes (= 12 PB) purchased and delivered to BNL receiving
    
    Procurement process of FY24 storage for ATLAS started
    
    Data challenge 24 preparation:
    
    Stress-testing and performance evaluation of USATLAS storage (tuning, etc.) with dedicated FTS servers
    
    Joint USATLAS-USCMS site transfer stress testing (1/24):
    
    Achieved WAN READ throughput surpassing 400 Gb/s.
    
    No issues were identified on core, doors, and pools servers regarding CPU, memory, or load. Traffic distribution among the pools is uniform
    
    Some plots available here
  - 13:20
    Tier1 Services 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Networking:
    
    Hiro on WLCG DOMA meeting (Indico:1366553): "BNL is now 1.6Tbps split as 800Gbps for LHCOPN and 800Gbps for LHCONE"
    
    Updated BNL entry in CRIC accordingly
    
    Accounting:
    
    WLCG 2023 offical: over pledge: CPU/Disk/Tape:110%/104%/102%
    
    December 2023: validated at 12632 kHS23*days
    
    BNL provides 25% more CPU resources than accounted for due to HIMEM.
    
    BNL provided 13% (the most) of all ATLAS HIMEM jobslots for December 2023 (Monitoring link)
    
    Froblem "solved" by limiting the number of HIMEM slots in CRIC to 5k.
    
    To be discussed on S&C Week
    
    Farm:
    
    Corepower:
    
    Starting revising the BNL CPU corepower declared in CRIC. Current value: 12.69
    
    ALMA9 transition:
    
    Full chain set-up. HC jobs are sigfaulting. Ongoing debugging
    
    IPv6 transition:
    
    One node fully operational. Observing performance. Potentially scaling up by the ned of February
    
    Upgrade to HTCondor 10.x - nearly done.
    
    VP queue - running as expected. Limited by the number of slots opportunistically available
- 13:25 → 13:30
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Good production will all sites at or near capacity after MWT2 recovered from the dCache upgrade (and before too!)
  
  NET2 did extensive work on their network which led to some reduction in production.
  
  Quarterly reporting submitted before official deadline but well after Shawn's requested date.
  
  Still working on getting HEPSpec06 to run on the benchmarking machines that Dell provided.
  
  Needed to compare HEPScore to HEPSpec to understand how the $/HS stacks up.
  
  NET2 has negotiated a good price ~$5.30/HEPSpec for servers with dual 128C/256T AMD 9754 Bergamo CPUs (512T total).
  
  Dell tells me that NVMe is cheaper than SAS and Rafael's results show using NVMe improves the HEPScore significantly.
  
  There will be no Tier 2 technical meeting next week because of S&C Week and the reminder email will be marked as canceled.
  
  I have corrected the length and dates of the technical meetings in Indico.
  
  I will miss the Jamboree because of a conflicting TRT meeting in Bonn. I will join the S&C week on Wednesday morning.
  
  I need a speaker for the topical talk in two weeks...
- 13:30 → 13:35
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
  Perlmutter: Jobs are running and failing after switching from Shifter now that the user namespace issue resolved.
  
  Switch to usatlas Globus collaboration endpoint
  
  The connection appears to be not very stable. Connection timeout in many tasks
  
  Fixed the Python version issue with the pilot
  
  Preparing to switch to Podman-HPC
  
  Working with Asoka deSilva to sort things out.
  
  TACC:
  
  Switched to and tested the TACC Globus v5 endpoint
  
  Setups need in both CIlogon and Access account
  
  Fixed the errors with the harvester envrioment with python 3.9
  
  Helping from Lincoln on the grid cretificats
- 13:35 → 13:50
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:35
    
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
  - 13:40
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:45
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    Added a reserved queue to the HTCondor batch system. It Currently consists of 1 node and would allow a analysis team to run whole node jobs for the MadGraph application.
    
    Working on taming the ceph file system. It has some stability issues that warrant more investigations. Very high mem usage on mds is observed during incidences.
    
    volume mounts are monintored and alerted on both HTCondor workers and the interactive login nodes.
    
    will work on rook-ceph upgrade. Had some trouble last time due to some K8s deprications. It appears at least the newer ceph version(v17) would have a alertable metric(slow mds ops) that we usually observed during incidences.
    
    will also update os.
    
    AnalysisBase image has been updated to latest: 24.2.37. All the libraries have been updated. Now setting dev version of uproot with a lot of fixes for reading physlite data files.
- 13:50 → 14:05
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Lots of updates on tokens in Rucio/FTS and monitoring for DC24 at today's WLCG DOMA General (link)
  
  LHCONE Jumbo Frames survey: https://limesurvey.web.cern.ch/977971
  - 13:50
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Open and dormant GGUS tickets:
    
    Storage Tokens at SWT2_CPB (link)
    
    CMS SAM/ETF jobs at BNL (link)
    
    Site Network Monitoring at SWT2_CPB (link)
    
    Site Network Monitoring at OU (link)
    
    Alma 9 image testing ongoing at BNL T1 farm and T3 interactive
    
    BNL xcache crashed yesterday (possible hw issue), otherwise smooth running
  - 13:55
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Elasticsearch
    
    Some cluster stability issues. We anyhow need an upgrade so we will rebuild cluster next week.
    
    XCache
    
    Added new node to MWT2. We will rebuild the one causing issues.
    
    Need a stronger xcache node at AGLT2. This one can't keep up:
    
    Added a SLATE installed xcache node to Wuppertal. Work on setting up their VP queue. This is a test for the larger deployment now that Germany is getting rid of ATLAS sites at Universities and moving resources to HPCs.
    
    We need better ticketing when xcache has issue. If Squid people can monitor and ticket sites (and there are many more of them) we should be able to do that for XCaches too. We have enough tests, we just need another person to look at them every day and create a ticket if needed
    
    Varnishes
    
    work fine but we get tickets when we move traffic from one to another:
  - 14:00
    
    Facility R&D 5m
    
    Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Presented on Stretched K8S options at last week's facility R&D call
    
    Asking sites for their input and consideration
- 14:05 → 14:10
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Release (next week?)

IRIS-HEP Y6Q4 (Aug 2024) milestones