US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2024-04-24T13:00:00-04:00
End: 2024-04-24T15:25:00-04:00
Location: No location set

Wednesday 24 Apr 2024, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 1
  WBS 2.3 Facility Management News
  
  Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Spring HEPiX was last week and there are lots of interesting talks attached to the Timetable: https://indico.cern.ch/event/1377701/ Note that Fall HEPiX will be at OU in early November...start getting your talks ready!
  
  Quarterly reports are due. In WBS 2.3 we are missing: 2.3.2, MWT2, SWT2, HPC (2.3.3), 2.3.4.1, 2.3.4.3, CIOPS(2.3.5*)
  
  Milestone updates are due. This link has the list of what milestone updates/changes/additions are known. Let Rob, Alexei and Shawn know of others.
  
  Today through Friday is the Joint ATLAS / IRIS-HEP Kubernetes Hackathon (https://indico.cern.ch/event/1384683/) and we have quite a few people unable to join our meeting today.
  
  The DC24 analysis for USATLAS sites is mostly complete but still needs a summary. We have already merged the site results into the WLCG DOMA final report.
  
  Working on L3/L4 Management team composition
  
  Tier-1
  
  Discussion with ADC about bulk data rebalancing [BNL received ~the same data volume to archive in 30 days, as for the whole Y2023]
  
  from communicaton with ADC Coordination "I see how this information might not propagate to the right places.
  We agreed that for the next time a similar tape campaign starts, an approximation of the data amount expected per site and the duration of the campaign is going to be given directly to the sites in advance."
  
  rolling upgrade of dCache at BNL is now running on 9.2.17
  
  Scheduled for the next week : Upgrade the firmware of the two Oracle SL8500 Atlas Libraries. Oracle needs 2-3 hours downtime and suggests next Monday or Tuesday – April 29 or 30 [green light from ADC coordination]
  
  ongoing discussion about HC auto-exclusion of sites
- 2
  
  OSG-LHC
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 3
    
    Infrastructure and Compute Farm
    
    Speaker: Thomas Smith
  - 4
    
    Storage
    
    Speaker: Jason Smith
  - 5
    Tier1 Services
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Issues:
    
    CVMFS wrapper issues - from one node. Solved.
    
    Failed pilots due to HTCondor retries. Ongoing.
    
    Blacklisted on 04/12/24, 20:20 CET due to dCache pool problem. Solved.
    
    Intervention:
    
    dCache rolling upgrade to v.9.2.17 (GitHub:7525). Finished on 04/24/2024.
    
    Tape Libraries firmware update - planned for 04/29. No downtime required.
    
    Misc
    
    DC24 report finished.
- WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Good running over the past few weeks with minor problems.
  
  AGLT2 job failures involving cmvfs and certain jobsets
  
  NET2 almost all BU servers in production (a success not a problem!)
  
  SWT2 CPB working on LSM and several other issues.
  
  Working on reporting and milestones.
  
  Reworking the capacity sheet to make it easier to enter data.
  
  I am really worried that some sites will not make the June 30 deadline to retire EL7.
  
  Also need to replace OSG version 3.6 with version 23 by June 30.
  
  Still no news on the final funding increment?
- 6
  WBS 2.3.3 HPC Operations
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
  TACC
  
  Job failure due to input file
  
  Pilot fails to create the symbol link under the working area PanDA_Pilot-*(no error message from pilot.copytool.mv)
  
  PanDA_Pilot-* permission has been set to 750 instead of 770
  
  Switched to pilot v3.5.2.12. Local test in the debug queue succeeded
  
  Requesting new test task
  
  The previous one was broken because of a false setting in the SLURM parameter for the regular queue. Jobs fail to be submitted afterward
  
  NERSC
  
  starting to get users requests (based on invite from Mike Hance)
  
  currently running with 5 node SLURM submissions for up to 20 hrs.
  
  seeing both evgen and simulation jobs.
  
  still need more work to keep up with the uniform usage line
- WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 7
    
    Analysis Facilities - BNL
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
  - 8
    
    Analysis Facilities - SLAC
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 9
    
    Analysis Facilities - Chicago
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
- WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 10
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    ADC
    
    LHC Operation started (ramped up to 2000 bunches). No problems on data processing side.
    
    RPG/Data Consolidation campaign is currently running which adds afdditional traffic to tape systems.
    
    ARM resources at CERN increased to 2k slots (Monit)
    
    Running only Full Simulation (the only physics validated ATLAS workflow).
    
    Extremely low failure rate: 1/2650 failed/finished jobs (Monit)
    
    For the same period at CERN x86 the failure rate for Full Simulation is 3% (Monit)
    
    HC blacklisting:
    
    2/3 AFTs were relying on a single input file which was effectively testing the one disk server storing this file. These tests will be replaced with same test, new software version, more and smaller files (DAODs instead of AOD)
    
    IGTF Root CAs is still using SHA-1 (deprecated 10 years ago) which forces EL9 WNs to re-enable SHA-1.
    
    ATLAS officially requested an upgrade of the CA (IGTF) from WLCG.
    
    For OSG, Alma9 nodes switch to SHA-1 (OSG Documentation)
    
    WLCG Service Report: Link: “Being followed up, but a quick solution looks very unlikely”
    
    Relevant GitHub issue: https://github.com/dlgroep/fetch-crl/issues/4
    
    USATLAS
    
    SWT2
    
    Higher load with high I/O jobs overloads the storage which makes HC jobs fail which blacklists the site.
    
    Can limit the average I/O per PQ in CRIC
    
    Not transparent on what is the configuration behind the internal xroot door.
    
    No news on LSM decomissioning
    
    AGLT2 PQs reconfigured (agreed with AGLT2 admins):
    
    AGLT2: maxrss:32000
    
    AGLT2_VHIMEM: minrss:32001, maxrss:48000
    
    SLAC
    
    Wei managed to configure the local Frontier
  - 11
    
    Service Development & Deployment
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  - 12
    
    Facility R&D
    
    Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- 13
  
  AOB

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder