US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Spring HEPiX was last week and there are lots of interesting talks attached to the Timetable: https://indico.cern.ch/event/1377701/    Note that Fall HEPiX will be at OU in early November...start getting your talks ready!

      Quarterly reports are due.   In WBS 2.3 we are missing: 2.3.2, MWT2, SWT2, HPC (2.3.3), 2.3.4.1, 2.3.4.3, CIOPS(2.3.5*)

      Milestone updates are due.   This link has the list of what milestone updates/changes/additions are known.  Let Rob, Alexei and Shawn know of others.

      Today through Friday is the Joint ATLAS / IRIS-HEP Kubernetes Hackathon (https://indico.cern.ch/event/1384683/) and we have quite a few people unable to join our meeting today.

      The DC24 analysis for USATLAS sites is mostly complete but still needs a summary.  We have already merged the site results into the WLCG DOMA final report.

       

      Working on L3/L4 Management team composition

      Tier-1

      • Discussion with ADC about bulk data rebalancing [BNL received ~the same data volume to archive in 30 days, as for the whole Y2023]
        • from communicaton with ADC Coordination "I see how this information might not propagate to the right places.
          We agreed that for the next time a similar tape campaign starts, an approximation of the data amount expected per site and the duration of the campaign is going to be given directly to the sites in advance."
        •  

      • rolling upgrade of dCache at BNL is now running on 9.2.17
      • Scheduled for the next week :  Upgrade the firmware of the two Oracle SL8500 Atlas Libraries. Oracle needs 2-3 hours downtime and suggests next Monday or Tuesday – April 29 or 30 [green light from ADC coordination]
      • ongoing discussion about HC auto-exclusion of sites
      •  

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
    • 13:10 13:25
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 13:10
        Infrastructure and Compute Farm 5m
        Speaker: Thomas Smith
      • 13:15
        Storage 5m
        Speaker: Jason Smith
      • 13:20
        Tier1 Services 5m
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
        • Issues:
          • CVMFS wrapper issues - from one node. Solved.
          • Failed pilots due to HTCondor retries. Ongoing.
          • Blacklisted on 04/12/24, 20:20 CET due to dCache pool problem. Solved.
        • Intervention:
          • dCache rolling upgrade to v.9.2.17 (GitHub:7525). Finished on 04/24/2024.
          • Tape Libraries firmware update - planned for 04/29. No downtime required.
        • Misc
          • DC24 report finished.
    • 13:25 13:35
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Good running over the past few weeks with minor problems.
        • AGLT2 job failures involving cmvfs and certain jobsets
        • NET2 almost all BU servers in production (a success not a problem!)
        • SWT2 CPB working on LSM and several other issues.
      • Working on reporting and milestones.
      • Reworking the capacity sheet to make it easier to enter data.
      • I am really worried that some sites will not make the June 30 deadline to retire EL7.
        • Also need to replace OSG version 3.6 with version 23 by June 30.
      • Still no news on the final funding increment?
    • 13:35 13:40
      WBS 2.3.3 HPC Operations 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
      • TACC
        • Job failure due to input file
          • Pilot fails to create the symbol link under the working area PanDA_Pilot-*(no error message from pilot.copytool.mv)
            • PanDA_Pilot-* permission has been set to 750 instead of 770
          • Switched to pilot v3.5.2.12. Local test in the debug queue succeeded
        • Requesting new test task
          • The previous one was broken because of a false setting in the SLURM parameter for the regular queue. Jobs fail to be submitted afterward
      • NERSC
        • starting to get users requests (based on invite from Mike Hance)
        • currently running with 5 node SLURM submissions for up to 20 hrs.
        • seeing both evgen and simulation jobs. 
          • still need more work to keep up with the uniform usage line
    • 13:40 13:55
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:40
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
      • 13:45
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    • 13:55 14:10
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 13:55
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
        • ADC
          • LHC Operation started (ramped up to 2000 bunches). No problems on data processing side.
          • RPG/Data Consolidation campaign is currently running which adds afdditional traffic to tape systems.
          • ARM resources at CERN increased to 2k slots (Monit)
            • Running only Full Simulation (the only physics validated ATLAS workflow).
            • Extremely low failure rate: 1/2650 failed/finished jobs (Monit)
            • For the same period at CERN x86 the failure rate for Full Simulation is 3% (Monit)
          • HC blacklisting:
            • 2/3 AFTs were relying on a single input file which was effectively testing the one disk server storing this file. These tests will be replaced with same test, new software version, more and smaller files (DAODs instead of AOD) 
          • IGTF Root CAs is still using SHA-1 (deprecated 10 years ago) which forces EL9 WNs to re-enable SHA-1.
            • ATLAS officially requested an upgrade of the CA (IGTF) from WLCG.
            • For OSG, Alma9 nodes switch to SHA-1 (OSG Documentation)
            • WLCG Service Report: Link: “Being followed up, but a quick solution looks very unlikely”
            • Relevant GitHub issue: https://github.com/dlgroep/fetch-crl/issues/4
        • USATLAS
          • SWT2
            • Higher load with high I/O jobs overloads the storage which makes HC jobs fail which blacklists the site.
              • Can limit the average I/O per PQ in CRIC
              • Not transparent on what is the configuration behind the internal xroot door.
            • No news on LSM decomissioning
          • AGLT2 PQs reconfigured (agreed with AGLT2 admins):
            • AGLT2: maxrss:32000
            • AGLT2_VHIMEM: minrss:32001, maxrss:48000
          • SLAC
            • Wei managed to configure the local Frontier
      • 14:00
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • 14:05
        Facility R&D 5m
        Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    • 14:10 14:20
      AOB 10m