US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Facility discussion one week from Friday https://indico.cern.ch/event/1445942/.   We need site presentations describing site details and bottlenecks.  Please mark your calendar.

      Ongoing disucssion about organizing ADC TIM in Jan 2025 (SUNY or UMass) - focus topic R2R4.  Note September is a R2R4 review (ATLAS is preparing)

      From ESnet:  "

      • Currently two of our 400GE links are out of service.
      • The announced emergency maintenance window for these two links runs until 23 August 2024."

       

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Brian Lin will likely be late due to a planning retreat

      Software

    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Conveners: Alexei Klimentov (Brookhaven National Laboratory (US)), Thomas Smith
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))
      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speaker: Doug Benjamin (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speakers: Ofer Rind (Brookhaven National Laboratory), Ofer Rind (Brookhaven National Laboratory), Dr Quilan Huang (BNL)
        • AF Planning Discussion yesterday (indico)
        • Still need to add job wait time metric
        • Investigating prospect of getting more Discourse info for AI analytics
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))
        • Supporting user with "columanar analysis" on AF -- user running jobs with the coffea-casa instances deployed at AF. One need is to write some output parquet files after dask task finishes. Tried using x509 proxy to write to eos -- having trouble pass the environment variable to workers for some code. Point users to our dev instance that allow them to write to /data area. worked great! Also exposed a user provisiioning issue on htcondor workers during the process. Lincoln fixed it this morning.
        • We also updated coffea-casa to make the atlas iam tokens available in the notebooks.
    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • BNL experiencing significant issues related to the latest CVMFS update.  T2 admins are advised to proceed cautiously when updating, particularly on live systems.  Details here.  T2 admins are encouraged to report problems here.
      • SWT2 Alma 9 update is still in progress, in part due to Slurm issues.  Need to update delayed milestone with a new target date.
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
        • ADC
          • LHC: Full steam data taking. The 10000 LHC fill was finished yesterday night
          • SAM: Removed GridFTP tests as critical tests for ATLAS. REST ARC-CE based tests to replace them - still in preproduction
          • Tape buffers: Overloaded on some sites using the same buffer for staging in and out due to full steam datataking and reprocessing campaign. Working on communication between Rucio and Data Carousel and pinning of files by FTS.
          • CRIC: Restricting RSE actions only to DDM team members. Ongoing.
          • Draining of sites over the weekend - due to DDM not processing all transfer requests immedeately as expected. Mitigated, but nor understood.
          • VIHIMEM Sherpa evgen
            • HIMEM campign (working with sites to increas meanrss) - postoned for after the summer
            • working with Doug to run it on Perlmutter
        • USATLAS
          • All
            • Drained during the DDM problem.
          • AGLT2
            • Looking at HIMEM. Retired VHIMEM PQ and reduced meanrss to 1.8 GB/core
            • Looking at filling the farm at 99.5% level
            • Two separate sites - one with EL9 and one with SL7
            • requestDisk in jdl was added to allow for site admins to handle metter merge jobs
            • To understand relation between SAM and CRIC
          • NET2
            • Gradually draining since two weeks. To be understood
            • Some problem with EventIndex jobs.
          • SWT2
            • Slurm “Kill task failed” investigations
          • TW-FTT
            • Running at the level of 500 slots due to ALMA9 upgrade (should be at 2200 level)
            • Running fine since limited to run only simulation and generation
            • New site communication e-group created
        • Details can be found in:
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
        • ServiceX - version 1.4.1 is in testing. Once validated it will be deployed.
        • XCache - rebuilt of el9. Tested and deployed on all the SLATE sites. ESnet node not ready yet. AGLT2 UM node will be moved to NRP next week.
        • VP - working fine.
        • CREST - firewall is now open. Some issues with TLS cert, opened ticket for that. Preparing for testbed setup. 
        • Analytics
          • will add new data for AF benchmarks once Juan has example benchmark doc.
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • (Aidan) Working on Keycloak OIDC integration with Netbird, to allow login with IAM
        • (Lincoln) Working on user 'dashboard' page as a front-end to keycloak
    • 14:25 14:35
      AOB 10m