US ATLAS Computing Facility (Possible Topical)

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  993 2967 7148

Meeting password: 452400

Invite link:  https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09

 

 

    • 13:00 13:05
      WBS 2.3 Facility Management News 5m
      Speakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Scrubbing prep continues.  WBS 2.3 scrubbing is Tuesday, July 15th (L2 and L3s).

      Quarterly reports for L3 areas in WBS 2.3 will be due Friday July 18th.  Please plan ahead to have them completed.

      Next week is ATLAS S&C focused on R2R4 https://indico.cern.ch/event/1509063/ 

      Need to plan for next capacity tests as well as capability tests.   See https://drive.google.com/drive/folders/1Af7hWa0Zm30EuqsV1PbekSjb--gXAsVG?usp=drive_link

      Alexei:  possible collaboration with Weka, details to be discussed.  Also noted new Google contact for ATLAS assigned.

    • 13:05 13:10
      OSG-LHC 5m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release (probably next week)

      • cvmfs-2.13.1: fixes a bug that has been present since cvmfs-2.12.0 which prevents periodic resets to the closest stratum 1, causing performance degradation. All who have upgraded to version 2.12.0 or later are encouraged to update as soon as possible.
      • Previously mentioned Frontier Squid 6.13 delayed and moved into osg-upcoming due to required manual changes
      • XRootD 5.8.3-1.3: same as 5.8.3-1.2 except with https://github.com/xrootd/xrootd/pull/2472

       

    • 13:10 13:30
      WBS 2.3.1: Tier1 Center
      Convener: Alexei Klimentov (Brookhaven National Laboratory (US))

      WBS 2.3.1.2 Tier-1 Infrastructure - Jason

      • A GGUS on Varnish monitoring. Ongoing.

      WBS 2.3.1.3 Tier-1 Compute - Tom

      • Two black hole nodes (acas0927 and acas0950). Hardware issues with both of them. Taken offline by Oszkar.
      • VP queue
        • Does not run anymore merge jobs (thanks to the WFMS team)
        • Due to misconfiguration it was running only on Intel WNs - fixed. 

      WBS 2.3.1.4 Tier-1 Storage - Ivan (Carlos is on leave)

      • Internal Tape discussion:
        • Increased the limit of the queued files for staging on BNL tape in the Data Carousel settings to 200k
        • Tim compiling list of files grouped by tape
        • Increase of staging request timeout was discussed but it is not possible from Rucio (it is hardcoded for all sites) 

      WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan

      • NTR
      • 13:10
        Tier-1 Infrastructure 5m
        Speaker: Jason Smith
      • 13:15
        Compute Farm 5m
        Speaker: Thomas Smith
      • 13:20
        Storage 5m
        Speakers: Carlos Fernando Gamboa (Brookhaven National Laboratory (US)), Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno)
      • 13:25
        Tier1 Operations and Monitoring 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))
    • 13:30 13:40
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Reasonable running over the past two weeks with some problems.
        • AGLT2 had two partial drains on June 25-26 and June 29-30.
        • MWT2 was affected mistake made by IU Networking on June 17 causing a significant drain. June 17-18.
        • MWT2 took a down time June 24-25 to upgrade to dCache version 10.2.13.
        • NET2 had the tape cache fill and currently transfers to the NET2 tape are off and data already on the cache is going onto tape and removed from the cache.
        • OU was completely drained June 18-20. More recently they had a black hole node that caused some trouble.
        • TW-FTT continued to have job failures and periods of low CPU efficiency.
      • Work continues at MSU and UTA on updating to EL9 and to put FY24 purchases online.
        • MSU has made significant progress over the last two weeks and has there install process working.
        • UTA continues setting up a test storage cluster using EL9 and their new storage servers.
      • Quarterly report is due June 18 so management can look at the L3 reports while writing their reports.
      • Rafael and I are still discussing various possible funding scenarios and how we would respond.
        • It takes careful thinking to understand how to minimize the effect of sharp budget decreases.
    • 13:40 13:50
      WBS 2.3.3 Heterogenous Integration and Operations

      HIOPS

      Convener: Rui Wang (Argonne National Laboratory (US))
      • 13:40
        HPC Operations 5m
        Speaker: Rui Wang (Argonne National Laboratory (US))

        Perlmutter: running smoothly, both CPU&GPU above the expectation

      • 13:45
        Integration of Complex Workflows on Heterogeneous Resources 5m
        Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
    • 13:50 14:10
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Qiulan Huang (Brookhaven National Laboratory (US))
        • ~6K dedicated cores added to Tier-3 Alma 9 pool (retired from Tier-1)

        • SL7 submit hosts will be shut down next Monday
        • Progress on dCache data display issue within a Pod
          • By configuring the OpenShift worknode to retrieve LDAP information like other nodes, the pod is able to correctly display string-based user and group names for dCache data. This solution is simpler as it does not require a sidecar container.

          • We have manually applied and verified this approach on a test node. However, there are still some issues related to the configuration file setup. 

        • Had a meeting to discuss the AF status, current issues and the following work

          • Start to take a look at Kubespawner to understand how to use UID/GID to start a non-root user's container.

          • Work on the https proxy for the external access to JuypterHub running within a pod

           

      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speaker: Fengping Hu (University of Chicago (US))

        Dask-Gateway / HTCondor Integration

        • Deployment is now functional.

        • Implemented a backend that runs the scheduler pod in Kubernetes while spawning workers as HTCondor jobs.

        • Currently, the code has some coupling with our user management (CIConnect); efforts are underway to decouple this before making the repository public so that other interested sites can adopt it as well.

    • 14:10 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:10
        ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Ivan Glushkov (Brookhaven National Laboratory (US))

        ADC Operations

        • First sites going for 16-cores under CMS pressure (IN2P3-CC). Evaluation of impact on ADC workflows is under investigation.
        • Transferring space from SCRATCHDISK to DATADISK is ongoing. Greedy deletions enabled for BNL, SWT2 and MWT2 today.
        • Cleaned up Rucio VOMS roles today (Details)
        • Varnish deployment is ongoing.
        • Work still ongoing to add VP queue at NET2 pointing to the ESNET xcache
        • Transfer failures seemingly related to cert issues still ongoing at TW-FTT (GGUS)
      • 14:15
        Services DevOps 5m
        Speaker: Ilija Vukotic (University of Chicago (US))
      • 14:20
        Facility R&D 5m
        Speaker: Lincoln Bryant (University of Chicago (US))
        • EOS deployment moved to bare metal and puppetized, working on understanding how to properly authorize users
        • HTCondor pool setup on rp1 ongoing, working through auth issues between schedd/collector/startds 
        • Armada integration ongoing, executor is deployed on the UChicago AF to ostensibly send work from RP1, but scheduler seems to be non-functional for reasons TBD 
    • 14:25 14:35
      AOB 10m