US ATLAS Computing Facility

US/Eastern
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      WLCG workshop on analysis facilities weekend before chep 2023, https://indico.cern.ch/event/1230126/

       

      From Zach:


      Thank you all for attending S&C Week #74! Recordings have been added to the agenda.

       

      Feel free to sign up for atlas-comp-deriv-phys-physlite, a new list for the coordination, discussion, and advertising of PHYS/LITE derivation production campaigns.
       
       

      The WLCG DOMA meeting was today:  https://indico.cern.ch/event/1232370/     Notes on preparing for the next WLCG Data Challenge are at https://docs.google.com/document/d/1qorG3JYNW5XZx51pTyJM6s_t7-blYr9r--FkYUyhQeQ/edit# 
       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • EL9
        • Initial EL9 tests kicked yesterday: found some missing packages
        • We have some concerns about issues with XRootD, EL9, and OpenSSL 3
      • HTCondor
        • 10.2.0 targeted for upcoming (this week?)
        • HTCondor-CE 6 targeted for some time in the next few weeks: they bumped the major version to align authentication/authorization configurations with the rest of HTCondor but it was done in such a way that it should be transparent to sites
        • See F2F slides for a stab at the venn diagram of GSI + EL9 + HTCondor versions: https://indico.cern.ch/event/1201515/contributions/5141108/attachments/2559076/4411772/2022-12-02.osg-lhc-atlas-f2f.pdf 
        • N.B. it's important to differentiate between GSI and X.509 support: we are dropping GSI support at the authentication level, X.509 proxy delegation down to the worker node will continue to work
      • Throughput Computing 2023 in July
        • Joint HTCondor Week and OSG All Hands in Madison, WI
        • For the ATLAS (and CMS?) session, do you need AV equipment?
    • 13:20 13:25
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      smooth running

      Farm team - starting test of HTCondor 10 including new HTCondorCE submitting to HTCondor 9 resources. so we can do rolling HTCondor 10 upgrade.

      Farm team successfully recovered more compute nodes (replaced failed hard drives)

    • 13:25 13:45
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonably good running since the previous meeting on January 3rd.
        • NET2.0 (BU) shutdown for good on January 11th except BU_NESE storage remains available until the the NET2.1 (UMass) storage is online and data can be transferred off of it to the new storage. At that point the old storage will be added to the new NET2.1 storage pool.
      • Progress continues on NET2.1 which Eduardo will describe in their report.
      • We should be able to procure in a couple of weeks once the paperwork is fully signed.
      • Operations plans are due for each Tier 2 on February 28.
        • The Google directory holding the operations plans is here. Please use the AGLT2 plan as a template.
      • 13:25
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen

        a missing file from rucio dataset casued transfer error, we declared file loss to rucio. 

        one of the dCache doors at the UM site had flooded /var partition from iptables log, caused 40% job failure. 

        Fixed a few things(mostly adding compatible library files and more restrict sudo athentication ) to enable BOINC backfilling jobs to continue to run on CentOS8 stream. 

        We completed 3 milestones due at the end of Dec 2022 and Jan 2023:

        1. enabled UEFI boot in cobbler, and used it to build CentOS8 stream OS. 
        2. updated vmware from 6.7 to 7.0 
        3. installed TrueNAS at MSU and upgraded BOTH UM and MSU to Bluefin from Angelfish (The part of the milestone for TrueNAS that was missing was having the storage for the MSU vmware)


         

         

      • 13:30
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Edward James Dambik (Indiana University (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
        • UIUC had a PM on January 18th. Everything came back up smoothly.
        • Working on upgrading to Condor 10.
        • Discussing procurement plan and deciding how to divide equipment between sites.
        • We switched from Squid to Varnish. Currently, we are testing it out and fixing any issues that come up.
      • 13:35
        NET2 5m
        Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
        • Transfer of machines ongoing: one rack installed in current UMass POD for demonstration, other machine crated until new racks are prepared.
        • Racks: ongoing work with MGHPCC and UMass to start infrastructure work.
        • Network: ESnet connection will be done in Boston instead of NYC in preparation for high-speed connection later in the year. Being worked out this week.
        • Disks: dCache system being tested. Tape DTN being transferred to UMass.
      • 13:40
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA

        • Recurring problems with transfers
          • Disabled the test queue
          • Update door versions
            • Would like to test offline, but may have to install on production first
            • Potentially go to slightly older version as on the other door
        • Accounting issue between GRACC and ATLAS numbers
          • Probably related to test queue, but need to do research
        • Network Upgrade
          • Problem with long run cables between TOR and core switches
          • Forward Error Correction (FEC) not negotiated correctly, can be manually set
          • Dell's recommendation is to update OS/Firmware on switches

        OU

        • Running well, no issues
        • Working on installing AlmaLinux9 HTCondor-CE test GK
    • 13:45 13:50
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))

      TACC 

       - Running jobs on the scale of ~10 nodes (1 node per slurm job) for about a week now

       - Tried scaling to multi-node but encountered SIGBUS errors. Haven't traced down the cause. 

       

      NERSC

       - Some large job failures at Cori due to not being able to find Sim_tf.  Believe this was because queue was set to online instead of brokeroff after downtime..

       - HammerCloud tests running on Perlmutter. Machine is down for emergency network maintenance.

       

    • 13:50 14:05
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 13:50
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • User set up running Athena GPU studies on Federated hub
          • Exposed an environment issue that Doug identified and quickly fixed
          • Need updated documentation
        • Doug has made progress launching notebooks on GPU nodes using HTCondor (+Dask)
      • 13:55
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:00
        Analysis Facilities - Chicago 5m
        Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
        • Getting container image caching service deployed on AF at UChicago
          • Two components: a harbor instance and a mutation controller(kyverno with cluster policy)
          • Transparent to users
          • Caches public access images, privates images will not be cached.
    • 14:05 14:25
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • Global Rucio meltdown on Monday (caused by URL signing req on Google cloud overloading Rucio servers?)
      • WLCG DOMA General Meeting earlier today focused on discussion of DC24 planning document.
        • WLCG coordination and oversight, but no specific person-power for this.  Bottom-up approach with self-forming "topical splinter groups."
        • Target date is "early 2024" before LHC run commences and lasting "multiple weeks."  Possible "pre-challenge" concurrent with SC23 in November.
        • Target rates likely around 20-25% of HL-LHC
        • Dune and Belle-II added as participants.  Their targets TBD.  More direct involvement of NRENs as well.
        • Token auth only for disk endpoints.
        • Demonstrate Tape REST-API for recall operations on selected sites.
        • Involvement of Analysis Facilities?
      • Status of ARC-CE at SLAC? (GGUS)
      • BNL VP running well, up to 1.5K jobs.  Can we ramp up further?
        • Checksum errors completely dominated by accesses to SWT2 (link)
      • Possible attempt to update IAM again next week (GGUS)
      • RAC meeting tomorrow re:Localgroupdisk requirements
      • 14:05
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 14:10
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCaches

        • working fine

        VP

        • a lot of errors from Taiwan and SWT2 endpoints
        • looking at panda brokering decissions in order to understand what is limiting number of jobs in different queues

        Varnish

        • found and fixed a misconfiguration (became apparent when Fermilab lost IPv6 networking). 
        • running smoothly at UC and AGLT2.

        ServiceX

        • testing 1.1.3. If all is fine, will deploy it on UC AF.

         

      • 14:15
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))
        • The cluster is running fine. At times there was a spike in stage-out/stage-in errors, which must be related to connectivity issues (ticketed in ggus:160124).
        • Occasionally there are OOM termination of jobs running in old nodes with 2GB/core (never in the nodes with higher GB/core). Those nodes are request optimized to run at maximum efficient mode for job occupancy. When Request/Task is using much more memory than requested, the node may run out of memory and jobs terminated. Looking into a solution which may work for all type of nodes. As I was investigating this, I noticed that some panda pages (lookup from WorkerID) are not showing any job selection - pinged pandamon support.
        • Trying to optimize the job CPU requests coefficient sent from Harvester (has 0.9 scale down value as default). The idea is to not overcommit the node CPU, but at the same time leave CPU request space for other system/auxiliary pods.
        • Follow-up on the issue with the name of the parameter "resource_type_limits.SCORE_HIMEM" in CRIC, which got an extra space typo in the name. Fixed it for SWT2_CPB_K8S, and also pinged to Ryan (Victoria), so he fixed it for his site too, but for a while there was no feedback from the CRIC expert, so Fernando deleted that parameter from the CRIC. With some delay we got a response from Alexey. He checked DB that there are no more "bad" names, and they now also have mechanism in place to avoid introducing parameters with such typos.
    • 14:25 14:35
      AOB 10m