US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      HEPiX is this week https://indico.cern.ch/event/1123214/timetable/#20220427.detailed

       

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci

      Release yesterday: https://opensciencegrid.org/docs/release/osg-36/#april-26-2022-cvmfs-292-upcoming-htcondor-981

      • CVMFS bugfix release
      • VOMS clients now generate 2048 bit proxies by default
      • osg-ce minor update that will help us track OSG 3.6 updates

      OSG 3.5 EOL on May 1!

      HTCondor Week registration is closing soon! See invitation:

      Greetings CHTC Users!

      We want to invite you to HTCondor Week 2022, our annual HTCondor user conference, May 23-26, 2022. This year, HTCondor Week will be a hybrid event: we are hosting an in-person meeting at the Fluno Center on the University of Wisconsin-Madison campus. This provides HTCondor Week attendees with a compelling environment in which to attend tutorials and talks from HTCondor developers, meet other users like you and attend social events. For those who cannot attend in person, we'll also be broadcasting the event online via a Zoom meeting.

      Registration for HTCondor Week 2022 is open now. The registration deadline for in-person attendee is May 2, 2022, and the cost is $90 per day to partake in conference food. For virtual-only attendance, registration is a flat $25 fee for the whole week.

      UW-Madison affiliates who attend conference talks in-person only need to register for in-person participation (and pay) if they plan to partake in conference food. We otherwise/also recommend the virtual registration (still with a fee) for UW-Madison affiliates who plan to participate virtually.
      You can register at http://htcondor.org/HTCondorWeek2022.

      There will be specific programming highlighting the UW-Madison campus community on Thursday, May 26, where you can meet other campus users of CHTC and HTCondor, as well as CHTC staff. We will separately contact some CHTC users to present their work that day!!

      On other days, we will have a variety of in-depth tutorials and talks where you can learn more about HTCondor and how other people are using and deploying HTCondor. Best of all, you can establish contacts and learn best practices from people in industry, government, and academia who are using HTCondor to solve hard problems, many of which may be similar to those you are facing.

      Hotel details and agenda overview are on the HTCondor Week 2022 site:

      http://htcondor.org/HTCondorWeek2022

      We hope to see you there,

      The Center for High Throughput Computing

       

    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      Migrated all gatekeepers to OSG 3.6

      - HC turned us off 5 times in past week, investigation on going, as it looks like issues are non-site related errors 


       

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Pretty good running over the past two weeks.
        • Lots compute slots coming online.
        • The main issue (the EPoll:) issue is a pilot issue where the string "EPoll:" gets prepended to variable values causing the pilot to kill the job after it has been running without problems. This hurt job efficiency a lot and caused Hammer Cloud to kick sites offline when there was not a site problem. It seems to affect sites that are running HTCondor and dCache even though the working hypothesis is an XRootD issue.
      • We are down to the wire on getting OSG 3.6 into use.
      • How are NET2 and SWT2 doing on enabling IPV6.
      • Please keep these sheets up to date:
        • Service versions: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
        • Run 3 readiness: https://docs.google.com/spreadsheets/d/1KniOlqb4dbJ6dKUHBYYt9OfriKjhVpUqXguPvryIMY8
        • Site capacity: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
          • NB: I have not yet added the tabs for the current (April-June) quarter until I am sure that the data for the previous quarter actually reflects the situation on March 31.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        On OSG 3.6, for gatekeepers and worker nodes.

        We broke frontier squids while trying to fix gratia probe problems.
        Our first fix attempt inadvertently re-enabled a local setup script overriding squid location variables.
        Gratia issues solved: directory ownership was root instead of condor.

         

        2 tickets:

        156868  15-Apr-2022   AGLT2: Failing jobs in panda with "Unable to identify specific exception"
        156873  17-Apr-2022   US AGLT2: High Transfer failures as source

        The job problems was traced to time outs during stage-out.
        There was no clear problem but the likely suspect was dcache and java running out of memory.
        We increased the memory for webdav on the doors and dCacheDomain on the headnodes.
        Also added CPUs and memory to the VM doors.  That all helped.
        We also upgraded dcache from 6.2.35 to 7.2.15 (since we had to restart to load new CA certs anyway)
        The issues from both tickets disappeared after that.

         

        Maintenance:

        mostly through updating all worker nodes for new kernel, Dell FW updates, OSG updates (cvmfs)

         

        Network upgrades completed * and tested * :

        All new multi-path and multi-100G connections to ESnet and between MSU and UM are now fully deployed
        and were tested for proper failover in case of backhoe vs fiber incident.

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        All new compute and storage online at the three sites.

        Both gatekeepers upgraded to OSG 3.6.

        Restarted dCache after upgrading osg-ca-certs. Planning upgrade to 7.2 for the week of 9 May.

         

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef

         

        Smooth operations.  New workers are in production. 

        NESE Team preparing for ~5 rack expansion of NESE Ceph including NET2 storage.  Slowed down by Cisco switch delivery.  This will allow retirement of NET2 GPFS and make more space for workers.

        Working on ipv6; then OSG 3.6; upgrading TOR networking and NET2-NESE networking. 

      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        • Compute nodes from UTA_SWT2 have been integrated into SWT2_CPB
          • UTA_SWT2 is now disabled in CRIC
          • Working on updating some old compute nodes with additional memory
          • Still need to update Capacity Spreadsheet/OIM/CRIC to reflect changes
        • Received partial shipment of R6525 nodes (8 nodes of 48)
          • The machines are racked
          • Need to update Rocks install kernel to support RAID card before installation
        • Work is progressing on configuring the new OSG 3.6 CE.

        OU:

        - Drained some HEP nodes to move them, should be back up later today.

        - Should get the rest of the newly arrived HEP nodes up and running soon as well.

         

    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))

      TACC

      • Running as normal. a bit over 10K SUs left. 

      NERSC

      • Old task was taking 10+hrs to run. asked John Anders to send a new task. New task seems to be running ok. 
      • Working with Doug and Wei to get XRootD going at NERSC for DATADISK.
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • First of new AF Forum series held last week with an update on k8s batch and upcoming kubecon
        • Met on Friday to work out details of BNL/NERSC XRootd SE setup
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        AGC was a complete success. Everything worked, had some new users, very interesting discussions, especially interesting was discussion at ATLAS parallel session at the end.

         

    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • All BNL gatekeepers now updated and running OSG 3.6 HTCondor-CEs
      • On Monday, ATLAS jobs began triggering a machine check exception (MCE) on some older servers at BNL (Dell R640 Skylake).  These hosts are currently closed to jobs to apply a firmware update.
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        * XCache - working fine

        * VP - working fine - will summarize performance and BHAM experience with switching to VP at next DDM meeting.

        * ServiceX - works fine at 1.0.30. Next week will be dedicated to performance improvements developments.

        * Analytics - adding new functionality to ATLAS Alarm & Alert Frontend.

      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        All the existing Kubernetes worker nodes were updated with additional memory. Also part of hardware from the retired UTA_SWT2 cluster was racked in CPB, and added to the cluster. Kubernetes was installed on those nodes, and those workers were joined to the existing cluster. The cluster is showing healthy. 
        Now trying to find out why the grid jobs are reaching the workers, but are stuck there in a waiting state. 

    • 14:55 15:05
      AOB 10m