US ATLAS Computing Facility

US/Eastern
Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID:  996 1094 4232

Meeting password: 125

Invite link:  https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

 

 

    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))

      Pre-scrubbing schedule:

      • June 27 (all day) - Tier1 (Rob and Shawn in person at Brookhaven)
      • June 28 (morning) - 2.3.2, 2.3.3, 2.3.4, 2.3.5 (L3 managers join via Zoom)

      Date for the actual scrubbing is likely the first week of August, at UMass Amherst (Verena hosting).  This might be combined with an all-US ATLAS S&C open technical meeting, TBD.

    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
      • Release this week containing osg-wn-client (was missing voms-clients-cpp and stashcp)
      • 3.5 EOL/token transition
        • Feedback/questions? Any issues/difficulties?
        • Stopped updating 'fresh', '3.5-release', and 'release' image tags
        • Removed most OSG 3.5 documentation
    • 13:20 13:50
      Topical Reports
      Convener: Robert William Gardner Jr (University of Chicago (US))
      • 13:20
        Joint CMS/ATLAS HPC/Cloud Blueprint Status/Updates 30m
        Speakers: Fernando Harald Barreiro Megino (University of Texas at Arlington), Lincoln Bryant (University of Chicago (US))

        Doug - have you talked with the Centers about injecting remote workloads.  NSERC has a related "Superfacility" project.

        Brian Lin to Everyone (12:14 PM)
        @Doug are the various HPCs you were talking about looking into a common interface or are each of them putting together their own special sauce?

        Douglas Benjamin to Everyone (12:17 PM)
        look at NERSC superfacility talks from Debbie Bard, At OLCF there are talks on their SLATE setup.

        Kaushik: please don't lose focus on the three review questions that we really need to understand - a first answer within the first six months.   1) what are the workloads that work best on HPCs, Clouds.  2) what is the cost - in people and hardware - there are costs. 3) What can be done in the future jointly.

        Note - CMS wants to enlarge 2) to include Tier1 and Tier2.  This requires a lot more work. 

        Doug: what about workloads that *dont* work well.

        Paolo: suggesting 

    • 13:50 13:55
      WBS 2.3.1 Tier1 Center 5m
      Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))

      dCache downtime next week - in downtime calendar - 4 hours.

    • 13:55 14:15
      WBS 2.3.2 Tier2 Centers

      Updates on US Tier-2 centers

      Convener: Fred Luehring (Indiana University (US))
      • Reasonable running over the last two weeks.
        • AGL2 issues after scheduled power outage and certificate problem.
        • MWT2 dCache upgrade downtime and some trouble keeping site full.
        • NET2 stability issues on GPFS partition.
        • SWT2 CPB readonly disk clogged up job submission twice draining the site
        • There were several issues with the central services: Rucio suffered a network outage and a database issue.
      • Run 3 data taking readiness:
        • AGLT2 fully updated and ready, some compute servers not yet received
        • MWT2 fully updated and ready, some network gear not received (work around in place)
        • NET2 needs to update to OSG 3.6, support IPV6, get XRootD WAN access up, finish network upgrade, transition to storage being entirely on CEPH with GPFS retired.
        • SWT2 OU need to get new hardware for gatekeeper and  SLATE in operation, need to up to OSG 3.6
        • SWT2 CPB need to update to OSG 3.6, support IPV6, remove LSM, some compute servers not yet received.
      • 13:55
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)

        05/01/2022

        There was a scheduled power shutdown in the UM Tier3 server room due to maintenance of the facility, the shutdown lasted 6 hours, a couple of things broke during the shutdown, including the network card for one UPS unit and the containerd/network forwarding service on one of the nodes of the slate kubelet cluster. (The containerd failure was caused by a wrong configuration of the net.ipv4.conf.default.forwarding and net.ipv4.conf.all.forwarding, they should be set as 1). The kublete node problem caused one of the squid servers hosted on the kubelet cluster to be down, and all traffic went to the other squid server and did not cause job failure. 

        5/02/2022

        The slate kubelet cluster node sl-um-es5 reverted the ip forwarding change by cfengine, so the squid service went down again, this caused a lot of BOINC jobs failing as all BOINC clients are configured to use this proxy server. We switched the BOINC proxy server to sl-um-es3, which is located in the Tier2 server room and should be more robust. The BOINC jobs started to refill the work nodes after we changed the proxy. And later we fixed the sl-um-es5 node. 

        5/5/2022

        During our annual renewal of the host certificates, we made a mistake to request the gatekeepers’ host certificates issued by the InCommon RSA instead of by InCommon IGTF, and this started to cause authentication errors on all gatekeepers for any incoming jobs. The change was made late in the afternoon, and the error was not caught until the next morning, so the site got drained overnight. We replaced  the RSA certs with  IGTF certs on the gatekeepers,  and the site started to ramp up. During the 17 hour draining period, BOINC jobs ramped  up as we designed, and filled up all the cluster, so the overall cpu time used by ATLAS jobs  stayed about the same compared to before the draining. 

      • 14:00
        MWT2 5m
        Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))

        dCache upgraded to 7.2.15

        Working on adding an additional gatekeeper at both IU and UC

        Upgraded workers to OSG 3.6

      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef
      • 14:10
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        - Running well, except for occasional xrootd overloads. Working with Andy and Wei to address this.

        - Today OSCER maintenance to upgrade SLURM (critical vulnerability). Didn't schedule maintenance because jobs will just be held, and launched after completion.

        - Got very good opportunistic throughput the last few days while cluster was draining for maintenance. Up to 5,500 slots total, which I think is a record for OU.

         

        UTA:

        • Still receiving R6225's from last purchase; 50% has been delivered to lab.
        • HTCondor-CE from OSG 3.6 has begun testing this morning.
        • Odd node failure caused problems late last week.
          • Failure prevented node check to run correctly.
          • Jobs scheduled to the node failed to start
          • Failed jobs were held (looks queued to HTCondor)
          • Pilot submission choked off.
          • Will find a permanent fix
    • 14:15 14:20
      WBS 2.3.3 HPC Operations 5m
      Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))

      TACC

      • Allocation essentially finished. We have 1500 SUs left, less than 1%. Will use the rest to experiment with HostedCEs

      NERSC

      • Some recent job failures that we are looking into.  Small permissions issue with shared ownership of Harvester directory, not clear if related.
      • Ongoing work with XRootD setup at NERSC. 
    • 14:20 14:35
      WBS 2.3.4 Analysis Facilities
      Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: Ofer Rind (Brookhaven National Laboratory)
        • Working on pulling AF metrics for 2.3/5 meeting tomorrow
        • Federated Jupyterhub nearing approval after meeting with GUV center
      • 14:25
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:30
        Analysis Facilities - Chicago 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        A system getting set to monitor Analysis Facilities usage. 

        Repository AF metrics collector contains simple scripts to collect basic data (logged in users, jupyterlogs, condor users, jobs, etc.). Data is sent to UC logstash and then to ES. 

        Currently only UC AF sends data. Here initial dashboard

    • 14:35 14:55
      WBS 2.3.5 Continuous Operations
      Convener: Ofer Rind (Brookhaven National Laboratory)
      • 14:35
        US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
        Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • 14:40
        Service Development & Deployment 5m
        Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))

        XCache

        • upgrading today to 5.4.3
        • BNL xcache was dead but sending heartbeats. Ofer fixed it.
        • Prague lost a disk.
        • AGLT2 server has network issues. Removed until it gets fixed.

        VP

        • All works fine 

        ServiceX

        • Improving logs parsing
      • 14:45
        Kubernetes R&D at UTA 5m
        Speaker: Armen Vartapetian (University of Texas at Arlington (US))

        The cluster is running fine. The grid jobs are reaching the workers, but are stuck there in a waiting state. I was looking into those pods, but the warning message in the description of those pods was not very conclusive/helpful. 
        I also see that there is one calico pod (in calico-system namespace), which is running but is not showing healthy. Though overall the internal network provided by calico is working fine, there seems to be some configuration issue. That issue must be the source of the problem with stuck pods.

    • 14:55 15:05
      AOB 10m