US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-05-11T13:00:00-04:00
End: 2022-05-11T15:10:00-04:00
Location: No location set

Wednesday 11 May 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Pre-scrubbing schedule:
  
  June 27 (all day) - Tier1 (Rob and Shawn in person at Brookhaven)
  
  June 28 (morning) - 2.3.2, 2.3.3, 2.3.4, 2.3.5 (L3 managers join via Zoom)
  
  Date for the actual scrubbing is likely the first week of August, at UMass Amherst (Verena hosting). This might be combined with an all-US ATLAS S&C open technical meeting, TBD.
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release this week containing osg-wn-client (was missing voms-clients-cpp and stashcp)
  
  3.5 EOL/token transition
  
  Feedback/questions? Any issues/difficulties?
  
  Stopped updating 'fresh', '3.5-release', and 'release' image tags
  
  Removed most OSG 3.5 documentation
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Joint CMS/ATLAS HPC/Cloud Blueprint Status/Updates 30m
    
    Speakers: Fernando Harald Barreiro Megino (University of Texas at Arlington), Lincoln Bryant (University of Chicago (US))
    
    slides
    
    Doug - have you talked with the Centers about injecting remote workloads. NSERC has a related "Superfacility" project.
    
    Brian Lin to Everyone (12:14 PM)
    @Doug are the various HPCs you were talking about looking into a common interface or are each of them putting together their own special sauce?
    
    Douglas Benjamin to Everyone (12:17 PM)
    look at NERSC superfacility talks from Debbie Bard, At OLCF there are talks on their SLATE setup.
    
    Kaushik: please don't lose focus on the three review questions that we really need to understand - a first answer within the first six months. 1) what are the workloads that work best on HPCs, Clouds. 2) what is the cost - in people and hardware - there are costs. 3) What can be done in the future jointly.
    
    Note - CMS wants to enlarge 2) to include Tier1 and Tier2. This requires a lot more work.
    
    Doug: what about workloads that *dont* work well.
    
    Paolo: suggesting
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  dCache downtime next week - in downtime calendar - 4 hours.
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs_20220511.png
  
  Success_20220511.png
  
  Transfer_20220511.png
  Reasonable running over the last two weeks.
  
  AGL2 issues after scheduled power outage and certificate problem.
  
  MWT2 dCache upgrade downtime and some trouble keeping site full.
  
  NET2 stability issues on GPFS partition.
  
  SWT2 CPB readonly disk clogged up job submission twice draining the site
  
  There were several issues with the central services: Rucio suffered a network outage and a database issue.
  
  Run 3 data taking readiness:
  
  AGLT2 fully updated and ready, some compute servers not yet received
  
  MWT2 fully updated and ready, some network gear not received (work around in place)
  
  NET2 needs to update to OSG 3.6, support IPV6, get XRootD WAN access up, finish network upgrade, transition to storage being entirely on CEPH with GPFS retired.
  
  SWT2 OU need to get new hardware for gatekeeper and SLATE in operation, need to up to OSG 3.6
  
  SWT2 CPB need to update to OSG 3.6, support IPV6, remove LSM, some compute servers not yet received.
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    05/01/2022
    
    There was a scheduled power shutdown in the UM Tier3 server room due to maintenance of the facility, the shutdown lasted 6 hours, a couple of things broke during the shutdown, including the network card for one UPS unit and the containerd/network forwarding service on one of the nodes of the slate kubelet cluster. (The containerd failure was caused by a wrong configuration of the net.ipv4.conf.default.forwarding and net.ipv4.conf.all.forwarding, they should be set as 1). The kublete node problem caused one of the squid servers hosted on the kubelet cluster to be down, and all traffic went to the other squid server and did not cause job failure.
    
    5/02/2022
    
    The slate kubelet cluster node sl-um-es5 reverted the ip forwarding change by cfengine, so the squid service went down again, this caused a lot of BOINC jobs failing as all BOINC clients are configured to use this proxy server. We switched the BOINC proxy server to sl-um-es3, which is located in the Tier2 server room and should be more robust. The BOINC jobs started to refill the work nodes after we changed the proxy. And later we fixed the sl-um-es5 node.
    
    5/5/2022
    
    During our annual renewal of the host certificates, we made a mistake to request the gatekeepers’ host certificates issued by the InCommon RSA instead of by InCommon IGTF, and this started to cause authentication errors on all gatekeepers for any incoming jobs. The change was made late in the afternoon, and the error was not caught until the next morning, so the site got drained overnight. We replaced the RSA certs with IGTF certs on the gatekeepers, and the site started to ramp up. During the 17 hour draining period, BOINC jobs ramped up as we designed, and filled up all the cluster, so the overall cpu time used by ATLAS jobs stayed about the same compared to before the draining.
  - 14:00
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    dCache upgraded to 7.2.15
    
    Working on adding an additional gatekeeper at both IU and UC
    
    Upgraded workers to OSG 3.6
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
  - 14:10
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Running well, except for occasional xrootd overloads. Working with Andy and Wei to address this.
    
    - Today OSCER maintenance to upgrade SLURM (critical vulnerability). Didn't schedule maintenance because jobs will just be held, and launched after completion.
    
    - Got very good opportunistic throughput the last few days while cluster was draining for maintenance. Up to 5,500 slots total, which I think is a record for OU.
    
    UTA:
    
    Still receiving R6225's from last purchase; 50% has been delivered to lab.
    
    HTCondor-CE from OSG 3.6 has begun testing this morning.
    
    Odd node failure caused problems late last week.
    
    Failure prevented node check to run correctly.
    
    Jobs scheduled to the node failed to start
    
    Failed jobs were held (looks queued to HTCondor)
    
    Pilot submission choked off.
    
    Will find a permanent fix
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  TACC
  
  Allocation essentially finished. We have 1500 SUs left, less than 1%. Will use the rest to experiment with HostedCEs
  
  NERSC
  
  Some recent job failures that we are looking into. Small permissions issue with shared ownership of Harvester directory, not clear if related.
  
  Ongoing work with XRootD setup at NERSC.
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Working on pulling AF metrics for 2.3/5 meeting tomorrow
    
    Federated Jupyterhub nearing approval after meeting with GUV center
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    A system getting set to monitor Analysis Facilities usage.
    
    Repository AF metrics collector contains simple scripts to collect basic data (logged in users, jupyterlogs, condor users, jobs, etc.). Data is sent to UC logstash and then to ES.
    
    Currently only UC AF sends data. Here initial dashboard.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Updates to ATLAS storage setup twiki (prompted by XRootd protocol access issues at NET2): https://twiki.cern.ch/twiki/bin/view/AtlasComputing/StorageSetUp
  
  Updating BNL xcache monitoring
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-5_11_22.pdf
    
    US-cloud-summary-5_4_22.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    upgrading today to 5.4.3
    
    BNL xcache was dead but sending heartbeats. Ofer fixed it.
    
    Prague lost a disk.
    
    AGLT2 server has network issues. Removed until it gets fixed.
    
    VP
    
    All works fine
    
    ServiceX
    
    Improving logs parsing
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    The cluster is running fine. The grid jobs are reaching the workers, but are stuck there in a waiting state. I was looking into those pods, but the warning message in the description of those pods was not very conclusive/helpful.
    I also see that there is one calico pod (in calico-system namespace), which is running but is not showing healthy. Though overall the internal network provided by calico is working fine, there seems to be some configuration issue. That issue must be the source of the problem with stuck pods.
- 14:55 → 15:05
  
  AOB 10m