- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
Pre-scrubbing schedule:
Date for the actual scrubbing is likely the first week of August, at UMass Amherst (Verena hosting). This might be combined with an all-US ATLAS S&C open technical meeting, TBD.
Doug - have you talked with the Centers about injecting remote workloads. NSERC has a related "Superfacility" project.
Brian Lin to Everyone (12:14 PM)
@Doug are the various HPCs you were talking about looking into a common interface or are each of them putting together their own special sauce?
Douglas Benjamin to Everyone (12:17 PM)
look at NERSC superfacility talks from Debbie Bard, At OLCF there are talks on their SLATE setup.
Kaushik: please don't lose focus on the three review questions that we really need to understand - a first answer within the first six months. 1) what are the workloads that work best on HPCs, Clouds. 2) what is the cost - in people and hardware - there are costs. 3) What can be done in the future jointly.
Note - CMS wants to enlarge 2) to include Tier1 and Tier2. This requires a lot more work.
Doug: what about workloads that *dont* work well.
Paolo: suggesting
Updates on US Tier-2 centers
05/01/2022
There was a scheduled power shutdown in the UM Tier3 server room due to maintenance of the facility, the shutdown lasted 6 hours, a couple of things broke during the shutdown, including the network card for one UPS unit and the containerd/network forwarding service on one of the nodes of the slate kubelet cluster. (The containerd failure was caused by a wrong configuration of the net.ipv4.conf.default.forwarding and net.ipv4.conf.all.forwarding, they should be set as 1). The kublete node problem caused one of the squid servers hosted on the kubelet cluster to be down, and all traffic went to the other squid server and did not cause job failure.
5/02/2022
The slate kubelet cluster node sl-um-es5 reverted the ip forwarding change by cfengine, so the squid service went down again, this caused a lot of BOINC jobs failing as all BOINC clients are configured to use this proxy server. We switched the BOINC proxy server to sl-um-es3, which is located in the Tier2 server room and should be more robust. The BOINC jobs started to refill the work nodes after we changed the proxy. And later we fixed the sl-um-es5 node.
5/5/2022
During our annual renewal of the host certificates, we made a mistake to request the gatekeepers’ host certificates issued by the InCommon RSA instead of by InCommon IGTF, and this started to cause authentication errors on all gatekeepers for any incoming jobs. The change was made late in the afternoon, and the error was not caught until the next morning, so the site got drained overnight. We replaced the RSA certs with IGTF certs on the gatekeepers, and the site started to ramp up. During the 17 hour draining period, BOINC jobs ramped up as we designed, and filled up all the cluster, so the overall cpu time used by ATLAS jobs stayed about the same compared to before the draining.
OU:
- Running well, except for occasional xrootd overloads. Working with Andy and Wei to address this.
- Today OSCER maintenance to upgrade SLURM (critical vulnerability). Didn't schedule maintenance because jobs will just be held, and launched after completion.
- Got very good opportunistic throughput the last few days while cluster was draining for maintenance. Up to 5,500 slots total, which I think is a record for OU.
UTA:
TACC
NERSC
A system getting set to monitor Analysis Facilities usage.
Repository AF metrics collector contains simple scripts to collect basic data (logged in users, jupyterlogs, condor users, jobs, etc.). Data is sent to UC logstash and then to ES.
Currently only UC AF sends data. Here initial dashboard.
The cluster is running fine. The grid jobs are reaching the workers, but are stuck there in a waiting state. I was looking into those pods, but the warning message in the description of those pods was not very conclusive/helpful.
I also see that there is one calico pod (in calico-system namespace), which is running but is not showing healthy. Though overall the internal network provided by calico is working fine, there seems to be some configuration issue. That issue must be the source of the problem with stuck pods.