US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
Pre-scrubbing schedule:
- June 27 (all day) - Tier1 (Rob and Shawn in person at Brookhaven)
- June 28 (morning) - 2.3.2, 2.3.3, 2.3.4, 2.3.5 (L3 managers join via Zoom)
Date for the actual scrubbing is likely the first week of August, at UMass Amherst (Verena hosting). This might be combined with an all-US ATLAS S&C open technical meeting, TBD.
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
Joint CMS/ATLAS HPC/Cloud Blueprint Status/Updates 30mSpeakers: Fernando Harald Barreiro Megino (University of Texas at Arlington), Lincoln Bryant (University of Chicago (US))
Doug - have you talked with the Centers about injecting remote workloads. NSERC has a related "Superfacility" project.
Brian Lin to Everyone (12:14 PM)
@Doug are the various HPCs you were talking about looking into a common interface or are each of them putting together their own special sauce?Douglas Benjamin to Everyone (12:17 PM)
look at NERSC superfacility talks from Debbie Bard, At OLCF there are talks on their SLATE setup.Kaushik: please don't lose focus on the three review questions that we really need to understand - a first answer within the first six months. 1) what are the workloads that work best on HPCs, Clouds. 2) what is the cost - in people and hardware - there are costs. 3) What can be done in the future jointly.
Note - CMS wants to enlarge 2) to include Tier1 and Tier2. This requires a lot more work.
Doug: what about workloads that *dont* work well.
Paolo: suggesting
-
13:20
- 13:50 → 13:55
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running over the last two weeks.
- AGL2 issues after scheduled power outage and certificate problem.
- MWT2 dCache upgrade downtime and some trouble keeping site full.
- NET2 stability issues on GPFS partition.
- SWT2 CPB readonly disk clogged up job submission twice draining the site

- There were several issues with the central services: Rucio suffered a network outage and a database issue.
- Run 3 data taking readiness:
- AGLT2 fully updated and ready, some compute servers not yet received
- MWT2 fully updated and ready, some network gear not received (work around in place)
- NET2 needs to update to OSG 3.6, support IPV6, get XRootD WAN access up, finish network upgrade, transition to storage being entirely on CEPH with GPFS retired.
- SWT2 OU need to get new hardware for gatekeeper and SLATE in operation, need to up to OSG 3.6
- SWT2 CPB need to update to OSG 3.6, support IPV6, remove LSM, some compute servers not yet received.
-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
05/01/2022
There was a scheduled power shutdown in the UM Tier3 server room due to maintenance of the facility, the shutdown lasted 6 hours, a couple of things broke during the shutdown, including the network card for one UPS unit and the containerd/network forwarding service on one of the nodes of the slate kubelet cluster. (The containerd failure was caused by a wrong configuration of the net.ipv4.conf.default.forwarding and net.ipv4.conf.all.forwarding, they should be set as 1). The kublete node problem caused one of the squid servers hosted on the kubelet cluster to be down, and all traffic went to the other squid server and did not cause job failure.
5/02/2022
The slate kubelet cluster node sl-um-es5 reverted the ip forwarding change by cfengine, so the squid service went down again, this caused a lot of BOINC jobs failing as all BOINC clients are configured to use this proxy server. We switched the BOINC proxy server to sl-um-es3, which is located in the Tier2 server room and should be more robust. The BOINC jobs started to refill the work nodes after we changed the proxy. And later we fixed the sl-um-es5 node.
5/5/2022
During our annual renewal of the host certificates, we made a mistake to request the gatekeepers’ host certificates issued by the InCommon RSA instead of by InCommon IGTF, and this started to cause authentication errors on all gatekeepers for any incoming jobs. The change was made late in the afternoon, and the error was not caught until the next morning, so the site got drained overnight. We replaced the RSA certs with IGTF certs on the gatekeepers, and the site started to ramp up. During the 17 hour draining period, BOINC jobs ramped up as we designed, and filled up all the cluster, so the overall cpu time used by ATLAS jobs stayed about the same compared to before the draining.
- 14:00
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- Running well, except for occasional xrootd overloads. Working with Andy and Wei to address this.
- Today OSCER maintenance to upgrade SLURM (critical vulnerability). Didn't schedule maintenance because jobs will just be held, and launched after completion.
- Got very good opportunistic throughput the last few days while cluster was draining for maintenance. Up to 5,500 slots total, which I think is a record for OU.
UTA:
- Still receiving R6225's from last purchase; 50% has been delivered to lab.
- HTCondor-CE from OSG 3.6 has begun testing this morning.
- Odd node failure caused problems late last week.
- Failure prevented node check to run correctly.
- Jobs scheduled to the node failed to start
- Failed jobs were held (looks queued to HTCondor)
- Pilot submission choked off.
- Will find a permanent fix
- Reasonable running over the last two weeks.
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
TACC
- Allocation essentially finished. We have 1500 SUs left, less than 1%. Will use the rest to experiment with HostedCEs
NERSC
- Some recent job failures that we are looking into. Small permissions issue with shared ownership of Harvester directory, not clear if related.
- Ongoing work with XRootD setup at NERSC.
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:20
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
Analysis Facilities - Chicago 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
A system getting set to monitor Analysis Facilities usage.
Repository AF metrics collector contains simple scripts to collect basic data (logged in users, jupyterlogs, condor users, jobs, etc.). Data is sent to UC logstash and then to ES.
Currently only UC AF sends data. Here initial dashboard.
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Updates to ATLAS storage setup twiki (prompted by XRootd protocol access issues at NET2): https://twiki.cern.ch/twiki/bin/view/AtlasComputing/StorageSetUp
- Updating BNL xcache monitoring
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14:45
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
The cluster is running fine. The grid jobs are reaching the workers, but are stuck there in a waiting state. I was looking into those pods, but the warning message in the description of those pods was not very conclusive/helpful.
I also see that there is one calico pod (in calico-system namespace), which is running but is not showing healthy. Though overall the internal network provided by calico is working fine, there seems to be some configuration issue. That issue must be the source of the problem with stuck pods.
-
14:55
→
15:05
AOB 10m
-
13:00
→
13:10