US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- cvmfs-2.9.3 in testing
- scitokens-cpp has an FD leak, which will affect CEs and potentially XRootD hosts. Working with upstream to get a package into epel-testing ASAP
- Working on an xrootd shoveler bugfix release
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
- 13:50 → 13:55
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
UM site had IPv6 issues after the hardware maintenance from Merit, we had to put the UM condor cluster offline to prevent failing more jobs. The issue was resolved the next day by Merit.
We found out a condor ce sub directory ownership issue on the condor-ce which had been causing 20% SAM test jobs fail(Site has only75% Reliability and Availabilty in May). That ownership issue was introduced in late April when we were trying to fix the ownership for the gratia directories.
-
14:00
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- Still troubleshooting downed storage machine.
- Declared some more lost files.
- Retiring ANALY_MWT2_GPU
- Waiting on updating to condor 9.0.13 until we work out an issue with the condor-externals RPM removal causing jobs to fail.
- Working on fixing an issue where the IU squid restarts periodically.
- Running ALRB testing version on condor
- Still troubleshooting downed storage machine.
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
-
13:55
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- Preparing for pre-scrubbing
- Doug, Lincoln, Ofer met with EOS team at CERN
- Okay to expand usage of fuse mounts
- Doug, Ofer met with Oksana Shadura and Alex Held at CERN
- Will support efforts for development standards to support OKD
- Will collaborate to get demo analyses running at BNL for testing/benchmarking
- Oksana has already gotten access using FNAL login to federated jupyterhub
- Doug, Ofer met with Elena Gazzarini at CERN (taking over for Riccardo)
- Discussed collaborating on development of DLAAS tools
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
Analysis Facilities - Chicago 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- We upgraded Kubernetes two major versions from 1.21 to 1.23
- We upgraded HTCondor to 9.0.13 with OSG 3.6 on head/login nodes
- We did a
yum updateof all packages on the AF login and head nodes, including the latest mainline Kernel from ELRepo - We are still upgrading workers in the background
- We deferred the CephFS upgrade from v16 (Pacific) to v17 (Quincy) - we found 1 node (c001) with what seems to be a hardware error - all disks are reporting "I/O error" trying to mount them. Need to get the cluster clean before we upgrade major version.
-
14:20
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14:45
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
In the Calico network configuration the modification of the parameter IP_AUTODETECTION_METHOD (which was the possible suspect) was going through, but looking in the master node Calico pod, it was not showing that the update was propagating correctly (looks like something was overriding the change).
Lincoln suggested that it might be Calico operator, running in the background, and indeed, making the update on the operator level flipped that pod to healthy. Right now all K8s components are healthy. Though I still have submitted jobs waiting at the ContainerCreating state. I think I know what's the reason - working on a fix.
-
14:35
-
14:55
→
15:05
AOB 10m
-
13:00
→
13:10
