US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
2
OSG-LHCSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- cvmfs-2.9.3 in testing
- scitokens-cpp has an FD leak, which will affect CEs and potentially XRootD hosts. Working with upstream to get a package into epel-testing ASAP
- Working on an xrootd shoveler bugfix release
-
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
- 3
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))-
4
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
UM site had IPv6 issues after the hardware maintenance from Merit, we had to put the UM condor cluster offline to prevent failing more jobs. The issue was resolved the next day by Merit.
We found out a condor ce sub directory ownership issue on the condor-ce which had been causing 20% SAM test jobs fail(Site has only75% Reliability and Availabilty in May). That ownership issue was introduced in late April when we were trying to fix the ownership for the gratia directories.
-
5
MWT2Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- Still troubleshooting downed storage machine.
- Declared some more lost files.
- Retiring ANALY_MWT2_GPU
- Waiting on updating to condor 9.0.13 until we work out an issue with the condor-externals RPM removal causing jobs to fail.
- Working on fixing an issue where the IU squid restarts periodically.
- Running ALRB testing version on condor
- Still troubleshooting downed storage machine.
-
6
NET2Speaker: Prof. Saul Youssef
-
7
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
-
4
-
8
WBS 2.3.3 HPC OperationsSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
-
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
9
Analysis Facilities - BNLSpeaker: Ofer Rind (Brookhaven National Laboratory)
- Preparing for pre-scrubbing
- Doug, Lincoln, Ofer met with EOS team at CERN
- Okay to expand usage of fuse mounts
- Doug, Ofer met with Oksana Shadura and Alex Held at CERN
- Will support efforts for development standards to support OKD
- Will collaborate to get demo analyses running at BNL for testing/benchmarking
- Oksana has already gotten access using FNAL login to federated jupyterhub
- Doug, Ofer met with Elena Gazzarini at CERN (taking over for Riccardo)
- Discussed collaborating on development of DLAAS tools
-
10
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
11
Analysis Facilities - ChicagoSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- We upgraded Kubernetes two major versions from 1.21 to 1.23
- We upgraded HTCondor to 9.0.13 with OSG 3.6 on head/login nodes
- We did a
yum updateof all packages on the AF login and head nodes, including the latest mainline Kernel from ELRepo - We are still upgrading workers in the background
- We deferred the CephFS upgrade from v16 (Pacific) to v17 (Quincy) - we found 1 node (c001) with what seems to be a hardware error - all disks are reporting "I/O error" trying to mount them. Need to get the cluster clean before we upgrade major version.
-
9
-
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
12
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
13
Service Development & DeploymentSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14
Kubernetes R&D at UTASpeaker: Armen Vartapetian (University of Texas at Arlington (US))
In the Calico network configuration the modification of the parameter IP_AUTODETECTION_METHOD (which was the possible suspect) was going through, but looking in the master node Calico pod, it was not showing that the update was propagating correctly (looks like something was overriding the change).
Lincoln suggested that it might be Calico operator, running in the background, and indeed, making the update on the operator level flipped that pod to healthy. Right now all K8s components are healthy. Though I still have submitted jobs waiting at the ContainerCreating state. I think I know what's the reason - working on a fix.
-
12
-
15
AOB
-
1
