US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-06-22T13:00:00-04:00
End: 2022-06-22T15:10:00-04:00
Location: No location set

Wednesday 22 Jun 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 1
  
  WBS 2.3 Facility Management News
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 2
  OSG-LHC
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  cvmfs-2.9.3 in testing
  
  scitokens-cpp has an FD leak, which will affect CEs and potentially XRootD hosts. Working with upstream to get a package into epel-testing ASAP
  
  https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2022-1a3ee1e251
  
  https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2022-033762bcf7
  
  Working on an xrootd shoveler bugfix release
- Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
- 3
  
  WBS 2.3.1 Tier1 Center
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  No major issue
  
  Preparing for the pre-scrubbing
- WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  Error_Burst-20220622.png
  
  N_jobs_20220622.png
  
  Success_20220622.png
  
  Transfers_20220622.png
  Pretty good two weeks.
  
  MWT2 drained while testing new ALRB release due to issue with passing environmental variables to the jobs. Now fixed and the new ALRB seems good. Asoka wants to test for another few days.
  
  There was an incident last night affect all ATLAS sites except those on NorduGrid.
  - 4
    
    AGLT2
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    UM site had IPv6 issues after the hardware maintenance from Merit, we had to put the UM condor cluster offline to prevent failing more jobs. The issue was resolved the next day by Merit.
    
    We found out a condor ce sub directory ownership issue on the condor-ce which had been causing 20% SAM test jobs fail(Site has only75% Reliability and Availabilty in May). That ownership issue was introduced in late April when we were trying to fix the ownership for the gratia directories.
  - 5
    MWT2
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Still troubleshooting downed storage machine.
    
    Declared some more lost files.
    
    Retiring ANALY_MWT2_GPU
    
    Waiting on updating to condor 9.0.13 until we work out an issue with the condor-externals RPM removal causing jobs to fail.
    
    Working on fixing an issue where the IU squid restarts periodically.
    
    Running ALRB testing version on condor
  - 6
    
    NET2
    
    Speaker: Prof. Saul Youssef
  - 7
    
    SWT2
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Deployed new storage (3.2PB) and retired ~600TB of oldest storage
    
    Cabled, installed and characterized the R6525 nodes, bringing into batch system now. Provides 48 Nodes x 96 slots (1455 HEPSpec per machine)
    
    Testing beginning on IPV6
    
    OU:
    
    Nothing to report, all running well.
- 8
  WBS 2.3.3 HPC Operations
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  NERSC
  
  Cori running fine
  
  Perlmutter integrated last week & running fine over the weekend
  
  1 job per node for right now
  
  issue with Globus 5 credentials in the last day - fixed now.
  
  Not clear how to renew endpoint credentials with Globus 5 before they expire - investigating..
- WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 9
    Analysis Facilities - BNL
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Preparing for pre-scrubbing
    
    Doug, Lincoln, Ofer met with EOS team at CERN
    
    Okay to expand usage of fuse mounts
    
    Doug, Ofer met with Oksana Shadura and Alex Held at CERN
    
    Will support efforts for development standards to support OKD
    
    Will collaborate to get demo analyses running at BNL for testing/benchmarking
    
    Oksana has already gotten access using FNAL login to federated jupyterhub
    
    Doug, Ofer met with Elena Gazzarini at CERN (taking over for Riccardo)
    
    Discussed collaborating on development of DLAAS tools
  - 10
    
    Analysis Facilities - SLAC
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 11
    Analysis Facilities - Chicago
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    We upgraded Kubernetes two major versions from 1.21 to 1.23
    
    We upgraded HTCondor to 9.0.13 with OSG 3.6 on head/login nodes
    
    We did a yum update of all packages on the AF login and head nodes, including the latest mainline Kernel from ELRepo
    
    We are still upgrading workers in the background
    
    We deferred the CephFS upgrade from v16 (Pacific) to v17 (Quincy) - we found 1 node (c001) with what seems to be a hardware error - all disks are reporting "I/O error" trying to mount them. Need to get the cluster clean before we upgrade major version.
- WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Working on pre-scrubbing slides and OTP
  
  Quiet week, except for ~2h downtime last night due to Panda/Harvester servers going offline
  
  Less than 2 weeks to physics collisions....
  - 12
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-6_15_22.pdf
    
    US-cloud-summary-6_22_22.pdf
  - 13
    Service Development & Deployment
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    all working fine
    
    LRZ in downtime
    
    waiting for 4.5
    
    VP
    
    NET2 is running a lot of VP jobs
    
    Investigating some new http caching tools.
    
    Squid Fed Ops
    
    SLATE team working on testing new federation controller. Will migrate to using this in the next couple of weeks.
  - 14
    
    Kubernetes R&D at UTA
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    In the Calico network configuration the modification of the parameter IP_AUTODETECTION_METHOD (which was the possible suspect) was going through, but looking in the master node Calico pod, it was not showing that the update was propagating correctly (looks like something was overriding the change).
    Lincoln suggested that it might be Calico operator, running in the background, and indeed, making the update on the operator level flipped that pod to healthy. Right now all K8s components are healthy. Though I still have submitted jobs waiting at the ContainerCreating state. I think I know what's the reason - working on a fix.
- 15
  
  AOB