US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-05-25T13:00:00-04:00
End: 2022-05-25T15:10:00-04:00
Location: No location set

Wednesday 25 May 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Minutes
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Shawn and Rob have been re-appointed for another round of facilities management.
  
  First big task will be to review each L3 area in prep for the pre-scrubbing.
- 13:10 → 13:20
  
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBN 30m
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Minutes
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20220535.png
  
  Success-20220525.png
  
  Transfer-20220525.png
  The efficiency of the sites was lower than usual,
  
  AGLT2 nfs server failure on 5/21 (also updated condor to 9.0.12 on 5/12)
  
  MWT2 Problems keeping site full: Partly drained over the weekend of 5/14 and 5/15 when 1 of 2 GK failed. To add margin turned on 3rd gatekeeper on 5/20 and working on getting a 4th gatekeeper in service today.
  
  NET2 series of expired certificates caused site to drain. Offline for last 2 days for power work. Not refilling.
  
  SWT2 CPB storage issues caused SLURM to not start jobs (twice?)
  
  ADC Pilot issue caused jobs to be killed at 48 hours - work around in place.
  
  Run 3 data taking readiness:
  
  AGLT2 and MWT2 are ready (both sites are waiting for hardware delayed hardware deliveries)
  
  NET2 Needs to update to OSG 3.6, XRootD 5.4.3?, enable IPV6, enable new storage and retire GPFS
  
  SWT2 OU Need to setup OSG 3.6 gk, setup SLATE server, XRootD 5.4.3?, finish enabling IPV6
  
  SWT2 CPB Finish OSG 3.6 update, XRootD 5.4.3, finish enabling IPV6, remove LSM, still waiting for delivery of some servers, need to put new network hardware into service.
  - 13:55
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    05/12/2022
    
    Updated condor from 9.0.11 to 9.0.12
    
    Updated gratia probes on all gatekeepers. Gratia probe stopped working for a day after the upgrade, and it was fixed by reconfigure, and then manually restart condor-ce and run
    
    su - condor -c " /usr/share/gratia/htcondor-ce/condor_meter"
    
    05/17/2022
    
    We migrated the Tier2 NFS server umfs02 to a virtual machine without having downtime. This nfs server provides the home directory for all grid users. The migration hit some problems: 1) the MSU work nodes could not mount the new NFS server because of routing issues . We added the routing rules as a workaround. 2) This nfs server also serves as the archive directory for the dCache postgresql databases’ hot standby replication. For one of the database servers (head01), the hot standby replication did not have a smooth transition during the 20minutes downtime when the NFS servers were swapping, so we ended up reseeding the database from head01 to its hot standby server d-head01.
    
    We converted all 26 remaining SL7 servers at UM site to CentOS7, this includes all the dCache pool nodes and lustre storage nodes.
    
    05/21/2021
    
    The new nfs server(virtual machine) umfs02 lost accessibility, increasing the memory and CPU restored the service.The site drained to 10% usage on 21st because of this incident.
    
    05/23/2022
    
    Gratia on the OSG gatekeeper (gate02) stopped working for 2 days. Restarting condor-ce service fixed that
    
    MSU finished installing and phasing in the 3x new VMware AMD host nodes (ordered Sept 2021).
    But still using old direct-attach SAS storage. Last step will be to start using new NVMe storage via iSCSI (also received from 2021 order)
  - 14:00
    MWT2 5m
    
    Minutes
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Storage node went down again. Troubleshooting further with vendor. Needed to declare a handful of files as lost as a result.
    
    Production switches arrived at UC. Working on getting times to install them and take out the temporary switches.
    
    Second gatekeeper at IU is set up and in production. Currently running 3 active GKs.
    
    Working on setting up a second gatekeeper at UC for a total of 4 gatekeepers for MWT2.
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
  - 14:10
    
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - One of our 7 xrootd storage servers is having RAID6 issues, so we copied all of its contents to the OSCER ceph scratch and then pointed xrootd there, while we are re-creating the RAID6 array from scratch with two new drives, and then we'll copy everything back. Should take a few days.
    
    - xrootd pointed at the ceph copy seems to work fine.
    
    - We prevented new data from being stored on that server during this maintenance.
    
    SWT2_CPB:
    
    - Installing compute nodes from our purchase earlier this year (48 nodes total).
    
    - Still awaiting delivery of WN's from the previous purchase! Dell claims it's imminent...
    
    - Working to finalize scheduling for our remaining to-do's in Fred's list.
    
    - The partition holding the slurm DB filled up on 5/20 (ggus 157319). Took a while to clean the area and remove the debris. We'll implement some configuration changes to avoid a recurrence.
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Minutes
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  Perlmutter CPU queues are online and "free" currently.
  
  NERSC_Perlmutter_Test has been working well in the CPU queue.
  
  Only running 1 job/node (128 cores) right now (~50% CPU efficiency).
  
  30 min to complete a job this way
  
  May be worth it to be more inefficient if there will be a preemptable queue for CPU? Usually there are guarantees e.g. 2 hrs walltime before preemption.
  
  Asking production coordinators to send some real tasks to NERSC_Perlmutter.
  
  Cori working well
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Minutes
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Analysis Ecosystems Workshop this week
    
    Summary report being written
    
    AF Forum last week
    
    Ilija spoke about integrating xcache with AF - will try this out at BNL AF
    
    Discussion of EOS at AF at tomorrow's 2.3/5 meeting
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    it appears user's jobs running on AF were unaware of Squid caches so these accesses went directly to CERN squid.
    
    Fengping and Lincoln working on making these jobs use correct squids.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Minutes
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Non-rucio pilot copytool cleanup in the following queues (* denotes LSM primary):
  
  ANALY_OU_OSCER_GPU_TEST
  ANALY_SWT2_CPB_VP *
  SLAC_ES
  SWT2_CPB *
  SWT2_CPB_TEST
  SWT2_PAUL_TEST *
  
  Looking to migrate Facilities Operations and Federated Operations documentation to the new US ATLAS web site, probably off of the Software and Computing page.
  
  NET2 capped at 10 running job slots since end of maintenance period?
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-5_18_22.pdf
    
    US-cloud-summary-5_25_22.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    works fine
    
    VP
    
    works fine
    
    integration goes (slowly)
    
    Analytics
    
    work on separating code base and hardware used by AF and ATLAS Analytics
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Minutes
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Started looking into Calico network configuration, and also Lincoln suggested what to modify from parameters to try first, see if that will fix. As I was working on that, noticed a general networking problem.
    Waiting for that to be fixed before moving forward with any network related configuration changes on the K8S side.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder