US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-06-08T13:00:00-04:00
End: 2022-06-08T15:10:00-04:00
Location: No location set

Wednesday 8 Jun 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (hopefully tomorrow)
  
  Gratia Probe 2.6.1
  
  HTCondor 9.0.13
  
  HTCondor-CE 5.1.5
  
  XCache 3.1.0
  
  xrootd-multiuser 2.0.4
  
  Miscellaneous
  
  Contact for AGIS/CRIC? OSG Central Collector AGIS compat layer has been down for quite some time
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Discussion with ESnet about preparing for HL-LHC 30m
    
    Speakers: Dale Carder (Lawrence Berkeley National Lab), Eli Dart, Kate Robinson (ESnet)
    
    ESnet update - ATLAS Meeting 2022-06-08.pdf
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Minutes
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  Smooth running
  
  Chris Hollowell gave a talk about experience with Kubernetes at BNL at the pre-GDB meeting on 6/7 https://indico.cern.ch/event/1096043/
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Minutes
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20220608.png
  
  Success_20220608.png
  
  Transfers-20220608.png
  Reasonably good running since last meeting especially in the last week.
  
  Central ADC issues: RUCIO DB outage 5/27 for a few hours. Planned ~1 hour Oracle upgrade today (6/8) with production system paused for the duration.
  
  AGLT2 Second incident with NFS server on 5/27
  
  MWT2 Job starvation on 5/29 & 5/30.
  
  NET2 Recover from annual power 5/26 & 5/27. 5/30 Transfer issues. Several small drainings in June.
  
  SWT2 Transfer errors(?) and/or job starvation caused draining from 5/27-5/31.
  
  Draft document for entering FY22 & FY23 equipment purchases (FY23 not included yet) is at:
  
  https://docs.google.com/document/d/10zRzY8yWXCUY3CVG6T4pZk091raN8qkIUJaxnWn1nx0
  
  The charts in this document are links from a reworked WLCG-v60 tab which is where you should enter numeric data.
  
  https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
  
  This will be part of the preparation for the pre-scrubbing.
  
  I'll be asking everyone about their readiness for Run 3 data taking.
  - 13:55
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    Incidents
    
        05/25/2022
        2nd instance of umfs02 NFS server VM problem (used for osghome and our management files)
        lost accessibility again with same error
        “NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! ”.
        Migrated it to different Vmware host
    
        05/27
        Trouble with the nfs server umfs02 again
        Now suspecting latency from VMware iSCSI storage (TrueNAS), so moved VM to local NVMe storage
        Unfortunately this problem causes apparent high "load" and BOINC job control throttles back
    
        06/02
        umfs02 in trouble again
        realized we were missing NFS server tuning after transition physical to VM
        (/etc/nfs.conf, changed threads from default 8 to 512).
        Problem solved
    
    Hardware
    
        06/03
        UM site received the 10 R6525 work nodes ordered in Sep 2021,
        nodes racked/cabled/labeled and provisioned and put in production in 2 days.
    
    Software
    
        6/07
        Update dCache from 7.2.15 to 7.2.16,
        and also updated kernel and firmware (rebooted to install BIOS updates).
        The process went well.
    
        6/08
        Update condor from 9.0.12 to 9.0.13 from the osg testing repository.
        This will cause an automatic rolling draining and condor restart.
        We also set condor to drain and wait on the C6420 work nodes, so we could reboot them to apply the new BIOS updates.
    
        Update HTCondor-CE from 5.1.3 to 5.1.5 from the osg testing repository
        on the test gatekeeper gate04.
  - 14:00
    
    MWT2 5m
    
    Minutes
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Continuing to debug hardware issues on one of our dCache pool nodes.
    
    Upgrading condor on our workers to 9.0.13.
    
    Fourth gatekeeper in production and receiving jobs.
    
    UC and IU Squids reconfigured for multicore (4 cores/squid) and have heartbeats added.
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
  - 14:10
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    OSG 3.6
    
    All production jobs are running on OSG 3.6 CE
    
    Backup CE needs to be updated
    
    IPV6
    
    Moved all network connections to IPV6 Switch
    
    Build procedure for hosts in place
    
    Requesting IPV6 address (w/o DNS entries, for the moment)
    
    Testing still in progress
    
    GridFTP/LSM
    
    Disabled GridFTP as WAN transfer protocol
    
    Had to enable root protocol as WAN/1 to keep LSM
    
    Testing the removal of LSM is tricky (managed to offline production queue twice)
    
    Rucio mover/pilot may not be able to use internal ROOT door due to URL being registered.
    
    OU:
    
    - Still working with Dell to get RAID6 array fixed on cstore13. In the mean time, xrootd is working fine using the copy of the data from that data server on ourdisk ceph scratch partition.
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Minutes
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  Cori operating normally
  
  Still waiting on validation samples for Perlmutter
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Minutes
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Multiple topics (metrics, eos mounting, discourse,...) on tap for tomorrow's 2.3/5 meeting
    
    Interesting presentation by Ricardo at last week's HSF AF Forum
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Minutes
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  GridFTP protocols removed from remaining RSEs
  
  Nevis_SE_0, NERSC-PDSFSRM_SE_0 now set to HTTP
  
  NERSC-PDSF_SE_0 now set to GLOBUS
  
  SWT2_CPB_SE_0 write_wan/1 now XROOTD
  
  No change in settings for non-Rucio pilot copytools
  
  ADCR Database intervention this morning. Rucio, Panda, CRIC all down from 9-10 CEST.
  
  Pre-GDB on k8s earlier this week (talks by Lincoln on SLATE, Chris on BNL OKD deployments)
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-6_1_22.pdf
    
    US-cloud-summary-6_8_22.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Minutes
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    BNL xcache updated to 5.4.3rc4
    
    other caches work fine
    
    vp works fine
    
    a lot of VP jobs running at NET2
    
    Starting preparations for ES upgrade.
    
    Performance upgrade for ServiceX getting ready.
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Minutes
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Last week the network of the cluster was still misbehaving, and at the end of the week Patrick replaced the switch which was locking up. That resolved the issue. In the Calico network configuration I modified the IP_AUTODETECTION_METHOD which was the possible suspect, and the system responded that it is updated. The process recreated the Calico pod for that node, but not clear that it did the trick (could be something overrode the parameter), and at least id didn't resolve the connectivity issue for that Calico pod.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder