US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2023-05-24T13:00:00-04:00
End: 2023-05-24T15:10:00-04:00
Location: No location set

Wednesday 24 May 2023, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  The USATLAS technical meeting and pre-scrubbing in Bloomington is fast approaching: https://indico.cern.ch/event/1273590/
  - First draft of L3 presentations due by COB this Friday
  - Final versions of L3 presentations due by Friday, June 2
  There will be a day devoted to LHC computing - a morning for ATLAS, and the afternoon a joint session with CMS - at the upcoming Throughput Computing 2023 (HTCondor Week + OSG All-Hands Meeting) on Wednesday July 12th. Info on the event at https://path-cc.io/htc23
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Remember to register for HTC '23 (https://path-cc.io/htc23). Schedule here https://agenda.hep.wisc.edu/event/2014/abstracts/
  New OSG Software Team member Matt Westphall started this week
  Expect a new HTCondor 9 release containing the tools to help detect potential issues in a 9 -> 10 upgrade. Then an HTCondor 10 upgrade ~1 week later.
  InCommon IGTF CA v2 is steadily making its way through the IGTF process and you can expect a release in the coming months. For the time being, please avoid issuing v2 certs
  Working on the workaround for the SHA1-signed certs issue with tight OS crypto defaults (https://opensciencegrid.atlassian.net/browse/SOFTWARE-5365). Anyone willing to give it a spin when we have a package ready?
- 13:20 → 13:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Some job drainage overnight Monday (cause?)
  Issues with replication of CVMFS nightlies at BNL affecting ART jobs, under investigation (JIRA, GGUS)
  Intervention for ATLAS IAM db migration planned Thursday overnight, estimated 7 minute downtime, should be fine according to P. Vokac (OTG)
  Squid and XRootd issues at OU, plus network problems over the weekend.
  Squid still does not appear on monitoring page
  Deploying ATLAS token support for CERN EOS, planned for 5/31 (restart of Management and Metadata Server)
  Preparation for DC24
  - 13:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
  - 13:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    ServiceX
    a lot of develpments. fixing UI issues, performance improvements
    Should have new version today or tomorrow
    XCache
    upgraded to 5.5.5
    works fine
    VP
    more changes in UK cloud
  - 13:30
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Both clusters (SWT2_CPB_K8S & SWT2_CPB_K8S_TEST) have been running fine.
    Have been working on Prometheus monitoring. Installed on SWT2_CPB_K8S_TEST using Helm charts. In addition had to setup Persistent Volumes, and after some adjustment in configuration, Prometheus was working fine.
    On SWT2_CPB_K8S we were starting to hit Disk Pressure on the master node, when trying to setup new stuff. So going forward we needed to increase the size of the main partition for the new cluster, and the only way was to reinstall the node.
    So the SWT2_CPB_K8S_TEST cluster was rebuilt and K8S installed again. All went smoothly, and the cluster is running fine.
    Started draining the SWT2_CPB_K8S cluster in preparation for switching to the new cluster, and scaling it up (to about 1000 cores). Patrick is preparing a diverse set of machines for that. Most probably all that will be completed this week.
- 13:40 → 13:45
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  No major issue
  Preparing for the pre-scrubbing
- 13:45 → 14:05
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Reasonable running over the past 30 days...
  ADC has had trouble at times supplying sufficient work.
  AGLT2 and MWT2 have received much or all of their FY23 orders.
  NET2 is close to being in operation but I will let them explain that.
  I will be preparing the Tier 2 pre-scrubbing slides over the next couple of days.
  NET2 will present their status at the pre-scrubbing.
  All other sites should be let me know if they have any input for the pre-scrubbing.
  - 13:45
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
    
    2023 equipment purchase:
    Added 11x R6525 with AMD 7443 (96 HT/node)
    measured HEPscore= 17.48 HS06= 17.75
    total added: 1056 cores, 18.5k HEPscore / 18.7k HS06
    note: new PERC H355 not supported by OMSA on CentOS7
    Retired 32x 24HT R410s (768 cores / 6.6k HS06) 9.84
    Retired 24x 32HT R620s (736 cores / 8.1k HS06) 11.03
    Retired 5x 40HT R620s (200 cores / 2.1k HS06) 10.68
    (retirements above may not be exact, will correct after meeting)
    Total Change expected about neutral
    Also added 7x of 12x R740xd2 at UM
    (each with 24x 20TB disks= 370 TB usable in dcache)
    4x R740xd2 at MSU waiting on network config (promised for tonight)
    The other 5x R740xd2 at UM to compare RAID6 to JBOD/raidz3
    Retired 4x MD3xxx shelves of 8T disks (~1.4 PB)
    Waiting on MSU to be deployed before sitewide space re-balance
    Did NOT YET increase dcache advertised space
    Ultimately the grand total change will be about +4.5 PB
    Events:
    10-May: ZFS problem on NVMe mirror holding dcache database (Ticket 161890).
    After file system recovery one postgres file remained flagged as possibly corrupted.
    Recovered from backup/mirroring node.
    18-May: MSU Data Center heats up during regular/yearly Fire Alarm testing.
    All newer/hotter Worker Nodes (C6420s and R6525s) shut themselves down.
    Unexpected. Could be operator error but no official report yet.
    For fun/curiosity: coarse comparison HS06 vs HEPscore (as of May 2023)
    | | 6132 | 6240R | 7302 | 7413 | 7443 |
    | | 2.60GHz | 2.40GHz | 3.00GHz | 2.65GHz | 2.85GHz |
    | Tot HT | 56 | 96 | 64 | 96 | 96 |
    |----------+---------+---------+---------+---------+---------|
    | HS06/HT | 13.64 | 10.94 | 16.42 | 17.28 | 17.75 |
    | HEPS/HT | 13.16 | 12.22 | 16.17 | 16.97 | 17.48 |
    |----------+---------+---------+---------+---------+---------|
    note: HEPscore measured as average of 2 runs on only 1 node each
    (except 7443 with 2 runs on 5 nodes)
    HS06 taken from US facility spreadsheet for AGLT2.
  - 13:50
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    dCache upgrade scheduled for Tuesday.
    New equipment received at IU and UC. In the process of racking and building.
    Retired one of our MD3460/MD3060e dCache storage nodes for spare parts.
  - 13:55
    
    NET2 5m
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    Configuration of IPv6 on NESE network is fixed. Doing final checks before configuring storage element.
    NESE team is automating DNS for storage servers.
    The installation of OpenShift is ongoing. Many layers of many technologies make the debugging processes slow.
    The perfSonar server is in place in the MGHPCC.
    Final fiber distribution being installed.
    Finalizing purchase of racks with RDHX for system.
  - 14:00
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA
    A pair of incidents with the campus chilled water supply caused disruptions. We were able to maintain storage access, but in the first incident we had to drop all of the computational load. In the second incident we lost about 1/4 of the computational load.
    Scaling up internal K8 cluster. Previous K8 cluster will be merged into SWT2_CPB
    Power balancing operations have started.
    OU
    Last Wednesday was OSCER maintenance, they upgraded network switches. That apparently didn't go too well, since on Friday afternoon the core network collapsed; core switches had high CPU usage and broadcast storms or something like that. Was fixed Saturday morning.
    Also saw some fraction of stage-in transfer failures with strange IPv4 network error. Not clear if that started around that time as well, or if it was there at a low level before. A restart of xrootd (both proxy on se1 and backend storage) seems to have fixed that.
    Old OCHEP squid server stopped reporting to CERN monitoring. Not sure yet what's going on there, investigating.
- 14:05 → 14:10
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  Perlmutter
  Fairly good running over the last week or so. Lustre issues killed a lot of jobs in the last 24h. Stable configuration at ~50K cores.
  Following up with Michal Svatos regarding a Squid / Frontier access issue he flagged
  The queuing time of the GPU nodes appears to be shorten now, would consider switching back to full node configuration
  Frontera
  Difficult to get jobs through the queue. PanDA is cancelling jobs that have waited for days.
  Following up with Asoka on CVMFSExec Squid configuration issue.
- 14:10 → 14:25
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:10
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Ongoing analysis container development work/discussion
    Shuwei testing nested containers at BNL and SLAC
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    Images deployable on AF have been upgraded:
    base is now nvidia image with cuda 11.8 and new cuDNN.
    both ml-platform and conda image packages updated.
    FAB ServiceX deployment
    FAB and LHCONE peering is up
    Sorting through xroot authentication from ipv6 only network
    HTCondor issue - scheduler down due to stuck IO. Restart ceph daemons as a temperary fix. Updating kernels as a potential fix.
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder