US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2023-07-19T13:00:00-04:00
End: 2023-07-19T15:10:00-04:00
Location: No location set

Wednesday 19 Jul 2023, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Proposed milestones to be add by COB Friday https://docs.google.com/spreadsheets/d/1CF5nSKi2UWiiF4hJpLbJIba_A-2aM00jS14lDDFcplY/edit#gid=634097696
  - Note we need more "detailed" milestones in EACH L3 area to cover all of calendar year 2024
  Quarterly reports deadline is Friday. All L3 WBS quarterlies should be in by COB today
  Working on scrubbing responses due ASAP.
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  New osg-xrootd + xcache versions
  HTCondor 10.0.6 in EL7 & EL8 release
  HTCondor 10.6.0 in upcoming (EL7, EL8) and release (EL9)
  NO XRootD 5.6.0 or 5.6.1: we caught issues in our integration testing
  OSG 23
  OSG 23 will be the next release series
  Looking like a September release
  See slides 10-12 about OSG 23 plans https://agenda.hep.wisc.edu/event/2014/contributions/28481/attachments/9167/11063/2023-07-11.htc23.osg-software-timeline.pdf
- 13:20 → 13:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  QR complete
  ANALY_BNL_VP queue maxWorkers doubled to 10000
  Usage has remained below ~500 slots. Multicore scheduling issue?
  Issue with home disks filling up at OU
  CVMFS squid failover issue at SLAC (GGUS) - Wei may have solved
  - 13:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
  - 13:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    running fine
    some bypass at MWT2 and AGLT2 once traffic goes above 2TB/h
    VP
    next step in integration in Rucio now in PR.
    probability of a dataset to have a virtual replica at BNL increased factor 5. we will need to look at the VP queue CRIC settings to get it to continuously to run more jobs
    ServiceX
    working fine on AF
    more performance optimizations merged
    running fine on FAB. Getting all servicex images to come with special gei.conf
    Analytics
    all services work fine.
  - 13:30
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Understanding the performance of the new cluster, looking into details if something doesn't look right. Overall the production is running fine.
    Noticed a couple of time drop in production level, but it appears to be not specific to K8S cluster, and looks like was due to storage servers getting overloaded.
    With the new hardware, noticed that the nodes with more cpu cores (64/72/96) have overcommiting the node CPU. For the previous cluster I solved this issue by optimizing the job CPU requests coefficient sent from Harvester. Have to look into this, probably readjust it.
    Noticed that K8S was trying to schedule production jobs on the master node. A NoSchedule taint was in place initially but looks like was lost at some point - reinstated.
    Working on reinstalling Prometheus on a dedicated node. And next setting up job accounting reporting.
- 13:40 → 13:45
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  Working fine
  Preparing answers to the post-scrubbing
  Milestones and Risks have been provided
  VP increased to 5,000k jobs (as discussed at OSG AHM), reached 5,000 running jobs, increased limit to 10,000 and never got more than 500 jobs since then. To be understood.
  Quarterly report published
- 13:45 → 14:05
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  The last 30 days had good running.
  CPB got most of their FY22 compute online but I leave it to Patrick to describe the status.
  NET2 is pretty close to being online but again I leave it to Eduardo to describe the status.
  Working on info for scrubbing response.
  Also doing the quarterly reporting in parallel.
  Looked at the Tier 2 milestones match what I was aware of.
  - 13:45
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
    
    Incidents:
    We had two incidents with dCache. On July 8th, the postgresql partion of the head node was flooded by the billing database, and it took us over 24 hours on the weekend to recover it, we are planing to rebuild a R6525 work nodes with larger NVMe cards as the new head node to host a bigger postgresql partition (6TB vs 1TB)
    The second incident is on July 19th, 2 dCache nodes had all the pools offline, and caused some transfer failure, restarting the pools fixed the issue.
    System update:
    We updated HTCondor from 9.0.17 to 10.0.5, and also took this chance to apply firmware and kernel updates with required system reboot. We ran into some token issue because in Condor 10, the TRUST_DOMAIN default value is changed to TRUST_UID, and the tokens used by daemon authentication need to be signed with the same TRUST_DOMAIN. Our fix is to set the TRUST_DOMAIN with the old value.
  - 13:50
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Hardware issues on storage node at UC. Replaced controller and putting it back into service shortly.
    UIUC preventative maintenance today (7/19). Will make sure nodes all come back online in production once it ends.
    Planning to start setting up the WLCG SOC network monitoring hardware next week. Minor disruptions in the network could occur, but no downtime should be needed.
    Building our first set of el9 (AlmaLinux9) worker nodes at IU. Have one in production at UC at the moment and seems to be OK.
    UIUC compute has mostly come in (waiting on a couple chassis). Looking to install what has arrived by the end of the PM, but may have to wait a little longer.
  - 13:55
    
    NET2 5m
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    Solving some failing transfers using CERN and BNL FTS services.pdf
    
    Storage
    Load tests
    -Manual transfers from lxplus and CERN FTS service worked
    - transfers from BNL FTS service revealed some SSL issues regarding the use of SHA1 for signing
    - tested experimental package made by OSG team, no effects
    - A second problem related to the transfer mode (PUL) for FTS revealed when transferring from BNL. It was working fine from CERN because it allowed streaming mode. webdav.authn.require-client-cert true was preventing HTTP-TPC from work.
    - with FTS transfers working correctly we were able to saturate our network link—WebDav and Xrootd were tested.
    - we are talking with Fabio to publish our storage
    webdav.data.net2.mghpcc.org
    xrootd.data.net2.mghpcc.org
    
    Openshift
    -Progressing, configuring X509 credentials for kubernetes cluster access
    - Many problems due to the dual stack setup (network policy controllers not working, Security Context Constraints not working)
  - 14:00
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA
    Completed installation of available new machines
    A few machines need repair
    More than 20,500 cores, additional 450 in repair, K8 cluster using 1,000
    Mostly have balanced power and cooling in the data center
    Have deployed one new rack and preparing to deploy another for additional space
    Further work should be invisible as we move machines in groups of one or two at a time to new rack
    Looking at replacing admin node in cluster
    OU
    Completed installation of new machines; now 5300 slots plus opportunistic OSCER nodes
    Ordered 3 more R6525, expected to arrive soon
    Have installed slate01.oscer.ou.edu with RockyLinux 9.2, in the process of configuring it
    Today OSCER maintenance, upgrading SLURM from v20 to v23 (or v22, if there are issues with v23)
- 14:05 → 14:10
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  Met with TACC support about possibly moving us to $SCRATCH2 for more IOPS
  Perlmutter at <15% allocation remaining and running well
  Rui has been working on a way for users to run a custom image on NERSC_Perlmutter_GPU
  https://gitlab.cern.ch/argonne_computing/hpcops/examples/perlmutter_gpu_queue_tensorflow
- 14:10 → 14:25
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:10
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    QR complete
    CHEP paper in progress - need help from authors
    Container development work ongoing (Shuwei), discussed at last week's 2.3/5 meeting
    What shoudl be procedure for announcing downtimes?
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    Downtime coming up next week for various upgrades(firmware, os, kubernetes, rook-ceph)
    Servicex on FABRIC
    slice stability issue(vms disapear) raised with FABRIC team - Seems to be a known issue that they will deploy a fix
    Should have found a soluction for IPv6 preference(gai.conf to set preference, the default config prefers IPv4 over private IPv6)
- 14:25 → 14:35
  
  AOB 10m

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Release (this week)

OSG 23