US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-10-27T13:00:00-04:00
End: 2021-10-27T15:10:00-04:00
Location: No location set

Wednesday 27 Oct 2021, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 1
  
  WBS 2.3 Facility Management News
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 2
  OSG-LHC
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  HTCondor 9.3.0 in OSG 3.6 upcoming
  
  osg-ca-certs-updater for EL8
  
  Other
  
  Update your CEs to HTCondor-CE 5 and HTCondor 9 from OSG 3.5 upcoming!
  
  FaHui is updating central Harvester infrastructure to support token-based pilots
  
  For pilots, we expect to be able to consolidate token -> user mappings to a single user!
  
  Working with the HTCondor team to figure out a solution for mapping SAM/ETF tests to a separate user (scope based mappings?)
- Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 3
    
    Removing SRM from TPC for US Sites
    
    A quick presentation about removing SRMv2 protocol for Third party copy for our sites.
    
    Speaker: Shawn Mc Kee (University of Michigan (US))
    
    Changing CRIC to Remove SRM from TPC.pdf
    
    Removing SRM from TPC
- 4
  
  WBS 2.3.1 Tier1 Center
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  Smooth operation\
  
  FTS upgrade on Nov 2, 4h downtime
- WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_jobs-20211027.png
  
  Success-20211027.png
  
  Transfers-20211027.png
  Relatively rough period over the last two weeks:
  
  Several problems on the CERN side
  
  Network outage on 10/15 (Friday)
  
  Draining problem over the weekend (10/17-10/18)
  
  Yesterday someone at CERN removed an "unnecessary" DB link.
  
  Other decreases were site maintenance amd various site networking problems.
  
  Large increase in job slots for AGLT2 is a redefinition of BOINC Job as 8 slots rather than 1 job.
  - 5
    
    AGLT2
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    Overall smooth 2-week operation (until today)
    
    - new problem : 60% of pilots failing, but other 40% payload normal (2.4% failure)
    Being asked to check condor config, do not locally retry jobs that fail
    
    Also 2 events in BOINC queue, presumably unrelated
    - CERN operation error with DBRelease yesterday seemed to have caused job failures
    - monitoring: job count has jumped from ~2k to ~12k.
    Presumed to be proper accounting of multicore payload.
    
    Storage purchase
    
    - UM received 5x R740xd2 last week, installed, in production, adding 1180 TB usable
    Retired 2x older storage nodes (678 TB) with 4x MD3xxx shelves (including umfs11 which had the recent hardware problems)
    
    - MSU received 3x R740xd2 this week, racked, soon into production, adding 708 TB usable
    
    - net + 1210 TB
  - 6
    
    MWT2
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Brief scidmz network outage at UChicago Oct 15
    
    IU downtime to reorganize racks and reconfigure network
    
    ICC PM downtime
    
    Updated vo-client on the MWT2 dCache nodes
    
    Removed SRMv2 from TPC list in CRIC
  - 7
    
    NET2
    
    Speaker: Prof. Saul Youssef
  - 8
    
    SWT2
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Overall, site running well.
    
    - xrootd hung up on proxy (se1.oscer.ou.edu) again last night, restarted. Andy and Wei are looking into it.
    
    - Problem exacerbated by rucio copytool (site mover) still not doing write_lan correctly, meaning all stage-outs are still routed through se1 instead of directly to the local redirector, That really needs to get fixed soon, since it causes huge inefficiencies.
    
    SWT2_CPB:
    
    - XRootD on the webDAV host is more stable following the update to the curl libraries
    
    - Installed more storage from the most recent purchase (Summer)
    
    - setup new perfsonar BW host. Both machines now new hardware.
    
    UTA_SWT2:
    
    - Retirement progressing
    
    - ddm ops re-started the cleanup of the SE. As of this morning ~10 TB of data known to rucio remain
- 9
  WBS 2.3.3 HPC Operations
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
  
  hpc-10-27.PNG
  TACC
  
  Needed to update local copy of ALRB etc to get the VO Client update at TACC (same as what affected Tier 2s over the weekend)
  
  System maintenance yesterday, some job failures because of it.
  
  Priority is looking better overall, jobs are generally going through.
  
  32% of allocation remaining
  
  NERSC
  
  Filesystem degraded on Monday causing job failures.
  
  60% of 20M additional node hour allocation remaining.
  
  Another 10-20M hours possibly coming. Will need to use or lose it before end of year.
- WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 10
    
    Analysis Facilities - BNL
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
  - 11
    
    Analysis Facilities - SLAC
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 12
    Analysis Facilities - Chicago
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    ML platform running fine.
    
    AF
    
    Andrew working on getting ML-platform like thing reimplemented on AF
    
    Both xAOD and Uproot instances on AF running fine. A lot of improvements.
    
    Got new SLATE deployed XCache dedicated to ServiceX. Works great and only limited by the NIC. Will get 2x25Gbps this week.
- WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  - 13
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-10_20_21.pdf
    
    US-cloud-summary-10_27_21.pdf
  - 14
    Service Development & Deployment
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    
    all running fine.
    
    5.3.2 is out and I am testing it. Will update everything this week.
    
    I am adding a container that will send rucio heartbeats.
    
    VP
    
    all running fine
    
    only issues with RAL. Should be solved with that 5.3.2 update.
    
    Rucio
    
    work on integrating VP
    
    json database for placements (PR is ready)
    
    adding heartbeats (almost ready)
    
    work on the placement engine not started yet
    
    ServiceX
    
    upgraded uproot xrootd plugin version
    
    much better performance with the big fast xcache.
    
    will try deploying in FABRIC.
  - 15
    
    Kubernetes R&D at UTA
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    UTA_SWT2 decommissioning nearing completion, for hardware to be used for Kubernetes cluster at CPB (see Mark's report).
- 16
  
  AOB

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder