US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-09-29T13:00:00-04:00
End: 2021-09-29T14:45:00-04:00
Location: No location set

Wednesday 29 Sept 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Data challenge is next week. One issue is our perfSONAR deployment. It is in need of some attention!! Please review your toolkit deployment. Make sure you have your new hardware operational ASAP
  
  New monitoring dashboard in beta from ESnet. Not yet "announced" but have a look at: https://public.stardust.es.net/d/IkFCB5Hnk/lhc-data-challenge-overview?orgId=1 Verify your site is visible. Send feedback (for now) to Shawn. Announcement later this week?
  
  Have a look at the DOMA projects list https://docs.google.com/document/d/1i5YLxgDaVFt_-0R4DHABCyAeCvzaZdynTaVLNiM9anA/edit#heading=h.qucfjz4ani2c This is supposed to briefly document all the ongoing DOMA related activities.
  
  Packet marking / flow labeling RPM is available to install on storage nodes. Currently deployed at AGLT2 and BNL. Service watches netstat and sends firefly packets to ESnet collector. Let Shawn know if your site is interested in participating.
  
  LHCOPN/LHCONE meeting in two weeks: https://indico.cern.ch/event/1022426/
  
  Other relevant topics covered in WBS sub-areas.
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (expected this week)
  
  OSG 3.5 + 3.6
  
  CMVFS 2.8.2
  
  osg-wn-client
  
  OSG 3.5 upcoming
  
  xrootd-multiuser 2.0.2
  
  OSG 3.5 upcoming and OSG 3.6
  
  HTCondor 9.0.6
  
  blahp 2.1.2
  
  OSG 3.6 upcoming
  
  HTCondor 9.2.0 (see updated versionioning scheme slides 8+9 indico.cern.ch/event/1059494/contributions/4532565/attachments/2312014/3934741/WhatsNew_European_Workshop_Sept_2021.pdf)
  
  Token Transition
  
  HTCondor-CE 5.1.2 + HTCondor 9.0.6 available in OSG 3.6 and OSG 3.5-upcoming
  
  Oct 12 Pre-GDB Token/WebDAV transition https://indico.cern.ch/event/876809/
  
  Oct 14-15 Token transition workshop https://indico.fnal.gov/event/50597/overview
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 10m
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  Draining of BNLLAKE still continues...
  
  Paused writing to tape to create more free slots in tape silo. Downtime till Friday COB
  
  Updated dCache front end node to 6.2.29 includes new SRR (need to sort out some BNL specific issues)
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20210929.png
  
  Success-20210929.png
  
  Transfers-20210929.png
  Smooth running over the last two weeks. The main problem was transfer backlog at most sites. Hiro solved it by increasing the number of concurrent FTS transfers allowed which cleared the backlogs at all sites.
  
  Investigating actual delivery date for Dell equipment.
  
  dCache version that supports SRR released.
  
  HTTP-TPC at the sites using XRootD doors is still not quite there for next week's data challenge.
  
  Will come back to IPV6 after the data challenge.
  
  Will try to update to OSG 3.5 upcoming soon.
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    09/16/2021
    
    MSU Dell PO issued.
    Missing info to find it on Dell website.
    Asked Dell reps, but no success yet.
    Only found the WN part with estimate 18-Jan-2022
    
    09/17/2021
    
    Allowed all smaller worker nodes to run BOINC (using larger swap file)
    after long test period and measurement and verification of low impact
    on Atlas jobs and node stability.
    
    09/20/2021
    
    One of the dCache pools (umfs11_6) went disabled again (twice in 2 weeks).
    we repaired the file system first, then started the pool.
    The disabled pool caused 110 failed jobs for staging-out files.
    
    Finally we decided to retire this pool and another pool on the same host
    because they each had unresponsive and pending failure disks
    which we are not planning to replace anymore.
    (This whole storage node was already targeted for retirement
    as soon we get our new storage nodes, now estimated Jan 2022 ).
    With some struggle(the pool would disable itself during draining),
    we finally drained and retired the pool umfs11_6.
    
    Eventually we found that over 11K files were lost during the xfs_repair,
    we declared the lost files in JIRA ticket ATLDDMOPS-5575 on 09/29/2021.
    
    09/22/2021
    
    Updated dcache from 6.2.25 to 6.2.29 (for new SRR support).
    We also did system firmware and software (including kernel) update and rebooted all dCache servers.
    Two dCache storage nodes (umfs11 and umfs19) had corrupted grub configuration files
    we had to mount an ISO file to recover the grub file.
    
    09/22/2021
    
    Also applied new firmware and kernel updates on worker nodes,
    and drained and reboot the nodes in batches.
  - 13:45
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Overall a quiet two weeks
    
    In the process of migrating data to the new datacenter. Migrated 2.5PB, paused for now to relocate hardware
    
    Building swing elasticsearch hardware
    
    Planning to upgrade dCache to 6.2.29
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    1. Did live transition from Gridftp to WEBDAV with Xrootd 5.3 with private containers.
    2. GPFS disk failures => migrations needed, lost about 70TB in BU GPFS
    3. Two 100G links from BU to NESE Ceph failed, causing many PanDA jobs to fail with stagein/out timeouts, etc.
    4. Top of rack switch failing, causing about 50% job failures. Drained those workers and investigating.
    5. Lots of progress in NESE tape.
    
    Priority: Getting NESE storage under WEBDAV for Data Challenge.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - All running well
    
    - Had two xrootd related issues in the last week:
    
    - xrootd was in a funny state on cstore13, causing that node to fail transfers; not sure why or how;
    
    xrootd restart on cstore13 fixed that. That happened once before on that node about a year ago ..
    
    - xrootd hung up on se1, our proxy gateway; that completely halted all WAN xrootd transfers;
    
    again, xrootd restart on se1 fixed that
    
    - It's possible that the first issue caused a few corrupted log files (adler32 mis-match).
    
    They were declared lost in rucio.
    
    - Spent remaining hardware funds on more compute nodes and slate node; ETA a few months, of course ...
    
    UTA:
    
    1) UTA_SWT2_UCORE panda queue is now retired in favor of UTA_SWT2_UCORE_RET whose I/O operations use SWT2_CPB_DATADISK
    
    2) We are in the process of retiring UTA_SWT2_DATADISK, additional space was made at SWT2_CPB_DATADISK to accept migrated data, although half of UTA_SWT2's datasets are already in place there.
    
    3) Our Webdav door is in production at SWT2_CPB and ATLAS prefers to use it for I/O despite ATLAS CRIC settings setting gridftp as the higher priority. This caused somewhat of a problem at startup. We are noticing a possible problem with the long term stability of xrootd. Investigating the use xrd.report to generate statistics about metrics within the service.
    
    4) Still working on getting the initial setup of the Kubernetes cluster.
    
    5) Electrical work at SWT2_CPB *SHOULD* not affect operations overnight, but...
- 14:00 → 14:05
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  Jobs are sitting in queue at TACC for quite a while, getting Cancelled. e.g. from PanDA:
  
  pilot, 1236: Killed by Harvester due to worker queuing too long. 3504589 myjob normal phy20021 100 CANCELLED+ 0:0
  
  NERSC queue revived after 20M addtl hours added. Globus transfers are going. All proxies renewed on Tues.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    t3.png
    
    Operationally NTR, usage higher than usual lately
    
    Prepared RBT for new GPU resources at BNL, requesting funds for 4 A100 GPUs
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    David - Lincoln is troubleshooting the Broadcom NICs we are using.
    
    BCM57414 works ok with the stock kernel, but high amounts of retransmits are introduced with newer mainline kernels.
    
    Ilija
    
    ML works fine
    
    AF now has two instances of ServiceX deployed (xAOD and UpROOT)
    
    ServiceX dedicated xCache deployed.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  dCache SRR REST-API released yesterday, integrated into frontend service and fixing the storageshares reporting
  
  Deployed at BNL
  
  Ready to test HTCondor-CE 5.1.2 deployment at BNL
  
  Production XRootd endpoint deployed at BU
  
  Data Challenge next week
  
  DC Merged Dashboard
  
  DC FTS Status Dashboard
  
  CERN Mattermost channel
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-9_22_21.pdf
    
    US-cloud-summary-9_29_21.pdf
  - 14:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Oracle upgrade went smooth!
    
    ES
    
    David added new nodes in the new computing center. All working fine. Data now getting redistributed.
    
    Next step is draining half of nodes in the old cluster and physically moving them to the new center.
    
    XCaches
    
    All working fine
    
    VP
    
    All working fine
    
    ServiceX
    
    Now using Flux deployed instances (6 on SSL cluster, 2 on AF)
    
    New, faster DID Finder in production. Twice faster, much less load on Rucio
    
    A lot of developments and testing.
  - 14:30
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    UTA decommissioning nearing completion - hardware move to CPB for k8s testing to be scheduled
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder