US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-11-10T13:00:00-05:00
End: 2021-11-10T15:10:00-05:00
Location: No location set

Wednesday 10 Nov 2021, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  SRR-US-Cloud.txt
  
  We have a request from Alessandra to fix our SRR set ups. There is a new script you can run to check a specific cloud.
  
  For the US Cloud I did the following:
  
  setupATLAS -3
  lsetup adctools
  voms-proxy-init -voms atlas
  checkSRR.sh US
  
  I will attach a the output as a separate file. The US cloud has some issues to fix here.
  
  Next week is Supercomputing
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (this week)
  
  3.5 upcoming
  
  XRootD 5.3.2
  
  xrootd-multiuser
  
  osg-ca-certs-updater (3.5, EL7+EL8)
  
  3.6
  
  gratia-probe
  
  osg-flock
  
  Token Transition
  
  Default issuer + subject mappings
  # ATLAS production SCITOKENS /^https:\/\/atlas-auth.web.cern.ch\/,7dee38a3-6ab8-4fe2-9e4c-58039c21d817/ usatlas1 # ATLAS analysis SCITOKENS /^https:\/\/atlas-auth.web.cern.ch\/,750e9609-485a-4ed4-bf16-d5cc46c71024/ usatlas3 # ATLAS SAM/ETF SCITOKENS /^https:\/\/atlas-auth.web.cern.ch\/,5c5d2a4d-9177-3efa-912f-1b4e5c9fb660/ usatlas2
  
  CEs that need to update to HTCondor-CE 5/HTCondor 9 from 3.5 upcoming
  
  gate02.grid.umich.edu
  
  gate01.aglt2.org
  
  gate03.aglt2.org
  
  gridgk01.racf.bnl.gov
  
  gridgk02.racf.bnl.gov
  
  gridgk03.racf.bnl.gov
  
  gridgk04.racf.bnl.gov
  
  gridgk06.racf.bnl.gov
  
  gridgk07.racf.bnl.gov
  
  gridgk08.racf.bnl.gov
  
  iut2-gk.mwt2.org
  
  uct2-gk.mwt2.org
  
  mwt2-gk.campuscluster.illinois.edu
  
  atlas-ce.bu.edu
  
  deepthought.crc.nd.edu
  
  ouhep0.nhn.ou.edu
  
  tier2-01.ochep.ou.edu
  
  grid1.oscer.ou.edu
  
  gk04.swt2.uta.edu
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 30m
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  dCache upgrade to latest golden release. bug in SRR - dCache had work around - to be fixed
  
  New VP site and VP queue - to avoid injection of local xcache address that caused problems for users jobs using data from BNLLAKE_DATADISK.
  
  Shigeki will produce a list of questions for a collaboration btwn BNL SDCC, dCache, ATLAS ADC, Rucio, FTS dev. goal to provide enough metadata to efficiently use TAPE system at Teri 1 site.
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20211110.png
  
  ProdJobFailures-20211110.png
  
  Transfer-20211110.png
  The past two weeks were OK running but there were problems:
  
  AGLT2 grid job / dCache issues over last weekend
  
  BOINC did fill empty slots.
  
  However the grid prod job failure rate was 11%.
  
  MWT2 Drained over weekend before last.
  
  Issue not understood but unlikely to be a site problem.
  
  NET2 high failure rate for prod jobs: 24% over the two weeks.
  
  SWT2 OU: XRootD service hangs and 21% prod job failure rate
  
  SWT2 CPB: Some XRootD service issues but generally OK
  
  Rob discovered in talking to Dell that sites ordering C6525 servers(chassis) are getting them this month bit sites ordering R6525 servers (1U) are getting them in January.
  
  Do we want to change our orders to chassis or ask Dell if we can get better CPUs since we are waiting until January anyway for 1U servers?
  
  IU is getting its chassis servers this week and Illinois expects to receive theirs soon.
  
  Could UMich, MSU, UC, BU, and OU please email me with what was ordered what the expected dates are to receive the gear.
  
  Once I have this info I will organize a planning meeting to discuss the next round of purchases this week or early next week. We must urgently start the process of spending down the rest of the equipment money by January 31...
  
  XRootD continued to have issues. From Wei:
  
  We still see problem at OU. I finally was able to repeat this issue at SLAC.
  
  For a long time, SLAC ran xrootd 5.3.0 from xrootd repo, and we never saw the problem. OU runs xrootd 5.3.1 from OSG 3.5upcoming and saw the problem of accumulating TCP CLOSE_WAIT in short period of time.
  
  Since Oct 28, SLAC switched to use the same very OU use and we saw this problem after 9 days. SLAC has since to switched to xrootd 5.3.1 in EPEL to see if we can repeat the problem
  
  Please provide updates on IPV6 (NET2, SWT2) and the moves (MWT2, SWT2)
  
  We need to retire SRM soon.
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    Hardware
    
    MSU site installed 3x R740xd2, now doing benchmark on different stripe size of the RAID6.
    
    Service:
    
    Update dCache from 6.2.29 to 6.2.32 to address security issues. The update was smooth.
    
    HS06 benchmark:
    
    We run it on 2 types of CPUs (Intel(R)Xeon(R)Gold6240R and Intel(R) Xeon(R) CPU E5-2650 v2), there are 2 discoveries: 1) the new HS06 score is between 6-8% higher than the old number (with the same benchmark toolkits）but different kernel and firmware. 2) We compare HS06 score with and without BOINC jobs running in the background, and having BOINC reduces the score by 1.32-2
    
    Incidents:
    
    Removing DBRelease file by ADC caused BOINC jobs failing and mis-accounting.
    
    Xcache server sl-um-es4 crashed because of one disk failing. We replaced the disk (raid 0), recreated the instance for xcache.
    
    On 11/7/2021 , from 3am UTC, SrmManager on dcache head nodes (head01) started to fail, and it caused file transfer low efficiency (60% failure), and rucio deletion failure (datadisk was almost full), and the AGLT2 PanDA queue to be also set to test status (this drained the site to only 20% job slot usage) due to over 60% failure on the jobs. We first restarted dcache on the head node, and it fixed the rucio deletion failure issue. Later we still saw low transfer efficiency, and the problem was traced back to one xrootd door(dcdmsu01) and one pool node (msufs04), we restarted dcache on both nodes and the transfer efficiency started to improve, but we saw more errors from other pool nodes, so we ended up restarting dcache on all nodes which eventually solved all the problems.
  - 14:00
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Updated dcache from 6.2.30 to 6.2.33 for security fix.
    
    There were a handful of 1099 errors on a few workers at UC that all seemed to be failing to download the same file/set. Manually attempting to download the file succeeded.
    
    Found a pilot bug where pilot would tell Panda the job failed, but the job is still running happily on the compute nodes until completion. Found while troubleshooting the 1099 errors and a spur of "lost heartbeat" errors.
    
    Swing compute, for the move, is being benchmarked. Swing storage is in production.
    
    First physical machine move is scheduled for December 6th.
    
    UIUC:
    
    - finalizing infrastructure changes to support the new compute purchase (shipment ETA this week)
    
    - ICCP (Chit) is rebuilding the OS image for the SLATE node and Lincoln has offered to configure software
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
    
    SRR reporting endpoint updated => gridftp not a used protocol in CRIC
    
    Bump from used space > total space briefly...
    
    Issue with migration slowing GPFS down too much, resulting in staging errors.
    
    Smooth operations otherwise
    
    Low level of squid failovers...
    
    We're mostly working on preparations for new workers, networking upgrade, NESE Tape preparation and preparing for adding more NESE Ceph storage.
  - 14:10
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    SWT2_CPB:
    
    - System generally running smoothly
    
    - Continue to monitor XRootD on SE gateway for hang-ups
    
    - Adding additional hosts to distribute the load
    
    UTA_SWT2:
    
    - Decommissioning of the storage essentially done
    
    - Working with data center personnel to schedule powering off, disconnecting network, etc.
    
    - Final step is physically moving the hardware back to campus
    
    OU:
    
    - Generally smooth running, except
    
    - XrootD on SE gateway also keeps hanging up
    
    - Continue investigating with Andy, Wei, and rucio developers
    
    - Some dark data on OU_OSCER_ATLAS_DATADISK, investigating
    
    - Dark data was also responsible for DATADISK being black listed for writing, which caused additional job failures, since rucio refused to stage-out jobs. rucio is supposed to still stage-out, and they found a bug which they're working on.
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
  TACC has good throughput for the last few days
  
  Cori filesystem has been unstable off and on for the past week
  
  20M additional hours added to Cori, need to use by the end of the year. Need to ramp up number of prod jobs going to Cori and use the regular priority queue to make sure we use the allocation
  
  Perlmutter setup continues, no major blockers just need to focus on configuration and testing.
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  We are adding quote some materials to AF docs. Most of the are focus on data storage/sharing/caching. Proposed to re-org to doc to have a dedicated top level section for DOMA so that other topics (e.g. Jupyter) are still visible at the top level panel.
  - 14:20
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Update of BNL shared-T3 documentation in progress.
    
    Very useful AGC Tools Workshop last week.
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    Start to experimenting a new data sharing service at SLAC for low sensitivity data. This is part of the milestone to be delivered.
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    ML platform (www.atlas-ml.org) has been moved to the AF k8s cluster and is now FluxCD managed. Some updates were performed.
    
    We still have the old ML platform running on our ML nodes at UC (www.atlas-ml.net) until these nodes are moved to AF.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Issue with Rucio prepending xcache to storage URL for jobs accessing data from BNLLAKE datadisk
  
  Config changed to: Atlas site = BNL-ATLAS_VP , Panda site=BNL_VP, PQ=ANALY_BNL_VP, Storage Unit = BNL_VP_SU
  
  Ilija/Ofer following up to finalize configuration
  
  Investigating why VP not declared down for dCache upgrade
  
  Status of XRootd site issues? Update to 5.3.2?
  
  Planning to test xrootd-standalone at BNL
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-11_10_21.pdf
    
    US-cloud-summary-11_3_21.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    All working fine.
    
    All Slate managed instances on latest version 5.3.2
    
    BNL and LRZ-LMU asked to upgrade
    
    Issue with RAL as an origin. Debugging with Andy and Matevz
    
    VP
    
    Working fine
    
    BNL VP queue moved to a new site (BNL_VP). This was needed due to introduction of BNLLAKE. Users jobs at BNL-ATLAS were being prepended xcache paths when reading from BNLLAKE.
    
    Oxford advised to move to VP
    
    ServiceX
    
    Successful debut at the AGC Tools workshop last week.
    
    More tests. Documentation is being updated.
    
    Rucio
    
    still issues returning closest replicas. Debugging with Martin.
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    UTA_SWT2 decommissioning nearing completion, for hardware to be used for Kubernetes cluster at CPB (see Mark's report).
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder