Name: US ATLAS Computing Facility
Start: 2021-11-24T13:00:00-05:00
End: 2021-11-24T15:10:00-05:00
Location: No location set

- 1
  
  WBS 2.3 Facility Management News
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 2
  OSG-LHC
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (next week)
  
  OSG 3.5
  
  vo-client 115
  
  python-scitokens 1.6.2
  
  OSG 3.5-upcoming
  
  HTCondor 9.0.7 (has GSI, proxy delegation works)
  
  OSG 3.6
  
  vo-client 115
  
  XRootD 5.3.2
  
  xrootd-multiuser 2.0.3
  
  XCache 3.0.0
  
  osg-xrootd 3.6-10
  
  HTCondor 9.0.7 (no GSI, proxy delegation broken)
  
  blahp 2.2.0 (no GSI)
  
  python-scitokens 1.6.2
  
  OSG 3.6-upcoming
  
  HTCondor 9.3.0 (no GSI, proxy delegation works)
  
  Miscellaneous
  
  How's testing of XRootD in 3.6 going?
  
  HTCondor-CE updates to support tokens
  
  Known issue with C-style comments outside of routes in JOB_ROUTER_ENTRIES (thanks for the report, Wenjing!): https://opensciencegrid.org/docs/release/notes/#known-issues
  
  CEs on token-supporting versions of HTCondor-CE
  
  gate01.aglt2.org
  
  gate02.grid.umich.edu
  
  gate04.aglt2.org
  
  gridgk05.racf.bnl.gov
  
  osg-gk.mwt2.org
  
  CEs on old versions of HTCondor-CE
  
  atlas-ce.bu.edu
  
  gk01.atlas-swt2.org
  
  gk04.swt2.uta.edu
  
  grid1.oscer.ou.edu
  
  gridgk01.racf.bnl.gov
  
  gridgk02.racf.bnl.gov
  
  gridgk03.racf.bnl.gov
  
  gridgk04.racf.bnl.gov
  
  gridgk06.racf.bnl.gov
  
  gridgk08.racf.bnl.gov
  
  iut2-gk.mwt2.org
  
  mwt2-gk.campuscluster.illinois.edu
  
  ouhep0.nhn.ou.edu
  
  spce01.sdcc.bnl.gov
  
  tier2-01.ochep.ou.edu
  
  uct2-gk.mwt2.org
- Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 3
    
    TBD
- 4
  
  WBS 2.3.1 Tier1 Center
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
- WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Reasonable running - some site problems....
  
  AGLT2 was down for a Condor upgrade when the network failed.
  
  Their new network redundancy scheme is not quite in service.
  
  MWT2 memcache problem last weekend caused site to drain.
  
  Planning continues to expend the funding completely by the end of the grant.
  
  https://docs.google.com/spreadsheets/d/1-CV5UgeVsDYj8KrVMvuLP0lAVAcjNEQ8TgdYY6911vo
  
  XRootD continues to need to be restarted once in a while.
  
  AGLT2 updated to HTCondor 9.06 and HTCondor CE 5.1.osg35 and ran into weird bug involving ignoring comments in a configuration file.
  
  We continue to bang on removing SRM and getting SRR reporting to be stable.
  
  As a side effect from this I have noticed that our storage element definitions are in consistent. I think that Horst and Alessandra got this right at OU and we need to iterate at the other sites.
  
  There has been an extend discussion on setting various CRIC parameters.
  
  Ofer and I have been planning out how prioritize the updates that we need to get done between now and the start of run 3.
  
  IPV6 is still not implemented at NET2 and CPB.
  
  Mark Sosebee will retire in January (though he might come back part time).
  - 5
    
    AGLT2
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    Hardware
    
    3 R740xd2 serves from MSU are in production system. The IO benchmark shows the strip size 512K for RAID6 has the best IO performance, about 10% better.
    
    Incident:
    
    11/22/2021, from 10am local time, the 10G commodity link connecting the AGLT2 UM site to Merit went off, so all nodes on the aglt2.org domain name lost access to the Merit DNS servers. The issue was resolved around 7pm when Merit repaired the hardware connecting to this link. During this window, all data transfers were failing and the site was already drained to 8% because of a planned condor update before the network outage.
    
    dcache pool umfs06_12 caused jobs to fail at staging-in files, restarted the dcache service resolved the problem.
    
    System update:
    
    Conor was updated from 8.8.15 to 9.0.6, and condor-ce was updated from 4.5.2 to 5.1.2. During this update, we switched the authentication from host-based to token-based for the Condor Cluster, and that went smoothly because we already practiced it on a testbed. But we hit an issue with condor-ce after the update, where the condor-ce could see the incoming jobs, but the jobs could not be submitted to the local condor system. It took a few hours debugging to find the cause which is the new htcondor-ce does not support the format for commenting in the job router configuration, and this is already reported as a bug to the htcondor development team. At about 13:00 11/23/2021, the site started to ramp up with jobs. And during the entire period with draining and updating problems, BOINC jobs were able to fill all the job slots of the site.
  - 6
    
    MWT2
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    One of our MD3460 s-nodes went offline again Nov 15, back up later that day. Discussing retirement plans for these nodes (also now out of warranty).
    
    Site issue over last weekend caused the site to drain. Back online Monday.
    
    IU scheduled to update to HTCondor CE 5 / HTCondor 9 November 29. UC will update the first week of January. UIUC will be scheduled after the UC update.
  - 7
    NET2
    
    Speaker: Prof. Saul Youssef
    
    Source of occasional HC bumping us offline probably found and dealt with.
    
    Problem with the 2 x 100Gb networking between NET2 and NESE Ceph.
    
    Minor post xrd bump: nodes rebooting where container loses GPFS mount.
    
    Planning for networking re-arrangements, worker nodes.
    
    Working on NESE Tape with NESE and MIT teams.
    
    Todo:
    
    new perfsonar hardware
    
    ipv6
    
    OSG 3.5/3.6 upgrade
  - 8
    SWT2
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Site generally running well
    
    - Still seeing xrootd hangups, have cron in place to restart hung services
    
    - Waiting for next xrootd patch
    
    - Trying to switch SAM/SiteMon over from GRIDFTP to XROOTD for primary SE monitoring, that's currently causing UNKNOWN status, possibly because SiteMon is trying to monitor internal xrootd door, which isn't possible
    
    - SiteMon/MONIT team is looking at this
    
    UTA:
    
    New Storage is coming online (~2PB) will mostly be used to retire existing storage.
    
    WebDAV door is performing fairly well, considering the load. Working on converting existing GridFTP to include WEBDav.
    
    We have IPV6 addresses for the PerfSonar machines and are in the process of setting them up.
    
    We are investigating an issue where some jobs failed to use the correct FRONTIER_SERVER variable.
- 9
  WBS 2.3.3 HPC Operations
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
  TACC
  
  running fairly well but ran out of jobs. Followed up with DPA and a new dedicated task has been assigned.
  
  "Texascale" mode coming up - will be offline for a week starting Dec 6 to run only the largest jobs
  
  NERSC
  
  Cori scheduled maintenance last week, plus ongoing filesystem instability
  
  No updates for Perlmutter this week.
- WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 10
    
    Analysis Facilities - BNL
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
  - 11
    
    Analysis Facilities - SLAC
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 12
    
    Analysis Facilities - Chicago
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    We are working on adding Jupyter Notebook selection and scheduling. unifying atlas-ml.org website with af.uchicago.edu.
- WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Working numerous issues:
  
  Cleaning up SRR reporting and disabling SRM protocol fallback (particularly AGLT2 and maybe BNL)
  
  Fred and Ofer reviewing WLCG storage availability reporting
  
  Pointed out change in Kibana monitor auth to ADC (result of shift to OpenStack, requires membership in es-atlas-kibana e-group to view); also found plots now missing on brokerage page
  
  Need to clarify procedure for adding elements to CRIC/Topology
  
  F-S DevOps: working with Michal to understand squid failover threshold for ticket alerts
  
  SLATE squid container update to OSG 3.6?
  
  HTCondor-CE updates? (Pushed back at BNL until next week)
  
  xrootd standalone server deployed at BNL for testing; Qiulan will help to configure
  
  Prioritized readiness list for Run3
  - 13
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-11_17_21.pdf
    
    US-cloud-summary-11_24_21.pdf
  - 14
    Service Development & Deployment
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    
    All working fine.
    
    AGLT2 networking issue was gracefully automatically handled.
    
    Upgraded to 5.3.4
    
    VP
    
    Working fine
    
    RAL still did not upgrade to 5.3.4 and most failures are coming from them.
    
    Rucio
    
    VP integration development continues. Heartbeat endpoint PR now in review.
    
    Oracle DB change is in and working fine.
    
    ServiceX
    
    AF deployed instances work stably
    
    A lot of developments and cleanups
    
    Testing FABRIC deployed instance.
  - 15
    
    Kubernetes R&D at UTA
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    UTA_SWT2 decommissioning nearing completion, for hardware to be used for Kubernetes cluster at CPB.
- 16
  
  AOB

US ATLAS Computing Facility

Facilities Team Google Drive Folder