US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-08-04T13:00:00-04:00
End: 2021-08-04T14:45:00-04:00
Location: No location set

Wednesday 4 Aug 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release
  
  XRootD 5.3.1 (including async.io fix)
  
  gratia-probe 1.24.0 (some bug fixes for condor batch)
  
  Misc
  
  Add SRM tape vs disk service types in Topology: https://opensciencegrid.atlassian.net/browse/SOFTWARE-4732
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 10m
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  Condor pool happily full (except for brief outage see bellow)
  
  In middle of HPSS upgrade. will know tomorrow if the downtime needs to be extended
  
  Need to work with OSG so that Tape only downtime does not take offline all other storage
  
  Increased the amount of space in the dCache Tape staging pools to account for increasing use of Data Carousel.
  
  Need better monitoring w/ respect to Data Carousel to understand why we have some transfer tasks that do not finish within 1 week.
  
  Before HPSS downtime - copied off of tape 210k DAOD and AOD files where BNL LAKE was only copy
  
  Data was copied from BNLLAKE to BNL-OSG2_DATADISK
  
  16864 DAOD and 193889 AOD files remain to be copied off tape onto Disk will be done when
  
  Reconfigured BNLLAKE and DATATAPE/MCTAPE pools to increase DATATAPE/MCTAPE pool to 1.8 PB
  
  Preparatory work done to split Tape file families in two for BNLLake - on File Family associated with the Local group disk and one with data disk.
  
  Setting up test stand for SRM + HTTPS tests.
  
  Once initial tests are successful plan to use SRM + HTTPS for BNLLAKE and treat it like a tape endpoint until Rucio QOS goes into production sometime in FY22
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  FY21-Purchasing.txt
  
  N_Jobs-20210804.png
  
  Success-20210804.png
  
  Transfer-30310804.png
  Reasonably good running - One CERN FTS issue affecting most of the grid 7/26-7/27.
  
  Lots of work preparing for the FY21 purchases.
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    1) update to the new security update CentOS7 kernel 1160.36,
    did firmware update to all nodes and rebooted them to the new kernel,
    Also in the process of rebuilding all remaining SL7 WNs to CentOS7, for uniformity.
    
    2) did the 2 condor security updates: 8.8.13->8.8.14->8.8.15,
    update gatekeepers' condor-ce to 4.5.24)
    
    3) since reboot needed for new kernel, also updated dcache from 6.2.23 to 6.2.25,
    smooth update (also FW/BIOS)
    
    5) MSU site is moving the last batch of WNs to the Data Center today.
    All nodes moved, powered, currently getting connected. Will be done by end of day.
    
    6) still have IPV6 issues at UM. We see them happening on the data switches too.
    
    7) had one instance of job draining due to pending transfer jobs >4000
    
    8) Adjusted space tokens in dCache, result is increase of AGLT2DATADISK by 290 TB
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC
    
    Power outage at UC took out some of our equipment, including our 40kW UPS. Running on bypass mode for now while we are relocating datacenters
    
    More relocation hardware received. David is working on setup so that we can start data migrations
    
    Network downtime scheduled for Aug 16. Need to schedule in CRIC
    
    IU
    
    Nothing notable to report
    
    UIUC
    
    SLATE node received and in the process of being built
    
    PerfSONAR nodes purchased, estimated delivery Aug 16
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    0 GGUS tickets
    
    1 HC bump Yesterday, stage-in timeouts
    
    MGHPCC annual maintenance down day coming up : August 9
    
    Xrootd containers working at BU, HTTP-TPC tests working, with local adler callout, some deletion errors, probably will disappear when we expand. Next steps:
    
    Expand atlas-xrootd.bu.edu to all current gridftp endpoints
    
    Do likewise for NESE storage endpoints (NESE_DATADISK, NESE_SCRATCHDISK)
    
    Do likewise for NESE Tape endpoints
    
    Need to update OIM with Mark
    
    16 DTN endpoints arrived at NESE, racked and cabled.
    
    ipv6 set up on perfsonar nodes.... We still need to test, then expand.
    
    Preparing for major worker node purchase ASAP
    
    Start planning for NESE Ceph storage purchase in the Fall
    
    UMass joining NET2, new person to help with day-to-day operations; expand into UMass space at MGHPCC; collaborate with BU, Harvard, UMass types on large shared pool of worker nodes roughly along the lines of the shared storage NESE project.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Running well
    
    - Had CVMFS problems with one worker node which caused strange rucio errors. Reboot fixed that.
    
    - Now using HTTP-TPC in production, seems to run well.
    
    UTA:
    
    - Two site issues:
    
    (i) squid outage on 7/24
    
    (ii) site drained on 7/30
    
    - Equipment from most recent purchase arriving. Will need to schedule a downtime for the installation of the LAN re-vamp.
    
    - Ongoing work / testing:
    
    (i) IPv6
    
    (ii) xrootd-TPC
- 14:00 → 14:05
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  
  nersc-080421.png
  
  tacc-080421.png
  TACC:
  
  We have used just over 50% of our TACC allocation.
  
  I messed up my proxy on 7-27 which led to the large number of failures, otherwise TACC running fine.
  
  NERSC:
  
  Generally running well.
  
  Only 3.5M hours remaining at NERSC. Do we need to dial it back more?
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    Working on backend and getting ready for actual users with the team.
    
    Added a couple pre-alpha users to poke around internally.
    
    Waiting for our permanent networking to be configured and connected. Hopefully will be done in the next couple weeks.
    
    Dell GPU we purchased is slated to arrive in November due to shortages.
    
    "old" ML platform works fine. There were two instances of users hoarding GPUs.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Discussing adding a new OSG-CRIC mapping to distinguish between tape and disk SE downtimes
  
  Attempting to avoid site job starvation due to FTS problems causing backlog of jobs transferring out
  
  Updated PQ transferringlimit parameter to 5000 at MWT2, 3000 at other T2s (OU still default). BNL is 8000. BU at 20000?
  
  Enabling IPV6 on BNL gridftp and davs doors
  
  HTTP-TPC: OU looks good in Paul's smoke tests. BU still failing? What is status of CPB?
  
  Would like to deploy SLATE Squid at OU -- avoid reliance on network connection to UTA (Frontier-Squid seg faulting on failed DNS lookup).
  
  BNL VP queue re-enabled with new gStream monitoring
  
  Activity on disk cache still seems low - investigating
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-7_28_21.pdf
    
    US-cloud-summary-8_4_21.pdf
  - 14:25
    
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Back from vacation.
    
    Found that everything was working fine without me.
    
    XCaches - all running fine.
    
    VP - all fine. Yesterday ANALY_BNL_VP queue put online. Will take quite some time to fill the cache with only 100 workers.
    
    This week testing Rucio changes related to replica ordering based on GeoIP.
  - 14:30
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Nothing new from what we discussed the last time.
- 14:40 → 14:45
  
  AOB 5m