US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-01-05T13:00:00-05:00
End: 2022-01-05T15:10:00-05:00
Location: No location set

Wednesday 5 Jan 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  We have an early deadline for quarterly reports because of the review at the end of the month. Reports are due by Friday, January 14, 2022 (week from this coming Friday). To allow Rob and Shawn to get our WBS 2.3 version completed, we need the level 3 (WBS 2.3.x) reports done by Wednesday, January 12, 2022. Please try to get these completed ASAP.
- 13:10 → 13:20
  
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  
  Several packages ready for testing:
  
  3.5.53-upcoming and 3.6:
  - HTCondor-CE 5.1.3 (various bugfixes, see https://opensciencegrid.atlassian.net/browse/SOFTWARE-4951)
  - XRootD 5.4.0 (new features and bug fixes, see https://github.com/xrootd/xrootd/releases/tag/v5.4.0)
  3.6 only:
  - oidc-agent 4.2.4 (new major version, see https://github.com/indigo-dc/oidc-agent/releases for changes since 3.3.3)
  - cvmfs 2.9.0
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 30m
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  currently moving CPU servers from old data center to new data center today. coming online later this week.
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20220105.png
  
  Success-20220105.png
  
  Transfers-20220105.png
  No big crises during the holiday break.
  
  All sites had some problems.
  
  The was a major issue at CERN that messed up the monitoring but the missing data is now available.
  
  The quarterly reporting is due early this year. I want your input by the end of the day next Tuesday I listed 4 specific items that I want each site to address in their report on (some sites will simply report that they in the final configuration for the start of Run 3 for some items):
  
  Updating OSG and Condor versions.
  
  Updating storage version.
  
  Updates to the queuing system.
  
  IPV6
  
  Seems like XRootD may be making progress toward stable HTTP-TPC transfers.
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    - updated ELK for log4j issues, applied other security updates.
    - MSU received SLATE node, being installed with Alma Linux (per SLATE team request)
    - MSU received network capture node, will also be Alma Linux (for Milan CPU)
    - MSU received VMware storage node
    - purchase plan: R740xd2 with 18T drives, R6525 with AMD 7452 (128 HT/node), final count to be determined after final quotes. Roughly $500k total and roughly 50/50 for storage/compute.
    
    - Rebooting the cisco border switches caused ipv4 issues among various machines on the UM site, it caused cvmfs failure and squid servers failover. It took a couple of days to debug (between UM ITS and cisco support)to fix the issue.
    
    - Another slate squid issue: does not show traffic on the CERN squid monitoring, had to rejoin the nodes to the k8s cluster to fix it.
    
    - A patch was applied to the cisco border switches, which fixed the IPV6 forwarding (to Dell management switches) issues, so we were able to bring all the R620 work nodes whose data connections are through the Dell management switches back to condor.
    
    - Merit Networks has had another issue on MiLR (our network that connects us to Chicago and East Lansing). This has broken our default route and access to and from AGLT2 from non research and education (R&E) networks.
    
    - a typo in the routing rules change caused ipv6 ping failure to all CERN machines, a lot of jobs fail at rucio timeout. It was fixed the next morning.
  - 14:00
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Upgraded uct2-gk to htcondor-ce 5.1.2 and condor 9.0.8 this morning
    
    One of the UC dCache nodes went offline December 26th. Pools were brought back up that day
    
    Second set of dCache transfers finished for the UC server room relocation. Next move is scheduled for January 24th
    
    New IU and UIUC compute nodes online. Revised UC order submitted, still waiting on an estimate shipping date
    
    Surplus UC servers arrived at IU. Fred is in the process of installing
    
    Discussing upcoming purchase order. Fred is working on benchmarking and quotes
  - 14:05
    NET2 5m
    
    Speaker: Prof. Saul Youssef
    
    Screen Shot 2022-01-05 at 1.12.00 PM.png
    
    o We had staging issues over the break and had to limit the total number of jobs by hand.
    
    o Down time on Tuesday Jan 11 for
    
    Retiring 3TB pool (770TB)
    
    NFS kernel upgrade
    
    Preparations for new worker nodes
    
    o Adding 4 DTN nodes to increase the GPFS-worker bandwidth.
    
    o About to place orders on a new NESE Ceph rack to add to NESE_DATADISK. 3.8PB raw, 12 new DTN nodes.
    
    o NESE Tape working, coming online.
    
    o Pressing Harvard on ipv6.
    
    o Plans for NET2 expansion with UMass, bare metal cluster, etc. nearing finalization.
  - 14:10
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Nothing to report, all running well.
    
    UTA:
    
    Problems still occurring with WebDAV door. We are going to upgrade the version of XrootD and setup the existing gridftp servers to take the load of transfers.
    
    Over the break, we had one small downtime as the chilled water for the lab was being worked on. Fortunately the cooling was maintained and were able to come back quickly.
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
  TACC
  
  SLATE node has an issue - Kubernetes has broken itself.
  
  Working with TACC team to fix this
  
  Harvester broken as well, because it was using SLATE node for MySQL DB
  
  Standard Sqlite installation won't work at TACC for some reason. Something strange in the environment?
  
  NERSC
  
  Allocation approved for Perlmutter, we have 500K CPU hours and 11K GPU hours on Perlmutter starting Jan 19th for 1 year
  
  Cori failing large number of jobs - logs indicate SLURM is cancelling the jobs after about 30 minutes.
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    NTR. Compiling info for next week's GDB presentation.
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Tier3 status update provided by Fengping Hu et al... sent to Alessandra:
    
    https://docs.google.com/presentation/d/1RZeeTkCZ8biLEXGGDhKsuN-nOM5Pf6dHcwy0eXPWHMw/edit?usp=sharing
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  
  Screen Shot 2022-01-05 at 12.57.52 PM.png
  
  Screen Shot 2022-01-05 at 12.58.16 PM.png
  BNL HTCondor-CE's have been upgraded (thanks Xin!)
  
  ANALY_BNL_VP queue issues were traced to a deactivated CE, followed by a problematic "flavour" value in CRIC, then a maxWallTime=0 pilot setting....jobs seem to be running again as of this morning
  
  File transfers, and staging, apparently continued during the BNL tape service downtime on 12/29 13:00-17:00 UTC (link). Why?
  
  MWT2 squid service briefly degraded after SLATE reconfiguration and failed restart
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-12_15_21.pdf
    
    US-cloud-summary-12_22_21.pdf
    
    US-cloud-summary-12_29_21.pdf
    
    US-cloud-summary-1_5_22.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    working fine. restarted all SLATE instances to get some small changes in.
    
    AGLT2 nodes needed intervention from Wenjing and Mohammad
    
    VP
    
    working fine
    
    BNL_VP now getting jobs but jobs failing. Xin and Ofer looking at it. Not related to XCache
    
    Analytics
    
    ES running fine. Preparing the next batch of servers for transport
    
    updating all the logstashes. there are 4 running.
    
    Updating Alarm & Alert frontend.
    
    ServiceX
    
    stress testing
    
    testing for graceful handling of errors
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    While waiting for the UTA_SWT2 decommissioned hardware, which will be used for Kubernetes cluster at CPB, we are working on a faster option, to start with fewer machines, before the main chunk of UTA_SWT2 hardware arrives.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder