US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-05-26T13:00:00-04:00
End: 2021-05-26T14:45:00-04:00
Location: No location set

Wednesday 26 May 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  
  pre-scrubbing prep meeting Friday at 2pm central.
  
  Sites should start FY21 equip purchase planning ASAP due to global chip shortage
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  US ATLAS downtimes
  
  According to Alexey, the AGLT2 SE downtime was not added to the calendar because the AGLT2_SE resource has an FQDN/service combination of head01.aglt2.org/GridFTP but the ATLAS CRIC registered GridFTP endpoint is associated with dcdum02.aglt2.org (https://atlas-cric.cern.ch/core/service/detail/AGLT2_SE_0/)
  
  Verifying with Alexey but it seems like we need to do an audit so that each CRIC endpoint and its protocols correspond to a Topology registered resource and services
  
  Topology service -> CRIC protocol. Currently CRIC only picks up GridFTP/SRM downtimes
  
  GridFtp -> GRIDFTP
  
  SRMv2 -> SRMv2
  
  WebDAV -> WEBDAV (not yet implemented)
  
  XRootD component -> XROOTD (not yet implemented)
  
  XRootD HA component -> XROOTD (not yet implemented)
  
  XRootD cache server -> XROOTD (not yet implemented)
  
  OSG will start working on the ability to more easily declare downtime for an entire site (i.e., Topology Resource Group): https://opensciencegrid.atlassian.net/browse/SOFTWARE-3526
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Declarative deployment of Frontier-Squid 15m
    
    Speaker: Mitchell Steinman (University of Utah)
    
    98ac66a4-23cc-4ec1-91fb-e011d3407031.pdf
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR))
  
  Mostly quiet at T1.
  
  Currently dCache experts are performing deletion rate tests. To increase the deletion from 16 Hz to something much bigger. More results soon.
  
  Investigating 11 missing files. (Trying to understand why dCache removed the files soon after they were deleted)
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20210526.png
  
  Success-20210526.png
  
  Transfer-20210526.png
  Several serious problems over the last 2 weeks:
  
  AGLT2: Main switch at UM failed.
  
  MWT2: AC problems at UC. Also some trouble keeping site full.
  
  NET2: Network issues plus other smaller issues.
  
  SWT2_OU: One third of jobs trying to return output files to CERN fail.
  
  Limited outgoing bandwidth: O(100 kb/s)
  
  Incoming network works fine with reasonable bandwidth.
  
  SWT2_CPB: Outage last Saturday and intermittent problems thereafter.
  
  These site problems smoked out an issue preventing storage from
  
  Could NET2 and SWT2 please report on their IPV6 status.
  
  Heard that there was progress on the XRootD TPC transfers.
  
  Preparing for pre-scrubbing.
  
  Get your purchase in early as there are long leadtimes to receive servers because of the worldwide chip shortage. Planning to arrange a presentation on Dell's newest servers.
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    1) MSU resolved the cooling issue in the server room
    
    2) 18th May, the core switch (Mellanox 2700) died from storage device issue. To recover, we used the new Dell S5232F switch to replace it. However afterwards, we see a lot of packets loss among different hosts, we had to connect all the other switches and core service hosts directly to the new switch. It seems to be a spanning tree issue between the older Dell switches (OS can't be updated to OS10) and the new switch. We can't resolve this issue, especially we have planned to replace all switches in middle June.
    
    During this incident, we lost 3 dcache pools with 150TB data from an aging and no warranty MD3260 storage enclosure. (The vdisk failed, and could not be recovered without technical support)We declared the lost files.
    
    One big challenge to recover is from vmware, one of the vmware node has trouble to boot into its system(due to network issue), we had to migrate the images from this host to the other 2 , and we are still working with the vmware support to bring this host back to the cluster.
    
    This incident cause a downtime of 6 days. We still have ~30 work nodes suffering from significant packet loss, but the job failure from ATLAS is low (~3%). More impact is on the Tier3 jobs.
    
    3) Had difficulties setting site downtime. Didn't seem to match CERN and file access expectations.
    Part of that was operator problem (Philippe) until he started using the proper OSG wiki info.
    Part of that must be some missing topology entries (more services besides gridftp in SE?) or advertisement between OSG and CRIC.
    
    [1]MSU CRAC problems: We replaced the control board of CRAC#2 as it was still operating but could no longer read its temperature sensors.
    CRAC#2 was relying on CRAC#1 for control, thus being a potential single point of total failure if CRAC#1 ever stopped. This scenario was deemed too dangerous, even for the couple months we have left, thus we paid for a control board replacement.
    Separately, CRAC#1 turned off one of its compressors last Sunday. That's a repeat of 1 month ago.
    These concerns will disappear when we move to the MSU data center in June-July.
  - 13:45
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jessica Lynn Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Multiple cooling failures at UC last Thursday and Friday. UC workers were offline over the weekend, but storage remained online. Harvester took a while to fill our site each time we came out of downtime
    
    Issues with declaring the MWT2 SE offline due to WebDAV. Fred followed up on this
    
    Updated condor-ce on our GKs to 4.4.1-1.osg35.el7
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Incident when one of the GPFS pools filled filled => GGUS ticket fixed immediately.
    
    Made some squid config progress with squid team.
    
    Working on xrd 5.2.0 with containers
    
    Preparing to buy worker nodes
    
    One node being tested with ipv6
    
    Met with Mark, re: figuring out how to add new resources in OSG in the github era
  - 13:55
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    IPV6 testing continues with setting up the deployment process. Will ask for AAAA records for perfsonar machines.
    
    Starting to clear logistics backlog that will allow decommission of UTA_SWT2.
    
    Incident with rack level switch caused problems with production, looking at updating firmware.
    
    OU:
    
    - Running well, getting great opportunistic throughput, up to 70% (2k slots) extra.
    
    - Seeing some OU-CERN outbound transfer failures/timeouts, working with both local network folks and WLCG experts to isolate the issue. Will likely migrate to OU DMZ soon.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  
  hpc-may-25-2021.jpg
  
  NERSC was failing jobs all weekend, perhaps related to config change / teething problems changing over from Doug to Lincoln. Reverted config change - jobs aren't outright failing now, but there are a lot of jobs "closed". Doug advises that PanDA is not getting results quickly enough from NERSC. Investigating.
  
  TACC running more-or-less smoothly, need to switch proxy over from robot cert to personal cert with production role ASAP.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    20210526_115638.jpg
    
    20210526_115655.jpg
    
    20210526_121129.jpg
    
    All running fine on the legacy ML platform.
    
    For the AF, equipment is being racked - hyperconverged storage/CPU, login servers. the GPU server will not arrive till September.
    
    Starting to work on setting up the AF - DNS, website, account provisioning.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind
  
  Analy_BNL_VP queue.png
  
  XRootd Test Server Memory Consumption
  
  XRootd Test Server Network Activity
  OIM/CRIC downtime declaration issues during AGLT2 switch incident
  
  Organized discussion as main topic at ADC Weekly Round Table
  
  Bug found in CRIC; clarifications for OSG Topology
  
  Brian/Alexey following up
  
  HTCondor_CE version tracking added to Facility Services spreadsheet
  
  BNL XRootd test shows stable memory profile
  
  BNL_ANALY_VP now running well with transfer activity on BNL XCache
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-5_19_21.pdf
    
    US-cloud-summary-5_26_21.pdf
  - 14:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    
    MWT2 - from time to time very high load and data gets bypassed
    
    AGLT2 - had issue with the main switch but caches are running fine
    
    All others running fine.
    
    VP
    
    runs fine
    
    ServiceX
    
    stress testing with local (MWT2) data. 2.4TB (1400 files) in ~8 min from up to 280 transformers.
    
    bugging Rucio developers to improve replica ordering
    
    Squids
    
    running fine (again AGLT2 switch thing)
    
    fixed some CRIC settings
    
    still things to understand with Frontier Squid requests. Talking to Julio and Dave.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder