US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-07-21T13:00:00-04:00
End: 2021-07-21T14:45:00-04:00
Location: No location set

Wednesday 21 Jul 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release (tomorrow)
  
  HTCondor 9.0.2, BLAHP 2.1.0 (3.5 upcoming, 3.6)
  
  XRootD 5.3.0 (3.5 upcoming)
  
  voms client to support requesting VOMS proxies from IAM
  
  XCache 2.0.1 (3.5 upcoming)
  
  Miscellaneous
  
  OSG Yum repos down, subscribe to updates here https://status.opensciencegrid.org/
  
  Multi-resource downtime page should be available this week or next
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 10m
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  Currently adding more space from the DATALAKE to the ATLAS DATA-Tape and MC-Tape dCache staging pools. This should reduce the churn that we are seeing. (ie files copied from tape to staging disk and then removed before ATLAS copied the files away)
  
  Reminder - HPSS (tape system) downtime 2-Aug-21 through 7:00 pm - 5-Aug-21
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20210721.png
  
  Success-20210721.png
  
  Transfers-20210721.png
  Some issues in the last couple of weeks
  
  MWT2: Enterprise switch upgrades to main enterprise network switches at UC this Wednesday and last Wednesday. The change last week caused IPV6 issues
  
  AGLT2: IPV6 issues and a full work area caused problems.
  
  MSU: Moving to new location today.
  
  Illinois: Today is quarterly prevent maintenance period.
  
  Get your reporting in today!!!
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    1) MSU site is moving 65 WNs to the new DC, i.e. all the newer WNs (R620s, R630s, C6420s).
    
    2) UM site is working on the ipv6 issues on the new network. 2 causes, we solved one set of the problem by adding static IPv6 ND mapping to the gatekeeper, still working on set 2 problem from the R620s connected their data cables to the management switches
    
    3) Job failures: 40% failure on 20th July due to 2 errors, "payload metadata does not exit", which disappeared on 21st July. (AGLT2 has the biggest number of failed jobs for this error within usatlas, but some other sites have similar errors). "no local space" error, the home directory for the usatlas users are full after years of piling up of small files, we cleaned the space and set up a cronjob to clean it.
    
    details about 2)
    
    More work nodes are having ipv6 connectivity issues (do not reach gw), there are 2 set of causes: one is possibly by a bug in either the juniper or the cisco switch border switches. The workaround is to add the static ipv6 ND mapping to the juniper gateway. (We have added all work nodes). Hopefully this will be resolved when we can get rid of the juniper gateway (using cisco instead) in August. Two is the management switches (S3048) have ipv6 issues. We have ~20 R620s which need to connect to the management switches for data connections, we havn't found a solution to that yet, so retired condor on all R620 work nodes for now
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC:
    
    North border router was upgraded last week (7/14) and the south border router is being upgrading this week (7/21)
    
    After the north border router was swapped out, routing moved to the south which led to IPv6 issues over the weekend and early this week. UC network engineers worked on a fix, but ultimately moved IPv6 routing through the north border router temporarily.
    
    GGUS #153052 associated with the IPv6 issue (transfer issues)
    
    0% transfer efficiency with NERSC-PDSF
    
    Relocation equipment trickling in.
    
    IU:
    
    New management nodes up and running.
    
    Working on getting new PerfSonar machines set up
    
    UIUC:
    
    SLATE node arrived. Needs built and configured.
    
    Quarterly PM today (7/21)
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Screen Shot 2021-07-21 at 12.25.55 PM.png
    
    GGUS tickets: 0
    HC blacklists: 0
    
    o Smooth operations
    o Site full except for a dip around 2021-07-15 (unknown if it's a widespread dip)
    
    o Advanced stages of getting ready to buy worker nodes.
    o xrd 5.3.0 installed and working in our custom container
    o Successfully exporting NET2_DATADISK, _SCRATCHDISK, _LOCALGROUPDISK
    o Endpoint atlas-xrootd.bu.edu registered in CRIC
    o Configured for HTTP-TPC, custom adler32, both work successfully
    o Getting put into "smoke tests" by Alessandra & co.
    o Some problems remain, possibly related to transfers to dcache sites, Wei and Andy are investigating.
    o NESE Tape ATLAS endpoints have arrived, expect to be racked and cabled this week.
    o perfSonar node rebuilt with new hardware, both nodes are ipv6 now.
    
    o Annual MGHPCC power maintenance, August 9
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Setting up a test host, as proxy, with test version of XRootD 5.3 from OSG. Software installed, working on configuration.
    
    Operations mostly smooth over period
    
    OU:
    
    - Smooth operations, ran low on jobs occasionally.
    
    - XRootD 5.3.0 installed, HTTP-TPC working, waiting to be included in smoke tests.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  
  hpc-7-21-21.png
  
  Overall, some small issues with failed jobs due to forgetting to renew credentials over the weekend.
  
  Frontera running 100-node jobs for some weeks now, throughput more consistent. ~211,000 SUs (56% of allocation) remaining
  
  Cori reduced significantly, only a single job queued at a time.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    NTR
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    A few more compute nodes showed up and have been racked, built, etc. and added to the cluster.
    
    A couple more interactive machines showed up, but haven't been racked and built yet. These (along with the three machines mentioned above) aren't necessary for us to go into production.
    
    Still waiting on the GPU machine. We believe sometime in November is when it will arrive (according to Dell).
    
    We've gotten a condor queue up and running. Can submit jobs from both submit hosts we're planning to have for users day 1.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Created FedOps team email list and documentation page, waiting on RT queue
  
  XRootd 5.3.0 release should now be deployed at sites - need to add into DOMA smoke tests
  
  BNL xcache updated and operational again (lost one NVME drive) - will wait for Ilija before activating VP queue
  
  Mark working with Saul on topology clean-up for NET2
  
  Working on Quarterly Report
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-7_14_21.pdf
    
    US-cloud-summary-7_21_21.pdf
  - 14:25
    
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    https://docs.google.com/document/d/1YDvIHMYihczN9zX-wxyBp2k7pBzS8YoDnA0EPhXspUo/edit?usp=sharing
    
    describes emerging FedOps procedure for Frontier-Squid.
  - 14:30
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Nothing new to report.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder