US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2021-09-01T13:00:00-04:00
End: 2021-09-01T14:45:00-04:00
Location: No location set

Wednesday 1 Sept 2021, 13:00 → 14:45 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Token Transition
  
  Token workshop Oct 14-15
  
  Fahui successfully submitted to a MWT2 CE from Harvester this morning
  
  HTCondor 9.0.5 available in osg-upcoming-testing and HTCondor-CE 5.1.1 available in osg-upcoming should be the target versions to install for ATLAS CEs
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 10m
- 13:35 → 13:40
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  - Working with ADC to move 3.8PB of MAS storage as DATADISK like in order to meet 2021 pledges while disk purchase is being processed. Timescale ~1-2 weeks.
  
  - 2021 disk purchase released 8/16. Delivery mid-October.
  
  - 2021 CPU purchase released 8/26. Delivery timescale 2-3 months.
  
  - 2022 purchases will be released early FY22 in order to meet WLCG April 1st, 2002 milestone.
  
  - All new purchases will be deployed in new data center.
  
  - HPSS core will be moved to new data center in October (after WLCG Tape test): 1 day HPSS downtime.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs_20210901.png
  
  Success_20210901.png
  
  Transfers_20210901.png
  Some bumps in the road over the last 2 weeks:
  
  A number of the problems were on the ADC side - mostly FTS problems:
  
  8/22 access to RUCIO's authentication blocked by CERN firewall.
  
  8/30 another problem with RUCIO authentication
  
  Final (I think) quotes are available from Dell. Please order.
  
  Still working on HTTP-TPC at sites using XRootD.
  
  Still working IPV6 at NET2 and SWT2
  
  Please state whether you are using just the SLATE squid or continue to maintain your own squids too.
  
  UTA_SWT2 move status?
  
  The Naples Grid Site went down blocking transfers. AGLT2 transfer backlog went over transferring limit and no new jobs were activated. Rod W raised transferring limit and jobs started flowing. Weirdly other US sites had transfer backlogs over their new, higher transfer limits and had jobs activated.
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    Stable running, overall.
    
    1. 24-Aug had ~250 jobs failing due to stage-out error.
       This is caused by 3 work nodes which had IPV6 issues, they could ping gw, but not some dcache servers.
       We added them to the offline nodes with IPV6 issues, hopefully this can be resolved after getting rid
       of the Shinano border switch
    
    2. 30-Aug: sites started to drain because of accumulated transferring jobs (3750, exceeds the limit of 3000 set in CRIC).
       The accumulated transferring is destinated to the Napoli site which is currently on unscheduled downtime.
       The transferring limit is raised to 4000 , and the jobs are slowly ramping up.
    
    3. We noticed frequent work nodes crashing caused by BOINC jobs (with squashfs error flooding the /var/log/message),
       as a workaround, redirect the squashfs message to another log file and use logrotate more often.
       ATLAS@home also released a new version, which does not seem to solve the problem.
       We are also testing removing squashfs/singularity from work nodes,
       to force the BOINC jobs to use the cvmfs singularity image.
    
    4. MSU site migration to campus data center complete (T2 and T3).
       Now 2x100G to Chicago and ESnet.
       Old room in dept building emptied, now used to test cables for IceCube.
       Will ship EX9208 parts to UC.
       Issue with "Export Control" understood.
  - 13:45
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Cooling failure in the UC datacenter over the past weekend. Storage was set offline until temperatures were stable Sunday morning.
    
    Declared another file lost from our dCache issues last December. Identified a larger list of empty files that also appear to be from the same time period and will declare these lost as well.
    
    All swing equipment has arrived. dCache storage nodes are installed and in the process of being benchmarked. Data migrations should start this week.
    
    UC and IU are moved to the SLATE squids. UIUC is still using the old squids.
    
    Finalizing compute purchasing.
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    o I've ordered 88 r6525 worker nodes from DELL.
    o Smooth running, watching for problems as usual.
    o atlas-xrootd.bu.edu working with webdav, adler callout, containers. Expanding to 9 nodes very soon. NESE endpoints soon to follow.
    o I'm on vacation this week, email me for urgent matters.
    
    - Saul
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU -
    
    1) Migrated SE to OFFN DMZ network. Running well, haven't tested new max 50 Gbps or ipv6 connectivity yet.
    
    2) Today is OSCER maintenance, which will upgrade the LDAP/IPA server, which could cause intermittent job or transfer failures because of authentication issues.
    
    3) Had some job failures yesterday, caused by CVMFS issues on 3 compute nodes; rebooting fixed that.
    
    UTA -
    
    1) Compute nodes and storage from last purchase have been racked. Expect to bring on-line next week
    
    2) Downtime in September for LAN installation
    
    3) Preparing for compute node purchase using common Dell quotes
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Lincoln Bryant (University of Chicago (US))
  
  Completed the NERSC allocation this week. TACC-Frontera working fine.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Re-mounting /hpcgpfs01 FS on spar010X nodes as NFS due to GPFS-appliance-version conflicts
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    Analysis Facilities - Chicago 5m
    
    Speakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    (David)
    
    GPU unit shipped without rails. Contacting vendor to get them fix the problem.
    
    Looking to purchase 8 more machines for the new analysis facility cluster in the upcoming purchase (IRIS-HEP SSL funds).
    
    Now on production network gear. Testing on it now.
    
    (Ilija)
    
    ML platform needs a security related changes (completely change k8s client library). Should be finished in a day or two.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  BU Xrootd server looks good in smoke tests - need to add NESE endpoint and move both into production
  
  UTA Xrootd server should be ready to add to smoke tests
  
  SRM+HTTPS tests ongoing at BNL
  
  Need to settle on Topology/CRIC changes for BNL to allow separate downtime for tape endpoint
  
  OU to purchase a SLATE host to deploy local Frontier/Squid
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-8_25_21.pdf
    
    US-cloud-summary-9_1_21.pdf
  - 14:25
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    
    all working fine
    
    VP queues
    
    AGLT2 today just a few jobs
    
    Some issues accessing data from SWT2-CPB.
    
    Squids
    
    Some failovers from MWT2. Investigations ongoing. CPU was maxing out, will add one more instance.
    
    Issues with PerfSonar data pipeline, Nebraska message bus went down. Not yet completely back.
  - 14:30
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Compute nodes and storage from the last purchase have been racked in CPB, logistically part of preparation the move of UTA_SWT2 hardware to CPB (see Mark's report).
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder