US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 11:00 11:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Good running over the last couple of weeks.
        • MWT2 has a small decrease in the number of slots today caused by kernel updates at Illinois site. 
        • NET2 has experienced Intermittent brief periods of being put offline. This was due to 800G tests interfering with connectivity to the storage.
        • OU had some job failures caused by large memory jobs (probably pile-up).
        • Power issue caused a 3 day unplanned downtime at SWT2_CPB. More recently there was another problem that caused partial draining.
      •  Tickets
        • 1000848 NET2 Transfer and Staging Errors (Not a NET2 problem?)
        • 1000757 NET2 as the dst with deletion error "The requested service is not available at the moment"
        • 3255 NET2_Amherst: jobs failing with "Job has reached the specified backoff limitBackoffLimitExceeded"
        • 683424 Dual-stack on OU_OSCER_ATLAS
        • 1000849 Transfer Errors: From TW-FTT to KIT-T2 Transfer failure: The peers certificate with subject's DN
      • Please get your reporting in this week.
      • Again this week there will be no procurement meeting on Friday.
    • 11:10 11:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
      • Currently the network connection is 10 Gbit/s out of which 5 Gbit/s are dedicated to TW-FTT site

      • Ticket #1000849 "Transfer Errors: From TW-FTT
        The reason is that the storage certificate expired. We are in the process of applying for a new certificate to resolve issue.
    • 11:20 11:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      Overall stable running.

      Continue to have sporadic cvmfs issues, almost all at UM.
        Shared nodes with Dave for debugging.
        "new fix coming for 2.13.3"

      07-Oct: Cooling issue at MSU data center; 31 worker nodes shut themselves off on temperature
        Happened during planned maintenance to replace compressor.
        Plenty of capacity/redundancy but nearest ACU must be prevented from circulating hot air.
        Data Center management team is meeting with building and maintenance to understand and address procedure problems.

      keep tracing work nodes with bad CPU utilization (too high idle, system), deployed metricbeat and dashboard to identify not optimal nodes. 

      optimized BOINC control wrapper to have a better CPU usage. (scavenge more of the idle CPU)

    • 11:30 11:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • Pilot version updated to the new major version (3.11.0.29)
      • UIUC PM today. Illinois nodes drained for rebooting into the new image
      • IU compute downtime October 28
      • Xcache ticket resolved. Ilija also deprecated two of the three xcaches
    • 11:40 11:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Storage improvements are ongoing:

      • Faster connection (400->800)
      • ZFS parameters change on storage servers for better performance
      • SENSE deployment

      Unfortunately this is leading to occasional blacklistings, as testing can overload links and ZFS changes require draining pools which can occasionally cause instability.

      Tape is working quite well, but this is also contributing to the occasional blacklistings.  Surges of tape transfers can temporarily fill pool's queue, leading to imbalance issues that have to be corrected manually, a problem that is exacerbated if some pools are being drained.  Furthermore, since tape transfers are part of the same queue as other transfers, they can time out if there are other issues (like the problem with transfers to INFN-ROMA) causing the queue to get clogged.  We believe that this is responsible for the ticket 1000848.

      Backoff limit errors have been more prevalent lately, possibly due to more multi-core jobs running on the site?  Trying to figure out a pilot fix.

      The VP queue has been getting very few jobs lately, and those that it does get don't use the xCache, so it's been difficult to see if it can benefit us at all.  Also, there was a couple days of downtime over the holiday weekend thanks to an expired certificate.

    • 11:50 12:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • Power Incident 10/4

        • We experienced a power incident on 10/4, when a surge affected a main fuse. This caused the data center to lose cooling, lighting, and networking, although the servers and worker nodes remained online under full load. We arrived on-site and performed a shutdown of all nodes and the UPS to minimize equipment risk as ambient temperature rose rapidly. 

        • We coordinated with Facilities Management to restore power, then executed the appropriate checks and a phased restart. The UPS and racks were brought back online, and a full storage integrity review found no hardware damage, including no failed drives.

        • We are implementing additional safeguards and stronger alerting for similar scenarios in the future, have initiated replacement of three end-of-life 1G switches, and resolved strange behaviors observed on a few storage servers. We are looking into possibly adding a cellular-based system for notification in case the network becomes unavailable, improving recovery scripts, correcting UPS labeling, and implementing automated server shutdowns at defined temperature thresholds. 

        • Overall, we are treating this as a learning opportunity to strengthen our processes and address issues discovered. 

      • Data Migration and New Storage

        • We continue to move data from additional MD3460 storage arrays and are using retired storage for testing. To accelerate progress, we have scaled up by running multiple threads across two storage systems at time for now. 

        • We have essentially completed migration of the fourth MD3460, with a small set of files still undergoing checks due to the October 4 incident.

      • EL9 migration

        • We have had limited time to progress this work and are deferring changes to additional head nodes while data migration is in progress. 

        • The remaining EL9 migration depends on completing data migration so we can safely upgrade existing EL7 storage to EL9 and deploy new EL9 storage while retiring legacy EL7 systems.

      • Perfsonar

        • We are renewing the certificate on one of our perfSONAR machines ahead of its upcoming expiration. Unfortunately, psuta01 is again experiencing issues. 

        • We are working to resolve this. Although the node has been rebuilt multiple times, we suspect a configuration related problem. Tests run for a period of time and then stop producing results. We will continue investigating until resolved.

        • Shawn is helping us with this, and we will try the suggested fix. 
      • GGUS-Ticket-ID: #1000845

        • We experienced transfer issues from 10/11 to 10/13. We contacted campus networking on 10/13 to investigate and try to understand the underlying cause of this. They performed some tests and are checking logs for more information. 

        • We also experienced issues with certain services at UTA, making us suspect there may have been a networking issue. 

        • So far, the issues appeared to have stopped in the morning on 10/13.

      • Error Publishing Data About Usage Storage Size

        •  After the power incident on 10/4, our system for publishing used storage size was working with error until 10/13. From 10/6 to 10/13 we have not updated data about usage storage size , causing RUCIO to not stop writing data to our RSE. On weekend full size of RUCIO data was on 0.7 PB more then our limit/quota. It was a possible reason for DDM and jobs errors on the weekend.

        • The publish script is fixed now, and we will add WARNING alerts to this script.

        • We also contacted DDM Ops for more information that may help us understand what happened.


      OU:

      • Still see occasional compute node fall over because of hi-mem jobs, but overall running well.