US ATLAS Tier 2 Technical

US/Eastern
Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))
Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Zoom Meeting ID
67453565657
Host
Fred Luehring
Useful links
Join via phone
Zoom URL
    • 10:00 10:10
      Introduction 10m
      Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
      • Very good running over the past two weeks!
        • AGLT2 had transfer failures on the S3 endpoint but that seems fixed.
        • AGLT2 had some trouble with cvmfs. It is still not understood why AGLT2 has more trouble with cvmfs.
        • NET2 was blacklisted last night. For some reason two of the four HammerCloud tests have not been submitted for long periods.
        • OU had failures caused by large memory jobs.
        • Taiwan has had some user and group job failures. Production jobs look good.
      • NET2 is hoping to upgrade OKD next week.
      • CPB is still migrating data to their new storage.
      • The release notes for pilot 3.10.5.57 did contain a note saying that the sub-cgroup memory limits are not being set in addition to stating that the limits were functioning correctly.
        • Fred Luehring missed that note when he incorrectly said that the pilot was enforcing memory limits.
        • Paul Nilsson did test the pilot memory limits were functioning correctly on sites using HTCondor version 24.0.7 or newer but decided to wait to implement the actual limits until all the experts are back from vacation in September.
        • The bottom line is that the sub-cgroup limits would function as expected (killing the payload and not the pilot) if a limit was used but currently no limit is set.
      • We are still waiting for the final outcome of the scrubbing before moving on new procurement.
      • The US capacity mini-challenges are  being held within this week.
      • Rafael teaches from 10 am to 11 am EDT so we will move the meeting to 11:00 am.
        • We will be forced to hold separate meetings with TW-FTT at a better time for them for this semester.
        • Eric Yen says he will attend at the new time because even though it is late, he finds attending the full meeting to be informative.
    • 10:10 10:20
      TW-FTT 10m
      Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
    • 10:20 10:30
      AGLT2 10m
      Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

      CVMFS: 

        cvmfs issue at MSU, all nodes had cvmfs issues,
        caused by the NRP varnish server being down for a few hours (a cause but not an explanation?),
        and we could not recover those nodes by any methods.
        Eventually updated cvmfs from 2.13.1 to 2.13.2 2 which fixed the issue. 

        setup multiple cvmfs proxy testing groups to help understand and verify the stability of varnish vs. squid. .

      - Condor: 

        cgroup in condor setup, to allow jobs to exceed 40% of memory usage, significantly reduced the killed jobs.  

      - Tickets:

      S3 end point: Added memory and CPUs to VM. Reduced transfer concurrency

      squid monitoring: One former SLATE node transferred to NRP Now needs to be removed from CRIC to be removed from monitoring


      - Mini data challenge on Tuesday:

      Found that JSON file monitoring MSU was not showing proper traffic.
      The Python script is looking at interface by SNMP index (not by interface numbers)
      One data center had to be warranty-swapped which changed the SNMP indexing.
      Also noticed that all traffic seems to be going through only one of the 2 data center core switches (2x100 instead of 2x2x100)
      More to be understood.

    • 10:30 10:40
      MWT2 10m
      Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
      • Reconfiguring elasticsearch disk configuration to RAID from JBOD because the JBOD option is deprecated
      • Added two new xrootd doors to dCache to help with transfer load. Two more extras also ready to add in case needed
      • Finished another round of testing and gathering power consumption data for compute nodes
      • Adjusted cgroups memory limit on compute to 1.4x
      • Updated cvmfs to 2.13.2

       

      Note: Working with Dell on disk replacements for storage, they informed us that the following Seagate disk model is known to have issues. But the latest firmware fixes them.

      Model: seagate ST24000NM002H-3K

      Working firmware version: SUA5

    • 10:40 10:50
      NET2 10m
      Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)

      Briefly blacklisted on 8/13 due to failure of jobs from two servers to resolve Varnish host name.  Restarted the hosts in question and haven't seen this error since.  Nobody seems to have seen it before so for now we are just keeping an eye out in case it comes back.

      Also blacklisted on 8/12 due to a network instability affecting the connection between servers in the campus data center and those in the main data center.  (These are different servers than those that had the Varnish issue.)  UMass IT fixed the issue pretty quickly and it hasn't recurred.

      Finally, blacklisted on 8/6 due to SCRATCHDISK becoming unavailable thanks to extremely high load on a dcache pool, owing to a user trying to stage a file on that pool for all of his thousands of jobs.  Using LRU as for load balancing policy, cost based p2p migration is tricky.

      Also, saw an unusual XCache problem, where xrootd crashed inside the XCache: as a result, caching wasn't working, but jobs continued to be submitted to the VP queue because the XCache itself was up.

      OKD upgrade preparation is almost done. Tests on baremetal is done and we are aiming to shcedule a downtime to next Thusday.

    • 10:50 11:00
      SWT2 10m
      Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))

      SWT2_CPB:

      • EL9 Migration

        • We have started testing new modules for the EL9 migration without affecting the current behavior of the test cluster. 

        • We will not make changes to our test cluster’s functionality until testing of the new migration script is complete. 

        • Modules for different types of nodes are complete but not yet tested.

      • New Storage Deployment

        • Due to the files that were lost with the original migration script, we have been redesigning it with additional safeguards and additional mechanisms to increase safety and robustness. 

        • We have run multiple tests in the test cluster that simulate production. 

        • We created additional scripts to support our main migration script and perform post-migration checks. 

        • Once development of the new migration script is complete and testing results are good, we will resume migrating data from old storage to new.  

      • GGUS-Ticket-ID: #683657: Varnish 

        • After doing brief tests with placing the new Varnish server as primary, we saw good results and decided to keep it in primary. We have our Squid in second priority within CRIC settings as failover. 

        • With Ilija’s help, we solved a routing to Varnish at our site and have been monitoring access to our failover Squid. 

        • The number of access to our Squid has dropped to 5%. We are investigating why our Squid is being accessed at this level.  

        • We discovered a bug in Frontier where XML was returned malform. 

        • ATLAS has decided to make the recommendation that Squid servers be kept as backups until December of 2025 due to our experience in migrating to Varnish. 

      • Dark Data

        • We are waiting for a response from DDM Ops regarding dark data at our SCRATCHDISK. 10 TB was deleted 2 weeks ago and 20 TB still has not been deleted. 

        • We checked that the last scheduler dump on 7/26 failed, because one of the new storage (EL9) did not have the necessary dump script. We fixed this problem and started to dump again, but the dump ran into a timeout errort. This was caused by 2 storage nodes taking longer than expected to generate dump files. We changed the time limit to get around this issue for now, but will investigate these storage nodes in case there is some underlying issue. The last inventory dump created was on 8/13.

      OU:

      • Nothing to report, site running well