US ATLAS Tier 2 Technical

Name: US ATLAS Tier 2 Technical
Start: 2025-04-30T10:00:00-04:00
End: 2025-04-30T11:00:00-04:00
Location: No location set

Wednesday 30 Apr 2025, 10:00 → 11:00 US/Eastern

Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), Robert William Gardner Jr (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

Description

Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.

Fred Luehring

luehring@iu.edu

+1 812 855 1025

- 10:00 → 10:10
  
  Introduction¶ 10m
  
  Speakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
  
  Rafael_USTier2Meeting_04302025.pdf
- 10:10 → 10:20
  TW-FTT¶ 10m
  
  Speakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
  Network bandwidth between TW-FTT and ESNet/LHCONE will be upgraded to 5Gbps from 1 May 2025.
  
  Plan to finish the migration of Condor-CE to AlmaLinux9 (from ARC-CE) in May and then the rest 1,500 job slots will be online afterwards.
- 10:20 → 10:30
  
  AGLT2¶ 10m
  
  Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
  
  Stable running.
  
  Working on procurement and operation plan.
  
  EL9 at MSU:
  Still present issue with Satellite occasionally corrupting some node definitions entries that can't be recovered but recreated.
  Builds were BIOS PXE so far. Making progress on UEFI http boot.
  
  On 4/28, one of the cyberpower PDUs died in Rack 1,
  we powered off 12 work nodes to protect the other PDU in the rack from being tripped over.
  RMA should be issued this week
  
  added new milestone 491: We migrated from ElasticSearch (ES) to OpenSearch (OS), but currently we do not have a way to ingest new data into OS, the goal is to restore the data ingestion to OS from various places, like rsyslog, htcondor, all kinds of beats, and eventually create useful dashboards OS for good visibility.
  
  progress: After 2 weeks set the milestone, we finally got the syslog / filebeat (for squid and boinc logs)
  ingested to logstash/opensearch , and created dashboards
- 10:30 → 10:40
  
  MWT2¶ 10m
  
  Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
  
  Updated condor to 24.0.7-1 to apply the cgroups patch
  
  Drained the site to remount CVMFS following the update to 2.12.7
- 10:40 → 10:50
  
  NET2¶ 10m
  
  Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
  
  Unexpected downtime over the weekend due to power failure at the data center. Cluster initially failed to recover correctly due to two problematic servers, didn't refill until this was fixed.
  
  Ongoing problem due to the pilot seeing all the disk space on nodes, not just the disk that is available for jobs. On C6320 servers this can lead to the storage space being overcommitted, causing all jobs on the server to fail. A ticket was opened about a couple of tasks but the issue is not related to those tasks but is a more general one. We expect to upgrade the cluster to a new version of OKD fairly soon which should resolve the underlying issue.
  
  BGP tagging of LHCone prefixes should be in place now.
- 10:50 → 11:00
  SWT2¶ 10m
  
  Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
  SWT2_CPB:
  
  Operations
  
  Experienced some errors with long running jobs hitting max walltime limit, but this has essentially stopped now.
  
  We experienced a dip in running jobs on 4/24. We did not make any changes and only investigated. It appeared to resolve itself on 4/25. We did not see any clear issues locally when reviewing logs.
  
  Other than this, we have been running well and staying full the past two weeks.
  
  Most of our jobs have returned to being majority production jobs after the transition from SWT2_CPB_TEST to SWT2_CPB. We had majority analysis jobs with very little production jobs for a moment.
  
  We had several worker nodes that got removed for various reasons in Slurm, but this may have been due to the job mix. We investigated these, rebuilt them, and added them back into service. So far, we have not seen any more issues with these nodes.
  
  Noticed low transfer efficiency as destination site from Spain (ES) on 4/28. Requested assistance/info from DPA.
  
  Renewed certificate that is going to expire soon for an XRootD Proxy server. Will be implementing these soon.
  
  EL9 Migration
  
  Test cluster is complete. We are waiting for assistance to add our new test CE and XRootD proxy server to CRIC properly, so we can see if test jobs complete successfully with our new EL9 modules.
  
  We got hostname and IP assigned in DNS with campus networking.
  
  Received new certificate for test XRootD Proxy and CE.
  
  All required services and appliances are running.
  
  We want to test these modules and improve them further before implementing them in the production cluster.
  
  Storage module is complete. It requires testing.
  
  The XRootD Proxy module is complete, but continually improving. It requires testing.
  
  New Storage
  
  We physically installed all new storage servers.
  
  We found a way to avoid purchasing new rails or racks to address long storage rails issues. We slightly modified some of our racks to enable us space to install new storage without problems.
  
  Configured RAID and iDRAC on all new storage.
  
  Once tests in the newly finished test cluster show all is fine with storage modules, we will begin next steps in deploying new storage.
  
  Monitoring
  
  We are continually developing new internal monitoring to better troubleshoot our CE and Slurm.
  
  We are currently testing out different tools and having internal discussions on this.
  
  GGUS Ticket - Enable Network Monitoring
  
  Followed up again with campus networking concerning this ticket recently. We are waiting for a response. This includes a reminder of our request for this ticket and references to details of our last meeting.
  
  GGUS Ticket - GoeGrid Transfer Failures
  
  ESNET’s network experts have begun working on the connectivity problem between SWT2 and GOEGRID. We have repeated the tests from SWT2’s side and requested GOEGRID to perform the same tests. We first requested these tests via email, then via ticket.
  
  OU:
  
  Mostly running well
  
  Still working on SLURM network issue which occasionally drops nodes; admins believe they are close

Choose timezone

US ATLAS Tier 2 Technical