US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
11:00
→
11:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
News:
- Tentative next procurement meeting next Friday (2/27). An email to Tier-2 managers has been sent. We will try to avoid clashing with the US ATLAS management meeting like last time.
- dCache version 11.2 only included fireflies for xroot transfers. dCache version 11.2.1 is supposed to have fireflies for all protocols. It also includes tape metadata capability.
- If your site is testing it and has results, please share news during the rount table (and enter in the minutes)
- Short mini-capacity challenge happening today between PRG-NET2 [notes]
- Keep the capacity and services spreadsheets updated. Keep CRIC and OSG topology updated when servers are added or retired.
Certificate issues:
- The transfer problems between SWT2 and IFIC/TECHNION was due to X.509 certificates with the full chain being used.
- StoRM systems only accept leaf certificates, which we agreed is not the correct procedure.
- WLCG and/or ATLAS do not have, in practice, any way to enforce sites to accept full certificates.
- A workaround has been used at SWT2 (see more details in Zach's minutes below) by using a leaf certificate in some servers. Apparently other sites do the same.
- This is not the correct procedure, and it will cause problems in the future with systems that only accept full-chain certificates (as it was the case for Google, for instance)
- Horst will open a ticket with OSG to understand what kind of certificate is being used.
Upcoming meetings:
Just finished: ATLAS S&C meeting [Feb 9-13 at CERN]
Upcoming:
- LHCOPN-LHCONE meeting #56 [Apr 15-16 in Montreal]
- HEPiX Spring 2026 Workshop [Apr 20-24 in Lisbon]
- dCache workshop [May 6-7 at NIKEFF]
- CHEP 2026 [23-29 May in Bangkok, Thailand]
Open tickets:
- ggus:1001568 SWT2/OU: xrootd version higher than 5.7.0 needed
- ggus:3559 SWT2/OU: Dual-stack [on hold]
- ggus:1001382 TW-FTT: failing transfers as SOURCE due to certificate issue
Operations:
- AGLT2
- MWT2
- NET2
- SWT2/CPB
- SWT2/OU
-
11:10
→
11:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW)), Yi-Ru Chen (Academia Sinica (TW))
-
11:20
→
11:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
05-Feb: Upgraded to dCache 11.2.0 (from 10.2.18)
First Golden release to support flow marking
But bug/omission: see marking for xrootd but not webdav transfers
11.2.1 coming soon; will upgrade when available05-Feb repeated mini-challenge MWT2<>AGLT2 (plots)
MWT2->AGLT2 very smooth with 90/95Gbps to UM/MSU
AGLT2->MWT2 somewhat puzzling
good: verified dcache config bug fixed: doors are redirecting on read
good: reads from UM saturated at 80G bottleneck
odd?: reads from MSU "choppy" 10-90G.
Will re-test with separate reads from just MSU and just UMNoticed cvmfs file /cvmfs/sft.cern.ch/lcg/lastUpdate stopped updating after Friday 06-Feb
Found no issue; same at MWT2, and lxplus
Contacted Dave & Valentin
Quickly verified it was not a cvmfs issue, but just that file, started JIRA ticket
CCed Andre who recognized problem came from omission while moving gitlab repositories
For reference, some useful command to debug repo updating issues:
To check health and updates of local repo:
cvmfs_config status sft.cern.ch
To check current revision time stamp on stratum 1:
curl http://cvmfs-stratum-one.cern.ch:8000/cvmfs/sft.cern.ch/.cvmfspublished -s -o - | grep -a ^T
And/Or to check current Revision number:
https://cvmfs-monitor-frontend.web.cern.ch/sft.cern.chRequested correction for January A/R ticket
Reminder: We fixed condor config bug keeping test jobs idle when too many merge jobs present
AGLT2 should now have more consistent/correct reportingAfter successful 2-day test of removing IPv4 from LHCONE
planning on removing it permanently -
11:30
→
11:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- Discussing procurement plan, what to purchase, estimating retirements
- Fred to work on quotes. UC team will be setting up a meeting with Dell rep
- Discussing elasticsearch analytics refresh in the plan as well
- IU network team updated the VRF for Technion to fix asymmetric routes between IU and Technion. Monitoring
- Reran mini capacity challenge after the Feb 3 network and job saturation. Ran Feb 5
- Waiting on dcache 11.2.1 for Fireflies for WebDav
- Discussing procurement plan, what to purchase, estimating retirements
-
11:40
→
11:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Some issues over the last two weeks related to the tape. Last week the dcache pool facing the tape froze, leading to failures when transfers accumulated. The previous week we had an issue with a couple of pools that were not delivering enough throughput. This led to errors that filled the dcache head node disk with logs, causing it to crash, and there was some difficulty in returning to normal operations.
IPV4 off test on LHCOne went with no problems.
-
11:50
→
12:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
We rebuilt another R740 storage server from EL7 to EL9 while preserving the existing data.
-
After the rebuild, we verified the data and it appears to have been preserved. We have a temporary backup of the data in case of any data loss.
-
The server has been returned to production, and we have not observed any issues so far.
-
We are currently creating backups of four R740 storage servers in preparation of migrating them from EL7 to EL9.
-
On 2/12/2026, we rebuilt and upgraded one of our four XRootD Proxy servers with improved hardware to increase performance.
-
We have not observed any related transfer issues so far.
-
We understand the cause of the transfer problems affecting IFIC-LCG2 and TECHNION-HEP and have closed GGUS-Ticket-ID: #1001633.
-
We changed the full chain certificates on two of our four XRootD Proxy servers with leaf only certificates and have been monitoring.
OU:
- Site running well
- Some storage overload and high mem jobs during the last week, but only sporadically.
-
-
11:00
→
11:10