US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
1
IntroductionSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
-
2
TW-FTTSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
-
3
AGLT2Speakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
Site blacklisted on Saturday 12-Jul
Both DATADISK and SCRATCHDISK filled with dark data.
Post understanding:
dcCache space reporting or file deletion was not working.
11-Jul rucio/DDM saw space exceeding threshold and started deleting cached files without seeing reduced usage
That deleted-but-not-reported space became ~2.5PB of "dark data"
Resolution Mon 14-Jul
Restarted dCache and added 0.5PB
Dark data disappeared shortly after
Cause:
Suspect some dCache service was not working since return from 8-Jul downtime for street water works
Issue with A/R tests on gate01 started 17-Jul and resolved 18-Jul
First suspected issue with updating certificates, but not the root cause
Updated software and rebooted. Recent etf jobs were in state I=idle in condor-ce (not R=running or C=completed).
Found condor-ce service had restarted on 16-Jul 16:49 and condor service not running.
Enabling and starting condor service let the etf jobs get to running state and resolve A/R issue.
Cause was not clear but ansible had started a run right before that time which didn’t seem to complete.
A/R test were failing but site was otherwise working. Will need to request correction.EL9 at MSU:
All issues with RedHat Satellite have been resolved.
All FY24 equipment in production since 14-Jul
Issue with incomplete dCache pool draining (with pinned&locked) files resolved
Remaining files were from Dec 6-10 database loss problem.
Un-pinning (rep set sticky -all off) allowed sweeper to purge them.
Status as of today for storage:
78% of nodes / 84% of space already upgraded; should finish today.
Then will continue migrating off from EL7 storage to be decommissioned.
Status as of today for compute:
69% of nodes / 80% of HEPspecs already upgraded; should essentially finish this week.
One set of 8x R6525s needs to be moved and re-addressed to workaround local airflow problem.
Then remaining old EL7 WNs will be decomissionned. -
4
MWT2Speakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
- UIUC PM on 7/17
- IU UPS replacement on 7/16
- Firewall hardening on IU nodes
- Moved 200TB from SCRATCHDISK to DATADISK per GGUS ticket #1000096
- Updating Elasticsearch cluster to 9.0 today (7/23)
- Pilot use of cgroups to enforce memory limits:
- Paul Nilsson has code working that allows cgroups memory limits to kill the payload without killing the pilot.
- This code only works with very recent versions of condor: 24.0.7 or higher.
- He has been testing release candidates using something akin to the Hammer Cloud to send about 8 jobs to MWT2 using new pilot versions.
- These tests are successful in the sense that the pilot runs without errors but the test jobs are well behaved and don't exceed the memory limits.
- I (Fred Luehring) have asked Paul to test that the killing the payload without killing the pilot functionality works.
- Last night I suggested using the derivation jobs that are badly leaking memory to test the code at MWT2.
- Previously I have suggested switching MWT2 over to the new pilot to make a large scale test but that was before we had a perfect request to test the cgroups memory limits.
- Doing tests at MWT2 requires sign off from the whole MWT2 team.
- The JIRA ticket tracking the cgroups work: https://its.cern.ch/jira/browse/ATLASPANDA-1251
- Paul Nilsson has code working that allows cgroups memory limits to kill the payload without killing the pilot.
-
5
NET2Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Issue with corrupted CA files causing issues with RU/DE certificates solved.
VP queue is running, though xcache crashed over the weekend. Got Justas to restart it, waiting on new version with supervisord.
Working to understand the reason for low efficiency of transfers to tape.
Work on the OKD upgrade is ongoing.
-
6
SWT2Speakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
-
EL9 Migration
-
We are currently focusing on deploying new storage, migrating old data, and retiring old storage, but we do have new modules to test once we are done using the test cluster for new storage deployment. Later, we will work toward implementing and migrating additional server types to EL9.
-
-
New Storage Deployment
-
We started migrating data from one MD3460 storage to one of the new R760xd2 storage servers. We are carefully monitoring this, checking data integrity, and improving the process as we go.
-
All twelve new R760xd2 storage servers are installed with EL9. We are focusing on making sure we safely migrate the first server for understanding before using the other storage servers.
-
-
GGUS-Ticket-ID: #681994: Enable network monitoring
-
We met the requirements and received approval from campus network security for enabling network monitoring on a web server.
-
We converted the SNMPv2c script provided by Shawn to use SNMPv3 instead. I shared this with Shawn in case other sites ever need to use it.
-
Our network monitoring is now enabled.
-
-
GGUS-Ticket-ID: #681997: Enable BGP Tagging
-
We worked with campus networking to get approval, they implemented it, then we checked and confirmed with Edoardo to verify it is correct.
-
-
GGUS-Ticket-ID: #683657: Implement Varnish
-
We ran a test in the test cluster according to Ilija's documentation.
-
We created a new Puppet module for Varnish, tested it, then converted gk02 to a Varnish server. It is fully configured, running, and appearing in Ilija’s Varnish monitoring dashboard.
-
We want to run jobs from the test cluster using gk02 separately from production PQ if possible before placing gk02 frontier proxy higher on the priority list in CRIC.
-
-
GGUS-Ticket-ID: #1000094: Reallocating Scratchdisk
-
We have adjusted the max size of SCRATCHDISK from 500 TB to 300 TB.
-
- Dark Data
- We sent our storage dump file of SCRATCHDISK on 7/8 to DDM Ops. We received a dump file of what is considered good data on 7/17. 7/8 showed 90 TB of dark data. Fabio cleaned dark data down to 30 TB, which the inventory dump from 7/17 showed. We recently generated a new inventory dump (started on 7/23), which completed recently and shows that dark data is now at 31.7 TB.
-
Certificate Issue
-
Result HTTP 401: Authentication Error after 1 attempts” 1.64 K for 24 h for SWT2 - dst DDM, and all sites src DDM
-
We noticed low transfer efficiency to RU, FR, and IT with these errors:
-
We discovered an issue with how we update our certificates. We updated the osg-ca-certs package on our static repo to version 1.136 and are considering changes to avoid doing this manually.
-
We are currently using OSG 23, but do not want to update this as it will change the version of packages that may cause us issues. We updated our current static repo’s osg-ca-certs package only for now.
-
After performing this change, we did not see a full recovery in transfer efficiency, but we may need to monitor and give it more time to reflect.
-
-
Hardware
-
We purchased new hardware to upgrade head nodes. Waiting for shipment.
-
We extended the warranty on several R740xd2s that were going to expire mid-July of this year.
-
OU:- Site running well, no issues
-
-
1