Updates on US Tier-2 centers
1. Condor update:
The main goal is to update everything to 8.8.8 to address the security issue.
1.1) We started a big project of rebuilding all 400 work nodes of the condor cluster
One motive for rebuilding was to separate the partition used by condor jobs from the tmp partition
This also performed the update them from 8.6.13 to 8.8.8/8.8.7-1.
We started with the UM site and finished rebuilding all the work nodes (179),
2/3 of UM nodes are running 8.8.7-1, and 1/3 8.8.8, depending on the rebuild day.
The WNs at MSU are now rebuilding in batches, about 1/3 yesterday, another ~1/3 today,
the rest Thursday and Friday.
1.2) updated (switched) the condor head node from sl6 /8.6.13 to sl7 /8.8.8
1.3) During the update of the main gatekeeper, we encountered a problem.
Idle ucore jobs did not get scheduled to unclaimed cores,
this was solved by updating the head node to 8.8.8
and also add a workaround to the negotiator
(to address a possible bug in the negotiator in 8.8.8)
2. Job failures caused by OOM killer.
This is very likely caused by
a) there are high memory pile up jobs (a single job use 56GB memory at peak)
running on our score queue (2 GB/core)
b) our site also has BOINC jobs running which use extra memory on the work nodes.
To address this issue, we stopped the BOINC jobs.
Now that we solved the problem caused by condor update,
it is a good time to monitor if the same error still exist.
Fred has been in contact with ADC to ask if possible
to put the pile up jobs in high mememory queque.
c) BOINC jobs will be suspended until we understand more about the situation.
3. AGLT2 started covid19 jobs from last Wednesday
We gave them a quota up to 2000 cores, this can be expanded to 5000,
For now we do not see enough queued jobs to our site,
the average number of covid19 jobs we process is around 800.
4. Ticket 146371
Weird problem about small set of files accessible via xrootd but not gsiftp.
Restarting dcache on pool node fixes it for a short time.
Shawn opened a ticket with dcache.
No resolution yet
5. COVID19.
No change to access plan at UM or MSU
Smooth operations except for high temp alarms due to broken fans. Replacement fans ordered.
Site was not getting filled by PanDA for a few weeks, but it's better now.
Two more NESE gateways added anticipating ramping up. Working as NESE_DATADISK in AGIS & Rucio.
6PB NESE upgrade arrived, installed, tested, but switches from DELL have been delayed twice.
Converging on NET2/NESE tape Tier. Getting helpful feedback from BNL and others in HEP.
Fred noticed that a few of our oldest nodes were getting a ~50% failure rate, strangely, from stage-out timeouts. The problem quickly disappeared, but we haven't yet figured out what the cause was.
User complained about 5 missing files at NET2_DATADISK. They were indeed missing, marked as gone by DDM. Don't think it's related to a local issue.
UTA_SWT2:
SWT2_CPB:
OU:
- Nothing to report, all running well.