Tickets:
No current ticket
Except for SRR issue globally on hold
Current investigation:
(Thanks to Fred for noticing and digging into every oddity.)
Investigating occasional but significant rate DNS/URL problem of jobs failing a DNS/URL lookup for squid.aglt2.org
This problem seems to happen almost exclusively at AGLT2. Error code 65.
warn [frontier.c:1014]: Request 1278 on chan 7 failed at Tue May 12 09:05:42 2020: -6 [fn-urlparse.c:178]: host name squid.aglt2.org problem: Temporary failure in name resolution
Frontier python code calls getaddrinfo
Added ipv6 to the round-robin DNS for squid.aglt2.org to match ipv4
Added access rule to our squids, for our ipv6 address space.
But transient errors are difficult to corner.
Ongoing. No clear answer yet.
Software
- preparing to update to condor 8.8.9 when available in OSG release
- preparing for renewal of all our SSL certificates
Hardware:
- reconfigured all smaller nodes with spare memory to minimum of 2G per HTcore
BOINC
- no longer running boinc on WNs with <= 2G/core, only on >=2.6G/core
- we also only re-enabled boinc on 1/2 these larger nodes to allow comparison.
- We also changed the boinc processes' initial OOM score.
1000 is the highest we can give to boinc jobs.
800 was assigned to condor jobs by condor starter.
Score evolves as oom_score=10x(percentage of memory used) + initial score.
Thus a condor job would have to use 20% of all memory to pass a boinc job.
This might be plausible on 8-core nodes but much less on a 40-core node.
Unless, of course,the job has a memory leak and thus should indeed be killed first.
- Since implementing the steps above we have only seen 4 instances of OOM kills:
3 were on nodes not running boinc,
all were (or would have been) legitimate oom killing of misbehaving processes
covid
- asked for increased time limit for covid jobs via osg (10h ->36h)
as a large fraction of jobs was starting to fail to complete.