- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
OSG 3.5.16 // 3.4.50 (this week)
Miscellaneous
Updates on US Tier-2 centers
It was a busy 2 weeks:
Tickets:
No current ticket
Except for SRR issue globally on hold
Current investigation:
(Thanks to Fred for noticing and digging into every oddity.)
Investigating occasional but significant rate DNS/URL problem of jobs failing a DNS/URL lookup for squid.aglt2.org
This problem seems to happen almost exclusively at AGLT2. Error code 65.
warn [frontier.c:1014]: Request 1278 on chan 7 failed at Tue May 12 09:05:42 2020: -6 [fn-urlparse.c:178]: host name squid.aglt2.org problem: Temporary failure in name resolution
Frontier python code calls getaddrinfo
Added ipv6 to the round-robin DNS for squid.aglt2.org to match ipv4
Added access rule to our squids, for our ipv6 address space.
But transient errors are difficult to corner.
Ongoing. No clear answer yet.
Software
- preparing to update to condor 8.8.9 when available in OSG release
- preparing for renewal of all our SSL certificates
Hardware:
- reconfigured all smaller nodes with spare memory to minimum of 2G per HTcore
BOINC
- no longer running boinc on WNs with <= 2G/core, only on >=2.6G/core
- we also only re-enabled boinc on 1/2 these larger nodes to allow comparison.
- We also changed the boinc processes' initial OOM score.
1000 is the highest we can give to boinc jobs.
800 was assigned to condor jobs by condor starter.
Score evolves as oom_score=10x(percentage of memory used) + initial score.
Thus a condor job would have to use 20% of all memory to pass a boinc job.
This might be plausible on 8-core nodes but much less on a 40-core node.
Unless, of course,the job has a memory leak and thus should indeed be killed first.
- Since implementing the steps above we have only seen 4 instances of OOM kills:
3 were on nodes not running boinc,
all were (or would have been) legitimate oom killing of misbehaving processes
covid
- asked for increased time limit for covid jobs via osg (10h ->36h)
as a large fraction of jobs was starting to fail to complete.
GGUS Ticket 146935:
Certificate issue (Brian mentioned). Fixed with a gPlasma and certificate restart, at least jobs with file transfers aren't completely failing.
Asked to lower COVID job pilots to our site. They were taking up a lot of space in place of ATLAS production.
UC:
Had two failing storage nodes. Draining and retiring them. Currently working on trying to get the last bit of storage (~3TB of data) offloaded from the remaining pool.
Continuing to update storage nodes from el6 to el7.
IU:
Running fine. All clear.
UIUC:
New hire, Nishant, recently brought on. Going through onboarding for UIUC/ICC, and will soon be run through MWT2 onboarding.
All clear.
Smooth operations except for a first problem on the NESE side and a couple of minor problems. NESE recovered and investigating.
Still no DELL S5048-ON switches, delaying NESE storage upgrade.
Mostly through with rolling linux kernel update.
Preparing materials for T2 review.
Low level authentication issue to understood as of this meeting, will update CAs to fix as Horst did.
OU:
- Not much to report, site full and running smoothly.
- Had a few failed jobs because a SLURM reconfiguration inadvertently killed running jobs.
- DDM transfers had issues to some UK sites, most likely related to obsolete CA files on both sides. Updated at OU. Ticket 146909 closed.
UTA:
- Fixed job accounting issue at SWT2_CPB (not sure if it's a permanent solution).
- We were waiting on feedback from ADC ops regarding an issue with a failing SAM squid test (already fixed at UTA_SWT2). Ticket will be closed later today.
- Both SWT2_CPB & UTA_SWT2 generally running o.k.
15 M events simulated at NERSC this past week.
NERSC ALCC allocation charged almost 21 Mhrs out of 36M (41.6% remaining). Hopefully we can effectively use this allocation up this month.
Will need to switch over to ALCC allocation as soon as possible.
Lincoln is working on data handling testing between MWT2 and TACC
VP service working fine.
VP panda queues show same job and CPU efficiency as regular queues.
AGLT2 upgraded server works better.
MWT2 - added an NVMe. Now creating a code to enable better utilization of NVMe performance.
Prague - running fine.
LRZ-LMU - works better. Limited by their spinning disk for metadata. Will try moving to RAM disk.
Some issues with xcache reaching 100% full when at high load.
Some dark data issues being investigated.