OSG 3.5.3 and 3.4.37 (tomorrow)
3 ggus tickets:
1) analysis jobs fail at "not enough resource available", this is cause by a ulimit set on our Condor system, it is fixed
2) UCORE jobs fail at authentication, due to a problematic work node, the node is rebuilt
3) 2 sites have transfer errors to AGLT2, (timeout), unresolved. ticket priority is changed to low priority
Received 3 R740x2d storage nodes(dCache pool nodes),1 to MSU, 2 to UM. waiting for provision.
Update on BOINC operation at AGLT2:
Reminder: we had difficulties with the kernel not effectively applying a lower priority to the BOINC jobs. They ended up receiving roughly half of the CPU cycles which had never been the goal. At the last meeting Wenjing had presented and reported on switching to using cgroups to try and tame that behavior. But we did not have results or graphs to show at the time.
Here is a link to the current CPU efficiency which is already back to a much more reasonable and acceptable range while we are continuing to work on tuning this BOINC-backfilling model.
Minor operations issues:
1. Low rate of CA errors only outgoing transfers to only certain external sites. Might or might not be a problem on our end, but we have to investigate.
2. We're occasionally still seeing too many squid failovers.
o Storage purchase out including a slate node
o More worker and storage purchases should go out in the next few days.
o We need a bit of manual operations from CERN DDM to set up the NESE gridftp endpoint for testing.
- Nothing to report, all running well.
- Storage server issues have shown up in both SWT2_CPB and UTA_SWT2, which required moving data to other servers.
- Seeing high loads on some data servers, which is causing problems with Event Index jobs, we are testing if a firmware update fixes the problem.
- SLATE node is being worked on.