pgsql was updated from 9.3.11 to 9.5.1 in advance of doing a dCache upgrade from the 2.10 to the 2.13 series. This occurred on Tuesday last week during a full downtime that day. At the same time our WNs were rebuilt completely, updating Condor to 8.4.4, cvmfs to 2.1.20, OSG-WN client to 3.3.8, glibc to 2.12-1.166.el6_7.7, and various other sl and sl-security updates. Gatekeepers were updated to OSG 3.3.9, utilizing the OSG installation of Condor 8.4.3. The master Condor machine is also on Condor 8.4.4, which works around a possible issue with the collector process in 8.4.3.

Generally all upgrades went smoothly, modulo interactions between the various components. The dCache update in particular surprised us with how quickly it went. Several items were not immediately obvious, but a dCache documentation search showed the way. The xrootd plugins required a bit more work, and consultations between Gerd, Ilija and Shawn will likely result in new plugin rpms in the near future.

There are no outstanding issues with our site at this time. However, we have noticed some recent jobs that are crashing WN. These jobs run a process called "JSAPrun.exe". Condor will suddenly report jobs running this process that have a (condor_status) LoadAv of many tens, even many hundreds, that results in the WN either crashing or becoming unresponsive. We then get hung_task_timeout dumps in /var/log/messages indicating processes that have been blocked for more than 120 seconds. We have only just discovered this, and so have not had a chance to do any further digging, but I mention it here because other sites may also be seeing this?