From Andy:
What needs to change for async caching support:
1) XrdPosix package to add async POSIX style I/O
2) XrdPss package to use async POSIX style I/O
3) XrdOucCache package to provide async cache interface, this also impacts XrdPosix package because it is responsible for loading and using the caching interface.
The issue here is that all of these interfaces are public which means we need to implement this without breaking ABI compatibility (i.e. is must be backward compatible).
Time estimates:
a) 1 week to design and code up the new caching interface (4/5/16).
b) 2 weeks to retrofit XrdPosix package to use (a) (3/21/16).
c) 1 week to retrofit XrdPss package to use (b) (3/25/16).
The above will always be available as work proceeds in the pssasync branch in the xroot github repo so other parallel work can proceed. Please be aware I go on vacation 3/28/16 for 12 days with limited if any internet connectivity so it is likely that we will not have a production quality version until 4/15/16 to 4/20/16, depending on how it goes.
pgsql was updated from 9.3.11 to 9.5.1 in advance of doing a dCache upgrade from the 2.10 to the 2.13 series. This occurred on Tuesday last week during a full downtime that day. At the same time our WNs were rebuilt completely, updating Condor to 8.4.4, cvmfs to 2.1.20, OSG-WN client to 3.3.8, glibc to 2.12-1.166.el6_7.7, and various other sl and sl-security updates. Gatekeepers were updated to OSG 3.3.9, utilizing the OSG installation of Condor 8.4.3. The master Condor machine is also on Condor 8.4.4, which works around a possible issue with the collector process in 8.4.3.
Generally all upgrades went smoothly, modulo interactions between the various components. The dCache update in particular surprised us with how quickly it went. Several items were not immediately obvious, but a dCache documentation search showed the way. The xrootd plugins required a bit more work, and consultations between Gerd, Ilija and Shawn will likely result in new plugin rpms in the near future.
There are no outstanding issues with our site at this time. However, we have noticed some recent jobs that are crashing WN. These jobs run a process called "JSAPrun.exe". Condor will suddenly report jobs running this process that have a (condor_status) LoadAv of many tens, even many hundreds, that results in the WN either crashing or becoming unresponsive. We then get hung_task_timeout dumps in /var/log/messages indicating processes that have been blocked for more than 120 seconds. We have only just discovered this, and so have not had a chance to do any further digging, but I mention it here because other sites may also be seeing this?
Site has been running well except for IU
Scan for latest OpenSSL bug (Drownattack) shows MWT2 clean
Minor update of dCache to 2.10.56-1
New Disk at UChicago
OSG 3.3.9
minRSS and maxRSS now set
ATLAS Analytics
misc
UTA_SWT2
Facilty electrical work forced a shutdown over the weekend, During shutdown we added memory to nodes with 24GB of memory
SWT2_CPB
Bringing 400TB of storage online.
UTA - Expecting network interruption this weekend.