Setup a new AGLT2_HOSPITAL queue, the difference is, input (read) data of jobs is non-local, but from different other US storage elements. It is also a multi core queue.
Incidents:
Massive data transfer failure occurred a few times between late December and early January due to the failure of the authentication service in dCache, and one storage node losing network connectivity for a short period.
Some of the Condor work nodes have unusual high load (over 1000) with or without jobs using the CPU, the symptoms include high load, hanging /tmp directory, losing connection with condor head node, 100% swap usage, hanging sanity check processes. We updated a few work nodes to 8.6.12 from 8.4.11 for debugging purpose
system updates:
We had 2 dCache updates in this quarter, respectively from 4.2.6 to 4.2.12, and from 4.2.12 to 4.2.21. The latter one is to support the xrootd-TPC and HTTP-TPC tests. During the first dCache update, we also updated the system firmware and SL7.5.
afs client 1.8 is compiled and installed on our CentOS 7 host. The available one for SL7 is still 1.6. We have not tested 1.8 on the SLC 7 nodes yet.
All the SL 7 nodes, including work nodes, grid service nodes and interactive nodes are all upgraded from SL7.5 to SL7.6, all the security patches are applied in time. And all the SL7.6 hosts are rebooted to run on the most recent kernel (3.10.0-957.1.3.el7.x86_64)
All the work nodes have the lustre-client upgraded from 2.10.4 to 2.10.6, this update is to support the most recent kernel (3.10.0-957.1.3.el7.x86_64).
All our three OSG gatekeepers have condor upgraded from 8.6.11 to 8.6.13