Site is running well
- Full of Atlas jobs (MCORE, SCORE, Analy and Opportunistic)
- Good efficiency
IU nodes now operational
- Over 13500 cores
- HS 133,303, APEL Factor of 9.87
- Accounting updated in OIM, REBUS, V38 and WLCG-V38
New Disk at UChicago
- Ceph based
- Will move MWT2_UC_LOCALGROUPDISK from dCache to Ceph
- Might change name to MWT2_LOCALGROUPDISK to aid in transition
- Lincoln is testing by using gfal-copy to push data into the system
OSG 3.3.8
- All head nodes have been using 3.3.x stack for a long time without problems
- CE (HTCondorCE)
- Squid
- CVMFS servers/clients
- GUMS
- Condor 8.4.3
- Still using 3.2.24 on worker nodes
- DCAP removed, we use in LSM
- Working on LSM update to remove DCAP
- Will use GFAL2 (gfal-copy, gfal-rm, gfal-sl)
Virtual Memory issues
- Large jobs causing many problems
- OOM killing other jobs
- Nodes hanging/crashing
- lostheartbeat
- Upgrade to HTCondor 8.4.3 and cgroups help control large jobs
- cgroup "soft" allows flexible RSS
- hard virtualmemory limit puts jobs into HELD
- Exposed inconsistent swapfile policy (little to no swap on some nodes)
FAX Door issues
- Doors at IU were causing problems
- Some type of internal IU networking issue (internal low level packet loss)
- Moved all doors to UC
- Will be moving doors off storage nodes onto VM like SRM (FAX and WebDAV)
WebDAV certificate issues
- Door is currently on a storage node (uct2-s13.mwt2.org)
- Needed a subject with SubjectAltName
- webdav.mwt2.org
- uct2-s13.mwt2.org
- Now supported in OSG PKI tools (osg-gridadmin-request -a)
- CI-Logon support added this Monday (2/1/2016)
minRSS and maxRSS now set