software update:
update the OSG software and htcondor-ce to the most recent release on all 3 gate keepers
Frontier Squid is also updated to 4.10-1.1.osg34.el6
Plan to upgrade all our SLC6 nodes to SLC7, including dcache,htcondor,afs services
Job Errors:
A lot of jobs failing at this error:
Non-zero return code from RAWtoESD (65); Logfile error in log.RAWtoESD: "AthMpEvtLoopMgr ERROR Failure in waiting or sub-process finished abnormally"
Some of the work nodes fail 100% of the jobs, we identified and rebuilt around 15 affected work nodes, and after rebuilding, they do not seem to fail many jobs (failure rate lower than 10%)
Note: This error also appears to the jobs on other 8 sites, AGLT2 fails 1/5 of them, there is no ticket, not sure if the error is from the job itself or the work nodes.