For all the sites that see small percentage of jobs fail with timeouts on input/output:
we are investigating interaction between rucio mover, gfal2 and xrootd. In a number of cases actual transfer was not even attempted and the reason seems to be the way rucio mover tries to stat file and get checksum. Hopefully fix will come soon, once ready we will try to get it expressly tested and deployed. This does not exclude possibility there are other issues lurking there.
Fred:
It was an OK week for production.
- There were a number of tasks that had high failure rates but from the submission side.
- Most recently in the last day looping event generation jobs that killed as a group.
- I was going to mention the Rucio transfer issue but Ilija beat me to it by providing the notes above.
- The was also an unintended Rucio release which caused trouble for about 1 day.
- Several sites had short-term issues.
- Covid jobs seemed to run OK but of course reduced ATLAS production.
- NET2 had some stage-out issues with the covid jobs.
- Looks like recovering just over a month (Feb 28 to Apr 8) of accounting data for CPB will be hard. Right now CPB is not reporting anything to the official GRACC/APEL system for the entire month of March.
- Port scanning form LHCONE????