- Except a brief outage of the SE stable operations at the Tier-1
- the dCache core domain (admin node) ran out of memory in the early morning of Dec 31. This was quickly fixed by Hiro with a restart of the domain. As this has not happened before we suspect the situation was related to unusual operations in conjunction with performance optimization of the replica creation processes
- The Tier-1 was flagged by ADC operations about DDM transfer errors because of missing files.
- Our investigations have shown that this is not a problem at the BNL site. All files reported as lost in the context of this ticket were created by jobs running at the ORNL_Titan site. This site is using the BNL SE to store the job output files. For the data transfer between the site and BNL a specific site mover is used by the pilot to move files produced at ORNL_Titan to the BNL SE. Our investigations have shown that some transfers suffer from a high failure rate and need a lot of retries until they, according to the site mover, eventually succeed. However, even if they are reported as being successfully transferred, the files don't exist at the destination SE (BNL). We suspect there is a race condition in the site mover code, most likely due to timing issues in the transfer failure recovery section, that leads to the deletion of a file that was successfully transferred to BNL. Note FTS is handling such cases correctly, but FTS is not managing these particular transfers.
- Missing files were declared lost by the T1 Storage Management Group
- Updated the FY16 capacity/procurement table.