Speaker
Description
The DUNE collaboration has an ongoing production effort to simulate the full detectors and to analyze the various prototypes that are currently running. Rucio is used to manage the 40PB of files made to date. When 500 or more jobs were sending output to Rucio simultaneously via Rucio upload, we observed timeouts, unhandled exceptions, and Rucio server restarts due to slow performance. In collaboration with the core Rucio team we did a full review of the Rucio upload code and identified several optimizations that can be made. We also have deployed the Ingress load balancer in front of our Rucio servers and added a database connection pooling utility. These changes led to significant improvement both in reliability and scalability, yet we anticipate even better performance will eventually be required. We describe in this paper the initial state of the system, the various debugging processes that were used, and our plans to further improve scalability.