Speaker
Description
The increasing computational scale and complexity of frontier scientific experiments, such as the ATLAS experiment at the Large Hadron Collider, continues to motivate a drive toward operational models that are resilient, automated, reproducible, and scalable. The University of Victoria (UVic) remains at the forefront of advancing cloud-native deployment patterns to address these challenges. Previous work established a cloud-native architecture for a complete ATLAS Tier 2 site on Kubernetes, including a functional prototype EOS storage element, but relied on a basic IPv4-only network design for the Kubernetes cluster. To overcome scalability and performance limitations associated with load balancing and software-defined routing in Openstack, and to satisfy ATLAS inter-site connectivity requirements, we designed a new cluster network architecture using direct-attached IPv6 addresses. We also improved performance, scalability, observability and robustness in the container network plane, and streamlined service routing, by switching to eBPF-based technology. Moreover, we migrated to an advanced load balancer capable of locality-aware address assignment, reducing latency and eliminating redundant lateral traffic flows within the cluster. Following these enhancements, we conduct an assessment of bandwidth scalability and benchmarks and demonstrate a significant performance optimization using the EOS shared filesystem redirection feature for direct CephFS access. Finally, we describe additional improvements to the EOS Helm chart, and the operational benefits of a fully containerized cloud-native deployment based on production experience.