Speaker
Description
The Kubernetes platform operated by CERN IT has supported scientific computing, online services and accelerator controls since 2016. It enables fully automated deployment and management of clusters with native integration to CERN storage systems (CVMFS, EOS, AFS, CEPH), authentication (SSO, Kerberos) and networking. Today the service spans more than 600 clusters across CERN’s two main datacenters and within air-gapped environments of the Technical Network (TN), serving campus services, experiment workflows and critical accelerator applications.
This contribution reviews the evolution of the service, lessons learned from long-term production usage, and adaptation to ongoing changes in Kubernetes and its ecosystem. We highlight operational improvements driven by increasing user diversity, scaling requirements and the need to reduce technical debt. As more sites in WLCG and the HEP community transition to a similar stack these lessons should be of value to many.
We then present the next-generation architecture, centred on ClusterAPI to unify and automate cluster provisioning and lifecycle management. In addition, a fully GitOps-based approach using ArgoCD enables automated management and upgrades of cluster add-ons, improving reproducibility, maintainability and service scalability across heterogeneous environments.