Speaker
Description
We present a series of case studies analyzing real-world network incidents within the WLCG infrastructure using traceroute and performance data from perfSONAR. Our methodology combines path-based anomaly detection with latency and throughput monitoring to identify routing disruptions, topological changes, and their correlation with performance degradation. The approach highlights common patterns such as persistent path inflation, detours via non-optimal transit networks, and silent degradations observable only through structural path analysis.
All case studies are linked to operator-confirmed events, demonstrating how integrated data analytics can support incident diagnosis and monitoring. We also introduce a community-maintained log of known or suspected incidents to foster collaborative validation. This work underscores the operational benefits of proactive, data-driven approaches to network reliability in large-scale distributed infrastructures like WLCG.