The end-to-end network performance troubleshooting cookbook

The growth in size and geographical distribution of scientific collaborations, while enabling researcher to achieve always higher and bolder results, also poses new technological challenges, one of these being the additional efforts to analyse and troubleshoot network flows that travel for thousands of miles, traversing a number of different network domains. While the day-to-day multi-domain monitoring and fault detection and handling procedures are firmly established and agreed on by the network operators in the R&E community, a cleverer end-to-end traffic analysis and troubleshooting is still something users are in need of, since the network providers not always have specific tools in place aimed to deal with this category of problems.
The well-known perfSONAR framework makes available to the users several testing instruments able to investigate a number of transmission parameters, like latency, jitter and end-to-end throughput measurement. Notwithstanding its high effectiveness in testing the path between two networks, a proper end-to-end monitoring between two servers in production is beyond the reach of perfSONAR.
Indeed, a single (either software or hardware) testing tool will not be capable of grasping the complete end-to-end performance analysis including all the pieces that take part in the data transfer between two end points. When a data movement happens, what we have is a long series of interactions between several components and domains, starting with a storage device (that could be further divided into even smaller parts: hard disk, controller, FC network, etc.), through a CPU, a network interface card, a switch (more likely, a whole LAN), a firewall, a router, then a network provider, then one or more backbone networks, and then the reverse path when approaching the other end of the data transmission. Not to mention the many software elements involved, including the science-specific applications, the authentication and authorization tools, the operating system sub-components, and so on.
It’s then clear that what is needed to face this challenge is a set of techniques and good practices that leverage on different tools, able to interact with the different layers and interaction domains, which collectively form the end-to-end application data transfer.
What we will present is a structured and systematic approach to a complete and effective network performance analysis, which can be carried out by any networking or system manager with proper access rights to the local infrastructure. The talk will explain the different domains on which the analysis needs to be performed, identifying the most appropriate tools to use and parameters to measure, that collectively will likely lead to find out where, along the path, the problem lies.

