Speaker
Description
Summary
Monitoring and measurement must go beyond merely assuring that the installed system network is functional and that its performance meets the design specifications. The system will regularly be operated above its design point as applications are refined to optimally consume all available computing and network resources. Even before that stage users will typically reduce the filtering criteria to accept an increasing number of events until the point that something fails. This could be for example saturation due to insufficient computing power or network overload and subsequent backpressure. Thus for such a network it is the norm is to be overloaded, links running continuously at 100% are to be expected and packet discards in the thousands per second will happen. During all this, the network monitoring system must still be able to distinguish between real and self inflicted problems as well as identify traffic anomalies that may be preventing the system from going even faster. While redundancy is built in there will still be performance loss caused by equipment failure and identification of the appropriate repairs must be made rapidly and accurately. Experience has shown that analysis of individual, or aggregate traffic flows is extremely useful for system diagnosis and performance measurement but the sheer size of the system imposes issues of scaling for any of the traditional measurement techniques. We describe the approach taken to meet these challenges.