Modern hardware is trending towards increasingly parallel and heterogeneous architectures. Contemporary machine processors are spread across multiple sockets, where each socket can access some system memory faster than the rest, creating non-uniform memory access (NUMA). Efficiently utilizing these NUMA machines is becoming increasingly important. This paper examines latest Intel Skylake and Xeon Phi NUMA node architectures, indicating possible performance problems for multi-threaded, data processing applications, due to the kernel thread migration (TM) mechanism, that I designed to optimize power consumption. We discuss NUMA aware CLARA workflow management system that defines proper level of vertical scaling and process affinity, associating CLARA worker threads with particular processor cores. By minimizing thread migration and context-switching cost among cores, we were able to improve the data locality and reduce the cache-coherency traffic among the cores, resulting in sizable performance improvements.
|Consider for promotion||Yes|