Speaker
Description
The Notre Dame CMS XRootD storage element, originally designed to handle traditional CMSSW workloads, underwent heavy I/O wait saturation when dealing with new data analysis workloads based on columnar analysis frameworks. These new workloads, using tools such as Uproot (to load data into structures such as Awkward Arrays), have revolutionized the I/O profile. This presentation starts by showing how this initial bottleneck was addressed with the help of Linux kernel-level tuning and a major revamp of the XRootD scheduler configuration, by changing its behavior from the default asynchronous mode (designed for a balance of interactive and batch jobs) to a more dynamic threaded model, better optimized for this new high-throughput, high-concurrency workload type.
While these optimizations successfully addressed the I/O wait bottleneck and dramatically improved performance, they brought a new kind of bottleneck to the table as transfer requests increased again: high CPU saturation.
The second part of this work presents the analysis of XRootD logs showing the correlation of this CPU load with the specifics of the new columnar analysis based workloads (a shift from sequential ofs_read operations to one dominated by a significant number of vectorized read (readV) requests). The analysis in this work shows a direct link between the number of readVs per TCP connection and server CPU load, identifying this as the new performance-limiting factor. We will outline our multi-stage tuning approach, discuss the analysis of these intensive bottlenecks, and show our capacity planning model to help deal with these intensive columnar analysis workloads.