Multicore processing

LCG Workshop - 20080415

Peter Elmer - Princeton University

CMS - Multicore processing

Given that our processing problem is inherently embarassing parallel (on events) the transition from (multiple) single-core CPU's to (multiple) multi-core CPU's was more natural for us than for many other applications originally designed for single-core machines (e.g. video games and processing applications, other desktop applications). We simply run N copies on N cores as we ran N copies on N CPU's with minor tweaks to our WM systems, queue configurations, etc. At least for small numbers of cores this has permitted us to exploit all available cores at the > 90% level.

The purpose of this presentation is to summarize some (probably not all) of the possibilities we are considering for exploitation of multicore CPU's in some way which is different from how we do that today.

Since we use multicore CPU's today, what are the reasons why we might change something? We believe there are two:

Physical memory (PM) - Our model for exploiting N core CPU's requires N times the physical memory needed to run a single application. Our applications have large memory footprints, however, reaching 1-2 GB, depending on the particular (combined) workflow we are running.
- If we run coherently run different instances of the same application/release, we benefit from the fact that the OS will only load a single copy of each library. (This is often a 200MB+ contribution to the observed VSIZE.)
- We have made some progress in reducing the footprint recently: I/O-related optimizations, reduced calibration sizes, etc. Further improvements are probably possible (and should happen in any case), but we are still at the limit of the memory budget per core that we had originally planned.
- It is also probable that some increases in memory will be needed for various reasons in the future (we don't even have real colliding beam data yet).
- For some applications (e.g. Heavy-Ion, many-output skims, full combined simulation workflows, etc.) we are pushing (well) beyond the limit today.
- At this point, this is primarily a cost tradeoff depending on future increases in memory needs, whether memory prices will fall faster than the number of cores increases (to match planned Moore's Law increases in computing power). We buy memory instead of additional cores, storage, etc. When we can do that, should we push beyond the limit in some sites with new memory needs, is however limited by purchasing cycles and delivery times and this can be disruptive to operational plans.
- This problem is reasonably well understood today (and can be trivially updated for new releases and applications) with simple tools (batch system memory use reports, top, ps, pmap, massif) which are used routinely by a number of people. The relationship between what we measure see and the code is well understood in most cases.
- If some significant amount of pressure could be removed for the PM problem, using the native 64-bit build more widely might be possible, with some gains in cpu-time/event. The last simple benchmarks I did (for reco, IIRC) showed 20-25% improvement, although it needs to be redone with more recent releases.
CPU memory caches (CC) - By running the full application per core, we have structured things such that the maximum amount of instructions/data needs to go down through the memory hierarchy for every single core. Shared (e.g. L2) caches between cores mean that the naive model that N cores on a single die behave as N single-core CPU's can break down.
- This problem is much less well-characterized than the PM problem above as the tools are more complex, they usually require special kernel patches, their interpretation is cpu-dependent, etc.
- To first order we have only very basic information for a few limited applications:
  - Some initial studies by Lassi Tuura with CMSSW_1_3_0 (nearly a year ago) for the reconstruction application using perfmon and looking at the CPU-level performance counters
  - Follow-on observations that the total code size is extremely (and probably unnecessarily) bloated
  - valgrind cachegrind reports about cache misses (recall that this is just a simulation, though)
  - igprof total use memory reports (observations that the dynamic memory use is extremely bloated). Much work has gone into reducing this, but there is still much to do.
  All all but the last of these have not received any systematic attention. The relationship with what could be changed in the code, if anything, is much less well understood even at the level of a single application running on the machine. How much we can gain is also not clear. For multi-core effects we know basically nothing.
- First indications of any degradation should however be visible through the HEPIX-style benchmark of running 1, 2, ..., N applications on N cores and looking for non-linear scaling of cpu-time/event. Note, however, that any observation of an effect with such a benchmark does not tell us anything about what is actually happening and we would need to return to more specific tools, as above, to actually understand that before embarking on implementing some solution. [Otherwise we are simply chit-chat theorizing (requires beer).]

Possible Changes

CMS has not done any specific work thus far on changing how we exploit multiple cores, although a number of possibilities have been discussed. In order of granularity:

Modifications to put the calibrations, magnetic field, geometry into shared memory and used by multiple cmsRun applications
- We know (e.g. from the massif reports) that a reasonably large fraction of the heap use in our applications comes from such things and they are common (with multiple) copies, like shared libraries, between applications when we run coherently N instances on N cores.
- Should help with the PM problem, not obviously related to the CC problem
- Could perhaps (for example) be done by some refactoring of the EventSetup/CORAL/frontier stack such that multiple (heavy-weight process) cmsRun applications communicate with some stub application which loads the constants into shared memory.
- This solution has the advantage that the code changes should all be in "Framework" code and not in general reco/sim/analysis code.
- A merge of the outputs of the client cmsRun applications could be done directly on the WN before stage-out, reducing load on the storage systems.
- It does not help with external applications (e.g. primarily Geant4/ROOT) which have common data
- Since average processing time per event is similar, running one such heavy-weight cmsRun "client" per core will still result in a well-balanced processing load across the cores. (The behaviour is not application specific.)
- A variant of this solution, where additional cmsRun's are forked after startup (and loading of initial calibrations, etc.) were also considered since it could exploit copy-on-write from the OS and thus address the PM problem. But this solution is unlikely to work as there is no obvious single place to fork as we do load many things during event processing in the offline applications.
Parallelization of event processing in multiple cmsRun worker threads
- Each thread processes an entire event, as a single cmsRun does today
- The primary advantage here is that calibrations, magnetic field, etc. would be shared between the threads, as in the previous possibility, thus the PM problem is addressed. Again it isn't obvious that there is any effect on the CC problem.
- Since average processing time per event is similar, running one such cmsRun event-processing worker thread per core will still result in a well-balanced processing load across the cores. (The behaviour is not application specific.)
- This solution has the advantage with respect to the previous solution that the input/output modules could be common between the threads. (Reduces PM use further than the previous solution, fewer open connections to the storage system for inputs, removes need for explicit merge of outputs, etc.) It is however more heavily dependent on proper thread-safe behaviour of ROOT in particular.
- Unlike the previous possibility, this method would place additional requirements for thread-safe behaviour on all CMSSW code, all externals used by CMSSW code, etc.
Parallelization of module processing in multiple cmsRun worker threads
- Using the framework scheduler, we could process a single event at a time across multiple cores by running (groups of) modules processing the data from a given event simultaneously.
- Addresses the PM problem and could perhaps address in some way (to be determined) CC issues
- Much more difficult to balance the processer load to fully utilize multiple cores, since in (for example) the reconstruction a few modules dominate the processing time. Many things depending on (for example) track reconstruction. Significant variation with event complexity (i.e. primary dataset) will also be present.
- In general the cpu-time/module will not be known until run-time, making the scheduling a bit complex for other than a few standad applications. (Will this job run effectively with 2, 3, 4, ... worker threads/cores? i.e. the behaviour is very application specific)
- Would place additional requirements for thread-safe behaviour on all CMSSW code, all externals used by CMSSW code, etc.
Fine-grained (sub-module) threading to process on multiple cores
- Rather than use parallel threads at the FWK-level (eventprocessor, module), as in the previous two possibilities, localized threads within specific modules could be used to parallelize parts of the expensive processing done by that module (e.g. track finding/fitting).
- Various implementations are possible.
- Still requires that some parts of CMSSW and the externals are thread-safe, but less universally so than the previous two solutions.
- This is of course very application specific and achieving full utilization/balancing of process load on all of the cores is less obvious. Depending on the specific choices, it should improve the PM problem somewhat and (perhaps, in theory) improve somewhat CC-type problems in certain cases.
Geant4 changes
- Geant4 is well encapsulated within our code and it is also a fairly mature component of our system
- It is (and will be even after first data) also an important cpu cost for CMS. Per the computing model it is also the one that we run in all sites (and in particular T2's) so we hit the PM problem most often (CERN, FNAL and some of the T1's have more than the prescribed 1GB/core, whereas T2's usually have the prescribed memory/core).
- Solutions within Geant4 to parallelize processing across multiple cores would presumably have far fewer implications on thread-safety within CMSSW that solutions 2 and 3 above.
Localized I/O related changes
- For the specific case of I/O, other more specialized changes might be possible.
- For example, storage will be a problem and there might be some possibilities for additional compression, at additional cpu cost, that could be run asynchronously on a separate thread. (e.g. I think the ROOT team was considering things like this.) This might be interesting in certain CMS applications like repacking in the Tier-0. But what we care about at the moment is total throughput and its cost, not time-to-run for a single application.
- Other solutions (e.g. for skimming, with multiple outputs and a large dynamic range on each output) could involve carving off the output module into a separate application and multiple child cmsRun applications feeding that.

Conclusions and plans

Much of our focus to date has been on improving the performance of single applications running on a single core. We of course intend to continue this work:
- Reducing dynamic memory use and footprint (an ongoing battle)
- Dealing with code size bloat (larger libraries or static binaries for some applications)
- Profiles using cpu performance counters
Of the solutions listed above, the short/medium term one we might attempt is number 1 (shared memory for calibrations, etc.).
Attempting the others (2,3,4) in CMSSW is probably both premature and difficult at the moment. The rate at which CMSSW is growing/changing (likely to continue for some time) is too high to impose stringent requirements on thread-safety widely within the code base.
Given its encapsulation we would welcome some solution like number 5 (i.e. in Geant4). It could prove an interesting test case for working through the Workflow Management and bookkeeping issues for multicore applications with LCG/OSG and the sites (not described above, but likely to require some effort).
It is not specifically obvious what would help us in ROOT. (Since we have a broad cross-section to changes in ROOT, we would also like to understand in advance any changes that are made there for non-CMS purposes.)