Multicore processing
LCG Workshop - 20080415
Peter Elmer - Princeton University
CMS - Multicore processing
Given that our processing problem is inherently embarassing parallel (on
events) the transition from (multiple) single-core CPU's to (multiple)
multi-core CPU's was more natural for us than for many other applications
originally designed for single-core machines (e.g. video games and processing
applications, other desktop applications). We simply run N copies on N cores
as we ran N copies on N CPU's with minor tweaks to our WM systems, queue
configurations, etc. At least for small numbers of cores this has permitted
us to exploit all available cores at the > 90% level.
The purpose of this presentation is to summarize some (probably not all)
of the possibilities we are considering for exploitation of multicore CPU's
in some way which is different from how we do that today.
Since we use multicore CPU's today, what are the reasons why we might change
something? We believe there are two:
- Physical memory (PM) - Our model for exploiting N core CPU's
requires N times the physical memory needed to run a single application.
Our applications have large memory footprints, however, reaching
1-2 GB, depending on the particular (combined) workflow we are running.
- If we run coherently run different instances of the same
application/release, we benefit from the fact that the OS will
only load a single copy of each library. (This is often a 200MB+
contribution to the observed VSIZE.)
- We have made some progress in reducing the footprint recently:
I/O-related optimizations, reduced calibration sizes, etc.
Further improvements are probably possible (and should happen
in any case), but we are still at the limit of the memory budget
per core that we had originally planned.
- It is also probable that some increases in memory will be needed
for various reasons in the future (we don't even have real colliding
beam data yet).
- For some applications (e.g. Heavy-Ion, many-output skims, full
combined simulation workflows, etc.) we are pushing (well) beyond
the limit today.
- At this point, this is primarily a cost tradeoff depending on
future increases in memory needs, whether memory prices will
fall faster than the number of cores increases (to match planned
Moore's Law increases in computing power). We buy memory instead
of additional cores, storage, etc. When we can do that, should
we push beyond the limit in some sites with new memory needs, is
however limited by purchasing cycles and delivery times and this
can be disruptive to operational plans.
- This problem is reasonably well understood today (and can be
trivially updated for new releases and applications) with simple
tools (batch system memory use reports, top, ps, pmap, massif)
which are used routinely by a number of people. The relationship
between what we measure see and the code is well understood in
most cases.
- If some significant amount of pressure could be removed for the
PM problem, using the native 64-bit build more widely might be
possible, with some gains in cpu-time/event. The last simple
benchmarks I did (for reco, IIRC) showed 20-25% improvement,
although it needs to be redone with more recent releases.
- CPU memory caches (CC) - By running the full application per core,
we have structured things such that the maximum amount of
instructions/data needs to go down through the memory hierarchy
for every single core. Shared (e.g. L2) caches between cores mean
that the naive model that N cores on a single die behave as N
single-core CPU's can break down.
- This problem is much less well-characterized than
the PM problem above as the tools are more complex, they usually
require special kernel patches, their interpretation is
cpu-dependent, etc.
- To first order we have only very basic information for a few
limited applications:
- Some initial studies by Lassi Tuura with CMSSW_1_3_0 (nearly
a year ago) for the reconstruction application using perfmon
and looking at the CPU-level performance counters
- Follow-on observations that the total code size is extremely
(and probably unnecessarily) bloated
- valgrind cachegrind reports about cache misses (recall that
this is just a simulation, though)
- igprof total use memory reports (observations that the dynamic
memory use is extremely bloated). Much work has gone into
reducing this, but there is still much to do.
All all but the last of these have not received any systematic
attention. The relationship with what could be changed in the
code, if anything, is much less well understood even at the level
of a single application running on the machine. How much we can
gain is also not clear. For multi-core effects we know basically
nothing.
- First indications of any degradation should however be visible
through the HEPIX-style benchmark of running 1, 2, ..., N
applications on N cores and looking for non-linear scaling of
cpu-time/event. Note, however, that any observation of an effect
with such a benchmark does not tell us anything
about what is actually happening and we would need to return
to more specific tools, as above, to actually understand that
before embarking on implementing some solution.
[Otherwise we are simply chit-chat theorizing (requires beer).]
Possible Changes
CMS has not done any specific work thus far on changing how we exploit
multiple cores, although a number of possibilities have been discussed.
In order of granularity:
- Modifications to put the calibrations, magnetic field, geometry into
shared memory and used by multiple cmsRun applications
- We know (e.g. from the massif reports) that a reasonably large
fraction of the heap use in our applications comes from such
things and they are common (with multiple) copies, like shared
libraries, between applications when we run coherently N instances
on N cores.
- Should help with the PM problem, not obviously related to the CC
problem
- Could perhaps (for example) be done by some refactoring of the
EventSetup/CORAL/frontier stack such that multiple (heavy-weight
process) cmsRun applications communicate with some stub application
which loads the constants into shared memory.
- This solution has the advantage that the code changes should
all be in "Framework" code and not in general reco/sim/analysis
code.
- A merge of the outputs of the client cmsRun applications could
be done directly on the WN before stage-out, reducing load on
the storage systems.
- It does not help with external applications (e.g. primarily
Geant4/ROOT) which have common data
- Since average processing time per event is similar, running one such
heavy-weight cmsRun "client" per core will still result in a
well-balanced processing load across the cores. (The behaviour
is not application specific.)
- A variant of this solution, where additional cmsRun's are forked
after startup (and loading of initial calibrations, etc.) were
also considered since it could exploit copy-on-write from the OS
and thus address the PM problem. But this solution is unlikely to
work as there is no obvious single place to fork as we do load
many things during event processing in the offline applications.
- Parallelization of event processing in multiple cmsRun worker threads
- Each thread processes an entire event, as a single cmsRun does
today
- The primary advantage here is that calibrations, magnetic field,
etc. would be shared between the threads, as in the previous
possibility, thus the PM problem is addressed. Again it isn't
obvious that there is any effect on the CC problem.
- Since average processing time per event is similar, running one such
cmsRun event-processing worker thread per core will still result
in a well-balanced processing load across the cores. (The behaviour
is not application specific.)
- This solution has the advantage with respect to the previous
solution that the input/output modules could be common between
the threads. (Reduces PM use further than the previous solution,
fewer open connections to the storage system for inputs, removes
need for explicit merge of outputs, etc.) It is however more heavily
dependent on proper thread-safe behaviour of ROOT in particular.
- Unlike the previous possibility, this method would place
additional requirements for thread-safe behaviour on all CMSSW
code, all externals used by CMSSW code, etc.
- Parallelization of module processing in multiple cmsRun worker threads
- Using the framework scheduler, we could process a single event
at a time across multiple cores by running (groups of) modules
processing the data from a given event simultaneously.
- Addresses the PM problem and could perhaps address in some way
(to be determined) CC issues
- Much more difficult to balance the processer load to fully
utilize multiple cores, since in (for example) the reconstruction
a few modules dominate the processing time. Many things depending
on (for example) track reconstruction. Significant variation with
event complexity (i.e. primary dataset) will also be present.
- In general the cpu-time/module will not be known until run-time,
making the scheduling a bit complex for other than a few standad
applications. (Will this job run effectively with 2, 3, 4, ...
worker threads/cores? i.e. the behaviour is very application
specific)
- Would place additional requirements for thread-safe behaviour on
all CMSSW code, all externals used by CMSSW code, etc.
- Fine-grained (sub-module) threading to process on multiple cores
- Rather than use parallel threads at the FWK-level (eventprocessor,
module), as in the previous two possibilities, localized threads
within specific modules could be used to parallelize parts
of the expensive processing done by that module (e.g. track
finding/fitting).
- Various implementations are possible.
- Still requires that some parts of CMSSW and the externals are
thread-safe, but less universally so than the previous two
solutions.
- This is of course very application specific and achieving full
utilization/balancing of process load on all of the cores is
less obvious. Depending on the specific choices, it should improve
the PM problem somewhat and (perhaps, in theory) improve somewhat
CC-type problems in certain cases.
- Geant4 changes
- Geant4 is well encapsulated within our code and it is also a fairly
mature component of our system
- It is (and will be even after first data) also an important cpu
cost for CMS. Per the computing model it is also the one that we
run in all sites (and in particular T2's) so we hit the PM problem
most often (CERN, FNAL and some of the T1's have more than the
prescribed 1GB/core, whereas T2's usually have the prescribed
memory/core).
- Solutions within Geant4 to parallelize processing across multiple
cores would presumably have far fewer implications on thread-safety
within CMSSW that solutions 2 and 3 above.
- Localized I/O related changes
- For the specific case of I/O, other more specialized changes might
be possible.
- For example, storage will be a problem and there might be some
possibilities for additional compression, at additional cpu
cost, that could be run asynchronously on a separate thread. (e.g.
I think the ROOT team was considering things like this.) This
might be interesting in certain CMS applications like
repacking in the Tier-0. But what we care about at the moment
is total throughput and its cost, not time-to-run for a single
application.
- Other solutions (e.g. for skimming, with multiple outputs and
a large dynamic range on each output) could involve carving
off the output module into a separate application and multiple
child cmsRun applications feeding that.
Conclusions and plans
- Much of our focus to date has been on improving the performance of
single applications running on a single core. We of course intend to
continue this work:
- Reducing dynamic memory use and footprint (an ongoing battle)
- Dealing with code size bloat (larger libraries or static binaries
for some applications)
- Profiles using cpu performance counters
- Of the solutions listed above, the short/medium term one we might
attempt is number 1 (shared memory for calibrations, etc.).
- Attempting the others (2,3,4) in CMSSW is probably both premature and
difficult at the moment. The rate at which CMSSW is growing/changing
(likely to continue for some time) is too high to impose stringent
requirements on thread-safety widely within the code base.
- Given its encapsulation we would welcome some solution
like number 5 (i.e. in Geant4). It could prove an interesting test
case for working through the Workflow Management and bookkeeping
issues for multicore applications with LCG/OSG and the sites (not
described above, but likely to require some effort).
- It is not specifically obvious what would help us in ROOT. (Since we
have a broad cross-section to changes in ROOT, we would also like to
understand in advance any changes that are made there for non-CMS
purposes.)