GeantV Profiling and Benchmarking (Preliminary)

GeantV 2019 Pre-Beta

Application Profiling Results Benchmark
FullCMS (pre-beta-7) Open|Speedshop AVX AVX2 SSE4.A IgProf (Mem) AVX AVX2 SSE4.A
FullCMS (pre-beta-6) Open|Speedshop (CPU) IgProf (Mem) AVX AVX2 SSE4.A
FullCMS (pre-beta-4) (AVX2) Open|Speedshop (CPU) IgProf (Mem) AVX2
FullCMS (pre-beta-4) Open|Speedshop (CPU) IgProf (Mem) AVX AVX2
FullCMS (branch canal/Scalabity v1) Open|Speedshop (CPU) IgProf (Mem) Summary
FullCMS (branch canal/Scalabity) Open|Speedshop (CPU) IgProf (Mem) Summary
FullCMS (Master-feb-21) Open|Speedshop (CPU) IgProf (Mem) Summary
FullCMS (Master-feb-04) Open|Speedshop (CPU) IgProf (Mem) Summary
FullCMS (pre-beta-3) Open|Speedshop (CPU) IgProf (Mem) Summary
FullCMS (Master-Jan-08) Open|Speedshop (CPU) IgProf (Mem) Summary

GeantV 2018 Fall Sprint (Oct. 2018, CERN): Profiling Results

Profiled on the Wilson cluster using Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (12 cores)
Application Performance Compare
FullCMS (Oct.10) Open|Speedshop IgProf Summary

GeantV 2017 Fall Sprint (Nov. 2017, CERN): Profiling Results

Profiled on the Wilson cluster using Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
Application Performance Compare
caloApp (+PE) Open|Speedshop IgProf Summary
caloApp Open|Speedshop IgProf Summary

VecGeom Performance with Geant4 (Sept. 2017)

Profiled on the Wilson cluster using AMD 6128HE Opteron 2GHz and Intel Xeon X5650 (2.67GHz)

*: VecGeom v00.04.00
**: cms2018 (the upgraded CMS pixel tracker and muon system)
Geant4 Version Application Performance Summary
10.4.Beta (VecGeom+cms2018) cmsExp (Vector) Open|Speedshop IgProf(Memory) CPU MEM
10.4.Beta (VecGeom+cms2018) cmsExp (Scalar) Open|Speedshop IgProf(Memory) CPU MEM
10.4.Beta (cms2018**) cmsExp Open|Speedshop IgProf(Memory) CPU MEM
10.4.Beta (VecGeom*) cmsExp Open|Speedshop IgProf(Memory) CPU MEM
10.4.Beta cmsExp Open|Speedshop IgProf(Memory) CPU MEM
Using VecGeom: master Oct. 31, 2016
Geant4 Version Application Performance Summary
10.2.p02 (G4Solid) cmsExp Simple Profiler Memory Profiler CPU MEM
10.2.p02 (VecGeom) cmsExp Simple Profiler Memory Profiler CPU MEM
geant4.10.2.p02-vecgeom-summary

GeantV 2015 Spring Sprint (March 2015, FNAL): Profiling Results

*: recompiled & rerun at the date
Profiled on the Wilson cluster using Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, & gcc 4.9.2
GeantV Version Application Performance Summary
master-2016-02-26 CMS + MinBias Open|Speedshop IgProf HCM

Check Lists

1) Consistency check between test-complex and runCMS setups (i.e., Cuts, BField, Input files) -> both using 1MeV cut

To-do Lists

1) Build profiling/bencmakring on an isolated node (@FNAL: phi nodes on the phi cluster)

Performance Breakdown

1) Overall Performance gain w.r.t Geant4: base-line
   R = CPU[Geant4 + G4Geometry + TabPhys] / CPU[GeantV + VecGeom-Vector + TabPhys] = R1 x R2 x R3
2) Performance gain by the GeantV framework (i.e., scheduler) - need a Geant4 interface to VecGeom-Scalar
   R1 = CPU[Geant4 + VecGeom-Scalar + TabPhys] / CPU[GeantV + VecGeom-Scalar + TabPhys]
3) Performance gain by VecGeom-Scalar (Improvement in Geometry Algorithm)
   R2 = CPU[Geant4 + G4Geometry + TabPhys]/ CPU[Geant4 + VecGeom-Scalar + TabPhys]
4) Performance gain by VecGeom-Vector (Vectorization)
   R3 = CPU[GeantV + VecGeom-Scalar + TabPhys]/ CPU[GeantV + VecGeom-Vector + TabPhys]
5) Overall Performance gain w.r.t Geant4: production
   Rf = CPU[Geant4 + G4Geometry + G4Physics] / CPU[GeantV + VecGeom-Vector + VecPhys-Vector
6) Need a multi-threaded application of test-complex (i.e, using Geant4-MT):
   Optional to compare Geant4 and GeantV with N-threads

2) Add Other Performance (HWC) metrics

1) Flat profile: Exclusive time (program counter sampling)
2) Inclusive/Exclusive/% time (call path profiling)
3) Memory and resource access patterns
4) Instuctions/Cycle or Cycles/instruction (osshwcsamp PAPI_TOT_CYC,PAPI_TOT_INS)
5) Cache behaviors; Long latency insruction impact (TLB, L1/L2 Data/Cache Miss)
6) Floating point/Vectorization efficiency
7) Branch mispredictions
8) Pipeline stalls

3) Throughput and Memory Reduction

1) Event Throughput (CPU, Xeon Phi)
2) Memory Reduction
3) Regression Plots
4) Comparison for two Data (Versions)

4) Useful Links for Performance Tools and Optimization

  1. Performance Tools: Open|SpeedShop