Hydra

A framework for data analysis in massively parallel platforms

Presented DIANA meeting
A. Augusto Alves Jr
University of Cincinnati
CERN, Geneve 2017-04-03

Outline


  • Overview: design, strategies and goals
  • Functionalities
  • Examples(improvements and new features)
    • New SOA generic container
    • Numerical integration (one- and multi-dimensional quadrature)
    • Phase-Space Monte Carlo (performance boost)
    • Interface to Minuit2
  • Summary and prospects

Hydra


Hydra is a header only templated C++ library to perform data analysis on massively parallel platforms.

Hydra

  • It is implemented on top of the C++ Standard Library and a variadic version of the Thrust library.
  • Hydra runs on Linux systems and can perform calculations using CUDAOpenMP and TBB enabled devices.
  • It is focused on portability, usability, performance and precision.

Design


  • The library is structured using static polymorphism: no virtual members.
  • No function pointers: stack behavior is known at compile time. (very important for CUDA backend)
  • No need to write explicit CUDA, OpenMP or TBB code.
  • Clean and concise semantics. Interfaces are designed to be easy to use correctly and hard to use incorrectly.
  • Common C++ idioms: iterator access to containers, named parameter semantics, RAII, CRTP…

The source files written using Hydra components and standard C++ compiles for GPU, CPU, Xeon Phi…

  • One can just exchanging the extension from .cu to .cpp or…
  • setting the -x language flag in the compiler

Functionality currently implemented


  • Generation of phase-space Monte Carlo samples with any number of particles in the final states. Sequentential decays, weighed and unweighted samples are supported.
  • Sampling of multidimensional PDFs.
  • Fitting of binned and unbinned multidimensional data sets with Minuit2.
  • Evaluation of multidimensional functions over heterogeneous datasets.
  • Numerical integration of multidimensional functions using Monte Carlo-based methods( flat and adaptive Vegas-like) or quadrature rules ( Gauss-Kronrod and Genz-Malik ).

Functors


In Hydra, all the calculations are performed using function objects. Hydra adds features, type information and interfaces to generic functors using the CRTP idiom.

//Gaussian with two parameters: mean and sigma
struct Gauss:public BaseFunctor<Gauss,double, 2>
{
  //client need to implement the Evaluate(T* ) method for homogeneous data.
  template<typename T>
  __host__ __device__
  inline double Evaluate(T* x){/* implementation */}

  //or heterogeneous data. In this case <T> represents a variadic tuple.
  template<typename T>
  __host__ __device__
  inline double Evaluate(T x){/* implementation */}
}
  • For all functors deriving from hydra::BaseFunctor<Func, ReturnType, NParams>, if the calculation is expensive or the functor will be called many times, the results of the first call can be cached.

Arithmetic and composition with functors


Hydra provides a lot “syntax sugar” to deal with function objects.

  • All the basic arithmetic operators are overloaded. Composition is also possible. If A, B and C are Hydra functors, the code below is completely legal.
 //basic arithmetic operations
 auto plus_functor = A + B;  auto minus_functor = A - B;
 auto multiply_functor = A * B;  auto divide_functor = A/B; 

 //composition of basic operations
 auto any_functor = (A - B)*(A + B)*(A/C);
 
 // C(A,B) is represented by:
 auto composed_functor = compose(C, A, B)
  • The functors resulting from arithmetic operations and composition can be cached.
  • No intrinsic limit on the number of functors participating on arithmetic or composition mathematical expressions.

Support for C++11 lambdas


  • C++11 Lambdas are fully supported in Hydra.
  • In the client code one can define a lambda function at any point and convert it into a Hydra functor using hydra::wrap_lambda():
double two = 2.0;
//define a simple lambda and capture "two" by value
auto my_lambda = [] __host__ __device__(double* x)
    { return two*sin(x[0]); };

//convert is into a Hydra functor
auto my_lamba_wrapped  = wrap_lambda(my_lambda);

CUDA 8.0 supports lambda functions in device and host code.

Data containers


  • Most of Hydra algorithms can operate over any iterable C++ container.
  • hydra::PointVector is an iterable collection of hydra::Point objects: hydra::Point represents multidimensional data points with coordinates and value, error of the coordinates and error of the value(usage mandatory for data fitting).
 //two dimensional data set
 PointVector<device, double, 2> data_d(1e6);

 // fill data_d and get data from device and fill a ROOT 2D histogram
 PointVector<host> data_h(data_d);

 TH2D hist("hist", "my histogram", 100, min, max); 
for(auto point : data_h )
{
   std::cout << point << std::endl; // stream the points to std::out
   hist.Fill(point.GetCoordinate(0), point.GetCoordinate(1));
   }

hydra::multivector


  • This class allows to store in memory a table with columns of different types.

  • The table can be accessed line-by-line or element-by-element, in a given column.

  • It is basicaly a SoA generic container with limitations to the type and number of columns.

  • Interface mimics std::vector and thrust::host(device)_vector

    //each line is modeled as a tuple
    typedef thrust::tuple<unsigned int, float, double> line_t;
    
    //prototype like it was a vector of lines  
    typedef thrust::host_vector<line_t>   prototype_h;
    typedef thrust::device_vector<line_t> prototype_d;
    
    //the table
    typedef hydra::experimental::multivector<prototype_h> table_h;
    typedef hydra::experimental::multivector<prototype_d> table_d;
    
    //so use that...
    table_d table;
    table.resize(10);
    table.push_back(thrust::make_tuple( 1, 2.0 , 3.0));
    

Using hydra::multivector to collect objects


  • Users can manage collections of objects using the macro _DeclareStorable(class_name, args...).

  • Objects will get stored as a hydra::multivector and individual objects can be accessed without cache misses and stored without padding.

  • Example:

    class Vector4R
    {
    
    public: 
    // ctors and public class methods etc
    
    private:
    
    GReal_t v[4]; //<< four-vector components are stored here
    
    _DeclareStorable(Vector4R, v[0], v[1] , v[2], v[3])
    };
    
    typedef thrust::host_vector<Vector4R::args_type> prototype_t; 
    //--------------------------^^^^^^^^^^^^^^^^^^^--------------
    
    typedef multivector<prototype_t> vector_particles; 
    //_______________________________________________________
    //exemple of usage: an array of daughter particle vectors.
    
    vector_particles fDaughters;
    
    fDaughters.push_back( Vector4R(5.279, 0, 0, 0) );
    

hydra::multivector deployments


Deployed in several classes and algorithms already:

  • Phase-space Monte Carlo generator
  • Generic multidimensional data container for fits and MC generation
  • Quadratures etc
  • Huge performance gain in all cases!

Generic function evaluation


Functors can be evaluated over large data sets using the template function hydra::eval .

hydra::eval(Iterarors ...itors) will return a vector of tuples with the results.

//allocate a vector [nentries] real numbers in the device
RealVector_d angles_d(nentries);

// Generate angles uniformly in the range generation will be performed in the device
Generator.Uniform(0 , 2*PI, angles_d.begin(), angles_d.end() );

// A lambda to calculate sin(x) and  cos(x)
auto sinL = [] __host__ __device__(double* x){ return sin(x[0]);};
auto sinW = wrap_lambda(sinL);

auto cosL = [] __host__ __device__(double*  x){ return cos(x[0]);};
auto cosW = wrap_lambda(cosL);

// Evaluation
auto functors = thrust::make_tuple( sinW, cosW);
auto range    = make_range( angles_d.begin(), angles_d.end());
auto result   = eval( functors, range);

Multidimensional PDF sampling


The template class hydra::Random manages the multidimensional PDF sampling in Hydra.

  • Generic PDF sampling using accept-reject method.
  • hydra::Random also provides some basic 1D distributions: Uniform, Gaussian, Breit-Wigner…
  • Can be configured via template parameters (policies) to use different random number generators.
  • The sampler takes the target container iterators as input and performs the generation on the host or device backend.

Multidimensional PDF sampling


//Random object with current time count as seed.
Random<thrust::random::default_random_engine>
Generator( std::chrono::system_clock::now().time_since_epoch().count() );

//1D host buffer
hydra::mc_host_vector<double>   data_h(nentries);

//Uniform
Generator.Uniform(-5.0, 5.0, data_h.begin(), data_h.end());

//Gaussian
Generator.Gauss(0.0, 1.0, data_h.begin(), data_h.end());

//Exponential
Generator.Exp(1.0, data_h.begin(), data_h.end());

//Breit-Wigner
Generator.BreitWigner(2.0, 0.2, data_h.begin(), data_h.end());

Multidimensional PDF sampling


//------------------------------
// Two gaussians hit-and-miss
//------------------------------
// Gaussian 1
std::array<double, 2>  means1  ={2.0, 2.0 };
std::array<double, 2>  sigmas1 ={1.5, 0.5 };
Gauss<2> Gaussian1(means1, sigmas1 );

// Gaussian 2
std::array<double, 2>  means2  ={-2.0, -2.0 };
std::array<double, 2>  sigmas2 ={0.5, 1.5 };
Gauss<2> Gaussian2(means2, sigmas2 );

//add the pdfs
auto Gaussians = Gaussian1 + Gaussian2;

//2D range
std::array<double, 2>  min  ={-5.0, -5.0 };
std::array<double, 2>  max  ={ 5.0,  5.0 };

auto data_d = Generator.Sample<device>(Gaussians,min, max, ntrials );
auto data_h = Generator.Sample<host>(Gaussians,min, max, ntrials );

Multidimensional PDF sampling


HIST_2D_gaussians_DEVICE

HIST_2D_gaussians_HOST

  • Same algorithm and functor invoked on the device and on the host.
  • To perform 10 million trials takes:
    • GeForce Titan-Z (2× 2880 CUDA cores): 30 ms
    • Intel i7 4 cores @ 3.60GHz : 450 ms

Phase-Space Monte Carlo


Hydra supports the production of phase-space Monte Carlo samples. The generation is managed by the class hydra::PhaseSpace and the storage is managed by the specialized container hydra::Events.

  • Policies to configure the underlying random number engine and the backend used for the calculation.
  • No limitation on the number of final states.
  • Support the generation of sequential decays.
  • hydra::Events is iterable and therefore is fully compatible with C++11 iteration over range semantics.
  • Generation of weighted and unweighted samples.

The Hydra phase-space generator supersedes the library MCBooster, from the same developer.

Phase-Space Monte Carlo


// B0 -> J/psi K pi
Vector4R B0(5.27961, 0.0, 0.0, 0.0);
std::vector<double> 
massesB0{3.096916, 0.493677, 0.13957018 };

// PhaseSpace object for B0-> K pi J/psi
PhaseSpace<3> phsp(B0.mass(), massesB0);
// container
Events<3, device> Evts_d(10e7);
// generate...
phsp.Generate(B0, Evts_d.begin(), Evts_d.end() );

//copy events to the host
Events<3, host> Evts_h(Evts_d);

for(auto event: Evts_h)
{ /*fill histogram*/}

PHSP

Time to generate 10M events:

  • GeForce Titan-Z (2× 2880 CUDA cores): < 20 ms
  • Intel i7 4 cores @ 3.60 GHz: < 250 ms

Multidimensional numerical integration


Hydra implements some popular methods for one- and multi-dimensional numerical integration:

  • Plain Monte Carlo: algorithm samples points randomly from the integration region to estimate the integral and its error.
  • Vegas-like algorithm of Lepage is based on self-adaptive importance sampling.

The upcoming release implements 4 quadrature methods:

  • Static and self-adaptive Gauss-Kronrod: Implemented rules G7K15, G10K21, G15K31, G20K41, G25K51, G30K61
  • Multidimensional Genz-Malik: Nested Rule and corresponding self-adaptive algorithm.

General strategy is to parallelize the function calls.

Interface to Minuit2 and fitting


Hydra implements an interface to Minuit2 that parallelizes the FCN calculation. This accelerates dramatically the calculation over large datasets.

  • The fit parameters are represented by the class hydra::Parameter and managed by the class hydra::UserParameters.
  • hydra::UserParameters has the same semantics of Minuit2::MnUserParameters.
  • Any positive definite Hydra-functor can be converted into PDF.
  • The PDFs are normalized on-the-fly.
  • The estimator is a policy in the FCN.
  • Data is passed via iterators of a hydra::PointVector.

The FCN provided by Hydra should be used directly in Minuit2.

Interface to Minuit2 and data fitting


Parameter definition syntax is very intuitive:

// 1) using the Named Parameter Idiom
Parameter  mean  = Parameter::Create().Value(3.0).Name("Mean").Error(0.001).Limits( 1.0, 4.0);
Parameter  sigma = Parameter::Create().Name("Sigma").Value(0.5).Error(0.001).Limits(0.1, 1.5);

or

// 2) using unnamed parameter idiom
Parameter mean("Mean", 3.0, 0.001, 1.0, 4.0) ;
Parameter sigma("Sigma", 0.5, 0.001, 0.1, 1.5) ;

Pass the parameters to Hydra and Minuit2:

hydra::UserParameters upar;
upar.AddParameter(&mean);
upar.AddParameter(&sigma);

//Printing the registered parameters
upar.PrintParameters();

Interface to Minuit2 and data fitting


Defining the functor (current implementation):

struct Gauss:public BaseFunctor<Gauss,double, 2>
{
Gauss(Parameter const& mean, Parameter const& sigma, GUInt_t position=0 ):
  BaseFunctor<Gauss,double,2>()
  fPosition(position),
  fM(mean),//<--- not necessary anymore
  fS(sigma)//<--- to declare a member for each parameter
  {RegistryParameters({&fM, &fS}); /* <-- explicit registration not necessary */ }

// Note: Copy ctor and operator= definitions are mmited for brevity

  template<typename T>
    __host__ __device__    inline double Evaluate(T* x)
    {    return exp(-((x[fPosition] - fM) * (x[fPosition] - fM)) / (2 * fS * fS));}

    GUInt_t  fPosition;
    Parameter fM;//<--- not necessary anymore
    Parameter fS;//<--- not necessary anymore
};

Interface to Minuit2 and data fitting


Defining the functor (new implementation):

struct Gauss:public BaseFunctor<Gauss,double, 2>
{
Gauss(Parameter const& mean, Parameter const& sigma, unsigned position=0 ):
  BaseFunctor<Gauss,double,2>({mean, sigma}), //registry parameters here 
  fPosition(position)
  { }

// Note: Copy ctor and operator= definitions are mmited for brevity

template<typename T>
__host__ __device__    inline double Evaluate(T* x)
{
  // _par[i] <--- access the ith parameter
  double y = exp(-((x[fPosition]-_par[0])*(x[fPosition]-_par[0]))/(2*_par[1]*_par[1]))
  return y ;
}

unsigned  fPosition;
};

Interface to Minuit2 and data fitting


Complete example:

  • Model = Gaussian1 + Gaussian2 + Exponential
  • 2 + 2 + 1 + 3 free parameters
  • Unbinned maximum likelihood extended fit
  • Data: 10 Million events.
//Allocate memory to store data 
PointVector<device, double, 1> data_d(nentries);

//Fill data container...

//Get the FCN
auto modelFCN = make_loglikehood_fcn(model, data_d.begin(), data_d.end() );
//Strategy
MnStrategy strategy(1);
//Create the Migrad minimizer
MnMigrad migrad(modelFCN, upar.GetState(),  strategy);
//Perform the fit
FunctionMinimum minimum = migrad();

Interface to Minuit2 and data fitting


Fit

  • Black dots: data points
  • Red line: fit result
  • Blue shadowed area: events sampled from fit result.

Time spent performing the fit:

  • GeForce Titan-Z (2× 2880 CUDA cores): ~2.0 s
  • Intel i7 4 cores @ 3.60 GHz: ~20 s

Testing


OpenMP and TBB backends (x86_64 Virtual Machines):

CUDA

  • Fedora 23: kernel-4.7.7, gcc-5.3.1, cuda-8.0, arch Kepler (Titan Z)
  • Fedora 23: kernel-4.7.7, gcc-5.3.1, cuda-8.0, arch Quadro (K2200)
  • Fedora 23: kernel-4.7.7, gcc-5.3.1, cuda-8.0, arch Maxwell (GTX970)
  • Fedora 23: kernel-4.7.7, gcc-5.3.1, cuda-8.0, arch Maxwell (GTX980M)
  • SUSE: kernel-4.8.0(?) and gcc-5.4.0, cuda-8.0, arch Pascal (GTX1080)

LHCb Analyses using Hydra


  • Charged Kaon mass measurement:
    • 1D and 2D fit 20M+ events
    • A lot of toy MC studies
  • Andrea will present some slides about that.

In progress…


  1. At design/implementation phase :

    • Sub-thread parallelization using SIMD where is possible.
    • Statistics module.
    • Mathematical functions module.
    • Unit-tests and performance measuments for each class/algorithm.
    • Generic convolution algorithm.
  2. Testing phase (to be in the next release):

    • New very fast multidimensional data containers: 100% “prefetcher friendly” on CPU and GPU.
    • Super fast phase-space generator: 15 million events/ ~7 microseconds.
    • Support to phase-space integrals
    • Multidimensional adaptive and non adaptive quadrature (Genz-Malik).
    • One-dimensional adaptive and non-adaptive quadrature (Gauss-Kronrod).
    • Many improvments on the API.

Summary


Please, visit the page of the project, give a try, report bugs, make suggestions and if you like it give a star. Thanks !