In this contribution we report on the activity of the GAP project, which aims to investigate the deployment of Graphic Processing Units (GPU) in different context of realtime scientific applications. The different areas of interest span across various rates of data processing, bandwidth and computational intensity of the executed algorithms. In this contribution we focus in particular on the applications of GPUs in asynchronous systems such as software trigger systems of particle physics experiment, and reconstruction of nuclear magnetic resonance images. All these application can benefit from the implementation on the massively parallel architecture of GPUs, optimizing different aspects.
As a first application we discuss how specific trigger algorithms can be naturally parallelized and thus benefit from the implementation on the GPU architecture, in terms of execution speed and complexity of the analyzed events. Two benchmark application environment under investigation are the NA62 and Atlas experiments at CERN.
The NA62 experiment aims at the measurement of ultra-rare kaon decays, recording data from the SPS high intensity hadron beam. A selective trigger, based on sequential hardware and software layers, is very important in order to identify in realtime interesting events produced at the level of 1/10-10. The GPUs can be exploited to build offline reconstruction quality trigger primitives, that allow the definition of highly pure and efficient selection criteria. Even if the NA62 collaboration is considering the application of GPUs both in the hardware and software trigger, in this contribution we focus on their implementation on this latter, devoted to reduce the data collection rate from 1 MHz to ~10 kHz. We discuss the benefits achievable from the implementation on GPU of the ring reconstruction algorithms in the NA62 RICH detector and tracking spectrometer. In both cases innovative algorithms have been designed to specifically benefit from the massive parallelism of the GPU architecture.
The Atlas experiment register data from the LHC pp collisions through an hybrid multi-stage trigger. A first synchronous level is based on custom electronics, while the subsequent is asynchronous and based on software algorithm ran on commodity PC farm. The benchmark activity we are carrying out involves the software trigger algorithms used for muon reconstruction in the detector. This is based on the execution for a large number of times of the same algorithms that reconstruct and match segments of particle trajectories in the detector, hence can benefit from a massively parallel execution on GPUs.
We will discuss in details the implementation of such algorithms on a GPU based system. We will characterize the performance of this new implementation, and benchmark it against the present ATLAS muon algorithm performances. The integration of the GPU within the current data acquisition system is done through a server-client structure  that can manage different tasks and their execution on a given device, such as the GPU. This element is flexible, able to deal with different computation devices, and is adding almost no overhead on the total latency of the algorithm execution. With the help of this structure it is possible to isolate the muon trigger algorithm itself, and optimize it for the execution on GPU. This will imply the translation to the CUDA programming language and the optimization of the different task that can be naturally parallelized. In such a way the dependency of the execution time on the complexity of the processed events will be reduced. A similar approach has been investigated in the past for the deployment of GPUs in different Atlas trigger algorithms with promising results . The evolution of the foreseen Atlas trigger system, that will merge the higher level trigger layers in a unique software processing stage, can take event more advantage from the use of GPUs. More complex algorithm, with offline- like resolution can be implemented on a thousand-core device with significant speedup factors. The timing comparison between the serial and the parallel implementation of the trigger algorithm is done on the data collected in the past year, and also on simulated data that reproduces the
foreseen data taking conditions with the LHC luminosity upgrade, with increased number of multiple interactions in the collisions.
A similar improvement can be obtained exploiting GPU in medical imaging. This diagnostic techniques, as the Nuclear Magnetic Resonance (NMR) allows to visualize images of the body part through information on diffusion of water molecules. The most advanced elaboration techniques are based on calculation of ~1M non-linear functions, naturally parallelizable and computationally demanding algorithms. In this project we are focusing on the kurtosis diffusion method K , that currently takes ~20 hours to precisely reconstruct a brain image. These algorithms, currently implemented in Matlab, can be converted to a parallel version for GPU thanks to available compatibility libraries. Performance measurements will be presented on the parallel implementation of the image reconstruction algorithms and of the Monte Carlo simulation techniques.
 The client-server structure is obtained using APE, an Atlas tool developed independently from this project.
 D. Emeliyanov, J. Howard, J. Phys.: Conf. Ser. 396 012018, 2012.
 J.H. Jensen, J.A. Helpern, NMR Biomed; 23 (7): 698-710, 2010.