Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

Visit of EPFL Blue Brain. Talk: Accelerating the Pipeline of Brain Tissue Simulations with Apache Spark

Europe/Zurich
513/1-024 (CERN)

513/1-024

CERN

50
Show room on map
Description

The aim of this meeting is meet and share experience on data management and data engineering work between CERN teams and EPFL Blue Brain.

    • 14:00 14:40
      Accelerating the Pipeline of Brain Tissue Simulations with Apache Spark 40m

      In the past years, the increasing computational power has made possible larger scientific experiments that have high computational demands, such as brain tissue simulations. In general, larger simulations imply dealing with larger amounts of input and output data that need to be read, processed, and analyzed. In this context, we foresee a 10x increase of the simulation data in the next year, making the implementation of certain stages of our workflow unfeasible in the near future. Therefore, we are exploring how to accelerate selected critical parts of the simulation pipeline with big data technologies, like Apache Spark. In this talk, we will present how we address these challenges at two different stages of the pipeline: circuit building and simulation analysis.
      The current implementation of brain circuit building places neurons in the volume of a brain region according to scientific rules, followed by spatial detection of touching neuron components. The resulting synapses are filtered to match the biological distributions of connections between cell types. Due to the data sizes involved, we chose to implement the last iteration of this step using Apache Spark, making extensive use of modern features such as Pandas UDFs.
      Regarding the analysis stage, we build RDDs/DataFrames from the simulation output data and perform different scientific data queries and transformations. After significant engineering and programming efforts, we have implemented our analysis stage in five different ways, combining RDDs, DataFrames, different data structures and representations and different data partitioning to evaluate which Spark features fit better our use case. We will present the outcome of our experiments run on a data analysis cluster at large scale.
      We would like to share with the audience our lessons learned: how Spark features can leverage the pipeline of our neuroscience research area and what type of decisions can impact performance. Moreover, we would also like to open potential collaborations and discussion related to Spark and big data to address the current challenges as a joint community effort.

      Speaker bio: Judit Planas received her Ph.D. in Computer Architecture from the Technical University of Catalonia (UPC, Spain) in 2015. She worked at the Barcelona Supercomputing Center from 2008 to 2015, where she developed her MSc and PhD in programming models for heterogeneous architectures (GPGPUs, Intel Xeon Phi). During this period, she participated in a number of different workshops and courses involving parallel programming models and CUDA programming as a teaching assistant. In addition, as part of her PhD, she did an internship at the University of Illinois at Urbana-Champaign (Illinois, USA) under the supervision of Prof. Wen-mei W. Hwu. From 2015, she is a Postdoctoral Researcher at the Blue Brain Project, Ecole Polytechnique Federale de Lausanne (EPFL, Switzerland). Her work focuses on memory-intensive and high I/O-demanding neuroscientific applications. In this context, she is developing software and techniques to accelerate applications and leverage from the latest hardware and software technologies, like using non-volatile memory (NVM) or big data solutions. She has published her work in international conferences and journals and has been invited to participate in different events as a speaker, panelist or program committee member.

      Speaker: Judit Planas (EPFL)
    • 14:40 15:00
      discussion 20m