Third Computational and Data Science school for HEP (CoDaS-HEP 2019)

US/Eastern
407 Jadwin Hall (Princeton University)

407 Jadwin Hall

Princeton University

Princeton Center For Theoretical Science (PCTS)
Description

The third school on tools, techniques and methods for Computational and Data Science for High Energy Physics (CoDaS-HEP 2019) will take place on 22-26 July, 2019, at Princeton University.

Advanced software is a critical ingredient to scientific research. Training young researchers in the latest tools and techniques is an essential part of developing the skills required for a successful career both in research and in industry.

The CoDaS-HEP school aims to provide a broad introduction to these critical skills as well as an overview of applications High Energy Physics. Specific topics to be covered at the school include:

  • Parallel Programming 
  • Big Data Tools and Techniques
  • Machine Learning 
  • Practical skills like performance evaluation, use of git, etc.

The school offers a limited number of young researchers an opportunity to learn these skills from experienced scientists and instructors. Successful applicants will receive travel and lodging support to attend the school.

School website: http://codas-hep.org

The school lectures will take place in 407 Jadwin Hall, in the main lecture hall of the Princeton Center for Theoretical Science (PCTS).

This project is supported by National Science Foundation grants OAC-1829707, OAC-1829729, OAC-1836650 and OAC-1450377, the Princeton Institute for Computational Science and Engineering (PICSciE), the Princeton Physics Department, the Office of the Dean for Research of Princeton University and the Enrico Fermi Institute at the University of Chicago. Any opinions, findings, conclusions or recommendations expressed in this material are those of the developers and do not necessarily reflect the views of the National Science Foundation.

 

    • 08:30 09:00
      Breakfast 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 09:00 09:15
      Welcome and Overview 15m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speaker: Peter Elmer (Princeton University (US))
    • 09:15 10:00
      Computational and Data Science Challenges 45m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Including What Every Physicist Should Know About Computer Architecture...

      Speaker: Ian Cosden (Princeton University)
    • 10:00 10:30
      Setup on local compute systems 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 10:30 11:00
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 11:00 12:30
      Version Control with Git and Github 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Fundamentaly, a Version Control System (VCS) is a system that records changes to a file or set of files over time, so that you can recall specific versions later.

      Git is a modern VCS that is fast and flexible to use thanks to its
      lightweight branch creation. Git is very popular, this is due in part to the availability of cloud hosting services like GitHub, Bitbucket and GitLab. Hosting a Git repositories on a remote service like GitHub greatly facilitates working collaboratively as well as allowing you to frequently backup your work on a remote host.

      We will start this talk by introducing the fundamental concepts of Git. The second part of the talk will show how to publish to a remote repository on GitHub.

      No prior knowledge of Git or version control will be necessary, but some familiarity with the Linux command line will be expected.

      Speaker: David Luet (Princeton University)
    • 12:30 13:30
      Lunch 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 13:30 15:00
      Parallel Programming - An introduction to parallel computing with OpenMP 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      We start with a discussion of the historical roots of parallel computing and how they appear in a modern context. We'll then use OpenMP and a series of hands-on exercises to explore the fundamental concepts behind parallel programming.

      Speaker: Tim Mattson (Intel)
    • 15:00 15:30
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 15:30 17:30
      Parallel Programming - The OpenMP Common Core 2h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      We will explore through hands-on exercises the common core of OpenMP; that is, the features of the API that most OpenMP programmers use in all their parallel programs. This will provide a foundation of understanding you can build on as you explore the more advanced features of OpenMP.

      Speaker: Tim Mattson (Intel)
    • 18:00 20:30
      Welcome Reception 2h 30m Palmer House

      Palmer House

    • 08:00 08:30
      Breakfast 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 08:30 10:30
      Parallel Programming - Working with OpenMP 2h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      We'll explore more complex OpenMP problems and get a feel for how to work with OpenMP with real applications.

      Speaker: Tim Mattson (Intel)
    • 10:30 10:40
      Group Photo - Jadwin Hall plaza 10m
    • 10:40 11:00
      Coffee Break 20m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 11:00 12:30
      Parallel Programming - The world beyond OpenMP 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Parallel programming is hard. There is no way to avoid that reality. We can mitigate these difficulties by focusing on the fundamental design patterns from which most parallel algorithms are constructed. Once mastered, these patterns make it much easier to understand how your problems map onto other parallel programming models. Hence for our last session on parallel programming, we'll review these essential design patterns as seen in OpenMP, and then show how they appear in cluster computing (with MPI) and GPGPU computing (with OpenCL and a bit of CUDA).

      Speaker: Tim Mattson (Intel)
    • 12:30 13:30
      Lunch 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 13:30 15:00
      The Scientific Python Ecosystem 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      In recent years, Python has become a glue language for scientific computing. Although code written in Python is generally slow, it has a good connection with compiled C code and a common data abstraction through Numpy. Many data processing, statistical, and most machine learning software has a Python interface as a matter of course.

      This tutorial will introduce you to core Python packages for science, such as Numpy, SciPy, Matplotlib, Pandas, and Numba, as well as HEP-specific tools like iminuit, particle, pyjet, uncertainties, and pyhf. We'll especially focus on accessing ROOT data in PyROOT and uproot.

      Speaker: Jim Pivarski (Princeton University)
    • 15:00 15:30
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 15:30 17:30
      Machine Learning 2h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Machine learning (ML) is a thriving field with active research topics. It has found numerous practical applications in natural language processing, understanding of speech and images as well as fundamental sciences. ML approaches are capable of replicating and often surpassing the accuracy of hypothesis driven first-principles simulations and can provide new insights to a research problem.

      We here provide an overview about the content of the Machine Learning tutorials. Although the theory and practice sessions are described separately, they will be taught alternating one to the other, during the four lectures. In this way, after we've introduced new concepts, we can immediately use them in a tailored exercise, which will help us absorb the material covered.

      Theory

      We will start with a gentle introduction to the ML field, introducing the 3 learning paradigms: supervised, unsupervised, and reinforcement learning. We'll then delve into the two different supervised sub-categories: regression and classification using neural nets' forward and backward propagation. We'll soon see that smart choices can be done to exploit the nature of the data we're dealing with, and introduce convolutional and recurrent neural nets. We'll move on to unsupervised learning, and we'll familiarise with generative models such as variational autoencoders and adversarial networks.

      Practice

      We will introduce machine learning technology focusing on the open source software stack PyTorch. We'll go over a brief introduction to PyTorch architecture, primitives and automatic differentiation, implementing multi-layer perceptron and convolutional layers, a deep dive into recurrent neural networks for sequence learning tasks, and finally some generative models. Python programming experience and Numpy exposure is highly desirable, but previous experience with PyTorch is not required.

      Speakers: Alfredo Canziani (NYU Center for Data Science), Henry Fredrick Schreiner (Princeton University), Savannah Jennifer Thais (Princeton University (US))
    • 18:30 20:30
      Social Mixer - Prospect House 2h

      Food and drinks at Prospect House

    • 08:00 08:30
      Breakfast 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 08:30 09:30
      The Use and Abuse of Random Numbers 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speaker: Daniel Sherman Riley (Cornell University (US))
    • 09:30 10:30
      Floating Point Arithmetic Is Not Real 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speaker: Bei Wang (Princeton University)
    • 10:30 11:00
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 11:00 12:30
      Columnar Data Analysis 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Data analysis languages, such as Numpy, MATLAB, R, IDL, and ADL, are typically interactive with an array-at-a-time interface. Instead of performing an entire analysis in a single loop, each step in the calculation is a separate pass, letting the user inspect distributions each step of the way.

      Unfortunately, these languages are limited to primitive data types: mostly numbers and booleans. Variable-length and nested data structures, such as different numbers of particles per event, don't fit this model. Fortunately, the model can be extended.

      This tutorial will introduce awkward-array, the concepts of columnar data structures, and how to use them in data analysis, such as computing combinatorics (quantities depending on combinations of particles) without any for loops.

      Speaker: Jim Pivarski (Princeton University)
    • 12:30 13:30
      Lunch 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 13:30 15:00
      Accelerating Python 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Python (or really its standard implementation, CPython) is notoriously slow, but it can be fast with the right techniques. Casting problems in Numpy is one way to do it, though algorithms that must "iterate until converged" don't fit Numpy's array-at-a-time model well.

      Numba is an alternative that compiles Python to run as fast as C, but only if the code consists purely of numbers and arrays that don't change type. Quite a few call out to C++, such as pybind11, Cython, and PyROOT, which is another way of escaping Python for tight loops. There are also many tools to parallelize Python, though there are some pitfalls to consider.

      In this session, we'll survey these methods and their strengths and weaknesses.

      Speaker: Jim Pivarski (Princeton University)
    • 15:00 15:30
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 15:30 17:30
      Machine Learning 2h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speakers: Alfredo Canziani (NYU Center for Data Science), Henry Fredrick Schreiner (Princeton University), Savannah Jennifer Thais (Princeton University (US))
    • 18:00 20:00
      Dinner on your own 2h
    • 08:00 08:30
      Breakfast 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 08:30 10:30
      Vector Parallelism on Multi-Core Processors 2h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      All modern CPUs boost their performance through vector processing units (VPUs). VPUs are activated through special SIMD instructions that load multiple numbers into extra-wide registers and operate on them simultaneously. Intel's latest processors feature a plethora of 512-bit vector registers, as well as 1 or 2 VPUs per core, each of which can operate on 16 floats or 8 doubles in every cycle. Typically these SIMD gains are achieved not by the programmer directly, but by (a) the compiler through automatic vectorization of simple loops in the source code, or (b) function calls to highly vectorized performance libraries. Either way, vectorization is a significant component of parallel performance on CPUs, and to maximize performance, it is important to consider how well one's code is vectorized.

      In the first part of the presentation, we take a look at vector hardware, then turn to simple code examples that illustrate how compiler-generated vectorization works and the crucial role of memory bandwidth in limiting the vector processing rate. What does it really take to reach the processor's nominal peak of floating-point performance? What can we learn from things like roofline analysis and compiler optimization reports? And what can a developer do to help out the compiler?

      In the second part, we consider how a physics application may be restructured to take better advantage of vectorization. In particular, we focus on the Matriplex concept that is used to implement parallel Kalman filtering in our collaboration's particle tracking R&D project. Drastic changes to data structures and loops were required to help the compiler find the SIMD opportunities in the algorithm. In certain places, vector operations were even enforced through calls to intrinsic functions. We examine a suite of test codes that helped to isolate the performance impact of the Matriplex class on the basic Kalman filter operations.

      Speaker: Steven R Lantz (Cornell University (US))
    • 10:30 11:00
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 11:00 12:30
      Introduction to Performance Tuning & Optimization Tools 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)

      Improving the performance of scientific code is something that is often considered to be some combination of difficult, mysterious, and time consuming, but it doesn't have to be. Performance tuning and optimization tools can greatly aid in the evaluation and understanding of the performance of scientific code. In this talk we will discuss how to approach performance tuning and introduce some measurement tools to evaluate the performance of compiled-language (C/C++/Fortran) code. Powerful profiling tools, such as Intel VTune and Advisor, will be introduced as well as demonstrated in practical applications. A hands-on example will allow students to gain some familiarity using VTune in a simple, yet realistic setting. Some of the more advanced features of VTune, including the ability to access the performance hardware counters on modern CPUs, will be introduced.

      Speaker: Bei Wang (Princeton University)
    • 12:30 13:30
      Lunch 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 13:30 15:00
      Machine Learning 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speakers: Alfredo Canziani (NYU Center for Data Science), Henry Fredrick Schreiner (Princeton University), Savannah Jennifer Thais (Princeton University (US))
    • 15:00 15:30
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 15:30 16:30
      Collaborative Programming 1h 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speaker: David Luet (Princeton University)
    • 16:30 17:15
      Charged Particle Tracking Reconstruction (Guest Lecture) 45m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speaker: Slava Krutelyov (Univ. of California San Diego (US))
    • 18:00 20:00
      School Dinner - Despana (235A Nassau Street, corner of Nassau St and Olden) 2h

      https://despanaprinceton.com/

    • 08:30 09:00
      Breakfast 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 09:00 10:30
      Machine Learning 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
      Speakers: Alfredo Canziani (NYU Center for Data Science), Henry Fredrick Schreiner (Princeton University), Savannah Jennifer Thais (Princeton University (US))
    • 10:30 11:00
      Coffee Break 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)
    • 11:00 12:30
      Closing Session 1h 30m 407 Jadwin Hall

      407 Jadwin Hall

      Princeton University

      Princeton Center For Theoretical Science (PCTS)