February 26, 2024 to March 1, 2024
Virtual
UTC timezone

What is an analysis pipeline?

No one does a data analysis once. After the exploratory phase, computations that were found to be useful are formalized as reusable programs that convert input data into final results, and these programs are run over and over, with updates, as new corrections and considerations for come to mind. This is a data analysis pipeline.

It is very important for a data analysis pipeline to be reproducible. After all, you want to draw conclusions about your data by running the pipeline under different conditions and seeing how the results change, but they would not be valid conclusions if running it under the same conditions also yields different results! A clean workbench is an essential part of the scientific method, and your data analysis code is part of your scientific workbench.

In addition, scientific results need to be reproducible after your experiment is done. Ensuring reproducibility during your analysis simplifies the process of preserving your analysis for future research. (This training workshop was previously called "Analysis Preservation.")

Reproducibility is a concern for software developers as well, and many of the tools that have been developed for the software industry can be applied to data analysis.

This training event is for data analysts who are already familiar with analysis tools and concepts (e.g. C++, Python, event selection, limit setting) who want to learn how to make their analysis pipelines robust using continuous testing (CI/CD) and containerization (Podman, Docker, and Apptainer).

What is the format of this workshop?

The main part of of the workshop is you learning asynchronously with pre-recorded videos and training material. This means that you can profit from this workshop no matter your time zone! During the whole time we offer assistance via slack.

On the first day, we offer a central kickoff session, help with the setup, and one live lecture. These are only in one timezone, but don't worry if you cannot make it (you can still profit from the rest of the workshop and they will also be recorded).

On the last day, we will offer small-group mentoring sessions (different sessions for all time zones) to help you answer additional/advanced questions and apply your new knowledge to your own analysis. There will also be one more live lecture.

What exactly will I learn?

Are there any prerequisites?

Yes! 

  • Familiarity with git (very important!)
    • Know how to create repositories
    • Know how to edit and push files
    • You should have an account either with github.com, gitlab.com, or gitlab.cern.ch 
  • Some familiarity with the Linux command line
  • Some familiarity with Python


Also, see Setup: do this first on the left sidebar.

Who is supporting this?

This event is supported by CERN and U.S. National Science Foundation Cooperative Agreement OAC-1836650 (IRIS-HEP)

Who is teaching this thing?

This is a hands-on training and consists of asynchronous lectures by the instructor via video recordings.  Along with this, there are mentors who will give individual attention and to debug assistance to participants via chat tools.  The people filling these roles are listed below.  

Instructors: 

  • Podman (Docker):
    • Michel Hernandez Villanueva (Brookhaven)
  • Apptainer (Singularity):
    • Marco Mambelli (Fermilab)
  • GitHub CI/CD:
    • Andres Rios-Tascon (Princeton Univ)
  • GitLab CI/CD:
    • Guillermo Fidalgo (Univ of Puerto Rico Mayaguez) 

Mentors (on Slack): 

  • Lera Lukashenko (Nikhef)
  • Marco Mambelli (Fermilab)
  • Jim Pivarski (Princeton Univ)
  • Richa Sharma (Univ of Puerto Rico Mayaguez) 
  • Michel Hernandez Villanueva (Brookhaven)
  • Alexander Moreno Briceño ( Universida Antonio Nariño)
  • Roy Cruz Candelaria (Univ of Puerto Rico Mayaguez) 
Starts
Ends
UTC
Virtual