High Throughput Distributed Processing of Future HEP Data
Introduction
- The challenges of HEP data processing in the post upgrade scenarios.
- Scientific software as the key to achieve the deliverables of the (HL-)LHC Physics Programme
- Parallelism, performance and programming models for exploitation of resources on a single box or on a cluster.
- The central role of data management, input and output.
- Evolution of hardware and platforms and their requirements on data analysis and tools.
Track 1: Technologies and Platforms
(4h lectures + 4h exercises)
"Introduction to Efficient Computing" by Andrzej Nowak
- The evolution of computing hardware and what it means in practice
- The seven dimensions of performance
- Controlling and benchmarking your computer and software
- Software that scales with the hardware
- Advanced performance tuning in hardware
"Intermediate Concepts in Efficient Computing" by Andrzej Nowak
- Memory architectures, hardware caching and NUMA
- Scaling out: Big Data – Big Hardware
- The role of compilers and VMs
- A brief look at accelerators and heterogeneity
"Data Oriented Design" by Andrzej Nowak
- Hardware vectorization in detail – theory vs. practice
- Software design for vectorization and smooth data flow
- How can compilers and other tools help?
"Summary and Future Technologies Overview" by Andrzej Nowak
- Teaching program summary and wrap-up
- Next-generation memory technologies and interconnect
- Rack-sized data centres and future computing evolution
- Software technologies – forecasts
Track 2: Parallel and optimised scientific software development
(5h lectures + 6h exercises)
"Computational Challenges of Run III and HL-LHC" by Danilo Piparo
- HEP data processing: from acquisition to analysis
- The upgrades of the LHC detectors and of the accelerators
- Upgrades: challenges of the new dataset and implications for scientific software
- Commonalities and differences with other disciplines such as genomics, plasma physics, astronomy
"Scientific programming: a modern approach" by Danilo Piparo
- Introduction: Amdahl's law, Performance and correctness of codebases
- Modern C++: new constructs, their advantages
- Exploit modern architectures using Python
- Near the hardware: the role of compilers
- Understanding the differences and commonalities of data structures, metrics for their classification, concrete examples
"Expressing Parallelism Pragmatically" by Danilo Piparo
- Trivial asynchronous execution
- Task and data decomposition
- Threads and the thread pool model
- In depth comparison of threads and processes, guidelines to choose the best option
"Protection of Resources and Thread Safety" by Danilo Piparo
- The problem of synchronization
- Useful design principles
- Replication, atomics, transactions and locks
- Lock-free programming techniques
- Functional programming style and elements of map-reduce
- Third party libraries and high level solutions
"Optimizing existing large codebase" by Sebastien Ponce
- How to measure performance. Key indicators, tools and their pros and contras
- The nightmare of thread safety
- Data structures for performant computation in modern C++
- What to expect from vectorization of existing code
Track 3: Effective I/O for Scientific Applications
(3h lectures + 2h exercises)
"Many ways to store data" by Sebastien Ponce
- Storage devices and their specificities
- Data federation
- Parallelizing files storage
- Introduction to the Map/Reduce pattern
"Preserving Data" by Sebastien Ponce
- Risks of data loss and corruption
- Data consistency (checksumming)
- Data safety (redundancy, parity, erasure coding)
"Key Ingredients to achieve effective I/O" by Sebastien Ponce
- Asynchronous I/O
- I/O optimizations
- Caching
- Influence of data structures on I/O efficiency