* How to reach the goal of recurrent mass production with timely schedules
* Run-based approach for data and MC processing - interaction with calibration procedure
* New tools
Machine Learning in Data Processing
- still a open point
- training production strategy is clears and works - need a few tweaks
Run-based approach (proposal being written by Valentin)
- main issue comes from the large number of files
- also to allow an improvement in efficiency
- run-wise means:
- data processing is a set of configuration
- produce for a given run all the files and merge them at a certain point
- most likely at trigger level
- merge by type: 1 file per run for data, 1 for atmospheric muon, 1 (or 2 if 2 different light propagators are used) neutrinos
- checks are done before merging
- if things fail, not merging is done and step is rerun
- query before the simulation all the inputs
- raw data
- calibration - which requires its own processing chain?
- Take care of how event weighting and headers are treated
- irods upload at the final step of fully-successful runs
- Is it possible to merge runs instead of files per run
- it's a design choice. to be addressed when decisions are made.
- Understand how bookkeeping should be done
- incorporate all tests - to be agreed between DPDQ, Comp&Soft, Simulation and Analysis WG
- Allow for at least 2 wasy
- GRID (DIRAC?)
- Local (batch on cluster, nextflow?)
Action point
- Valentin is writing a proposal. Discuss it when ready. Comp&Soft Workshop to think about it, too.