Joint Simulation and DPDQ Meeting

Europe/Zurich
Room TBD, Department of Physics (University of Salerno)

Room TBD, Department of Physics

University of Salerno

Via Giovanni Paolo II 132 Dipartimento di Fisica
Alba Domi (University of Amsterdam), Carla Distefano (LNS-INFN), Carla Distefano, Luigi Antonio Fusco (University of Salerno)
Description

Face-to-face meeting joining efforts from the Simulation and Data Processing and Data Quality working groups

The meeting will be hosted by the University of Salerno and remote connection via zoom will be available. if the number of registered participants will be low, the meeting will proceed as online-only

Salerno can be reached by plane by flying to Naples and then catching a (~1h) bus to Salerno, or going to the Naples train station and then catching a train to Salerno (which can be faster and maybe also more options are available in terms of arrival times).

Also, Salerno can be reached by high-speed train from most Italian cities so alternative solution for your flight are possible (e.g. flying to Rome and then it's a 1h30/2h train trip)

The University Campus is located in Fisciano, which can be reached from the city center via bus in 30/60 minutes depending on traffic.

Google drive folder for material: https://drive.google.com/drive/u/0/folders/1duv0h7Q1bJfR5uCpBlGlT0cOI3cwh8gx

Registration
Participants
Zoom Meeting ID
65853428775
Host
Carla Distefano
Useful links
Join via phone
Zoom URL
    • 10:00 13:00
      Simulation Benchmarks

      Action points:

      • Add generation volume in gSeaGen output
      • Add tau decay mode
      • Add tau polarisation
      • genie systematics, check with Oscillation WG
      • production level test of new JSirene developments (ORCA6 and ARCA8)
      • "Task force" on highQE PMTs

       

      Open task, need for personpower:

      • Work on muon propagation (Proposal in gSeaGen)
      • Developer for JSirene
      • Study highQE PMTs in ARCA
      • Work on tests using Gedanken experiments

       

      • 10:00
        Effective Volumes 1h

        https://git.km3net.de/working_groups/simulations/-/issues/55

        * Discussion on effective volume / effective area computations

         

        - Summary of the current status

        • GENHEN: generation of vertexes in a volume, straightforward computation of the effective volume. gSeaGen: generates starting point over a surface -> the natural figure of merit is an effective area
        • Generalise the computations of both Effective Volumes and Areas. Git Issue: https://git.km3net.de/working_groups/simulations/-/issues/55

         

        - What is needed in the files?

        • Generation volume in the events would make things easy:
          • Generation Area x Interaction Length. This Interaction Length is however dependent on the neutrino cross section
          • Cylinder can be both in water and rock. How is this treated? Effective mass?
          • To be put in the w2list
          • Document effective mass/volume. also per decay mode
            • (using parent/daughter relations - information must be available. e.g tau-to-muon, tau-to-electron, tau-to-hadrons).

         

        - Differences between generators

        • km3buu will work inside gSeaGen, so generation volumes will be inherited
        • GENHEN no developments foresee unless interest arises.
      • 11:00
        Event Kinematics/Cross section/Weighting 1h

        https://git.km3net.de/working_groups/simulations/-/issues/56

        * Generation with modified simulation inputs

        • some information is in the native gSeaGen file, how to store in the data format
          • some are not in the gSeaGen structure
            • tau polarisation: storing the vector
            • tau decay channel (see effective volume discussion)
          • new gSeaGen has a tool to convert from native to km3net dataformat
            • not the genie format, the gSeaGen one. The genie format can be output of the processing with specific option. If one wants more info, the gSeaGen class for writing files should be expanded.
          • genie systematics: written in the native genie file
            • an interface is present in gSeaGen. Systematics for individual parameters (out of a list) can be stored.
            • how to write in dataformat.
            • inputs needed from Oscillation Working Group.
        • original configuration file maintained then different versions for each tune.
          • Q: The configuration file is passed via a path: in a container, how does this work?
            • it can be re-defined in own directories.
          • All cross sections (physically or symbolic link - <tune_name>.xml) in the same location, so that it can be used directly. Just one name per tune. This works if all configuration files are OK for all the tunes.
            • Otherwise: not define the variables in the container but in the data_processing https://git.km3net.de/common/data_processing/-/issues/113
            • this will depend on what we actually want to do. e.g. different energy ranges.
            • will clear situation e.g. for low-energy tune used in mass production
          • Decay mode in new container to be tested. Understand what can also go into mass production.
            • Note: UnstableParticleDecayer.xml in the new genie version is in a new file

         

        * Cross sections

        • both differential and total cross sections are stored in gSeaGen. with (X,Y) one could reweight
        • genie systematics (see above)

         

      • 12:00
        Light and Particle propagation 1h

        *Light generation and particle / light propagation

        - Recent developments

        • delta-rays in JSirene, non-isotropic light emission profile. Still a sizeable difference between KM3Sim and JSirene.
        • output from Gedanken experiments
        • Muon scattering implemented
          • note: need for developers (Valentin on KM3Sim)
        • Ready to go at production level
          • ORCA6 km3sim/jsirene production like v7.0.1
          • ARCA8 mass production

         

        * comparison between different software

        • software is available (JDomino at light level, JPizza at generation level, ...)
          • at generation level, errors can be already thrown
          • at light level, histograms can be expected. not yet at a level where we can

         

        * different PMTs: specific meeting is needed

        • homogeneous detector simulation (ORCA) has been run. need to see the results
        • we can already run Gedanken experiments
    • 15:30 18:00
      Discussion and time for work
    • 10:00 13:00
      Processing benchmarks & Data Quality

      Discussion point: 

      *dynamic calibration into ML framework

      • 10:00
        Data/MC Comparison & Quality Tools 1h

        * tools to perform data/MC comparison at trigger and reconstruction level: at trigger level there is JRA and JMRA in Jpp, do they miss any useful plots?. At reconstruction level, in which framework would be best to include it? Which are the variables to look at?

        * Data/MC vs time

        * tools to check the quality of the data and of the MC: for the data there is JRA, that can be used also for MC. For MC, some sanity checks are needed: how many jobs finished/crashed, is the number of generated events ok, is the can size ok, is the number of MC_hits reasonable, etc..

         

        ARCA7/8 - DATA/MC comparisons (Vasileios):

        Studied the effect of cuts on WR and BR events for ARCA6 and ARCA7 looking at atmospheric muon MC files.

        Differences are found so the question is if they are expected.  They should be if we only look at atmospheric muons. The differences should be due to the events reconstructed as up-going. To get an answer, it is also necessary to look at the neutrino files.

        Another cause could be the dynamic calibration not available for ARCA6 and ARCA7 but only for ARCA8. It is suggested to see the effect of dynamic calibration versus static one.

        JShowerFit DATA/MC (Chiara):

        Update on JShowerFit processing (in which 3 steps have been automated) and DATA/MC comparison (better agreement with a small asymmetry for high-current runs).  The work has been done on a ORCA6 test sample.

        We need to split the files into subruns for the data and the atm. muon MC runs. Splitting and re-merging files can be problematic. There were not failures of the new procedure runnig on the test sample but checks are needed for the mass productions to prevent and take care of possible failures during splitting and merging.

        Chaira suggests a set of safety checks to be done as intermediate steps during the processing.

        JDAQSplit can fail if the file is not on irods, what to do to prevent?

        problems with JSF if a subrun is empty, how to prevent?

        suggestion to check if the output file is provided but also check the number of events in the file: a file can be present but contain an empty tree.

        are all the files in the run list on irods? a part of the MC files may be missing (e.g. they cannot be processed because the calibrations are missing). Missing file are documented.

        Running time remains high with an average duration of 12 hours. A speeding up of the code is expected which would also allow a reduction in the number of subruns, reducing the criticalities of splitting.

        We have DQ tools to run in the pre-analysis phase that list runs based on run id.
        It is also required to add step-by-step checks in the data_processing. Each code must provide the checks to be implemented,

         

        • ARCA7/8 & ARCA6 commissioning period - DATA/MC comparisons 20m
          Speaker: Vasileios Tsourapis
        • JShowerFit DATA/MC 20m
          Speakers: Chiara F Lastoria (Aix-Marseille Université, CNRS/IN2P3, CPPM, Marseille (France)), Chiara Filomena Lastoria (Istituto Nazionale Fisica Nucleare (IT))
      • 11:00
        DQ parameter thresholds 1h

        General DATA/MC strategy to set the DQ parameters thresholds:

        * start with "bad runs" and perform data/MC as a function of each DQ parameter to check the threshold for it. 

        * Vasilis has presented a preliminary investigation with ARCA6 and OOS parameter -> to be updated with ARCA8.

        * Valentin has shown a preliminary investigation for ORCA6 and HRV.

         

        Use of a general tool to have DATA/MC, MC/MC and DATA/DATA comparison that can be included into the DP chain:

        * Valentin presented a tool written in pyroot and tested on ORCA. The tool creates a list of histos with a set of parameters to check the run-by-run data/MC comparison and creates correlation matrix of parameters.

        Proposals:

        - adapt the tool to use JMRA histograms

        - add the tool in the data_processing

        - use the tool to set the DQ parameters thresholds

        - add OOS parameter in the tool

        - extend the using of the tool to ARCA

         

        It as been suggested to add DATA/DATA and MC/MC comparisons to better understand the impacts of OOSs in the data.

         

      • 12:00
        Online check of Data Quality 1h

        * plots in the webpages used by shifters: are they all useful? should we remove some plots? should we add any new plots?

        * Communication with DPDQ group if issues are spotted.

        * Is the list of issues included in the weekly shift report the best way to keep track of problems? Discuss a possible webpage with list of issues for the runs.

         

        It has been suggested to explore the xwiki functions to create a page with the run-by-run issue info starting from the elog entries.

        In the case of an alert, can we run JDataQuality run time on a event-by-event basis? Not at the moment, the proposal is to launch the DQ tools at the end of the run without making the query to the DB.

        ARCA19 on-line monitoring: the increase in the number of DOMs implies the need to make updates on the plots. The availability of volunteers is required to support Tamas in the work.

         

         

    • 15:30 18:00
      Discussion and time for work
    • 10:00 15:05
      Data processing and simulation interaction
      • 10:00
        Data Processing & Simulation documentations 20m

        * DP documentation: can it be improved? how? volunteers?

        * Simulation documentation: are all the simulation tools well documented? if no, which should be improved? volunteers?

         

        3 layers of documentation pertain to the DP WG

        1. Codes

        Code maintainers should check that this is the case

        Action points - can be started at the Comp&Soft WG workshop:

        • it should be checked that the documentation is still valid
        • newbie pages should point to that simulation git

        2. Data processing scripts

        • README can be updated?
        • collect feedback from users (Luigi)
        • Provide a processing workflow diagram and have a description of how to make it work in the data_processing
        • Documentation should be close to the code -> git

        Action point - next 2 weeks:

        • collect feedbacks from data_processing users, and send them to the Comp&Soft WG

        3. Production Output

        • git issue to report problematic files, also with txt files to be linked
          • tables reporting summary of issues
        • check on individual steps of the data processing to be implemented (see Wednesday)
          • have direct feedback on list of files and usability of production in the form of a txt file
        • plots accompanying the production
          • store them on disk, link them on the wiki pages
          • discussed on wednesday
        • versioning of the production (see later)
      • 10:20
        Software development interaction with data processing 40m

        How to improve the passage from developments to production

        1. Tests in the git CI/CD pipelines

        • unit tests
        • benchmark tests
        • functionality tests
          • all codes should have some testing implemented, how detailed it varies according to the codes. In theory unit tests should also be provided.
          • There are some codes that do not have tests - to be verified. mupage and km3sim?

        2. Pre-release tests

        • before tagging of a version, there should be a large-enough scale to test real application scenarios
        • Large statistics, not too many files
        • Do it on real detector (e.g. most recent production, in order to allow comparison) and ideal detectors.
          • For example a few files of ORCA6, ARCA6, ORCA115, ARCA115. Full chain simulation (or data).
          • need to define a "clean" simulation scheme with standard fixed inputs, and tests to be performed at the end.
          • If a mayor release -> require complete simulation to be run
          • if minor or patch -> basic chain (e.g. re-run only the specific step that is changed and check)

        2.1 Pre-mass processing tests

        • define a list of runs (that are OK from DQ standpoint) on which full simulation and data processing is performed.
          • real "mass production" per file statistics and inputs
        • provide data/mc
          • ideally, all bugs should be discovered in the pre-release tests of softwares via the full chains and validatation

        3. Mass production

        • versioning (see later)
        • add "run 0" MC (ideal conditions production)

        Action points

        - complete the simulation/documentation git

        - write down explicitly the testing strategy and agree with software developers on details

        - prepare "run 0" set

      • 11:00
        Software versions and Data Processing 1h

        Discussion on software versioning and data processing versioning

         

        In an ideal word:

        - mass production at fixed times in the years (twice per year, e.g. April and October)

        - software is frozen only once validated from the testing phase described above - this corresponds to a version

        - the test production (last step of preparation) should not be the place where bugs and problems are found, only data_processing should be modified there. the test production is not the software test. that is accomplished in step 2 of the software development tests described above

        - If a problem is found: only solve it, not add additional modification

        - the software test should be reported at the DPDQ working group

        - Software releases and mass processing are decoupled processes.

        - GANTT

         

        Action point:

        - write down all the above in a reasonable way (Luigi). Then discuss it

      • 12:00
        Future developments for data processing 1h

        * How to reach the goal of recurrent mass production with timely schedules

        * Run-based approach for data and MC processing - interaction with calibration procedure

        * New tools

         

        Machine Learning in Data Processing

        • still a open point
        • training production strategy is clears and works - need a few tweaks

        Run-based approach (proposal being written by Valentin)

        • main issue comes from the large number of files
        • also to allow an improvement in efficiency
        • run-wise means:
          • data processing is a set of configuration
          • produce for a given run all the files and merge them at a certain point
            • most likely at trigger level
            • merge by type: 1 file per run for data, 1 for atmospheric muon, 1 (or 2 if 2 different light propagators are used) neutrinos
            • checks are done before merging
            • if things fail, not merging is done and step is rerun
          • query before the simulation all the inputs
            • raw data
            • calibration - which requires its own processing chain?
          • Take care of how event weighting and headers are treated
          • irods upload at the final step of fully-successful runs
        • Is it possible to merge runs instead of files per run
          • it's a design choice. to be addressed when decisions are made.
        • Understand how bookkeeping should be done
        • incorporate all tests - to be agreed between DPDQ, Comp&Soft, Simulation and Analysis WG
        • Allow for at least 2 wasy
          • GRID (DIRAC?)
          • Local (batch on cluster, nextflow?)

        Action point

        - Valentin is writing a proposal. Discuss it when ready. Comp&Soft Workshop to think about it, too.