KM3NeT ML meeting
A big thanks to Mehdy B. for the minutes !
Lukas Hennig
Need to produce a new MC dataset in order to benchmark our new ML models
Test robustness of MC data agreement and improving it
IceCube has those kind of test
2 papers from IceCube, one was published in science on GNNs
Data/ MC agreement test, why are they important for our models ?
Model might be sensible to MC mismodelling so we need to be more attentive about the Neural Nets Models
problems :
multivariate disagreements, some of the disagrements are visible only if we look at all the variables together
(use usual metrics for classification problems ROCs, AUC (lot of random guessing))
talk about variable importance, but what kind of importance ? regarding which kind of scores ?
Provide additional features about the pulses
Questions :
- Change Calibrations ? Yes can be done.
- Do we understand the difference between the real data and data monte carlo ? How can we understand at the graph level, our monte carlo modelling.
The outputs of the GNN comes from the data and not the MC simulation.
The different systematics inform us if some models are sensible to some features, for instance in the CNN is not sensible to time features
- How many statistics do we use ? Should standard ML prod should be bigger since we need statistics for the models
data generation proposal for a first assessment :
- 1 standard ML Prod
- 1 ML prod with +10% PMT eff
- 1 ML prod with -10% PMT eff
- 1 ML prod with +10% light absorption
- 1 ML prod with -10% light absorption
- 1 ML prod with wrong time calibration
Ivan Mozùn
Questions :
- Transfered Model achieve 20% better AUROC than model trained from scratch
- Different version of MC, are you aware there was a bug in the data ?
- Santiago do you know what kind of effect you should get ?
- Slide 7, black line ROC-AUC achieved with the transformer ORCA115. And with the finetune model you can achieve a really great performance with ORCA6 training for track shower.
- Weight the evenements with energy
- Davit : in terms of configuration the number of hits is different, so how do you handle with the sequence length ? Fix the sequence to 300 hits, look at the distribution of hits and chose 300 because it was optimal.
Santiago
Questions :
- How do you plan to implement domain adversarial training ?
Questions & Answers :
- Ivan : Graphnet is more flexible the only thing we should keep from Orca km3py but not necessary because we can uproot and use tools from km3io or orcason.
- Lukas : Graphnet is better than Orcanet because larger user community, more tools. Master student has to do few month research project, had the task to implement tau neutrino identification, had to implement it with graphnet and he did pretty well.
- Ecap / Jutta : Would like to advocate to have a writing format from graphnet to the km3net internal format.
- Antonin : keeping maintenance of km3pip and km3io is needed, but also need to find people within the group to help maintain those packages
- Ivan : Graphnet work only for supervised learning, find easy to take utilities from graphnet like dataloading or data featurization and use them for km3net
- Lukas said he has a master student which is trying to reproduce what you have done, he is not trying to do optimization of what lukas has done and work with the hyperparameters that lukas has found. When he finished his report, he will make it available for the collaboration.
- Santiago to Jutta : Going the way of implementing our reader or writers, make open this tools ?
- Jutta : Sqlite format used by graphnet, make our tools more understandable and more accessible.
- Ivan : For graphnet they directly take the root files and transform into graphnet files. There's a repo where we have our own branch where is everything is implemented, it's the km3net branch of the repo.
- Jutta : We can maintain our own version of graphnet but it would be unofficial and what would be great is that our changes are taken in the official version of graphnet. A way forward would be to maintain the readers and writers as a KM3NeT package to control version and version of grpahnet it is compatible with.
- Ivan : not easy to take into account the change in graphnet and in km3net, and put the km3net changes into the official branch of graphnet. So yes it would be better that we do it ourself.
- Jutta : from rootfile to sqlite file with nothing inbetween. We are the only one using root files. Maintaning something that is not part of the collaboration will be trickier than if we do it ourself.
- Ivan : Mix of dependencies in all possibilities.
- Jutta : If it's our own writer we can handle the dependencies in a easier manner.
- Antonin : Graphnet seems good way to go, to lower the barrier for starting and developing. Pause meetings now and restart ML meetings second half of August or early September. Aim to have an update on rolling out MLFLOW and discuss that more widely with people, but it's looking good.
- Jutta : End of August meeting with icecube, thinking of using this method for neutrino oscillations or machine learning, data formats. Should be last friday of August.
- Antonin, AI summer school on that week in Caen but we can prepare the meeting so ML and data format is discussed. Will try to be available in the afternoon.