Topics likely to be discussed at PHYSTAT's "Statistics meet Machine Learning" include:

**Statistical tests we should require of ML**

What are the statistical checks that should be performed when using ML procedures? Are they different for Classification, Regression, Anomaly Detection, etc.**Generative models for many fast simulations from few full ones.**

What do we gain by using ML to learn from a few fully simulated events how to generate a large number of events quickly?

For example, we believe some (x,y) data should lie on a straight line and are interested in the gradient. With difficulty we do a full simulation of 4 (x,y) points. Our ML procedure learns from these 4 points how to generate new data, and produces 1000 new (x,y) pairs. The statistical uncertainty on the gradient is greatly reduces, but there is a large systematic related to the particular choice of the initial 4 points. Is there anything that we can useful we can learn from the larger sample? Are generative methods different from this in some subtle way?

**Interpretability**

There is the probably apocryphal story of a ML classifier learning to distinguish cats from dogs because in the training sample, all the cats were photographed curled up on living room coushes, while the dogs were running outdoors in fields.

How do we ensure that the distinction between, say, signal and background is based on significant features in the data, rather than on the particular way that soft particles are simulated?

Can interpretability help us diagnose this? Is it important for our methods to be interpretable, or it is enough just to check out their properties? Is Interpretability becoming an unrealistic goal: Just as We would not expect a 10-year-old to understand how a single hidden layer NN works, why should a very sophisticated ML procedure be interpretable by a mere human Physicist?

**Ambiguities**

Often interpreting the underlying features of an event in terms of the observed particles can be ambiguous. For example in events with two top quarks, each decaying via the sequence t->bW, W->mu+neutrino, there are 6 unknowns (the components at each neutrino’s 3-momentum) and 6 constraints; because the some of the constraints are quadratic, we can get 4 different solutions. ML procedures can distinguish among them, but how are they doing this. Are they using extra information, not used by the analytic solutions, or is it via the training samples we are using (see next point)?**Relevance of training samples**

Imagine we are training an ML procedure to learn the solutions of the equation a^2 + b^2 = c^2 i.e. we are going to give it a and b, and it should provide c. Our training examples consist of sets of (a,b,c). Then we give it a = 3 and b=4, and are happy when it says c=5. But what about c=-5. Presumably which one it gives us depends on the training samples.

How do we decide whether we are immune from such problems in the more complicated 2-neutrino example above?

**Mis-modelling of training samples**

Given that our ML procedures are capable of finding multi-dimensional anomalies that may not show up in lower dimensions, when we check that that our simulations are a good enough approximation to reality, do we have to look at all possible multi-D distributions?**Systematic from specific ML procedure**

Do we have to worry about assigning a specific systematic uncertainty to our result because of the specifics of our ML procedure (e.g. from the specific ML procedure and architecture and hyper-parameters we are using; failure to find the global optimum; etc.)? Or do such effects result in a somewhat degraded performance, which we account for in our analysis?

There is also the more general question on how magnitude of systematic effects are estimated when using a ML procedure.

**Checking that training data ‘covers’ multi-D data**

How do we check that all our data falls within the training samples in multi-dimensional space?**Level of data**

What are the pro’s and con’s of giving as input to our ML procedure low level information (e.g. the basic information from our detector), or higher level variables which we think might be helpful for our analysis?

And do we have to be careful about inputting to our ML procedure irrelevant information that might degrade its performance and/or increase the training time?

**Types of networks, Architecture, Loss functions, etc.**

Is it obvious which type of network is best for a particular type of analysis problem? What is the role of the loss function in determining the performance of a ML procedure? And similarly for the different possible excitation functions for the nodes of a Neural Network.