IML meeting: August 25, 2016 Peak people on Vidyo: 43 Peak people in the room: 28 Sergei: Intro and news: - Next meeting: Sept 29 - Likely topic: software and tools updates - New ROOT release with large ML updates via TMVA on similar timescale to next meeting - QCHS statistics session next week - Today: unsupervised learning Oksana: Unsupervised learning and GeantV - Supervised learning: generates a function that maps labeled inputs to desired outputs - Semi-supervised: combines both labeled and unlabeled examples - Unsupervised: only unlabeled examples - Unsupervised learning is about finding hidden structures in unlabeled data - PCA: principal component analysis: - Three uses: pre-processing for empirical modeling, data compression, noise suppression - Idea: least important principal components are noise to be suppressed - Projection of data matrix onto basis of eigenvectors - KPCA: transform existing dataset to another high-dimensional space, then perform PCA on the data in that space - Example of circles, linear PCA doesn't work, but applying a kernel then PCA separates well - Uncentered PCA: stop using covariance matrix, but use matrix of non-central second moments - Clustering: unsupervised learning task of organization or partitioning data into groups - Partitioning algorithms, hierarchy algorithms, density-based, grid-based, model-based - Of all the approaches, must determine how to select best method - Totally different parameters, use cases - Very nice example of different methods applied to datasets - Denoising autoencoder: capture the structure of data - Output of network is trying to reconstruct the input - Black-box optimisation: search to minimize a function whose analytical form is not known - Evolutionary algorithms: old and slow, but can be used for any type of problem - Ideal for brute force - Good for NP complete problems - New genetic algorithm operator: add a noise reduction genetic operator based on PCA - Using this can start to pull out results earlier, demonstration in slides - GeantV: next generation of simulation software - Geant4 or new improved physics models - Performance 2 to 5 times greater than Geant4 - Full simulation and fast simualtion options - Portable and usable by all - Need to optimize, as when doing parallel you have both working and waiting time, need to minimize waiting - Studied GeantV as a black-box optimisation task, especially for HPC sites - Performance optimisation of large-scale jobs could be simplified using ML - Question (Vidyo): what is your cost function? - Oksana: In case of GeantV, FPU utilisation is main function, based on standard performance metrics - Question (Steven): any quantitative estimate of improvements? - Oksana: Mostly conceptual at the moment, single-node seems like 5% is definitely possible - Question (Sergei): nice overview of methods, have you looked at what tools are available for these methods - Oksana: Using own mini-libraries mixed with existing software - Question (Sergei): PCA applied to GA, more details - Oksana: By skipping noise, helps a lot, as GA usually stuck in noise minima at the beginning - By adding this, convergence rate increases quite significantly - GA are old-style approach, but combining with other approaches could make it really work - Question (Sergei): evolution of GeantV RunManager slide, details on the tree? - Oksana: Each propagator has some quantity of separate threading functionality - Question (Room): coming back to metrics, most require code restructuring or parameter change - Requires re-compiling each time testing candidates? Need to re-compile candidates for each epoch? - Oksana: You have a dynamic set of parameters where you can just change arguments and it changes behaviour of simulation - Currently 4/5 threads with diff geometry, more possibilities to change parameters Nikita: Clustering with gradient descent - Clustering: task of grouping set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups - No objectively correct clustering algorithm - However, what if you know what makes a good cluster for your problem? - Switching from clustering to continuous optimization: - Write down your loss as a function of the cluster assignments - Assume each object has a probability to be in each cluster, write down expectation of the loss as a function of the probabilities - If calculating the precise expectation is difficult, use an approximation which is continuous and evaluates for integer probabilities - Solving boundary conditions: use trick from deep learning, softmax - Used theano for this work, super fast, transparent GPU support, symbolic differentiation support - LHCb streams and processing pipeline: - Event is selected if it passes at least one HLT line - Lines are grouped into streams - Data format only supports sequential access, tasks can only be launched on entire streams - Smaller the streams, faster they will run - Events are duplicated if in multiple streams, more streams means using more events - Limit on number of streams from computing side - Quality metric: - Assume each stream would be read as many times as there are lines, each line reads events N - Time is thus proportional to (N events) * (N lines in steam) summed over streams - Shows large gains for optimized streams compared to those made from grouping physically similar lines - Adding more constraints: some lines should go together as they are often used together, people expect to find in same stream - Method allows to solve clustering problem for a wide range of loss functions - Method was tested on sample of 2016 LHCb HLT data - Potential decrease of IO time by 20% compared to baseline while maintaining analysis/computing constraints - Question (Steven): How much further can this 20% be tweaked? - Nikita: Slide 16 shows a spread, probably can do more - Next step is to measure and find better way to approximate reality - Then repeating the exercise - Question (Sergei): mentioned theano, have you tried any other tools? - Nikita: no comparison, took it as the state of the art framework - Compared different algorithms within theano - SGD/AdaMax/etc tried, did compare algorithms, AdaMax was the best Enric: SWAN, Service for Web-based ANalysis - CERN service created as collaboration between EP-SFT and IT-ST - Depends on critical jupyter notebook technology - Can include multiple languages - Comments, text, inline with code - Can also run the code inline - Can be plain text, interactive images, etc - One document combining everything - Idea is to produce a notebook that tells a story and share with others so they can see what you did - They can then take it, modify it, and share with others too - SWAN is a service that delivers jupyter notebooks on demand - Only need your web browser, then prompted with jupyter notebook interface - Can do platform-independent analysis - Everything happens in the cloud, runs on the server side, nothing installed on your machine - Results produced when you run commands also in cloud storage - Can be used for teaching, they can run and complete exercises, then show you results - Root and TMVA provided in SWAN, but SWAN is not limited to any specific tool - Integration with R, python, ... - ROOT has been fully integrated with Jupyter technology - ROOT C++ kernel, python kernel, JavaScript interactive visualization, other goodies (tab completion, etc) - Full TMVA integration is ongoing, to be presented at next IML meeting - SWAN relies on production technologies at CERN - SSO authentication - Infrastructure of virtual machines in OpenStack cloud - Software distribution is CVMFS, managed by EP-SFT - Storage access via CERNBox and EOS, thus all experimental data potentially available - Also some external technologies: JupyterHub and Docker - Software environment: strategy to configure the environment... - Docker: single thin image, not managed by user, to distribute the session - CVMFS: configurable environment via "views", main software location - CERNBox: custom user environment, non-CVMFS software - Can configure environment to point to your software - Software environment for ML: - TMVA up to date with new features - Scikit-learn already there - Spark MLlib: tutorial by IT-DB next week - Open to incorporate new ML libraries to CVMFS, to be coordinated with IML and EP-SFT librarian - Number of users and potential impact should be enough to justify its addition - For testing, installation in CERNBox is possible from within SWAN, eg "pip install -user mypackage" - Cloud vs local, sync and share - All commands/etc run and stored on the cloud - Can have a local CERNBox install so anything put in the folder is put into the cloud - SWAN at this moment: - Pilot service released beginning of June, all main components in place - In the beta testing phase of ~200 users and growing - Contact swan-talk@cern.ch if interested and please provide feedback - Now accessible from outside of CERN - Several examples available in SWAN, including for ML (mostly TMVA), see swan.web.cern.ch and click on "galleries" - Question (Vidyo): possible to share notebook as per slide 11? - Enric: This is work in progress, for now you have to use CERNBox for sharing - In the future we want to make it very easy with such a button - Question (Vidyo): how can we access an EOS file from within the notebook? Public only or also private? - Enric: You can access it directly as it is fuse-mounted, using your credentials for private - Room: as ROOT is supported, can always use XRootD protocol as well - Question (Sergei): my experience with multiple users on the same notebook, get stuff like "notebook changed" - Do you foresee this sharing where people modify together, or static? - Enric: several options, one can be a copy to your CERNBox - Other option is symlink to your file, then both can edit it - Jupyter does not support concurrent editing - Not like google docs, can't have multiple editing simultaneously (at least not yet) - Question (Sergei): how do you balance all users for the CPU resources? - Enric: this is up to the management of the Spark cluster - Right now no resource management, you arrive and send your stuff and it runs if there are sufficient resources - Quotas are up to others - Question (Steven): mentioned access to EOS, but what about AFS? Also grid/uni-cluster/etc? - Enric: AFS is not there intentionally, want to move away from it and embrace CVMFS for software - Any root files you can open with ROOT (for files from a cluster DPM etc) - Question (Steven): mentioned additional CVMFS support, what about theano/similar? Are these planned? - Enric: Definitely can add theano if sufficient interest, seems likely Anton: (Kernel) Density Estimation - Density estimation is an example of unsupervised learning - Training sample, try to predict response (probability density) for a point not in the sample - Simplest method of density estimation: histogramming - Frequent extension at LHCb: cubic spline smoothing - However, binning effect is still present - Next step is kernel density estimation - Using adaptive kernel: first iteration fixed, second given by PDF of first, ... - Tried KDE implementations: RooFit, scikit-learn, personal implementation (Meerkat) - Personal implementation attempts to solve problems related to boundary effects and dimensionality - Boundary effects are important for KDE - Methods to address this: data reflection, kernel modification - Simple correction: divide result of KDE by convolution of kernel with flat density - More advanced corrections also tried in case of approximately knowing how PDF behaves at boundaries - Represent PDF as product: Pcorr(x) = f(x)*Pappr(x) - Know Pappr and it approximately describes narrow regions and boundaries - Useful as also helps with multiple dimensional cases - Meerkat library implements this procedure, usage: - Create a phase space from the building blocks provided - Optionally create an approximation density - Fill the relative KDE PDF - Store the binned versions of it into a file - Use it in your ML fits - Library includes C++ and python interfaces, used in several LCHb analyses - Also tried Gassian Mixture Models - generally better results than KDE for small samples and PDFs with irregularities (such as low stats) - Works especially well with modern hardware like GPUs, ~20 Gaussian components usable - Model-independent density estimation is crucial for many HEP analyses, tried many techniques in LHCb - Question (Sergei): many talks today on very different methods - Would like to set up benchmark datasets to compare these different methods - Would be interested to know what types of samples would be useful to try these techniques - Anton: have quite a lot of samples with different properties - Question (Sergei): your new package, how integrated is it with ROOT/etc - Anton: it's based on ROOT, can take ROOT trees as input to PDF estimation - Result can be used either as the histograms or in a Roofit PDF - Not quite integrated, standalone library using ROOT - Question (Sergei): who all is working on Meerkat? - Anton: mostly myself (development), but many users in LHCb who provide useful feedback - Just started to use less than a year ago - Question (Sergei): showed that approach assumes PDFs are factorizable, did you find residuals were significant? - Anton: not going to solve curse of dimensionality in all cases - Can build it up dimension by dimension adding more variables that it most depends on - Residual is quite small