SWAN (Service for Web-based ANalysis) is CERN's Jupyter notebook service, giving access to data from EOS and CERNBox, software from CVMFS and connectivity to CERN Spark clusters.
The goal of this workshop is to share use cases among the community and for the SWAN team to learn in more detail how the service is used and how it should evolve to suit the users’ needs.
We propose to integrate the Rucio data management system with SWAN. This will allow users, such as ATLAS and CMS scientists, to interact with experiment data without having to resort to manual operations for searching and downloading files. We will present several user workflow scenarios, the boundaries of the experiment infrastructure, and propose potential architectures to support this development.
This talk will provide a short overview on how we use SWAN within ATLAS trigger and DAQ operations. I will cover both live, online use and also show how this transitions into offline use.
CloudStor SWAN (Service for Web based ANalysis) is AARNet’s first attempt at providing data processing and analysis in the cloud to the research community in Australia. CloudStor has many similarities to CERNBox in that they are both based on ownCloud and use EOS for its underlying storage. In this talk I will discuss our SWAN deployment, the users who use it, training we provide and general feedback we have received from our users.
The CMS detector at the Large Hadron Collider (LHC) is currently being upgraded in the context of the future High-Luminosity (HL-LHC) phase, which will integrate 10 times the LHC luminosity. The CMS experiment plans to replace the current endcap calorimeter with a high granularity calorimeter (HGCAL).
In this talk I will focus on my personal usecase concerning the HGCAL test beam data analysis using the Jupyter notebook interface. I will describe the main tools used for the analysis, as well as the SWAN's functionalities which have been useful during my work. I will also try to point out which are the main SWAN-related problems and issues than an analyzer may have to face while using this platform.
ALICE Data Analysis without wired PC feat. SWAN
Many devices have been used for the HEP data analysis. Most of them were wired desktop, which has a stable network connection, enough storage and analysis specific software. Thanks to the abundant computing resources provided by CERN computing group, we could bring it to the client devices based on CERNbox and SWAN. In this talk, the overview plan to apply SWAN into ALICE analysis and current status will be covered.
In my day to day work, I usually use SWAN when I need to share with my colleagues some fast computation, some toy Monte Carlo or some quick analysis.
I also use SWAN when performing some tests or preparing some physics figures so I can always have them stored for future reference.
I find the usage of SWAN useful when working together with many people, it is nice to have an isolated environment that anyone can access.
It worked quite nicely with summer students as well even though they had some difficulty to start.
It is easy to prepare some tutorials based on SWAN so that newcomers can learn easily by trial and error without any serious consequence.
Here I will try to explain a bit more in detail about my experience, give you some examples and possibly some suggestions.
The TOTEM and CMS experiments use the software framework for offline data processing, namely CMSSW. The framework provides necessary data formats for accessing and analysing RECO and AOD objects. In our study we investigated the feasibility of using SWAN, RDataFrame and Spark technologies to examine, analyse and reduce 500 TB data sample generated by the TOTEM detectors. The sample itself is a collection of ROOT files in RECO format, stored temporarily on EOS instance in CERN.
We successfully managed to implement the analysis code using the RDataFrame in C++ and run it on lxplus cluster with the standard CMSSW setup. Next, we attempted to run it on SWAN in order to use the Spark cluster. This attempt was not successful yet, due to the following reasons:
The detailed description of the issues we found is the following. SWAN relies on CVMFS for software releases, letting the user choose from the latest LCG releases and nightlies. The core functionalities thus make use of software under the
sft.cern.ch CVMFS namespace, while the current analysis use case was originally developed within the CMSSW framework and still relies on software and libraries inside it, especially those regarding the AOD data format. Keeping in mind the final goal, that is to exploit the potential of the Spark clusters to run the analysis, the first simplest step to take was to run the code directly in the SWAN terminal. This is possible via some modifications to PATH environment variables in the SWAN user session. The next step involved using a Jupyter notebook to run the same code. This could be achieved by creating an “environment script” in bash, but still some silent errors started to show up and could only be seen in the SWAN logs. The final step, connecting to the Spark clusters to run the analysis, was simply impossible with the current status of the platform. The main problem revolves around having two CVMFS software stacks, namely SFT and CMS, clashing at different levels. For instance, ROOT relies also on external tools such as the gcc compiler and the python interpreter, which were picked from the corresponding CMS stack. At the same time, Jupyter and its extensions as well as the whole Spark framework are in the SFT stack but not in the CMS one, making the connection to the clusters simply incompatible with the environment needed for the analysis.
In summary, we observed that integrating CMSSW in SWAN is only possible if considering it as a terminal interface, that is picking only ROOT and the CMS data format libraries via modifying some environment variables. Using the Jupyter notebook and the Spark clusters is incompatible at this moment with the environment needed for the analysis. Such setup would involve picking ROOT from the CMS repository and the Jupyter extensions and Spark framework from the SFT repository.
SWAN usage at COMPASS
My PhD subject in COMPASS is the extraction of the cross section of the Deeply Virtual Compton Scattering process ($\mu p \to \mu p \gamma$), in order to get insights on the structure of the proton. The collaboration already has working softwares for reconstruction of the data, and also for the further analysis.
I chose to extract from the existing software all the relevant variables to my analysis, and store everything in TTrees. Then for flexibility purposes, I chose an interactive way to code my further analysis. This is done under SWAN and using the recent and powerful RDataFrame framework, the python kernel and PyROOT.
However the volume of data is still important even after prefiltering and takes a lot of time to be processed interactively. That is why I use the PyRDF module developped by Javier Cervantes Villanueva, in order to interface my RDataFrames with spark clusters blindly from the user point of view.
This solution is working and efficient but some issues are still present such as slow or impossible connection to SWAN with certain user configuration. Also some new interesting possibilities might be developped for RDataFrames to ease some studies.
I will discuss the AWAKE dataset and analysis package, focusing on image analysis techniques.
NXCALS system provided by BE-CO is a successor of CALS, based on Hadoop Big Data technologies using cluster computing power for Data Analysis. It stores data from accelerator complex related devices. It is perfectly integrated into SWAN environment which is proposed as one of the NXCALS data access methods.
The presentation will focus on how SWAN helps our users to perform logging data analysis, plotting and sharing. It will be followed by results of the users survey taken at the beginning of the year.
The purpose of the presentation is not only to show some use cases, but also to demonstrate how we imagine SWAN's role for the new NXCALS system.
SWAN can be used to extract machine measurements, query machine settings and beam dynamics simulations out of the box thanks to the pytimber, pjlsa, cpymad packages installed in the LCG stack. We present the status of those packages and our wishes for the SWAN platform.
The CERN SWAN service is being extensively used in the ABT group. In this presentation, the main use cases with some examples are reported. Also, some feature requests are highlighted which are believed that could improve our experience with this already very complete system.
The Large Hadron Collider is a complex machine composed of many interconnected systems serving a broad scientific community. Although its performance has been a considerable achievement, there have been already several cases of hardware deterioration. An early detection of fault precursors reduces the fault severity and facilitates maintenance planning. To this end, LHC undergoes a rigorous testing during the Hardware Commissioning campaigns prior to restarting operation. However, LHC systems are not extensively monitored during operation while the signals representing LHC systems have been continuously logged. Hence, there is a lot of data in terms of volume and variety which can be used to gather information about reference values of signals for the most critical systems. Signal features engineered this way will serve as a feedback on the performance evaluation design decisions and support the predictive maintenance.
The LHC Signal Monitoring project aims at monitoring signals in order to detect deviation from standard operating values. In other words, we aim at introducing predictive maintenance capabilities for LHC. The project scope includes superconducting circuits (magnets, busbars, power converters, protection devices, etc.). To this end, the project is divided into three stages: signal exploration, signal analysis, and signal monitoring. The first two stages involve accessing various signal databases (Post Mortem, CALS, NXCALS) and numerous iterations of feature extraction and signal processing algorithms for which SWAN is a suitable environment.
In this contribution, we will present selected notebooks and share our experience in using SWAN by a team of engineers analysing electrical signals of superconducting circuits. An important aspect of the collaborative environment is our software stack enabling versioning of code, automatic testing, generation of documentation, and packing of our API.
High energy physics requires the massive data sets that it does because the quantum randomness of nature is baked into our results. Only with orders of magnitude more data than other disciplines do we achieve any results at all. All this data needs some rather advanced, novel and specific statistical tools.
The statistical methods web page aims to link the high level statistical concepts such as profile likelihoods, subsidiary measurements and non-parametric inference and links them directly with analysis ready hands on instructions for how these methods can be implemented in practice. SWAN handles the hands-on part of the page. Allowing users to see a plot, click a button and investigate the effect of changing parameters and randomising data.
The project hopes to one day work across experimental boundaries providing an in depth interactive, community driven textbook on statistical methods. But this means hosting with permission issues related to single sign on, data storage, duplicating repos between GitHub and gitlab, and limits the availability of this resource only to those with cern lightweight computing accounts.
I'd love to one day make the page truly interactive and available to all but for now the facilities available from swan are incredibly useful to make the CERN statistical word open to all.
The SWAN web interface for Jupyter notebooks provides several advantages compared to the commonly used offline data analysis tools (Root, Matlab, Python etc.). A SWAN project is practically a directory in the user's CERNBox folder. It can contain input data (e.g. a radionuclide inventory in txt format), Jupyter notebooks with Python scripts for data analysis, step-to-step documentation using lightweight markup language, and live visualization of data. This creates a self-consistent and self-explaining documentation, which can be easily shared with and understood by colleagues.
In operational radiation protection, the benefits offered by SWAN have been employed for data analysis and post-processing of, e.g., radionuclide inventories, calibration data of radiation monitoring devices and benchmark of Monte Carlo simulations with experimental data. The notebooks are then shared with colleagues for integration, checks and validation. The SWAN projects constitute a solid base for editing technical notes and, in addition, they conserve the logic and make the radiological assessments easily reproducible.
However, sharing a SWAN project has some limitations. When it is shared, the recipient creates a copy of the original project into his/her own CERNBox folder. After editing, he/she shares the modified project creating additional copies. This is rather inefficient and can easily lead to a hard-to-track and ever growing bunch of project copies.
A combined use of SWAN and GitLab overcomes this aspect. The CERNBox client makes the created SWAN project locally available on the user's machine. Then, the folder can be set as local repository for a GitLab project. This technique allows the contributors to work on projects using the SWAN service and, at the same time, gives them access to the branching and sharing capabilities of GitLab. As a result, version controlling of the SWAN project becomes more transparent.
To increase the efficiency of using SWAN combined with GitLab, a committing interface could be integrated in the SWAN service. This interface would provide a direct link between the two and would also be useful for those who are not so familiar with the use of GitLab.
Although GitLab has already a viewer for Jupyter notebooks, this feature has currently some limitations:
- Live codes of the notebook are shown as static text in GitLab.
- Although plots from Python scripts are supported in GitLab, the use of Markdown syntaxes for inserting pictures is not supported.
- Aligning texts in the columns of tables to the left, right, or center is useful for scientific documentation. However, Markdown header syntaxes (:---, :---: or ---:) are not recognized either by SWAN nor by Gitlab. Moreover, text in tables are aligned to the left using GitLab, but to the right in case of SWAN.
A SWAN notebook was developed, which automatically generates input files for a tool of the STEAM framework (https:\cern.ch\steam) called LEDET, which is used to simulate electromagnetic and thermal transient in superconducting magnets. This program was used to simulate quench transients in more than 25 magnets for LHC, HL-LHC, FCC, and other project. In the coming years, it is expected to complete a library of all LHC and HL-LHC magnet models.
The SWAN notebook is used as a front-end to acquire input data from other sources, manually define parameters, perform repetitive actions needed to generate models, visualize the most relevant defined parameters, and finally generate a LEDET input file, which is an Excel worksheet.
The main advantages of generating LEDET files through the SWAN notebook are the possibility to develop input files quickly and easily; reduce the probabilities of mistakes thanks to the visualization of the defined parameters; version of the model generation scripts, for example using Gitlab; generate magnet model used by different users with the same features; and rapidly update reference models in case of new developed features, giving uniformity among different magnets models.
In the first months after its development, the developed SWAN notebook was used by seven users from different laboratories to generate a dozen magnet models.
Services like SWAN offer exciting prospects in the realm of education. However, educational scenarios also pose additional requirements which we will discuss from a user’s perspective.
Firstly, we’ll report on the use of CoCalc (a similar service) with high school students on standard curricular physics topics and what can be derived from this.
Secondly, we’ll present an example where custom Jupyter notebooks are developed to acquire and process real-time data from a pixel detector used in educational settings.
The LHC data are stored mainly on tape. When the experiments want to analyse a dataset, they issue a recall request, to move the data from tape to disk so that they are available for analysis.
The request propagates through a number of systems with a number of distinct events the process: user issues the request, the request is received by the tape system, the tape is mounted, etc.
In order to better understand the performance of the system in its entirety and identify bottlenecks, I'm using SWAN and SPARK to combine, filter, and process the logs of the different systems in a unified "request timeline". That way we can know in what component we have the largest delays and thus focus our effort in optimizing it.
The CERN Open Data portal provides access to a rapidly growing collection of data and software originating from the research at CERN. Included are as well documentation and examples, which makes the platform well suited for newcomers in the HEP community and the general public.
This talk presents how SWAN currently integrates into this ecosystem and discusses additional requirements for education and outreach.