Analysis facility discussion

Europe/Zurich
513/R-070 - Openlab Space (CERN)

513/R-070 - Openlab Space

CERN

15
Show room on map

Draft of the minutes of the meeting we had o October 6th. (Markus)

Initial discussion on how we define “interactive analysis” . The HEPIC Analysis facilities doc tried to do this, but the text is, with some justification, not accepted by everyone.

markus noted that recently Brian Bockelman suggested to introduce certain time lines, like “a coffee” , “over lunch” , but this has been very early… ti

In addition there was consensus that the transition between interactive work and scaled out (batch, multiple days etc.) should be actively supported by the tool chain. Manually moving code and config will pose the risk of inconsistencies.

There was then a discussion on the role of Dask and how it can be used  with very different batch systems, here at CERN we run it successful with Condor.

Markus asked whether anyone knew why for the German National Analysis Facility at DESY they decided to use a different scheduler (SLURM ?????). 
Ben remembered to have talked about this with a person from DESY, but he wasn’t clear why (or I didn’t understood it)

There was a common agreement that ONE batch system is highly desirable.

There was a brief, but interesting discussion between Enric, Andrea and Simone. Both Coffee and RDataframe are based on a paradigm that is different from the event-loop approach that is central to almost all current analysis code. A transition of existing analysis codes to these new approaches will be faced with some resistance (active or passive), it will be primarily the new analysis projects that will embraces this. Simone seemed to not 100% to agree with this.
Someone noted that new analysis code often starts by copying old code… ( derived from… ) . This doesn’t look promising. Maybe we should discuss incentives …. 


During this discussion we also touched on the overall scale of the problem. Simone pointed out that whatever happens there will be analysis at CERN. 
The overall “cost” of analysis (distributed over the grid) for an experiment like ATLAS is about 5-10% of the overall resource capacity. Since the rest of the computing needs (reco, MC etc.) aren’t compressible future Analysis facilities have to fit into the current resource envelope. 

Enric (?) showed as an intro to his demo a slide with the general setup and relation of the different components.

This triggered a few discussions. RDataFrame vs Coffee,  the dependency constraints of Coffee (runs currently only with the bleeding edge of LCG releases…)  .  This was seen as a real issue. Markus remarked that in his view this is more of an indication that the software is under very active development and that when this reaches widespread use this is either solved or solved by evolutionary pressure…. 

This lead to a discussion on versioning, persistency and transition to scale out,. It was noted that for the Analysis Grand Challenge ?? a system has been developed that links the versioning of the notebooks with GIT, removing the plots from the notebooks and creating compact “code only” versions. 

I remember that we discussed at two occasions the questions of portability, one time the sensitivity of Coffee was mentioned as a limitation, the other time it sounded all very optimistic, but I can’t remember a clear statement:: “Don’t worry! Do your code, run it here on 1% of the data, press the pink button and ship the compressed tar file to the farm in Farawayistan to run on 102% of the data” 

Ben mentioned in this context that Binder could play a role in this.

During the demo (which went really smooth and contained Coffee and RDataframe examples) the discussion circled around the persistency of the activity.

Two aspects where discussed. 

1) Large scale output. Histos, scatter plots, etc. currently live in the notebook. From there they can be written to file. This was considered by some as less ideal. Especially for very large results.

2) The notebook has to be kept alive, since the DASK scheduler lives in it. If this goes the connected nodes are freed and all work is lost. 
   In the discussion it looked as if people have been looking at ways to handle this (“code for this COULD be developed” etc.) , but no concrete solution exists. It was also noted that this and 1) are interconnected. 


It was agreed that this is central for the practical use of the approach. 


It was also agreed to meet again in this slot.

There are minutes attached to this event. Show them.