- 14:00 → 15:00
  Summary of the discussion with the ROOT team 1h 513/R-070 - Openlab Space
  
  513/R-070 - Openlab Space
  
  CERN
  
  15
  Show room on map
  
  Speakers: Giovanni Guerrieri (CERN), Markus Schulz (CERN), Marta Czurylo (CERN), Stephan Hageboeck (CERN), Dr Vincenzo Eduardo Padulano (CERN), ASRITH KRISHNA RADHAKRISHNAN (Universita e INFN, Bologna (IT))
  Data Format Choices
  
  We currently use CSV files for data preprocessing, but this introduces workflow sacrifices.
  
  Using CSVs is seen as suitable for exploratory analysis, but not optimal for large-scale or production ML workflows because of scalability issues.
  
  Handling Missing Values
  
  RDF does support missing value handling.
  
  Planning to use attention masking methods to handle artificial (previously missing) values during training ML model.
  
  Workflow and Performance Concerns
  
  Operations like `snapshot` and `as_numpy` in ROOT are not lazy (but can be); they trigger computation immediately and can block execution ---> single-core performance bottlenecks .
  
  Feeding Data into ML Infrastructure
  
  Current solutions do not work efficiently in distributed environments. Keeping data in RAM (e.g., with Numpy) is optimal but not always feasible.
  
  A distributed batch generator could address memory and scalability issues, especially when processing massive datasets Open Data ~O(50TB).
  
  ROOT.TMVA.Experimental.CreateTFDatasets (See this example)
  
  Action:
  - Will move away from CSVs and try with ROOT files with efficient batch generators.
  
  Dask and ROOT Integration Issues
  
  Problem: memory is not released after Dask jobs, leading to resource saturation. This issue comes from both Dask and ROOT.
  
  In SWAN sessions, the Dask Python process remains tied to the Condor job and session lifetime, to guarantee its usage at will by the users; this causes lingering memory usage and process persistence. To be tested with lazy snapshots.
  
  ROOT uses compilers for jitting (compiling on the fly) which further degrades memory retention, as compilers do not release memory while the Dask process is active.
  
  These problems are site-specific; for example, INFN is not configured in the same way and does not show this issue.
  
  Do we know what they do? Could we ask them?
- 15:00 → 15:20
  
  Additional News 20m 513/R-070 - Openlab Space
  
  513/R-070 - Openlab Space
  
  CERN
  
  15
  Show room on map
  
  Speaker: Markus Schulz (CERN)

Choose timezone

Analysis Facility Pilot (Weekly Discussion )

513/R-070 - Openlab Space

CERN

Useful information and links:

Overall description and useful information

Mattermost Channel

Workbook

Minutes

513/R-070 - Openlab Space

CERN

Data Format Choices

Dask and ROOT Integration Issues

513/R-070 - Openlab Space

CERN