Analysis Facility Pilot (Weekly Discussion )

Europe/Zurich
513/R-070 - Openlab Space (CERN)

513/R-070 - Openlab Space

CERN

15
Show room on map
Description

Useful information and links:

e-mail list: cern-analysis-facility@cern.ch

Overall description and useful information

Mattermost Channel

Workbook

Minutes

Zoom Meeting ID
61085982895
Host
Markus Schulz
Alternative host
Ben Jones
Useful links
Join via phone
Zoom URL
    • 14:00 15:00
      Summary of the discussion with the ROOT team 1h 513/R-070 - Openlab Space

      513/R-070 - Openlab Space

      CERN

      15
      Show room on map
      Speakers: Giovanni Guerrieri (CERN), Markus Schulz (CERN), Marta Czurylo (CERN), Stephan Hageboeck (CERN), Dr Vincenzo Eduardo Padulano (CERN), ASRITH KRISHNA RADHAKRISHNAN (Universita e INFN, Bologna (IT))

      Data Format Choices

      •  We currently use CSV files for data preprocessing, but this introduces workflow sacrifices.
        • Using CSVs is seen as suitable for exploratory analysis, but not optimal for large-scale or production ML workflows because of scalability issues.
        • Handling Missing Values
          • RDF does support missing value handling.
          • Planning to use attention masking methods to handle artificial (previously missing) values during training ML model.
        • Workflow and Performance Concerns
          • Operations like `snapshot` and `as_numpy` in ROOT are not lazy (but can be); they trigger computation immediately and can block execution ---> single-core performance bottlenecks .
        • Feeding Data into ML Infrastructure
          • Current solutions do not work efficiently in distributed environments. Keeping data in RAM (e.g., with Numpy) is optimal but not always feasible.
          • distributed batch generator could address memory and scalability issues, especially when processing massive datasets Open Data ~O(50TB).
          • Action:
                - Will move away from CSVs and try with ROOT files with efficient batch generators.

       

      Dask and ROOT Integration Issues

      • Problem: memory is not released after Dask jobs, leading to resource saturation. This issue comes from both Dask and ROOT.
        • In SWAN sessions, the Dask Python process remains tied to the Condor job and session lifetime, to guarantee its usage at will by the users; this causes lingering memory usage and process persistence. To be tested with lazy snapshots.
        • ROOT uses compilers for jitting (compiling on the fly) which further degrades memory retention, as compilers do not release memory while the Dask process is active.

       

      These problems are site-specific; for example, INFN is not configured in the same way and does not show this issue.

      Do we know what they do? Could we ask them?

       

      A screen shot of a graph

AI-generated content may be incorrect.

    • 15:00 15:20
      Additional News 20m 513/R-070 - Openlab Space

      513/R-070 - Openlab Space

      CERN

      15
      Show room on map
      Speaker: Markus Schulz (CERN)