Skip to main content

Analysis Facility Pilot (Weekly Discussion )

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Description

Useful information and links:

e-mail list: cern-analysis-facility@cern.ch

Overall description and useful information

Mattermost Channel

Workbook

Minutes

    • 14:00 15:00
      Follow up on the discussion with DESY 1h 513/R-068

      513/R-068

      CERN

      19
      Show room on map

      The answers given offline by Yves Kemp:

      Overview:

      • What triggered the deployment of the AF?
        -> In 2007: Grid is working for production on large scale, Grid Data at DESY
        -> Complementary facility for analysis by individual users and groups in an interactive, low latency, low entry threshold
        -> All German instute users (later: global for Belle), ATLAS, CMS, LHCb, (ILC) ... later: also HERA and new DESY HEP experiments
        -> Model for the Photon Science community at DESY
        • Were users demanding Jupyter/Dask/Scale-out etc.?
          -> in 2007, the buzzword was "proof" ... which we implemented, and no one used it
          -> since ~2019 first SPARK tests with BELLE
          -> presented SPARK 2019/2020 to local experiments ATLAS&CMS, little no interest.
          -> Jupyter RAM troubles in 2023: Larger notebooks solved the problem, no requests for DASK/SPARK anymore
          -> DASK only took momentum by the end 2024
      • Main services offered by the AF
        -> Interactive, Batch, Storage, Support&Consulting, Software, Connection to the Grid
        -> Interactive: ssh, FastX, (scaling) Jupyter (for all communities), Gitlab pipeline integration
        -> Storage: Grid dCache data (plus additional space), NFS additional local project space, $TMP on SSD (~20GB/core), plus access to RAW data and tape, integrated into DAQ and in experiments Data Management (Rucio,...)
      • Scale: amount resources and number of users.
        -> Interactive: Several O(10) Hypervisors with VMs for login
        -> Batch: ~180 WNs, 9000 cores (no HT, contrast to Grid: using HT) (some GPUs, if larger GPU required: more cross-usage with Maxwell HPC system)
        -> dCache Storage: ~20 PB (Full Tier-2 plus additional space e.g. "LOCALGROUPDISK")
        -> NFS project space: 5 PB (of which ~2 PB for NAF)
        -> Users: see next point
      • What has been the adoption by users?
        -> currently > 200 users per month (as seen by dcache storage) ... ~50% are belle users
        • Is the usage increasing over time?
          -> yes (last 5y +10%)
          -> Sheer number of #users are less important than #workflows for supporting people
          -> CPU usage fluctuates.
      • Do you only support interactive analysis?
        • If not, how does non-interactive analysis look like?
          -> HTCondor batch jobs
          -> JUpyter integrated into HTCondor
      • Can the AF be used via a terminal/batch?
        -> yes. SSH + HTCondor

      Metrics

      • How do you measure how many (active) users are using the facility?
        -> Kibana (HTCondor jobs)
        -> Users accessing dCache per month (time series)
        -> Room for improvement
      • What other user metrics do you measure?
        -> extensive WGS, HTCondor, dCache, ... service monitoring
        • Real-time or historical time series?
          -> both
          -> per-job-metrics possible
          -> working on eBPF monitoring for extensive debugging (especially compute-storage interaction)
      • What KPI do you consider to define success from the point of view of AF operations?
        -> #users, #jobs, #papers, #tickets, #complaints in user meetings
        -> no quantitive metric "interactivity" so far.
        -> 100% utilization is not an aim!
      • How do you monitor storage (XCache or other) usage and performance?
        -> dCache and DUST/NFS extensively monitored in performance and stability ... but traffic is unmanaged
        -> combined storage-CPU view important, and being developed at DESY
        -> ideally: jobs should tell us what storage/data they need, and input for scheduling (we are not yet there)
      • Number of users?
        -> see above

      Software distribution

      • How are the client software libraries made available to the users
        -> CVMFS for ATLAS, CMS, BELLE for central experiment software
        -> Python virtual environments (stored on NFS or AFS)
        -> Some group software in AFS or NFS
        -> Connection to Gitlab to be improved
        -> Some level of support for local installation development software
        -> Support for containers (WGS, HTCondor), either CERN-CVMFS-unpacked, Gitlab pipelines, NFS-based
      • To what extent, and how, can users customize their software stack (e.g. versions of Coffea and dependencies)?
        -> Python virtualenv, containers, local/group installations
      • To what extent do you provide support of the client software side? E.g. having one or few software distributions that are "guaranteed" to work?
        -> We are providing a compute infrastructure, but no experiment specific implementation (for HEP. Photon Science: Extensive provisioning of software)
        -> We could envisage to offer vanilla DASK (but not experiment specific, e.g. no Coffea, SciKitHep, ...)

      Other aspects from DESY:

      • Storage centricity / Integration
        -> data at the center. CPU is designed according to how data is presented and needed.
      • Resource access control & User Identities
        -> based on POSIX UID/GID/IP security: simple and efficient
        -> Tokens and remote access via xrd might be less efficient
      • Access protocols for data, compute
        -> Rely on standard, community-overarching protocols, e.g. NFS, HTTP, POSIX.
      • Efficient utilization ("green", funding, ...)
      • Support, governance
        -> Monthly NAF users comittee meetings
        -> weekly contributions in the local group meetings
        -> yearly user meetings
      • Interdisciplinarity (two LHC, different HEP, different communities)
        -> No HEP-specific solutions (and even less: ATLAS/CMS/...-only solution)
      • Takes on DASK:
        • We should offer something with DASK, that makes ~90% of DASK users happy
        • Offer feedback what works and what does not: support and/or automated feedback channels and/or contain users
        • There will always be ~10% other users

      Did we forget to ask about local and/or remote data access? markus

      Speaker: Markus Schulz (CERN)