Analysis Facility Pilot (Weekly Discussion ) (25 April 2025) · Indico

Analysis Facility Pilot (Weekly Discussion )

Friday 25 April 2025 - 14:00

Monday 21 April 2025
Tuesday 22 April 2025
Wednesday 23 April 2025
Thursday 24 April 2025
Friday 25 April 2025
14:00 Follow up on the discussion with DESY - Markus Schulz (CERN)
Follow up on the discussion with DESY
- Markus Schulz (CERN)
14:00 - 15:00
Room: 513/R-068 # The answers given offline by Yves Kemp: ## Overview: - *What triggered the deployment of the AF?* -> In 2007: Grid is working for production on large scale, Grid Data at DESY -> Complementary facility for analysis by individual users and groups in an interactive, low latency, low entry threshold -> All German instute users (later: global for Belle), ATLAS, CMS, LHCb, (ILC) ... later: also HERA and new DESY HEP experiments -> Model for the Photon Science community at DESY - Were users demanding Jupyter/Dask/Scale-out etc.? -> in 2007, the buzzword was "proof" ... which we implemented, and no one used it -> since ~2019 first SPARK tests with BELLE -> presented SPARK 2019/2020 to local experiments ATLAS&CMS, little no interest. -> Jupyter RAM troubles in 2023: Larger notebooks solved the problem, no requests for DASK/SPARK anymore -> DASK only took momentum by the end 2024 - *Main services offered by the AF* -> Interactive, Batch, Storage, Support&Consulting, Software, Connection to the Grid -> Interactive: ssh, FastX, (scaling) Jupyter (for all communities), Gitlab pipeline integration -> Storage: Grid dCache data (plus additional space), NFS additional local project space, $TMP on SSD (~20GB/core), plus access to RAW data and tape, integrated into DAQ and in experiments Data Management (Rucio,...) - *Scale: amount resources and number of users.* -> Interactive: Several O(10) Hypervisors with VMs for login -> Batch: ~180 WNs, 9000 cores (no HT, contrast to Grid: using HT) (some GPUs, if larger GPU required: more cross-usage with Maxwell HPC system) -> dCache Storage: ~20 PB (Full Tier-2 plus additional space e.g. "LOCALGROUPDISK") -> NFS project space: 5 PB (of which ~2 PB for NAF) -> Users: see next point - *What has been the adoption by users?* -> currently > 200 users per month (as seen by dcache storage) ... ~50% are belle users - *Is the usage increasing over time?* -> yes (last 5y +10%) -> Sheer number of #users are less important than #workflows for supporting people -> CPU usage fluctuates. - *Do you only support interactive analysis?* - *If not, how does non-interactive analysis look like?* -> HTCondor batch jobs -> JUpyter integrated into HTCondor - *Can the AF be used via a terminal/batch?* -> yes. SSH + HTCondor ## Metrics - *How do you measure how many (active) users are using the facility?* -> Kibana (HTCondor jobs) -> Users accessing dCache per month (time series) -> Room for improvement - *What other user metrics do you measure?* -> extensive WGS, HTCondor, dCache, ... service monitoring - Real-time or historical time series? -> both -> per-job-metrics possible -> working on eBPF monitoring for extensive debugging (especially compute-storage interaction) - *What KPI do you consider to define success from the point of view of AF operations?* -> #users, #jobs, #papers, #tickets, #complaints in user meetings -> no quantitive metric "interactivity" so far. -> 100% utilization is *not* an aim! - *How do you monitor storage (XCache or other) usage and performance?* -> dCache and DUST/NFS extensively monitored in performance and stability ... but traffic is unmanaged -> combined storage-CPU view important, and being developed at DESY -> ideally: jobs should tell us what storage/data they need, and input for scheduling (we are not yet there) - *Number of users?* -> see above ## Software distribution - *How are the client software libraries made available to the users* -> CVMFS for ATLAS, CMS, BELLE for central experiment software -> Python virtual environments (stored on NFS or AFS) -> Some group software in AFS or NFS -> Connection to Gitlab to be improved -> Some level of support for local installation development software -> Support for containers (WGS, HTCondor), either CERN-CVMFS-unpacked, Gitlab pipelines, NFS-based - *To what extent, and how, can users customize their software stack (e.g. versions of Coffea and dependencies)?* -> Python virtualenv, containers, local/group installations - *To what extent do you provide support of the client software side? E.g. having one or few software distributions that are "guaranteed" to work?* -> We are providing a compute infrastructure, but no experiment specific implementation (for HEP. Photon Science: Extensive provisioning of software) -> We could envisage to offer *vanilla* DASK (but not experiment specific, e.g. no Coffea, SciKitHep, ...) ## Other aspects from DESY: - Storage centricity / Integration -> data at the center. CPU is designed according to how data is presented and needed. - Resource access control & User Identities -> based on POSIX UID/GID/IP security: simple and efficient -> Tokens and remote access via xrd might be less efficient - Access protocols for data, compute -> Rely on standard, community-overarching protocols, e.g. NFS, HTTP, POSIX. - Efficient utilization ("green", funding, ...) - Support, governance -> Monthly NAF users comittee meetings -> weekly contributions in the local group meetings -> yearly user meetings - Interdisciplinarity (two LHC, different HEP, different communities) -> No HEP-specific solutions (and even less: ATLAS/CMS/...-only solution) - Takes on DASK: - We should offer something with DASK, that makes ~90% of DASK users happy - Offer feedback what works and what does not: support and/or automated feedback channels and/or contain users - There will always be ~10% other users **Did we forget to ask about local and/or remote data access?** markus