The answers given offline by Yves Kemp:
Overview:
- What triggered the deployment of the AF?
-> In 2007: Grid is working for production on large scale, Grid Data at DESY
-> Complementary facility for analysis by individual users and groups in an interactive, low latency, low entry threshold
-> All German instute users (later: global for Belle), ATLAS, CMS, LHCb, (ILC) ... later: also HERA and new DESY HEP experiments
-> Model for the Photon Science community at DESY
- Were users demanding Jupyter/Dask/Scale-out etc.?
-> in 2007, the buzzword was "proof" ... which we implemented, and no one used it
-> since ~2019 first SPARK tests with BELLE
-> presented SPARK 2019/2020 to local experiments ATLAS&CMS, little no interest.
-> Jupyter RAM troubles in 2023: Larger notebooks solved the problem, no requests for DASK/SPARK anymore
-> DASK only took momentum by the end 2024
- Main services offered by the AF
-> Interactive, Batch, Storage, Support&Consulting, Software, Connection to the Grid
-> Interactive: ssh, FastX, (scaling) Jupyter (for all communities), Gitlab pipeline integration
-> Storage: Grid dCache data (plus additional space), NFS additional local project space, $TMP on SSD (~20GB/core), plus access to RAW data and tape, integrated into DAQ and in experiments Data Management (Rucio,...)
- Scale: amount resources and number of users.
-> Interactive: Several O(10) Hypervisors with VMs for login
-> Batch: ~180 WNs, 9000 cores (no HT, contrast to Grid: using HT) (some GPUs, if larger GPU required: more cross-usage with Maxwell HPC system)
-> dCache Storage: ~20 PB (Full Tier-2 plus additional space e.g. "LOCALGROUPDISK")
-> NFS project space: 5 PB (of which ~2 PB for NAF)
-> Users: see next point
- What has been the adoption by users?
-> currently > 200 users per month (as seen by dcache storage) ... ~50% are belle users
- Is the usage increasing over time?
-> yes (last 5y +10%)
-> Sheer number of #users are less important than #workflows for supporting people
-> CPU usage fluctuates.
- Do you only support interactive analysis?
- If not, how does non-interactive analysis look like?
-> HTCondor batch jobs
-> JUpyter integrated into HTCondor
- Can the AF be used via a terminal/batch?
-> yes. SSH + HTCondor
Metrics
- How do you measure how many (active) users are using the facility?
-> Kibana (HTCondor jobs)
-> Users accessing dCache per month (time series)
-> Room for improvement
- What other user metrics do you measure?
-> extensive WGS, HTCondor, dCache, ... service monitoring
- Real-time or historical time series?
-> both
-> per-job-metrics possible
-> working on eBPF monitoring for extensive debugging (especially compute-storage interaction)
- What KPI do you consider to define success from the point of view of AF operations?
-> #users, #jobs, #papers, #tickets, #complaints in user meetings
-> no quantitive metric "interactivity" so far.
-> 100% utilization is not an aim!
- How do you monitor storage (XCache or other) usage and performance?
-> dCache and DUST/NFS extensively monitored in performance and stability ... but traffic is unmanaged
-> combined storage-CPU view important, and being developed at DESY
-> ideally: jobs should tell us what storage/data they need, and input for scheduling (we are not yet there)
- Number of users?
-> see above
Software distribution
- How are the client software libraries made available to the users
-> CVMFS for ATLAS, CMS, BELLE for central experiment software
-> Python virtual environments (stored on NFS or AFS)
-> Some group software in AFS or NFS
-> Connection to Gitlab to be improved
-> Some level of support for local installation development software
-> Support for containers (WGS, HTCondor), either CERN-CVMFS-unpacked, Gitlab pipelines, NFS-based
- To what extent, and how, can users customize their software stack (e.g. versions of Coffea and dependencies)?
-> Python virtualenv, containers, local/group installations
- To what extent do you provide support of the client software side? E.g. having one or few software distributions that are "guaranteed" to work?
-> We are providing a compute infrastructure, but no experiment specific implementation (for HEP. Photon Science: Extensive provisioning of software)
-> We could envisage to offer vanilla DASK (but not experiment specific, e.g. no Coffea, SciKitHep, ...)
Other aspects from DESY:
- Storage centricity / Integration
-> data at the center. CPU is designed according to how data is presented and needed.
- Resource access control & User Identities
-> based on POSIX UID/GID/IP security: simple and efficient
-> Tokens and remote access via xrd might be less efficient
- Access protocols for data, compute
-> Rely on standard, community-overarching protocols, e.g. NFS, HTTP, POSIX.
- Efficient utilization ("green", funding, ...)
- Support, governance
-> Monthly NAF users comittee meetings
-> weekly contributions in the local group meetings
-> yearly user meetings
- Interdisciplinarity (two LHC, different HEP, different communities)
-> No HEP-specific solutions (and even less: ATLAS/CMS/...-only solution)
- Takes on DASK:
- We should offer something with DASK, that makes ~90% of DASK users happy
- Offer feedback what works and what does not: support and/or automated feedback channels and/or contain users
- There will always be ~10% other users
Did we forget to ask about local and/or remote data access? markus
Speaker:
Markus Schulz
(CERN)