25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

An On-Grid deployment of ML Inference as a service at a Tier-2

26 May 2026, 16:51
18m
MHMK 202

MHMK 202

Oral Presentation Track 4 - Distributed computing Track 4 - Distributed computing

Speaker

Albert Gyorgy Borbely (University of Glasgow (GB))

Description

Recent developments demonstrate that HEP software can run effectively on
GPUs, while advances in ML models have shown predictable scaling laws
for compute, data, and model size, consistent with trends across the
wider AI community. As a result, there is growing demand within HEP for
inference using larger models that have already delivered significant
physics gains, such as b-tagging in ATLAS with the GN2 transformer-based
neural network.

At present, ML inference in HEP is largely performed on CPUs using
translation libraries such as ONNX. However, a sharp rise in RAM
costs—driven by supply constraints and strong demand for HBM2
high-bandwidth memory—makes it increasingly unlikely that WLCG sites
will move far beyond the 2 GB per-job memory limit. In response, both
the ATLAS and CMS collaborations have proposed inference-as-a-service
solutions to simplify model deployment while addressing memory
constraints and rapidly growing model sizes.

One possible implementation is an on-Grid inference-as-a-service
deployment that uses site-local GPUs with the NVIDIA Triton inference
server and standard Grid tools, including ARC-CE, HTCondor, CVMFS, and
XCache. We describe progress on this approach at the Glasgow Tier-2 WLCG
site, along with tests involving the submission of Grid jobs. Reusing
underutilised GPU resources already available at Grid sites could offer
a pragmatic way to meet the increasing demand for this type of service.

Author

Albert Gyorgy Borbely (University of Glasgow (GB))

Co-authors

David Britton (University of Glasgow (GB)) Emanuele Simili (University of Glasgow (GB)) Gordon Stewart Samuel Cadellin Skipsey

Presentation materials