25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

JobLens: A Lightweight Job Observability Collector for High-Throughput HEP Computing

Not scheduled
1m
Chulalongkorn University

Chulalongkorn University

Poster Presentation Track 4 - Distributed computing Poster

Speaker

Mr Zhenyuan Wang (Computing center, Institute of High Energy Physics, CAS, China)

Description

With the escalating processing demands of modern high-energy physics experiments, traditional monitoring tools are faltering under the dual pressures of cumbersome deployment and coarse-grained observability in high-throughput production environments. JobLens is a lightweight, one-click-deployable data collector designed to deliver fine-grained, job-level observability for HEP workloads. Its architecture centers on three core innovations: (1) eBPF-based kernel instrumentation enabling near-zero-overhead, dynamic tracing of process lifecycles and system calls without kernel modifications; (2) a highly configurable plugin architecture featuring asynchronous double-buffered pipelines that seamlessly export metrics to diverse backends (Elasticsearch, Prometheus, Kafka) while maintaining under 5% CPU average overhead; and (3) a Lua-scripted rule engine that dynamically registers monitoring policies to autonomously detect and track specific job categories in HTCondor-managed HEP clusters. This script-driven automation eliminates manual configuration, empowering operators to define custom matching rules (by experiment, user group, or resource template) that are evaluated at runtime to instantiate per-job collectors. Design analysis and preliminary benchmarks demonstrate support for over 200 concurrent jobs on a single worker node, targeting sub-second 99th-percentile collection latency. Comprehensive validation at production scale across HEP experiment workflows is currently underway.

Authors

Jingyan Shi (IHEP) Mr Zhenyuan Wang (Computing center, Institute of High Energy Physics, CAS, China)

Presentation materials

There are no materials yet.