Speaker
Description
The integration of Large Language Models (LLMs) into research workflows introduces a largely opaque layer of carbon intensity. Existing approaches to estimating AI energy consumption rely on time-based heuristics or static hardware profiling, which fail to capture the non-deterministic nature of generative inference. Variations in prompt design, quantization, and decoding strategies can lead to significant fluctuations in energy use, limiting the effectiveness of current sustainability assessments.
This paper introduces the One Token Model (OTM), a unified framework that redefines energy measurement through output-normalized attribution, expressed as Joules per token. OTM integrates telemetry across three layers: infrastructure dynamics, model architecture, and inference behavior.
We validate OTM through a real-time monitoring system that quantifies the marginal energy cost of individual inference requests. By enabling fine-grained, comparable measurements across systems, OTM supports energy-aware optimization and promotes more sustainable, transparent research computing practices.