Speaker
Description
Modern machine learning (ML) algorithms are sensitive to the specification of non-trainable parameters called hyperparameters (e.g., learning rate or weight decay). Without guiding principles, hyperparameter optimization is the computationally expensive process of sweeping over various model sizes and, at each, re-training the model over a grid of hyperparameter settings. However, recent progress from the ML theory community has given a prescription for scaling hyperparameters with respect to model size such that (1) the optimal hyperparameters identified for small models of a fixed architecture are the same for their larger counterparts (hyperparameter transfer) and (2) larger models perform better than their smaller counterparts (limiting behavior). When satisfied, these desiderata yield large computational savings and stable performance useful for computing, for example, neural scaling laws. In this talk, we will present a recipe for achieving hyperparameter transfer and limiting behavior in graph transformers, transformer variants combining simple message passing with sparse attention computed over the edges of each input graph. Though relatively new, graph transformers have been shown to outperform simple GNNs and transformers on a variety of benchmark tasks, and have particular relevance to scientific datasets where edges may encode known physical interactions and measurements. We will demonstrate the promise of these principled graph transformers on benchmark datasets and encourage discussion about how these results may be extended to tackle more challenging scenarios in particle physics.
Significance
These results are novel and represent the first time principles for hyperparameter transfer and limiting behavior have been applied to graph transformers. These results will make significant impact on the particle physics community because (1) scientists have not widely adopted these powerful model scalings and (2) particle physicists have a particular emphasis on graph-structured data due to the sparse and irregular nature of collider data.
| Experiment context, if any | N/A |
|---|