Fast Machine Learning for Science Conference 2024
Steward Center 306 (Third floor)
Purdue University

We are pleased to announce a four-day event Fast Machine Learning for Science, which will be hosted by Purdue University from October 15-18, 2024. The first three days will be workshop-style with invited and contributed talks. The last day will be dedicated to technical demonstrations and satellite meetings. The event will be hybrid with an in-person, on-site venue and the possibility to join virtually. For those attending in person, there will be a social reception during the evening of Tuesday, October 15, and a dinner on Thursday, 17th.
As advances in experimental methods create growing datasets and higher resolution and more complex measurements, machine learning (ML) is rapidly becoming the major tool to analyze complex datasets over many different disciplines. Following the rapid rise of ML through deep learning algorithms, the investigation of processing technologies and strategies to accelerate deep learning and inference is well underway. We envision this will enable a revolution in experimental design and data processing as a part of the scientific method to accelerate discovery greatly. This workshop is aimed at current and emerging methods and scientific applications for deep learning and inference acceleration, including novel methods of efficient ML algorithm design, ultrafast on-detector inference and real-time systems, acceleration as-a-service, hardware platforms, coprocessor technologies, distributed learning, and hyper-parameter optimization.
Abstract submission deadline: September 16th, 2024
Registration deadline: October 1st,2024

Organising Committee:
Mia Liu (Chair) 
Maria Dadarlat (Co-chair)
Andy Jung
Norbert Neumeister
Wei Xie
Paul Duffel
Haitong Li
Guang Ling
Eugenio Culurciello
Yong Chen
Alexandra Boltasseva
Laimei Nie
Scientific Committee:
Thea Aarrestad (ETH Zurich)
Javier Duarte (UCSD) 
Phil Harris (MIT)
Burt Holzman (Fermilab) 
Scott Hauck (U. Washington) 
Shih-Chieh Hsu (U. Washington) 
Sergo Jindariani (Fermilab) 
Mia Liu (Purdue University) 
Allison McCarn Deiana (Southern Methodist University) 
Mark Neubauer (U. Illinois Urbana-Champaign)
Jennifer Ngadiuba (Fermilab)
Maurizio Pierini (CERN)
Sioni Summers (CERN)
Alex Tapper (Imperial College) 
Nhan Tran (Fermilab)
Verena Martinez Outschoorn (UMass Amherst)

- 
                    
                        
                            
                        
                    
                    - 
        
            
        08:15
    
    
        →
        
            09:00
        
    
        
        Registration 45m Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907
- 
        
            
                
        09:00
    
    
        →
        
            09:10
        
    
            
        
        Welcome 10m Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907
- 
        
            
                
        09:10
    
    
        →
        
            10:55
        
    
            
        
        Invited talks: Chair: Prof. Shih-Chieh Hsu Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907invited talks 30 mins + 5 mins Q&A - 
        
            
                
        09:10
    
    
            
        
        [Remote] Opening Talk 35mSpeaker: Javier Mauricio Duarte (Univ. of California San Diego (US))
- 
        
            
                
        09:45
    
    
            
        
        Fast Machine Learning at the LHC 35mSpeaker: Dylan Sheldon Rankin (University of Pennsylvania (US))
- 
        
            
                
        10:20
    
    
            
        
        Enabling real-time detection, characterization and inference for the time-domain sky 35mSpeaker: Gautham Narayan (UIUC)
 
- 
        
            
                
        09:10
    
    
            
        
        
- 
        
            
                
        10:55
    
    
        →
        
            11:20
        
    
            
        
        Coffee/Posters Stewart Center 302 (Third floor)Stewart Center 302 (Third floor)Purdue University128 Memorial Mall, West Lafayette, IN 47907Information about posters: The dimensions of the display areas of the boards are 69” wide x 46” tall. Portrait and landscape both work. However, portrait would probably work best.The rolling white boards have magnets on them that you will use to attach the posters. 
- 
        
            
                
        11:20
    
    
        →
        
            13:05
        
    
            
        
        Invited talks: Chair: Prof. Phil Harris Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907invited talks 30 mins + 5 mins Q&A - 
        
            
                
        11:20
    
    
            
        
        [Remote] Machine learning and electronic structure calculation in materials and beyond 35mI will briefly outline the huge importance of density functional theory (DFT) calculations 
 to modern materials design (and to chemistry and warm dense matter, etc). I will then
 discuss the impact of machine learning on the field, especially the rise of machine-learned
 potentials. I will briefly mention my own work in using ML to improve DFT.Speaker: Kieron Burke
- 
        
            
                
        11:55
    
    
            
        
        ML for Accelerator control and design 35mSpeaker: Daniel Ratner (SLAC)
- 
        
            
                
        12:30
    
    
            
        
        Topological diagnostics of ML and AI algorithms 35mIt is now standard practice across science to use models that have been trained, fit, or learned based on a set of data. Many of these models involve a large number of parameters that make direct interpretation of the model challenging and a near black-box model view appropriate. We explore the possibilities of using ideas based on topological analysis methods to understand and evaluate these AI and ML-based functions. These show a surprising ability to generate easy to understand insights into these black-boxes. Speaker: David Gleich (Purdue)
 
- 
        
            
                
        11:20
    
    
            
        
        
- 
        
            
        13:05
    
    
        →
        
            14:00
        
    
        
        Lunch 55m
- 
        
            
                
        14:00
    
    
        →
        
            15:15
        
    
            
        
        Contributed talks: Chair: Prof. Dylan Rankin Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 4790713 mins + 2 mins Q&A - 
        
            
                
        14:00
    
    
            
        
        [Remote] Randomized Point Serialization-Based Efficient Point Transformer in High-Energy Physics Applications 15mThis study introduces a novel transformer model optimized for large-scale point cloud processing in scientific domains such as high-energy physics (HEP) and astrophysics. Addressing the limitations of graph neural networks and standard transformers, our model integrates local inductive bias and achieves near-linear complexity with hardware-friendly regular operations. One contribution of this work is the quantitative analysis of the error-complexity tradeoff of various sparsification techniques for building efficient transformers. Our findings highlight the superiority of using locality-sensitive hashing (LSH), especially OR & AND-construction LSH, in kernel approximation for large-scale point cloud data with local inductive bias. Based on this finding, we propose LSH-based Efficient Point Transformer (HEPT), which is based on randomized point serialization via E$^2$LSH with OR & AND constructions and is built upon regular computations. HEPT demonstrates remarkable performance on two critical yet time-consuming HEP tasks (tracking & pileup mitigation), significantly outperforming existing GNNs and transformers in accuracy and computational speed, marking a significant advancement in geometric deep learning and large-scale scientific data processing. Our code is available at https://github.com/Graph-COM/HEPT. Speaker: Siqi Miao (Georgia Tech)
- 
        
            
                
        14:15
    
    
            
        
        A Streamlined Neural Model for Real-Time Analysis at the First Level of the LHCb Trigger 15mOne of the most significant challenges in tracking reconstruction is the reduction of "ghost tracks," which are composed of false hit combinations in the detectors. When tracking reconstruction is performed in real-time at 30 MHz, it introduces the difficulty of meeting high efficiency and throughput requirements. A single-layer feed-forward neural network (NN) has been developed and trained to address this challenge. The simplicity of the NN allows for parallel evaluation of many track candidates to filter ghost tracks using CUDA within the Allen framework. This capability enables us to run this type of NN at the first level of the trigger (HLT1) in the LHCb experiment. This neural network approach is already utilized in several HLT1 algorithms and is becoming an essential tool for Run 3. Details of the implementation and performance of this strategy will be presented in this talk. Speaker: Jiahui Zhuo (Univ. of Valencia and CSIC (ES))
- 
        
            
                
        14:30
    
    
            
        
        ML4GW: An all-encompassing software framework for real-time deep learning applications to search for gravitational waves 15mDeep Learning (DL) applications for gravitational wave (GW) physics are becoming increasingly common without the infrastructure to validate them at scale or deploy them in real-time. The challenge of gravitational waves requires and real-time time series workflow. With ever more sensitive GW observing runs beginning in 2023-5 and progressing through the next decade, ever-increasing sensitivity will be present, demanding real-time robust processing of GW experimental pipelines. We present ml4gw, an end-to-end software framework for optimized training and ML inference for real-time gravitational wave processing. This framework is rapidly being adopted for many applications, including denoising, binary black detection, anomaly detection, and real-time parameter estimation. These tools allow for the development of deep learning-powered GW physics applications, which are faster, more intuitive, and better able to leverage the powerful modeling techniques available in the GW literature. We present the ML4GW toolkit and discuss how it optimally leverages heterogeneous computing. Finally, we discuss the future of real-time heterogeneously computing within GW detection and how it can be used to probe our ever-expanding universe. Speaker: Will Benoit
- 
        
            
                
        14:45
    
    
            
        
        SONIC: A Portable framework for as-a-service ML serving 15mComputing demands for large scientific experiments, including experiments at the Large Hadron Collider and the future DUNE neutrino detector, will increase dramatically in the next decades. Heterogeneous computing provides a solution enabling increased computing demands that pass the limitations brought on by the end of Dennard scaling. However, to effectively exploit Heterogeneous compute, software needs to be adapted, and resources need to be balanced. We explore the novel approach of Services for Optimized Network Inference on Coprocessors (SONIC) and present a strategy for optimized integration of heterogeneous coprocessors, including GPUs, FPGAs, Graphcore IPUs and others. Focusing on ML algorithms, we demonstrate how SONIC can be designed to dynamically allocate heterogeneous resources in an fully optimized mode. With the rapid adoption of deep learning models for core algorithms at big scientific experiments, we present a path towards rapid integration of deep learning models, and strategy for future large scale compute at big experiments including the CMS and ATLAS detectors at the Large Hadron Collider. We show our proposed path clears the way for substantially improved data processing by optimally exploiting resources while simultaneously increasing the bandwidth and overall computational power of these future experiments. Speaker: Dmitry Kondratyev (Purdue University (US))
- 
        
            
                
        15:00
    
    
            
        
        Deep(er)RICH: Reconstruction of Imaging CherenkovDetectors with Swin Transformers and Normalizing Flow Models 15mThe Deep(er)RICH architecture integrates Swin Transformers and normalizing flows, and demonstrates significant advancements in particle identification (PID) and fast simulation. Building on the earlier DeepRICH model, Deep(er)RICH extends its capabilities across the entire kinematic region covered by the DIRC detector in the \textsc{GlueX} experiment. It learns particle identification (PID) tasks as a continuous function of the charged particle's momentum, direction, and point of impact on the DIRC plane, showing superior performance over traditional geometric reconstruction methods for PID. Leveraging GPU deployment, we have achieved state-of-the-art time performance, with an effective inference time for identifying a charged particle of $\leq \mathcal{O}$(10) $\mu s$, comparable to the first version of DeepRICH, and an effective simulation time of $<$ 1 $\mu s$ per hit. This ideally enables near real-time applications, which are of particular interest for future high-luminosity experiments aiming to implement deep learning architectures in high-level triggers or more sophisticated streaming readout schemes like those under development at the EIC. The high quality of reconstruction and the fast computing time are two compelling features of the Deep(er)RICH architecture. The possibility of combining enhanced PID and fast simulations also enables handling complicated topologies arising from overlapping hit patterns detected in the same optical box and generated by simultaneously detected tracks, a problem that traditional methods currently cannot cope with. Consequently, Deep(er)RICH could contribute to important physics channels at both JLab and EIC. Deep(er)RICH is extremely portable; it is agnostic to the data injected, photon-yield, and detector geometry, and can therefore be adapted to other experiments and imaging Cherenkov detectors beyond the DIRC at \textsc{GlueX}, such as the hpDIRC at ePIC. Speaker: James Giroux (W&M)
 
- 
        
            
                
        14:00
    
    
            
        
        
- 
        
            
                
        15:15
    
    
        →
        
            16:25
        
    
            
        
        Lighting talks: Chair: Dr. Vladimir Loncar Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907- 
        
            
                
        15:15
    
    
            
        
        Robust and interpretable deep learning by leveraging domain knowledge 5mRecently, compelling evidence for the emission of high-energy neutrinos from our host Galaxy - the Milky Way - was reported by IceCube, a neutrino detector instrumenting a cubic kilometer of glacial ice at the South Pole. This breakthrough observation is enabled by advances in AI, including a physics-driven deep learning method capable of exploiting available symmetries and domain knowledge. This reconstruction method combines deep learning with maximum-likelihood estimation. 
 Analogously to Monte Carlo simulations, the neural network architecture is defined in forward direction, which allows for the decoupling of physics and detector effects and thus direct incorporation of domain knowledge in the network architecture. Due to the exploitation of this prior knowledge, the required amount of training data is reduced, and training convergence is facilitated. The resulting model can robustly extrapolate along built-in symmetries, while retaining beneficial properties of maximum-likelihood estimation such as uncertainty quantification and explainability. The presented hybrid reconstruction method is therefore well suited for applications in simulation-based domains that require a high standard of interpretability and robustness.Speaker: Mirco Hünnefeld (University of Wisconsin-Madison)
- 
        
            
                
        15:20
    
    
            
        
        Intelligent experiments through real-time AI: GNN-based trigger pipeline for sPHENIX 5mThis R&D project, initiated by the DOE Nuclear Physics AI-Machine Learning initiative in 2022, explores advanced AI technologies to address data processing challenges at RHIC and future EIC experiments. The main objective is to develop a demonstrator capable of efficient online identification of heavy-flavor events in proton-proton collisions (~1 MHz) based on their decay topologies, while optimizing the use of a limited DAQ bandwidth (~15 kHz). This showcases the transformative potential of AI and FPGA technologies in real-time data processing for high-energy nuclear and particle experiments. We deploy an attention-GNN-based trigger algorithm with the target latency of 10 $\mu$s, trained on sPHENIX p+p collision simulated data. The target device is a FELIX-712 board equipped with Xilinx Kintex Ultrascale FPGA. Hls4ml and FlowGNN was used to create two IP cores of the AI inference. 
 In this talk we would like to report the latest progress of the project. Firstly, showcase the attention-based model used for the trigger pipeline comparing it to other state-of-the-art GNN algorithms. Secondly, report the utilization of the FPGA infrastructure, data decoder, clusteriser, and a simplified trigger detection pipeline based on GARNET translated with both hls4ml and FlowGNN.Speaker: Jovan Mitrevski (Fermi National Accelerator Lab. (US))
- 
        
            
                
        15:25
    
    
            
        
        Interpreting and Accelerating Transformers for Jet Tagging 5mAttention-based transformers are ubiquitous in machine learning applications from natural language processing to computer vision. In high energy physics, one central application is to classify collimated particle showers in colliders based on the particle of origin, known as jet tagging. In this work, we study the interpretatbility and prospects for acceleration of Particle Transformer (ParT), a state-of-the-art model, leverages particle-level attention to improve jet-tagging performance. We analyzing ParT's attention maps and particle-pair correlations in the $\eta$-$\phi$ plane, revealing intriguing features, such as a binary attention pattern that identifies critical substructure in jets. These insights enhance our understanding of the model's internal workings and learning process and hint at ways to improve its efficiency. Along these lines, we also explore low-rank attention, attention alternatives, and dynamic quantization to accelerate transformers for jet tagging. With quantization, we achieve a 50% reduction in model size and a 10% increase in inference speed without compromising accuracy. These combined efforts enhance both the performance and the interpretability of transformers in high-energy physics, opening avenues for more efficient and physics-driven model designs. Speakers: Aaron Wang (University of Illinois at Chicago (US)), Vivekanand Gyanchand Sahu (University of California San Diego)
- 
        
            
                
        15:30
    
    
            
        
        S-QUARK: A Scalable Quantization-Aware Training Framework for FPGA Deployment based on Keras-v3 5mIn this work, we present the Scalable QUantization-Aware Real-time Keras (S-QUARK), an advanced quantization-aware training (QAT) framework for efficient FPGAs inference built on top of Keras-v3, supporting all Tensorflow, JAX, and PyTorch backends. The framework inherits all perks from the High Granularity Quantization (HGQ) library, and extends it to support fixed-point numbers with different overflow modes and different parametrization of the fixed-point quantizers. Furthermore, it extends the HGQ library to support bit-accurate softmax and multi-head attention layers. Bit-exact minifloat quantizer with differentiable mantissa and exponent bits, as well as the exponent bias, are also supported. On the TensorFlow and JAX backend, all layers provided by the framework support JIT compilation, which can significantly speed up the training process when the training process is io-bound. The speedup ranges from 1.5x to more than 3x compared to the HGQ framework, and has 10% to 100% overhead in training performance over the native TensorFlow or JAX with Keras implementation, depending on the exact model, dataset, and the hardware used. The library is available under the LGPLv3 license at https://github.com/calad0i/s-quark. Speaker: Chang Sun (California Institute of Technology (US))
- 
        
            
                
        15:35
    
    
            
        
        Online track reconstruction with graph neural networks on FPGAs for the ATLAS experiment 5mThe next phase of high energy particle physics research at CERN will 
 involve the High-Luminosity Large Hadron Collider (HL-LHC). In preparation for
 this phase, the ATLAS Trigger and Data AcQuisition (TDAQ) system will undergo
 upgrades to the online software tracking capabilities. Studies are underway to
 assess a heterogeneous computing farm deploying GPUs and/or FPGAs, together
 with the use of modern machine learning algorithms such as Graph Neural
 Networks (GNNs). We present a study on the reconstruction of tracks in the new
 all-silicon ATLAS Inner Tracker using GNNs on FPGAs for the Event Filter
 system. We explore each of the steps in a GNN-based tracking pipeline: graph
 construction, edge classification using an interaction network, and
 segmentation of the graph into track candidates. We investigate optimizations
 of the GNN approach that aim to minimize FPGA resources utilization and
 maximize throughput while retaining high track reconstruction efficiency and
 low fake rates required for the ATLAS Event Filter tracking system. These
 studies include model hyperparameter tuning, model pruning and
 quantization-aware training, and sequential processing of regions of the
 detector as graphs.Speaker: Jared Burleson (University of Illinois at Urbana-Champaign)
- 
        
            
                
        15:40
    
    
            
        
        IceSONIC - Network AI Inference on Coprocessors for IceCube Offline Processing 5mAn Artificial Intelligence (AI) model will spend “90% of its lifetime in inference.”To fully utilize co- 
 processors, such as FPGAs or GPUs, for AI inference requires O(10) CPU cores to feed to work to the
 coprocessors. Traditional data analysis pipelines will not be able to effectively and efficiently use
 the coprocessors to their full potential. To allow for distributed access to coprocessors for AI infer-
 ence workloads, the LHC’s Compact Muon Solenoid (CMS) experiment has developed the concept
 of Services for Optimized Network Inference on Coprocessors (SONIC) using NVIDIA’s Triton In-
 ference Servers. We have extended this concept for the IceCube Neutrino Observatory by deploying
 NVIDIA’s Triton Inference Servers in local and external Kubernetes clusters, integrating an NVIDIA
 Triton Client with IceCube’s data analysis framework, and deploying an OAuth2-based HTTP au-
 thentication service in front of the Triton Inference Servers. We will describe the setup and our
 experience adding this to IceCube’s offline processing system.Speaker: Benedikt Riedel
- 
        
            
                
        15:45
    
    
            
        
        Towards Online Machine Learning in DUNE Data Acquisition 5mProcessing large volumes of sparse neutrino interaction data is essential to the success of liquid argon time projection chamber (LArTPC) experiments such as DUNE. High rates of radiological background must be eliminated to extract critical information for track reconstruction and downstream analysis. Given the computational load of this rejection, and potential real time constraints of downstream analysis for certain physics applications, we propose the integration of machine learning based online data filtering into DUNE's data acquisition (DAQ) software through the Services for Optimized Network Inference on Coprocessors (SONIC) framework. This talk presents the current status of data processing methods for online data filtering within DUNE-DAQ. We show the status of implementing the NVIDIA Triton client-server model into DUNE DAQ, and compare directly to a real-time FGPA-based implementation from raw WIB readout data. We use the physics case of supernova pointing to benchmark the signal efficiency, latency, and throughput of our ML algorithms under various input and hardware configurations. Speaker: Andrew Mogan
- 
        
            
                
        15:50
    
    
            
        
        Fast Simulation of Particle Physics Calorimeters 5mDetector simulation is a key component of physics analysis and related activities in particle physics.In the upcoming High Luminosity LHC era, simulation will be required to use a smaller fraction of computing in order to satisfy resource constraints at the same time as experiments are being upgraded new with the new higher granularity detectors, which requires significantly more resources to simulate. This computing challenge motivates the use of generative machine learning models as fast surrogates to replace full physics-based simulators. We introduce CaloDiffusion, a new model which applies state-of-the-art diffusion models to simulate particle showers in calorimeters. 
 The simulations produced by CaloDiffusion are found to be nearly indistinguishable from those of full physics-based simulation, and can be generated up to 1000 times faster.Speaker: Oz Amram (Fermi National Accelerator Lab. (US))
- 
        
            
                
        15:55
    
    
            
        
        Real-Time AI-Based Data Selection in LArTPC Experiments Using Accelerated FPGA Platforms 5mThe demand for machine learning algorithms on edge devices, such as Field-Programmable Gate Arrays (FPGAs), arises from the need to process and intelligently reduce vast amounts of data in real-time, especially in large-scale experiments like the Deep Underground Neutrino Experiment (DUNE). Traditional methods, such as thresholding, clustering, multiplicity checks, or coincidence checks, struggle to extract complex features from large data volumes. In contrast, certain machine learning algorithms offer more efficient, accurate, and power-efficient processing, making real-time analysis feasible and minimizing the need for costly offline data processing. We designed 2D convolutional neural networks (2DCNNs) to effectively detect rare events and reject background noise, demonstrating the viability of CNN-based algorithms for this application. Modern tools like hls4ml and HLS have streamlined the deployment of these models on FPGA hardware. The deployment of this model on Xilinx Alveo U250/U55c accelerator cards has demonstrated promising performance, comfortably meeting resource budget and latency targets. This talk will showcase the potential for expanding the model to classify a wider range of signals with greater precision, along with the FPGA optimizations we have adopted to make it suitable for DUNE. Speaker: Akshay Malige
- 
        
            
                
        16:00
    
    
            
        
        Benchmarking and Interpreting Real-Time Quench Detection Algorithms 5mDetecting quenches in superconducting (SC) magnets by non-invasive means is a challenging real-time process that involves capturing 
 and sorting through physical events that occur at different frequencies and appear as various signal features. These events may be correlated across instrumentation type, thermal cycle, and ramp. These events together build a more complete picture of continuous processes occurring in the magnet, and may allow us to flag potential precursors for quench detection. We build upon our existing work on autoencoders for acoustic sensors and quench antenna (QA), by comparing auto encoder reconstruction loss under various algorithm training conditions to event distributions generated by an event detection framework we have developed. We also highlight our work on integrating QA and acoustic data streams to develop a unified dynamic quench detection algorithm for multi-modal data. All algorithms are evaluated in a simulated real-time environment, where instrumentation data is continuously streamed into the auto-encoder. This allows us to gain a more concrete understanding of the performance of our algorithms relative to physical events occurring in the magnet, and also provides a baseline software tool to generically evaluate autoencoders relative to their capture of quench precursors for SC magnets.Speaker: Maira Khan (Fermi National Accelerator Laboratory)
- 
        
            
                
        16:05
    
    
            
        
        Real-time Reinforcement Learning on AI Engines with Online Training for Autonomous Accelerators 5mReinforcement Learning (RL) is a promising approach for the autonomous AI-based control of particle accelerators. Real-time requirements for these algorithms can often not be satisfied with conventional hardware platforms. 
 In this contribution, the unique KINGFISHER platform being developed at KIT will be presented. Based on the novel AMD-Xilinx Versal platform, this system provides cutting-edge general microsecond-latency RL agents, specifically designed to perform online-training in a live environment.
 The successful application of this system to dampen horizontal betatron oscillations at the KArlsruhe Research Accelerator (KARA) will be discussed. Additionally, preliminary results of the application of the system to the highly non-linear problem of controlling microbunching instabilities will be presented.
 Special focus will be given to the implementation of low-latency neural networks targeting the AI Engines of Versal, and the necessary integration to the accelerator control system.Speaker: Luca Scomparin
- 
        
            
                
        16:10
    
    
            
        
        AI Red Teaming for Science 5mAI Red Teaming, an offshoot of traditional cybersecurity practices, has emerged as a critical tool for ensuring the integrity of AI systems. An under explored area has been the application of AI Red Teaming methodologies to scientific applications, which increasingly use machine learning models in workflows. I'll highlight why this is important and how AI Red Teaming can highlight vulnerabilities unique to ML-based systems used in scientific research. This approach not only protects against malicious actors but enhances the routine functioning of AI systems in scientific research. I will also briefly introduce FABRIC, an NSF testbed for optimizing science cyberinfrastructure, and show how it might be used for AI Red Teaming. Speaker: Anita Nikolich (UIUC)
- 
        
            
                
        16:15
    
    
            
        
        An Efficient Multiply Accumulate Tree for Real-time Quantized Neural Networks 5mNeural networks with a latency requirement at the order of microseconds, like the ones used at the CERN Large Hadron Colliders, are typically deployed on FPGAs fully unrolled. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the number of Multiply Accumulate (MAC) operations in matrix-vector multiplications. In this work, we present the Multiply Accumulate Tree (MAC tree), an algorithm that optimizes the area usage of fully parallel vector-dot products on chips by exploiting self-similar patterns in the network's weights. We implement the algorithm with the hls4ml library, a FOSS library for running real-time neural network inference on FPGAs, and compare the resource usage and latency with the original hls4ml implementation on different networks. The results show that the proposed MAC tree can achieve a reduction of LUT utilization by up to 50% in realistic quantized neural networks, while reducing the latency by up to a few folds. Furthermore, the proposed MAC tree provides an accurate estimation of the post-P&R resource utilization (error within ~10%) and reasonably good latency estimation, which can be used during the design phase to optimize the neural networks. Speaker: Chang Sun (California Institute of Technology (US))
- 
        
            
                
        16:20
    
    
            
        
        A gradient-based hardware-aware neural architecture search framework for hls4ml 5mIn software-hardware co-design, balancing performance with hardware constraints is critical, especially when using FPGAs for high-energy physics (HEP) applications with hls4ml. Limited resources and stringent latency requirements exacerbate this challenge. Existing frameworks such as AutoQKeras use Bayesian optimization to balance model size/energy and accuracy, but they are time-consuming, rely on early-stage training that can lead to inaccurate configuration evaluations, and often require significant trial and error. In addition, these metrics often do not reflect actual hardware usage. In this work, we present a gradient-based Neural Architecture Search (NAS) framework tailored for hardware-aware optimization within the hls4ml workflow. Our approach incorporates practical hardware resource metrics into the search process and dynamically adapts to different HLS designs, tool versions, and FPGA devices. Unlike AutoQKeras, our design is fully trained during the search process, requiring only minimal fine-tuning afterwards. This framework allows users to efficiently explore trade-offs between model performance and hardware usage for their specific tasks in a single shot. Key contributions include: (1) a user-friendly interface for easy customization of the search space; (2) deep integration with hls4ml, allowing users to define and experiment with their own HLS synthesis configurations for FPGA; and (3) flexibility, allowing users to define custom hardware metrics for optimization, such as combinations of multiple FPGA resources. We demonstrate the effectiveness of our approach using a 1.8M parameter convolutional neural network for an energy reconstruction task in calorimeters. Compared to the baseline model, the searched model achieved a 48.01% reduction in parameter count, as well as reductions in LUT usage of 29.73%, FF of 31.62%, BRAM of 16.06%, and DSP of 23.92%, with only a 0.84% degradation in MAE. The entire search process took approximately 2 GPU hours, demonstrating the potential of our framework to accelerate FPGA deployment in resource-constrained environments. Furthermore, this method can be extended beyond HEP to enable more efficient and scalable FPGA deployments in various fields, such as edge computing and autonomous systems. Speaker: ChiJui Chen
 
- 
        
            
                
        15:15
    
    
            
        
        
- 
        
            
                
        16:25
    
    
        →
        
            17:00
        
    
            
        
        Coffee/Posters Stewart Center 302 (Third floor)Stewart Center 302 (Third floor)Purdue University128 Memorial Mall, West Lafayette, IN 47907Information about posters: The dimensions of the display areas of the boards are 69” wide x 46” tall. Portrait and landscape both work. However, portrait would probably work best.The rolling white boards have magnets on them that you will use to attach the posters. 
- 
        
            
                
        17:00
    
    
        →
        
            18:25
        
    
            
        
        Contributed talks: Chair: Dr. Jan Schulte Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 4790713 mins + 2 mins Q&A - 
        
            
                
        17:00
    
    
            
        
        [Remote] Visualizing Loss Landscapes for Scientific Edge Machine Learning 15mCharacterizing the loss of a neural network can provide insights into local structure (e.g., smoothness of the so-called loss landscape) and global properties of the underlying model (e.g., generalization performance). Inspired by powerful tools from topological data analysis (TDA) for summarizing high-dimensional data, we are developing tools for characterizing the underlying shape (or topology) of loss landscapes. We are now exploring real-time scientific edge machine learning applications (e.g., high energy physics, microscopy) and using our tools to help design models and understand their robustness (e.g., how to quantify and visualize model diversity, how noise or quantization change the loss landscape). In this talk, I will focus on two of our recent collaborations. First, we evaluate how LogicNets ensembles perform on scientific machine learning tasks like the data compression task at the CERN Large Hadron Collider (LHC) Compact Muon Solenoid (CMS) experiment. By quantifying and visualizing the diversity of LogicNets ensembles, we hope to understand when ensembling can improve performance and how to decide which models to include in an ensemble. Second, we look at new physics constrained neural network architectures designed for the rapid fitting of force microscopy data. We visualize loss landscapes and their topology, observing sharp valleys in the loss landscapes of successfully trained models, likely reflecting the physical constraints. In contrast, we observe flatter but shallower basins in the loss landscapes of lower performing models, suggesting that training may be difficult and can fail to find a physically reasonable solution in some cases. These results highlight potential failure modes similar to those observed for other physically constrained architectures. Speaker: Caleb Geniesse (Lawrence Berkeley National Laboratory)
- 
        
            
                
        17:15
    
    
            
        
        GWAK: Low-Latency Machine Learning for Real-Time Detection of Unmodeled Gravitational Wave Transients 15mMatched-filtering detection techniques for gravitational-wave (GW) signals in ground-based interferometers rely on having well-modeled templates of the GW emission. Such techniques have been traditionally used in searches for compact binary coalescences (CBCs) and have been employed in all known GW detections so far. However, interesting science cases aside from compact mergers do not yet have accurate enough modeling to make matched filtering possible, including core-collapse supernovae and sources where stochasticity may be involved. Therefore, the development of techniques to identify sources of these types is of significant interest. We present a method of anomaly detection based on deep recurrent autoencoders to enhance the search region to unmodeled transients. Our approach, which we name “Gravitational Wave Anomalous Knowledge” (GWAK), employs a semi-supervised strategy designed for low-latency machine learning applications. GWAK runs in real time, offering a faster alternative to matched filtering and other classical burst identification techniques by leveraging the efficiency of deep learning algorithms. While the semi-supervised nature of the problem comes with a cost in terms of accuracy compared to supervised techniques, there is a qualitative advantage in generalizing experimental sensitivity beyond pre-computed signal templates. We construct a low-dimensional embedded space using the GWAK method, capturing the physical signatures of distinct signals on each axis of the space. By introducing signal priors that capture some of the salient features of GW signals, we allow for the recovery of sensitivity even when an unmodeled anomaly is encountered. We then show the newly public results for the GWAK algorithm on the third LIGO observing run (O3), including events not identified by any previous burst-detection algorithms. Speaker: Eric Anton Moreno (Massachusetts Institute of Technology (US))
- 
        
            
                
        17:30
    
    
            
        
        Realtime Anomaly Detection in the CMS Experiment 15mWe present the development, deployment, and initial recorded data of an unsupervised autoencoder trained for unbiased detection of new physics signatures in the CMS experiment during LHC Run 3. The Global Trigger makes the final hardware decision to readout or discard data from each LHC collision, which occur at a rate of 40 MHz, within nanosecond latency constraints. The anomaly detection algorithm, AXOL1TL, makes a prediction for each event within these constraints, selecting anomalous events for further analysis. The implementation occupies a small percentage of the resources of the system Virtex 7 FPGA, fitting seamlessly into the existing trigger logic. AXOL1TL was integrated into the Level-1 Trigger menu in May 2024, allocated bandwidth primarily in the High-Level Trigger scouting data streams. We describe the methodology to achieve ultra low latency anomaly detection, show the integration of the algorithm into the trigger system, as well as the monitoring and validation of the algorithm required to commission the trigger for data-taking. Finally, we present the first data recorded in 2024 by the anomaly detection trigger. Speaker: Noah Alexander Zipper (University of Colorado Boulder (US))
- 
        
            
                
        17:45
    
    
            
        
        Active Machine Learning for Projection Multi-photon 3D Printing 15mThe rapidly developing frontiers of additive manufacturing, especially multi-photon lithography, create a constant need for optimization of new process parameters. Multi-photon lithography is a 3D printing technique which uses the nonlinear absorption of two or more photons from a high intensity light source to induce highly confined polymerization. The process can 3D print structures with submicron features. However, the serial scanning nature of the typical process is slow. The recently developed projection multi-photon lithography process used in this work has presented a way to increase throughput by several orders of magnitude. Yet, like all additive manufacturing techniques, the process can require time-consuming experimentation and costly measurement techniques to determine optimal process parameters. In this work, an active machine learning based framework is presented for quickly and inexpensively determining optimal process parameters for the projection 3D printing process. The framework uses Bayesian optimization to guide experimentation for collection of the most informative data for training of a Gaussian process regression machine learning model. This model serves as a surrogate for the projection multi-photon lithography manufacturing process by predicting optimal patterns for achieving a target geometry. Three primitive 2D shapes at three different scales are used as test cases for this framework. In each case, the active learning framework improves the geometric accuracy, reducing the geometric error to within measurement accuracy in just four iterations (five experiments) of the Bayesian optimization, with each case requiring the collection only a few hundred training data points. Speaker: Mr Jason Edward Johnson (Purdue University)
- 
        
            
                
        18:00
    
    
            
        
        [Remote] Rapid, High-Resolution Coherent Diffractive Imaging with Physics-Informed Machine Learning 15mCoherent diffractive imaging (CDI) techniques like ptychography enable nanoscale imaging, bypassing the resolution limits of lenses. Yet, the need for time consuming iterative phase recovery hampers real-time imaging. While supervised deep learning strategies have increased reconstruction speed, they sacrifice image quality. Furthermore, these methods’ demand for extensive labeled training data is experimentally burdensome. Here, we propose an unsupervised physics-informed neural network reconstruction method, PtychoPINN, that retains the factor of 100-to-1000 speedup of deep learning-based reconstruction while improving reconstruction quality by combining the diffraction forward map with real-space constraints from overlapping measurements. In particular, PtychoPINN gains a factor of 4 in linear resolution and an 8 dB improvement in PSNR while also accruing improvements in generalizability and robustness. We validate PtychoPINN's performance on a range of datasets, spanning simulated objects and experimental measurements from the XPP endstation at LCLS, in both ptychographic and CDI modalities. The framework's novel combination of speed and accuracy offers new possibilities for real-time nanoscale imaging at x-ray light sources and beyond. Speaker: Oliver Hoidn
 
- 
        
            
                
        17:00
    
    
            
        
        
- 
        
            
        18:30
    
    
        →
        
            22:00
        
    
        
        Reception 3h 30m Union Club Hotel, North Ballroom, 1st FloorUnion Club Hotel, North Ballroom, 1st Floor201 Grant St, West Lafayette, IN 47906
 
- 
        
            
        08:15
    
    
        →
        
            09:00
        
    
        
        
- 
                    
                        
                            
                        
                    
                    - 
        
            
                
        09:00
    
    
        →
        
            10:45
        
    
            
        
        Invited talks: Chair: Prof. Maria C. Dadarlat Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907- 
        
            
                
        09:00
    
    
            
        
        (fast) Machine Learning Applications for Neuroscience 35mSpeaker: Jai Yu (U Chicago)
- 
        
            
                
        09:35
    
    
            
        
        [Remote] Efficient Deep Learning with Sparsity 35mSpeaker: Zhijian Liu (UCSD)
- 
        
            
                
        10:10
    
    
            
        
        AI’s Energy Challenge and four A’s to address it 35mSpeaker: Anand Raghunathan (Purdue University)
 
- 
        
            
                
        09:00
    
    
            
        
        
- 
        
            
                
        10:45
    
    
        →
        
            11:10
        
    
            
        
        Coffee/Posters Stewart Center 302 (Third floor)Stewart Center 302 (Third floor)Purdue University128 Memorial Mall, West Lafayette, IN 47907Information about posters: The dimensions of the display areas of the boards are 69” wide x 46” tall. Portrait and landscape both work. However, portrait would probably work best.The rolling white boards have magnets on them that you will use to attach the posters. 
- 
        
            
                
        11:10
    
    
        →
        
            12:20
        
    
            
        
        Invited talks: Chair: Prof. Yong Chen Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907invited talks 30 mins + 5 mins Q&A - 
        
            
                
        11:10
    
    
            
        
        Deep Learning Complexity in Neuromorphic Quantum Materials 35mSpatially resolved surface probes have recently revealed rich electronic textures at the nanoscale and mesoscale in many quantum materials. Rather than transitioning from insulator to metal all at once, VO2 forms an intricate network of metallic puddles that extend like filigree over a wide range of temperatures. We developed a convolutional neural network to harvest information from both optical microscope and scanning near field optical images of the metallic filigree. The neural network was able to identify the factors that cause electrons to clump during the transition, such as interactions with defects in the material and the strength of the electron-electron interactions. This reveals that the intricate patterns share universal features with domain structures in magnets, stripe orientation fractals in superconductors, and antiferromagnetic domains in rare earth nickelates, pointing to a universal origin of electron clumping in quantum materials. This identification opens the door to using hysteresis effects to sculpt the filigree, in order to improve the function of VO2 in novel electronic applications such as neuromorphic devices. [Phys. Rev. B 107, 205121 (2023); Nature Commun. 14, 2622 (2023); Nature Commun., 10, 4568 (2019)] Speaker: Erica Carlson
- 
        
            
                
        11:45
    
    
            
        
        [Remote] P-bits (quantum-inspired probabilistic computing) 35mSpeaker: Supriyo Datta (Purdue University)
 
- 
        
            
                
        11:10
    
    
            
        
        
- 
        
            
                
        12:20
    
    
        →
        
            12:45
        
    
            
        
        Contributed talks: NSF HDR ML challenges Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 4790713 mins + 2 mins Q&A 
- 
        
            
        12:45
    
    
        →
        
            14:00
        
    
        
        Lunch 1h 15m
- 
        
            
                
        14:00
    
    
        →
        
            15:10
        
    
            
        
        Invited talks: Chair: Prof. Josh Agar Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907invited talks 30 mins + 5 mins Q&A - 
        
            
                
        14:00
    
    
            
        
        Tools, Methodologies, and Co-design Principles for Building Microelectronics Artifacts for ML 35mSpeaker: Seda Ogrenci (Northwestern University)
- 
        
            
                
        14:35
    
    
            
        
        [Remote] ML for material science 35mSpeaker: Sergei Kalilin
 
- 
        
            
                
        14:00
    
    
            
        
        
- 
        
            
                
        15:10
    
    
        →
        
            16:15
        
    
            
        
        Lighting talks: Chair: Prof. Laimei Nie Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907- 
        
            
                
        15:10
    
    
            
        
        Model-Independent Real-Time Anomaly Detection at CMS with CICADA 5mIn the search for new physics, real-time detection of anomalous events is critical for maximizing the discovery potential of the LHC. CICADA (Calorimeter Image Convolutional Anomaly Detection Algorithm) is a novel CMS trigger algorithm operating at the 40 MHz collision rate. By leveraging unsupervised deep learning techniques, CICADA aims to enable physics-model independent trigger decisions, enhancing sensitivity to unanticipated signals. One of the key challenges is deploying such a system on resource-constrained hardware without compromising performance. This is addressed by utilizing knowledge distillation to replicate performance of larger unsupervised anomaly detection models in smaller supervised models that maintain high detection sensitivity while significantly reducing the memory footprint and computational demands. The final compressed model is deployed on FPGAs, allowing CICADA to perform real-time decision-making while operating within the stringent constraints of the CMS trigger system during Run3 data taking. In this talk, we will detail the architecture of CICADA, describe the knowledge distillation process, and evaluate its performance. Speaker: Lino Oscar Gerlach (Princeton University (US))
- 
        
            
                
        15:15
    
    
            
        
        Unsupervised Learning Methods of Real-Time Anomaly Detection for Data Selection and Detector Monitoring in Liquid Argon Time Projection Chambers 5mUnsupervised learning algorithms enable insights from large, unlabeled datasets, allowing for feature extraction and anomaly detection that can reveal latent patterns and relationships often not found by supervised or classical algorithms. Modern particle detectors, including liquid argon time projection chambers (LArTPCs), collect a vast amount of data, making it impractical to save everything for offline analysis. As a result, these experiments need to employ real-time analysis techniques during data acquisition. In this talk, I will present developments in building real-time, intelligent computer vision programs with unsupervised learning, both for selection of “rare signals” in the data and for detector monitoring applications in LArTPCs. Speaker: Jack Henry Cleeve (Columbia University)
- 
        
            
                
        15:20
    
    
            
        
        An open platform for in-situ high-speed computer vision with hls4ml 5mLow latency machine learning inference is vital for many high-speed imaging applications across various scientific domains. From analyzing fusion plasma [1] to rapid cell-sorting [2], there is a need for in-situ fast inference in experiments operating in the kHz to MHz range. External PCIe accelerators are often unsuitable for these experiments due to the associated data transfer overhead, high inference latencies, and increased system complexity. Thus, we have developed a framework to streamline the process of deploying standard streaming hls4ml neural networks and integrating them into existing data readout paths and hardware in these applications [3]. This will enable a wide range of high-speed intelligent imaging applications with off-the-shelf hardware. Typically, dedicated PCIe machine vision devices, so-called frame grabbers, are paired with high-speed cameras to handle high throughputs, and a protocol such as CoaXPress is used to transmit the raw camera data between the systems over fiber or copper. Many frame grabbers implement this protocol as well as additional pixel preprocessing stages on an FPGA device due to their flexibility and relatively low cost compared to ASICs. Some manufacturers, such as Euresys, have enabled easy access to their frame grabber FPGAs’ firmware reference design. This reference design, aptly named CustomLogic [4], allows the user to implement custom image processing functions on the available portion of the FPGA. Moreover, open-source co-design workflows like hls4ml enable easy translation and deployment of neural networks to FPGA devices, and have demonstrated latencies on the order of nanoseconds to microseconds [5]. Successful applications using a variety of FPGA accelerators have been demonstrated in many domains including particle physics and materials science. We provide the necessary wrappers, support files, and instruction to integrate an hls4ml model onto a frame grabber device with a few lines of code. We will present two comprehensive tutorials in collaboration with Euresys to demonstrate the full quantization-aware training-to-deployment and benchmarking process, in addition to hls4ml’s advanced feature set. We will also discuss and explore existing and potential applications. This work ultimately provides a convenient framework for performing in-situ inference on frame grabbers for high-speed imaging applications References 
 [1] Wei, Y., Forelli, R. F., Hansen, C., Levesque, J. P., Tran, N., Agar, J. C., Di Guglielmo, G., Mauel, M. E., Navratil, G. A. Review of Scientific Instruments, “Low latency optical-based mode tracking with machine learning deployed on FPGAs on a tokamak” 95(7), 073509 (2024),
 https://doi.org/10.1063/5.0190354.
 [2] Nitta, N., Sugimura, T., Isozaki, A., Mikami, H., Hiraki, K., Sakuma, S., et al. Cell, “Intelligent Image-Activated Cell Sorting” 175(1), 266-276.e13 (2018), https://doi.org/10.1016/j.cell.2018.08.028.
 [3] hls4ml-frame-grabbers, GitHub repository, https://github.com/fastmachinelearning/hls4ml-frame-grabbers.
 [4] Euresys, “CustomLogic,” https://www.euresys.com/en/CustomLogic, Euresys S.A., Seraing, Belgium (2021).
 [5] Duarte, J., Han, S., Harris, P., Jindariani, S., Kreinar, E., Kreis, B., Ngadiuba, J., Pierini, M., Rivera, R., Tran, N., Wu, Z. J. Instrum., “Fast inference of deep neural networks in FPGAs for particle physics” 13, P07027 (2018), https://doi.org/10.1088/1748-0221/13/07/P07027.Speaker: Ryan Forelli (Northwestern University)
- 
        
            
                
        15:25
    
    
            
        
        PearNets for Pearson Correlated Latent Optimization of Nanophotonic Devices 5mRecent advancements in generative artificial intelligence (AI), including transformers, adversarial networks, and diffusion models, have demonstrated significant potential across various fields, from creative art to drug discovery. Leveraging these models in engineering applications, particularly in nanophotonics, is an emerging frontier. Nanophotonic metasurfaces, which manipulate light at the subwavelength scale, require highly optimized meta-atom designs. Traditionally, optimizing such designs relied on computationally expensive, gradient-based methods, navigating exponentially large design spaces. In this work, we propose a novel machine learning-driven latent optimization approach, which improves the surrogate function correlation of latent optimization methods by enforcing a Pearson correlation through the usage of PearNets. Utilizing variational neural annealing, this technique effectively samples design candidates, achieving thermophotovoltaic efficiencies of up to 96.7%. Our method presents a scalable alternative for the design and optimization of nanophotonic devices, offering both reduction in computational complexity and improvements in accuracy in topological optimization. Speaker: Michael Tan Bezick
- 
        
            
                
        15:30
    
    
            
        
        EnsembleLUT: Scaling up LUT-based Neural Networks with Ensemble Learning 5mApplications like high-energy physics and cybersecurity require extremely high throughput and low latency neural network (NN) inference. Lookup-table-based NNs address these constraints by implementing NNs purely as lookup tables (LUTs), achieving inference latency on the order of nanoseconds. Since LUTs are a fundamental FPGA building block, LUT-based NNs map to FPGAs easily. LogicNets (and its successors) form one such class of LUT-based NNs that target FPGAs, mapping neurons directly to LUTs to meet the low latency constraints with minimal resources. However, it is difficult to implement larger, more performant LUT-based NNs like LogicNets because LUT usage increases exponentially with respect to neuron fan-in (i.e., number of synapses $\times$ synapse bitwidth). A large LUT-based NN quickly runs out of LUTs on an FPGA, which is unideal. Our work EnsembleLUT addresses this issue by creating ensembles of smaller LUT-based NNs that scale linearly with respect to the number of models, achieving higher accuracy within the resource constraints of an FPGA. We demonstrate that EnsembleLUT improves the scalability of LUT-based NNs on various scientific machine learning benchmarks such as jet substructure classification and high-granularity endcap calorimeter data compression found at the LHC CMS experiment, reaching higher accuracy with fewer resources than the largest LogicNets. Speaker: Olivia Weng
- 
        
            
                
        15:35
    
    
            
        
        An Efficient and Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks 5mRecent advancements in Vision-Language Models (VLMs) have enabled complex multimodal tasks by processing text and image data simultaneously, significantly enhancing the field of artificial intelligence. However, these models often exhibit biases that can skew outputs towards societal stereotypes, thus necessitating debiasing strategies. Existing debiasing methods focus narrowly on specific modalities or tasks and require extensive retraining, which adds to computational cost. To address these limitations, this paper introduces Selective Feature Imputation for Debiasing (SFID), a novel methodology that integrates feature pruning and low-confidence imputation (LCI) to effectively reduce biases in VLMs. SFID not only eliminates the need for retraining but also ensures that debiasing is achieved without increasing computational cost during inference, preserving efficiency throughout. Our experimental results demonstrate SFID's effectiveness across various VLM tasks, including zero-shot classification, text-to-image retrieval, image captioning, and text-to-image generation, significantly reducing gender biases without compromising performance. This approach enhances the fairness of VLM applications while maintaining computational efficiency across diverse scenarios. Speaker: Hoin Jung (Purdue University)
- 
        
            
                
        15:40
    
    
            
        
        wa-hls4ml: A benchmark and dataset for ML accelerator resource estimation 5mAs machine learning (ML) increasingly serves as a tool for addressing real-time challenges in scientific applications, the development of advanced tooling has significantly reduced the time required to iterate on various designs. Despite these advancements in areas that once posed major obstacles, newer challenges have emerged. For example, processes that were not previously considered bottlenecks, such as model synthesis, are now becoming limiting factors in the rapid iteration of designs. In an attempt to reduce these emerging constraints, multiple efforts have been launched towards designing a ML based surrogate model for resource estimation of synthesized accelerator architectures, which would help reduce the iteration time when attempting to design a solution within a set of given hardware constraints. This approach shows considerable potential, but as it stands, the effort is early and would benefit from coordination and standardization to assist future works as they emerge. In this work, we introduce wa-hls4ml, a ML accelerator resource estimation benchmark and corresponding dataset of 100,000+ synthesized dense neural networks. In addition to the resource utilization data provided in our dataset, we also offer the generated artifacts and logs for many of the synthesized neural networks with the intention to support future research in ML-based code generation. This benchmark evaluates performance against multiple common ML model architectures, primarily originating from scientific domains. The selected models are implemented through hls4ml on Xilinx FPGAs. We measure the performance of a given model through multiple metrics, including $R^2$ Score and SMAPE on regression tasks, as well as FLOPS and inference time to further characterize the estimator under test. Speaker: Ben Hawks (Fermi National Accelerator Lab)
- 
        
            
                
        15:45
    
    
            
        
        Neural Architecture Codesign for Fast Physics Applications 5mWe develop an automated pipeline to streamline neural architecture codesign for physics applications, to reduce the need for ML expertise when designing models for a novel task. Our method employs a two-stage neural architecture search (NAS) design to enhance these models, including hardware costs, leading to the discovery of more hardware-efficient neural architectures. The global search stage explores a wide range of architectures within a flexible and modular search space to identify promising candidate architectures. The local search stage further fine-tunes hyperparameters and applies compression techniques such as quantization aware training (QAT) and network pruning. We synthesize the optimal models to high level synthesis code for FPGA deployment with the hls4ml library. Additionally, our hierarchical search space provides greater flexibility in optimization, which can easily extend to other tasks and domains. We demonstrate this with two case studies: Bragg peak finding in materials science and jet classification in high energy physics. Speaker: Dmitri Demler
- 
        
            
                
        15:50
    
    
            
        
        Comprehensive Analysis of UNet Variants in Cardiac Image Segmentation 5mDeep learning, particularly employing the Unet architecture, has become pivotal in cardiology, facilitating detailed analysis of heart anatomy and function. The segmentation of cardiac images enables the quantification of essential parameters such as myocardial viability, ejection fraction, cardiac chamber volumes, and morphological features. These segmentation methods operate autonomously with minimal user intervention. Challenges arise in distinguishing the right ventricle from structures like the pulmonary artery, atrium, and aorta at their base, complicating accurate segmentation. To address these challenges, fully convolutional network models have been developed and implemented, optimizing learning parameters. Deep learning approaches for cardiac image segmentation demonstrate promising levels of accuracy. Comprehensively assesses four variants of the Unet architecture, Attention-Unet, TransUnet, U2Net, and Unet++, precisely for the segmentation of cardiac MRI. Utilising these include the Dice Coefficient, IoU Coefficient, Accuracy, and Loss. The analysis focuses on identifying architectural modifications and resource-efficient models that enhance performance. The findings contribute empirical evidence and credibility to inform future model selection for segmentation and analysis purposes. Speaker: Niharika Das (G H Raisoni University)
- 
        
            
                
        15:55
    
    
            
        
        Edge SpAIce: Enabling On-Board Data Compression With Machine Learning On FPGAs 5mThe number of CubeSats launched for data-intensive applications is increasing due to the modularity and reduced cost these platforms provide. Consequently, there is a growing need for efficient data processing and compression. Tailoring onboard processing with Machine Learning to specific mission tasks can optimise downlink usage by focusing only on relevant data, ultimately reducing the required bandwidth. The Edge SpAIce project showcases onboard data filtering and reduction by using Machine Learning to identify plastic litter in the oceans. The deployment pipeline, including drastic model compression and deployment using the open-source hls4ml and QONNX tools, enables high-performance, low-power, low-cost computation on onboard FPGA processors. We present lab-based demonstration results, highlighting performance in terms of accuracy, throughput, and power consumption, and discuss planned deployment aspects. Speaker: Nicolò Ghielmetti (CERN)
 
- 
        
            
                
        15:10
    
    
            
        
        
- 
        
            
                
        16:15
    
    
        →
        
            16:45
        
    
            
        
        Coffee/Posters Stewart Center 302 (Third floor)Stewart Center 302 (Third floor)Purdue University128 Memorial Mall, West Lafayette, IN 47907Information about posters: The dimensions of the display areas of the boards are 69” wide x 46” tall. Portrait and landscape both work. However, portrait would probably work best.The rolling white boards have magnets on them that you will use to attach the posters. 
- 
        
            
                
        16:45
    
    
        →
        
            18:20
        
    
            
        
        Contributed talks: Chair: Dr. Nhan Tran Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 4790713 mins + 2 mins Q&A - 
        
            
                
        16:45
    
    
            
        
        Fast Data, Faster Science: Connecting Instruments to Real-Time AI Compute 15mModern scientific instruments generate vast amounts of data at increasingly higher rates, outpacing traditional data management strategies that rely on large-scale transfers to offline storage for post-analysis. To enable next-generation experiments, data processing must be performed at the edge—directly alongside the scientific instruments. By integrating these instruments with high-bandwidth, low-latency computational resources, real-time data insights can be harnessed to optimize data acquisition and experimental strategy, ultimately enabling higher-impact scientific discovery. In this talk, we will highlight our work in enabling real-time data processing for scientific instruments, focusing on how NVIDIA’s advancements facilitate AI-driven workflows at the edge. We will further discuss networking and software solutions that allow for high-throughput data streaming from front-end sensors to GPUs, significantly reducing latency and increasing bandwidth to meet the needs of next-generation scientific experimentation. Speaker: Denis Leshchev
- 
        
            
                
        17:00
    
    
            
        
        An end-to-end ML-enabled platform for precision neuroscience 15mIn situ machine learning data processing for neuroscience probes can have wide-reaching applications from data filtering, event triggering, and ultimately real-time interventions at kilohertz frequencies intrinsic to natural systems. In this work, we present the integration of Machine Learning (ML) algorithms on an off-the-shelf neuroscience data acquisition platform by Spike Gadgets. The algorithms process data in situ on FPGAs in the head unit to extract phase information from neurological data of rodent brain signals to study the behavior of rats. Our goal is to obtain the analytic signal from the recorded EEG in real-time and estimate the phase angle of the analytic signal. We employ hls4ml to synthesize models integrated into the head unit hardware. The first stage of our work was synthesizing a dense MLP, after training it on the rats data, to extract the phase information and implementing it on the FPGA platform. To improve performance, we are further extending our algorithmic approach from simple MLP to use an FFT-based Hilbert Transform. Finally, we have created a more sophisticated model using Discrete Cosine Transforms that performed significantly better and produced more accurate results. Our work enables future exploration of optimized and hardware-efficient algorithms for in situ precision neuroscience. Speaker: Emadeldeen Hamdan (University of Illinois Chicago)
- 
        
            
                
        17:15
    
    
            
        
        Artificial Brains for Artificial Intelligence: A Novel Neurophysically Inspired Neural Network 15mArtificial neural networks (ANNs) are capable of complex feature extraction and classification with applications in robotics, natural language processing, and data science. Yet, many ANNs have several key limitations; notably, current neural network architectures require enormous training datasets and are computationally inefficient. It has been posited that biophysical computations in single neurons of the brain can inspire computationally more efficient algorithms and neural network architectures. Recently, research on dendrites, including work from our lab, suggests that each biological neuron is endowed with a hidden subcellular computational power more complex than the conventional perceptron providing a promising substrate for more efficient ANNs. 
 To this end, we propose the Interconnected Dendritic Network (IDN), a new type of ANN that takes close inspiration from cellular and subcellular networks of pyramidal neurons. Each neuron in an IDN has a set of dendrites that receive inputs; these dendrites are subdivided into branches of multiple orders to closely mimic dendritic organization in biological neurons. We further employ a family of physiologically-inspired activation functions to characterize neuronal input-output transformations. Instead of discrete layers, neurons are arranged in an n-dimensional space following topographical connectivity rules, forming a recurrent network. The network is composed of both excitatory and inhibitory neurons, which act in unison to regulate network activity. Learning happens by altering synaptic weights based on Hebbian plasticity, approximating synaptic weight distributions observed in biological networks. The IDN reached over 95% accuracy on written digit classification after training on only 400 data points and utilizing 3.6% of computational operations compared to a traditional dense model performing the same task with similar performance. Additionally, we show that the IDN is capable of predicting the motion of simulated physical systems in a reservoir-computing paradigm. Thus, we present a model that performs in low-data applications, is computationally efficient, and utilizes biophysically grounded mechanisms to mitigate the limitations of current ANNs.Speaker: Lorenzo Cacciapuoti
- 
        
            
                
        17:30
    
    
            
        
        Smart Pixels: Towards radiation hard ASIC with on-chip machine learning in 28nm CMOS 15mWe introduce a smart pixel prototype readout integrated circuit (ROIC) fabricated using a 28 nm bulk CMOS process, which integrates a machine learning (ML) algorithm for data filtering directly within the pixel region. This prototype serves as a proof-of-concept for a potential Phase III pixel detector upgrade of the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC). This chip, the second in a series of ROICs, employs a fully connected two-layer neural network (NN) to process data from a cluster of 256 pixels, identifying patterns corresponding to high-momentum particle tracks for selection and readout. The digital NN is embedded between the analog processing regions of the 256 pixels, maintaining the original pixel size. Its fully combinatorial digital logic circuit implementation minimizes power consumption, avoids clock distribution, and activates only upon receiving an input signal. The NN performs momentum classification based on cluster patterns, achieving a data rejection rate of 54.4% to 75.4% with a modest momentum threshold, opening up the possibility of using pixel information at 40 MHz for trigger purposes. The neural network NN) itself consumes around 300 µW. The overall power consumption per pixel, including analog and digital functions, is 6 µW, resulting in approximately 1 W/cm², within the permissible limits of the HL-LHC experiments. This presentation will showcase the preliminary testing results using Spacely, an open-source framework for post-silicon validation of analog, digital, and mixed-signal ASICs. Spacely maximizes hardware and software reuse, streamlining the testing process for small ASIC design teams in academia and research institutions. Speaker: Ms Jieun Yoo (UIC)
- 
        
            
                
        17:45
    
    
            
        
        Bit-Width Optimization of Power-Efficient Hardware Accelerators for Neural Networks using Catapult AI NN 15mNowadays, the application of neural networks (NNs) has expanded across different industries (e.g., autonomous vehicles, manufacturing, natural-language processing, etc.) due to their improved accuracy results. This was made possible because of the increased complexity of these networks which requires higher computational efforts and memory consumption. As a result, there is more demand for specialized NN hardware accelerators that can be used for efficient inference tasks, especially on resource-constrained edge devices (e.g., wearable devices). NNs are typically modeled and trained using high-level languages like Python. To implement NNs in hardware platforms such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), the Python-level code needs to be translated into register-transfer level (RTL) designs. This code transformation requires significant human effort and strong hardware expertise that can be challenging, especially when the NN architecture is not fixed. Catapult AI NN, an extension to the HLS4ML open-source project developed by Fermilab, converts the NN Python code into synthesizable C++ which after Catapult’s high level synthesis downstream flow, produces the RTL code at the end. Compared to other backends of HLS4ML (e.g., Vivado), Catapult allows designers to target ASIC platforms. By automating this transformation, designers can save their time and focus more on tuning the hardware related parameters (such as the amount of parallelism) to explore the design space and obtain the most optimal design for power, performance and area (PPA) in shorter time. We present a case study using Catapult AI NN to synthesize the design. When converting the floating-point data types into bit-level fixed point representation through Quantized Aware Training, value range analysis is performed to validate that optimal bit-widths are chosen and no overflow or saturation errors are present. Speaker: Marzieh Vaez Torshizi (Siemens EDA)
- 
        
            
                
        18:00
    
    
            
        
        End-to-end workflow for ML-based qubit readout with QICK + hls4ml 15mHigh-fidelity single-shot quantum state readout is crucial for advancing quantum technology. Machine-learning (ML) assisted qubit-state discriminators have shown high readout fidelity and strong resistance to crosstalk. By directly integrating these ML models into FPGA-based control hardware, fast feedback control becomes feasible, which is vital for quantum error correction and other applications. Here, we developed an end-to-end workflow for real-time ML-based qubit readout by integrating a neural network designed through hls4ml into the Quantum Instrumentation Control Kit (QICK). In our recent experiment test for single transmon qubit readout, we achieved single-shot readout fidelity of 92% in 1.3 µs readout time with an inference latency of less than 50 ns and resource usage of approximately 10% LUTs and 2% FFs for the FPGA RFSoC that host the QICK system. Our works can also serve as guidance for others to use these tools for their own research. Speaker: Botao Du (Purdue University)
 
- 
        
            
                
        16:45
    
    
            
        
        
 
- 
        
            
                
        09:00
    
    
        →
        
            10:45
        
    
            
        
        
- 
                    
                        
                            
                        
                    
                    - 
        
            
                
        09:00
    
    
        →
        
            10:45
        
    
            
        
        Invited talks: Chair: Prof. Seda Ogrenci Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907invited talks 30 mins + 5 mins Q&A - 09:00
- 
        
            
                
        09:35
    
    
            
        
        LLMs for chip design 35mSpeaker: Siddharth Garg
- 10:10
 
- 
        
            
                
        10:45
    
    
        →
        
            11:10
        
    
            
        
        Coffee/Posters: Conference photo then coffee! Stewart Center 302 (Third floor)Stewart Center 302 (Third floor)Purdue University128 Memorial Mall, West Lafayette, IN 47907Information about posters: The dimensions of the display areas of the boards are 69” wide x 46” tall. Portrait and landscape both work. However, portrait would probably work best.The rolling white boards have magnets on them that you will use to attach the posters. 
- 
        
            
                
        11:10
    
    
        →
        
            12:20
        
    
            
        
        Invited talks: Chair: Prof. Wei Xie Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 47907invited talks 30 mins + 5 mins Q&A - 
        
            
                
        11:10
    
    
            
        
        AI and ML at the future Electron Ion Collider 35mThe Electron Ion Collider (EIC) promises unprecedented insights into nuclear matter and quark-gluon interactions, with advances in artificial intelligence (AI) and machine learning (ML) playing a crucial role in unlocking its full potential. This talk will explore potential opportunities for AI/ML integration within the EIC program, drawn from broader discussions in the AI4EIC forum. I will begin by exploring the impact of AI-assisted detector design for future EIC experiments and its broader applications in future nuclear and high-energy physics experiments. These AI-driven methods have the potential to optimize detector performance and push experimental design beyond traditional limits. I will focus on how ML enhances event-level reconstruction and particle identification (PID), particularly for Cherenkov detectors, a key technology at EIC energy scales. I will demonstrate how ML models enable faster simulations and reconstruction, improving data analysis efficiency and expanding physics reach. I will also discuss the use of machine learning in kinematic reconstruction for key reaction mechanisms at the EIC, including deep learning to address uncertainty quantification, which is crucial for interpreting precise measurements. As an example, I will highlight its application in Deep Inelastic Scattering. If time permits, I will briefly mention the community’s efforts in streaming readout and its potential for real-time AI/ML applications, therefore introducing the subsequent talk on real-time PID and tracking in nuclear physics. This talk highlights the transformative role of AI/ML at the EIC, addressing key computational challenges and emphasizing their broader scientific impact on nuclear and particle physics. Speaker: Cristiano Fanelli (William & Mary)
- 
        
            
                
        11:45
    
    
            
        
        Real-time ML-FPGA filter for particle identification and tracking in nuclear physics 35mSpeaker: Sergey Furletov (Jefferson lab)
 
- 
        
            
                
        11:10
    
    
            
        
        
- 
        
            
        12:20
    
    
        →
        
            13:10
        
    
        
        Lunch 50m
- 
        
            
                
        13:10
    
    
        →
        
            15:20
        
    
            
        
        Contributed talks: Chair: Dr. Dmitry Kondratyev Steward Center 306 (Third floor)Steward Center 306 (Third floor)Purdue University128 Memorial Mall Dr, West Lafayette, IN 4790713 mins + 2 mins Q&A - 
        
            
                
        13:10
    
    
            
        
        rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA 15mDeploying Machine Learning (ML) models on Field-Programmable Gate Arrays (FPGAs) is becoming increasingly popular across various domains as a low-latency and low-power solution that helps manage large data rates generated by continuously improving detectors. However, developing ML models for FPGA deployment is often hindered by the time-consuming synthesis procedure required to evaluate resource usage and latency. In particular, the synthesis has a chance of failing depending on the ML architecture, especially if the required resources exceed the capacity of the target FPGA, which in turn makes the development process slow and repetitive. To accelerate this development, we introduce rule4ml, an open-source tool designed to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA. We leverage hls4ml, a framework that helps translate NNs into high-level synthesis (HLS) code, to synthesize a diverse dataset of NN architectures and train resource utilization and inference latency predictors. While hls4ml requires full synthesis to obtain resource and latency insights, our method uses trained regression models for immediate pre-synthesis predictions. The prediction models estimate key FPGA metrics, including the usage of Block RAM (BRAM), Digital Signal Processors (DSP), Flip-Flops (FF), and Look-Up Tables (LUT), as well as the inference clock cycles. Evaluation on both synthetic and benchmark NN architectures demonstrates high prediction accuracy, with R² scores between 0.8 and 0.98, and sMAPE values ranging from 10% to 30%. This presentation will focus on introducing rule4ml, showcasing how this tool allows immediate assessment of the feasibility and performance of NNs on FPGAs. We will also explore the data generation and the regression models' training and validation, presenting the predictive performance and the current limitations of our approach as well as potential future improvements. By providing these insights, we aim to show how rule4ml can significantly streamline the deployment of ML models in real-time applications, ultimately reducing development time and enhancing productivity in workflows that rely on NN-to-FPGA frameworks such as hls4ml. Speaker: Hamza Ezzaoui Rahali (University of Sherbrooke)
- 
        
            
                
        13:25
    
    
            
        
        Large Neural Network Partitionning for Distributed Inference on FPGAs 15mUltra-high-speed detectors are crucial in scientific and healthcare fields, such as medical imaging, particle accelerators and astrophysics. Consequently, upcoming large dark matter experiments, like the ARGO detector with an anticipated 200 m² detector surface, are generating massive amounts of data across a large quantity of channels that increase hardware, energy and environmental costs. Simultaneously, there are also increasing concerns about cybersecurity for edge devices, such as Internet of Things, which currently cannot detect attacks in real-time while maintaining normal operations. Many of these devices do not have the compute power to host complex cybersecurity algorithms. To address these challenges, future experiments and systems need effective real-time computation on edge devices. Many new approaches utilize machine learning (ML) algorithms at the source to analyze data in real-time on field-programmable gate arrays (FPGAs) using tools like HLS4ML. However, the complexity and size of the models often do not fit on a single FPGA hence the need for a distributed approach across multiple FPGAs. This work introduces a method to divide and distribute large neural network models across multiple FPGAs for inference. By decomposing the network layer by layer, we address the limitations of fitting expansive models on a single FPGA. When a layer is too large, it can be divided into multiple parallel components. We employ a partitioning tool using rule4ml to accelerate this process, ensuring efficient resource allocation and allowing for low-latency optimization. Alternatively, the method can be applied manually for more customized splitting and distribution. Utilizing a pipelined architecture, we mitigate the network-induced latency between each node. As a proof of concept, we implemented this approach by deploying a fully connected neural network (FCNN) for the MNIST dataset and a convolutional neural network (CNN) for a cybersecurity classifier on five small PYNQ-Z2 boards, handling models with 12k and 73k parameters, respectively. This technique not only accommodates large models but also reduce the model to FPGA tuning, making it ideal for applications requiring fast development cycles. This presentation will discuss the necessity of a distributed approach for large ML models on FPGAs and detail the methodology to split and distribute a large neural network models across multiple FPGAs, showing a quick demonstration of the process from start to finish on FPGAs. The tested models show latency ranging from milli to microsecond range with no loss of accuracy when compared to inference on CPU. Finally, we will discuss future directions for scaling this method to accommodate even larger models and more complex neural network architectures. Speaker: Charles-Étienne Granger (Université de Sherbrooke)
- 
        
            
                
        13:40
    
    
            
        
        Accelerating Reproducible FPGA Machine Learning Research With a Workflow Management Framework 15mHigh-Level Synthesis (HLS) techniques, coupled with domain-specific translation tools such as HLS4ML, have made the development of FPGA-based Machine Learning (ML) accelerators more accessible than ever before, allowing scientists to develop and test new models on hardware with unprecedented speed. However, these advantages come with significant costs in terms of implementation complexity. The process of taking code written in a high-level language and translating it to a synthesized hardware IP is a long and fraught one, and configuration and workflow choices made at every step along the way will affect the finished product. Properly documenting each of these subtle choices is difficult, and failing to do so can make it difficult or impossible to reproduce the model. Further complicating matters is the question of optimization - the breadth of possible design choices in modern hardware ML systems is vast, and optimizing these systems by hand is an intractable process. Efficient design space exploration methods are essential in these development flows. Modern tooling often supports mechanisms to improve the efficiency of design-space explorations. Tensorflow and Pytorch, for example both support data streaming APIs, where training and test datasets can be loaded into memory only as needed. These APIs can have significant performance benefits and enable the exploration of model options that would otherwise have been unfeasible to evaluate due to system resource limitations - our preliminary investigations demonstrated peak memory usage improvements of two orders of magnitude on a very large dataset. However, using these mechanisms requires both awareness that they exist and an investment of development time to make use of them. As a step toward resolving these issues, we introduce a new open-source framework, the Experimental Setup and Optimization System for HLS4ML (ExSeOS-HLS), which aims to enable optimized and reproducible ML model development flows on hardware systems. By centrally managing all of the steps in the design process, from preprocessing raw input data to extracting result metrics from vendor toolchain reports, ExSeOS-HLS can automatically and reproducibly optimize hyperparameters, HLS settings, and even model architectures for a user-defined target metric or combination thereof. Additionally, it can take advantage of tool-specific development optimizations by default, reducing system resource usage and accelerating the research process with minimal effort on the part of the researcher. Experiment configurations can be exported to a single file, which can be sent to collaborators or published online in order to allow exact reproduction of the research workflow. In introducing this system, our goal is to enable collaborative, reproducible workflows for FPGA-based ML acceleration across many scientific application domains. Speaker: Alexis Shuping (Northwestern University)
- 
        
            
                
        13:55
    
    
            
        
        Differentiable Weightless Neural Networks 15mWe introduce the Differentiable Weightless Neural Network (DWN), a model based on interconnected lookup tables. Training of DWNs is enabled by a novel Extended Finite Difference technique for approximate differentiation of binary values. We propose Learnable Mapping, Learnable Reduction, and Spectral Regularization to further improve the accuracy and efficiency of these models. We evaluate DWNs in three edge computing contexts: (1) an FPGA-based hardware accelerator, where they demonstrate superior latency, throughput, energy efficiency, and model area compared to state-of-the-art solutions, (2) a low-power microcontroller, where they achieve preferable accuracy to XGBoost while subject to stringent memory constraints, and (3) ultra-low-cost chips, where they consistently outperform small models in both accuracy and projected hardware area. DWNs also compare favorably against leading approaches for tabular datasets, with higher average rank. Overall, our work positions DWNs as a pioneering solution for edge-compatible high-throughput neural networks. Speaker: Alan T. L. Bacellar (University of Texas at Austin)
- 
        
            
                
        14:10
    
    
            
        
        [Remote] Machine Learning Inference on FPGAs Using HLS4ML with oneAPI Backend 20mThe increasing demand for efficient machine learning (ML) acceleration has intensified the need for user-friendly yet flexible solutions, particularly for edge computing. Field Programmable Gate Arrays (FPGAs), with their high configurability and low-latency processing, offer a compelling platform for this challenge. Our presentation gives update to an end-to-end ML acceleration flow utilizing the oneAPI backend for the HLS4ML compiler to translate models from open-source frameworks such as Keras and PyTorch into FPGA-ready kernels. These kernels, once synthesized, generate optimized bitstreams that implement core ML operations such as layers, activation functions, and normalization, orchestrated by the host for real-time inference. The key challenge of optimizing ML inference on FPGA lies in the architectural differences compared to traditional CPUs and GPUs. Our approach leverages domain-specific optimizations, including pipelined kernel execution, input streaming, fine-grained parallelism control, and improved memory organization. These techniques are critical to achieving superior results in terms of reduced resource utilization, higher maximum clock frequency (fMAX), and lower latency, as demonstrated in synthesis reports targeting Agilex™ FPGAs. Though hardware-based benchmarks are still in progress, we will present preliminary performance estimates and sample outputs from the HLS4ML compilation with the oneAPI backend. The goal of this presentation is to introduce how FPGAs can be used effectively for machine learning tasks, focusing on the oneAPI backend for HLS4ML. We aim to show how this approach simplifies the process of running ML models on FPGAs, making it easier for developers to prototype and deploy solutions. By integrating familiar software frameworks with FPGA hardware, this work provides a practical path toward fast ML inference on edge devices. Speaker: Haoyan Wang (Intel Corporation)
- 
        
            
                
        14:30
    
    
            
        
        [Remote] BRAM-Aware Quantization for Efficient Transformer Inference via Tile-based Architecture on a FPGA 15mTransformers are becoming increasingly popular in fields such as natural language processing, speech processing, and computer vision. However, due to the high memory bandwidth and power requirements of Transformers, contemporary hardware is gradually unable to keep pace with the trend of larger models. To improve hardware efficiency and increase throughput and reduce latency, there has been a shift towards using FPGAs to implement existing Transformer algorithms. Compared to GPUs, FPGAs offer more on-chip Block RAM, allowing for the deployment of more medium-sized models for acceleration on FPGAs. However, as input sequences grow longer, larger buffers are needed on the FPGA to temporarily store these long sequences. Therefore, this paper proposes using Flash Attention within a blocked computation flow architecture to reduce the usage of Attention in Block RAM and its bandwidth requirements. Nonetheless, due to the necessity of using more Block RAM to store QKV instead of accessing HBM, designs often struggle to complete Place\&Route. As a solution, before the hardware synthesis compilation stage, an optimized mixed precision configuration is derived using post-training quantized models along with a Block RAM estimator in conjunction with a simulated annealing method. This approach not only significantly reduces the design period, but also allows for a reduction of Block RAM utilization by approximately 20%~40% without substantially affecting accuracy. When implementing Transformer-related algorithms on FPGAs using high-level synthesis techniques, power efficiency can be improved by 61%~321% compared to other studies. Speaker: Ling-Chi Yang (Institute of Electronics in National Yang Ming Chiao Tung University)
- 
        
            
                
        14:45
    
    
            
        
        Episodic reinforcement learning for 0νββ decay signal discrimination 15mNeutrinoless double beta ($0 \nu \beta \beta$) decay is a Beyond the Standard Model process that, if discovered, could prove the Majorana nature of neutrinos—that they are their own antiparticles. In their search for this process, $0 \nu \beta \beta$ decay experiments rely on signal/background discrimination, which is traditionally approached as a supervised learning problem. However, the experiment data are by nature unlabeled, and producing ground-truth labels for each data point is an involved process if using traditional methods. As such, we reformulate the task of classifying $0 \nu \beta \beta$ decay experiment data as a weakly-supervised learning task and develop an episodic reinforcement learning (RL) algorithm with Randomized Return Decomposition to address it, training and validating our algorithm on real data produced by the Majorana Demonstrator experiment. We find that the RL-trained classifier slightly outperforms a standard supervised learning model trained under the same conditions. Our classifier serves as a proof of concept and shows potential for application in future $0 \nu \beta \beta$ decay experiments like LEGEND. Speaker: Sonata Simonaitis-boyd
- 
        
            
                
        15:00
    
    
            
        
        Towards a machine learning trigger for high-purity germanium spectrometers 15mHigh-purity germanium spectrometers are widely used in fundamental physics and beyond. Their excellent energy resolution enables the detection of electromagnetic signals and recoils down to below 1keV ionization energy and even lower. However, the detectors are also very sensitive to all types of noise that will overwhelm the trigger routines of the data acquisition system and significantly increase the file sizes. This ultimately limits the set trigger threshold to not be able to fully leverage the full potential of the detectors. 
 I will present in my talk time series cluster algorithms to identify this noise and show a concept of how to use anomaly detection algorithms for triggering to overcome the short comings of traditional trigger algorithms and lower the energy threshold. I will also illustrate the gains for fundamental physics detections using coherent elastic neutrino nucleus scattering as an example.Speaker: Janina Dorin Hakenmueller (Duke University)
 
- 
        
            
                
        13:10
    
    
            
        
        
- 
        
            
                
        15:20
    
    
        →
        
            15:30
        
    
            
        
        Coffee/Posters: Walk to PHYS Stewart Center 302 (Third floor)Stewart Center 302 (Third floor)Purdue University128 Memorial Mall, West Lafayette, IN 47907Information about posters: The dimensions of the display areas of the boards are 69” wide x 46” tall. Portrait and landscape both work. However, portrait would probably work best.The rolling white boards have magnets on them that you will use to attach the posters. 
- 
        
            
                
        15:30
    
    
        →
        
            17:00
        
    
            
        
        Physics Colloquium [No remote participation] 1h 30m PHYS 112PHYS 112Purdue University525 Northwestern Ave, West Lafayette, IN 47907, USASpeaker: Philip Coleman Harris (Massachusetts Inst. of Technology (US))
- 17:00 → 18:00
- 
        
            
        18:30
    
    
        →
        
            21:30
        
    
        
        Conference Dinner 3h Union Club Hotel, East/West Faculty Lounge, 2nd FloorUnion Club Hotel, East/West Faculty Lounge, 2nd Floor201 Grant St, West Lafayette, IN 47906
 
- 
        
            
                
        09:00
    
    
        →
        
            10:45
        
    
            
        
        
- 
                    
                        
                            
                        
                    
                    - 
        
            
                
        09:00
    
    
        →
        
            11:00
        
    
            
        
        HLS4ML tutorial Room 105 (Lambert Fieldhouse)Room 105Lambert FieldhouseThe hls4ml tutorial will take place in Lambert Fieldhouse (LAMB) room 105 The tutorial will be using the Anvil computing cluster at Purdue. Access is controlled using the NSF ACCESS system and participants need to create an account at https://operations.access-ci.org/identity/new-user. You will need to let us know your ACCESS username using this form https://forms.gle/5YD5q9ViQU2NQzGY7. Please do so not later than 48 hours before the tutorial. Convener: Jan-Frederik Schulte (Purdue University (US))
- 
        
            
                
        09:00
    
    
        →
        
            10:00
        
    
            
        
        SONIC tutorial BHEE 234 (Purdue University )BHEE 234Purdue University465 Northwestern Ave, West LafayetteThe meeting/tutorial will be held in BHEE (501 Northwestern Ave), room 234. The tutorial will be using the Anvil computing cluster at Purdue. Access is controlled using the NSF ACCESS system and participants need to create an account at https://operations.access-ci.org/identity/new-user. You will need to let us know your ACCESS username using this form https://forms.gle/5YD5q9ViQU2NQzGY7. Please do so not later than 48 hours before the tutorial. Convener: Yuan-Tang Chou (University of Washington (US))
- 
        
            
                
        10:00
    
    
        →
        
            11:00
        
    
            
        
        ECE Distinguished Lecture: Optics à la mode – a new way of making, using and understanding optics: Prof David Miller (Stanford) MSEE 112MSEE 112Purdue University501 Northwestern Ave, West Lafayette, IN 47907, USA
- 
        
            
                
        10:00
    
    
        →
        
            12:00
        
    
            
        
        Next Generation Triggers PHYS 390PHYS 390Purdue University525 Northwestern Avenue, West Lafayette- 10:00
- 
        
            
                
        10:20
    
    
            
        
        NGT From CMS(HLT) 20mSpeaker: Marco Rovere (CERN)
- 10:40
- 
        
            
                
        11:20
    
    
            
        
        Discussion 40m
 
- 
        
            
                
        10:00
    
    
        →
        
            11:00
        
    
            
        
        SONIC developer meeting BHEE 234BHEE 234Purdue UniversityConvener: Yuan-Tang Chou (University of Washington (US))
- 
        
            
                
        11:00
    
    
        →
        
            12:00
        
    
            
        
        HDR ML challenge hand-on session Room 105, Lambert FieldhouseRoom 105, Lambert FieldhousePurdue UniversityConvener: Yuan-Tang Chou (University of Washington (US))
- 
        
            
        12:00
    
    
        →
        
            13:45
        
    
        
        lunch 1h 45m PHYS 242 (CERN)PHYS 242CERN
- 
        
            
                
        13:45
    
    
        →
        
            14:30
        
    
            
        
        Awards and CloseoutPHYS 112 - 
        
            
                
        13:45
    
    
            
        
        Poster awards 5m
- 
        
            
                
        13:50
    
    
            
        
        Closeout talk 25mSpeaker: Sasha Boltasseva (Purdue University)
 
- 
        
            
                
        13:45
    
    
            
        
        
- 
        
            
                
        15:00
    
    
        →
        
            17:00
        
    
            
        
        Fast ML not-for-profit chat 390 (Physics building)390Physics building
 
- 
        
            
                
        09:00
    
    
        →
        
            11:00