Bites of FM4S: [2] LLMs for experiments in fundamental physics

Europe/Zurich
Description

Following up on the Foundation Models for Science mini-workshop, the CMS ML innovation team is excited to announce Bites of Foundation Models for Science, deep diving into specific themes in this area.

 The second of these bites will be held remotely on June 3rd 2024 (15:00-19:00 CERN time) with the theme of LLMs for experiments in fundamenta physics. We aim to connect researchers from CMS, ATLAS, and beyond LHC working on building LLMs-based tools, to support scientific experimental pipeline.

 We invite you to share your work! Abstract submission for the first workshop is open until May 15th. 

 We plan to follow-up with the next themes soon. Please register below to receive updates on the future events. Abstract submission for these is open as well. Feel free to submit an abstract for a theme that better fits your work, we will keep it in mind for future events!

 Don’t hesitate to contact us at cms-conveners-ml-innovation@cern.ch for more details or questions!

cms-ml-logo-reduced.jpeg

 

Organizing committee:

  • Javier Duarte (UCSD)
  • Gaia Grosso (IAIFI / MIT)
  • Raghav Kansal (Caltech / FNAL)
  • Pietro Vischia (Universidad de Oviedo / ICTEA)
From the same series
1 2
Participants
Zoom Meeting ID
64721985332
Host
Gaia Grosso
Alternative host
Raghav Kansal
Passcode
87730735
Useful links
Join via phone
Zoom URL
    • Invited talks
      • 1
        AccGPT: A Chatbot for CERN Internal Knowledge

        AccGPT is an innovative pilot project utilizing Large Language Models (LLMs) to create a chatbot for interacting with CERN's extensive internal knowledge base. This initiative is primarily led by the CERN Beams and IT departments, while the objective is to make this chatbot available to the entire CERN community. AccGPT is designed to provide quick and straightforward answers to queries, similar to ChatGPT, thereby enhancing productivity and decreasing the time experts spend on support tasks. Looking ahead, there are plans to expand AccGPT's functionalities, for example utilizing enhanced Agent features.

        Speaker: Dr Florian Rehm (CERN)
      • 2
        chATLAS: An AI Assistant for the ATLAS Collaboration

        The ATLAS Collaboration is composed of around 6,000 scientists, engineers, developers, students and administrators, with decades of institutional documentation spread across wikis, code docs, meeting agendas, recommendations, publications, tutorials, and project management systems. With the advent of retrieval augmented generation (RAG) and sophisticated large language models (LLMs) such as GPT-4, there is now an opportunity to produce a “front door” to this intimidatingly large corpus. ChATLAS is an attempt to provide this entrypoint, as ATLAS’ official AI assistant and search system. In this contribution, we review the past year of developments, present the latest updates to the system, and introduce ongoing work to improve back-end performance, agentic information gathering, and science-centric design components.

        Speaker: Daniel Thomas Murnane (Niels Bohr Institute, University of Copenhagen)
      • 3
        LLM-based physics analysis assistant at BESIII

        The data processing and analyzing is one of the main
        challenges at HEP experiments. To accelerate the physics
        analysis and drive new physics discovery, the rapidly
        developing Large Language Model (LLM) is the most promising
        approach, it have demonstrated astonishing capabilities in
        recognition and generation of text while most parts of physics
        analysis can be benefitted. In this talk we will discuss the
        construction of a dedicated intelligent agent, an AI assistant
        names Dr.Sai at BESIII based on LLM, the potential usage to
        boost hadron spectroscopy study, and the future plan towards a
        AI scientist.

        Speakers: Beijiang Liu, Changzheng YUAN, Ke Li (Chinese Academy of Sciences (CN)), Zhengde Zhang (中国科学院高能物理研究所)
    • 16:15
      Break
    • Invited talks
      • 4
        Multi-Agent Research Validator & Enabler Using LLMs (MARVEL): Experiences from LIGO

        Gravitational wave research at the Advanced LIGO observatories integrates complex, interconnected elements of experimental physics, computational simulations, and theoretical astrophysics. However, decades of valuable knowledge remain scattered across unstructured, multi-modal data and fragmented codebases. Efficient dissemination of this knowledge using large language models (LLMs) can significantly accelerate scientific discovery. In this talk, we share experiences from developing MARVEL, a modular, multi-agent research framework designed to provide scientific assistance in highly technical domains. MARVEL leverages open LLMs to ensure data privacy and is designed to be flexible enough to accommodate a broader range of scientific domains. We highlight challenges, including limitations in fine-tuning, pitfalls of naive Retrieval-Augmented Generation (RAG), model hallucinations, context window constraints, and difficulties in processing scientific documents via optical character recognition. To enhance factual accuracy and reasoning capabilities, MARVEL integrates tool usage and leverages test-time computing at the expense of increased latency. Finally, we emphasize the importance of modular workflows and custom benchmarks to adapt to advances in foundational models rapidly.

        Speaker: Nikhil Mukund (MIT)
      • 5
        The SpeakYSE: An Agentic LLM for Supernova Science

        Time-domain astronomy is rapidly entering a data-rich era in which wide-field surveys discover millions of transients per year, overwhelming traditional, hand-driven analysis. In this talk we present SpeakYSE, an agentic and open-source language model that turns natural-language requests into end-to-end analyses for the Young Supernova Experiment. The SpeakYSE links literature retrieval, database reasoning, and low-level tool-calling for on-the-fly exploratory data analysis. We describe the SpeakYSE’s architecture, early results from its use within the collaboration, and future directions enabled by next-generation reasoning models. Domain-specific LLM agents like SpeakYSE will be essential for exploiting next-generation surveys such as the Rubin LSST.

        Speaker: Alex Gagliano (IAIFI/MIT/Harvard)
      • 6
        Bridging LLMs and Scientific Infrastructure: A2rchi for Context-Aware Research Support

        We introduce A2rchi (AI-Augmented Research Chat Intelligence), an intelligent, domain-adaptable open source chatbot designed to support research and education workflows.
        Beyond its core functionalities, A2rchi also integrates with common communication and workflow tools, including email systems, ticketing platforms, and collaboration platforms like Mattermost, offering seamless assistance across multiple channels.

        Leveraging Retrieval-Augmented Generation (RAG), A2rchi combines foundational large language models with custom, project-specific data—such as course materials and documentation—to deliver accurate, context-aware responses.

        Originally developed and deployed to support MIT classroom instructions and the Physics Department’s analysis facility, A2rchi is now being expanded to serve the CMS experiment at CERN, with applications ranging from Tier-0 operations and data management to end-user physics analysis support.

        We present the current implementation, lessons learned from real-world deployments, and the roadmap for scaling A2rchi into a robust, domain-aware assistant for large-scale scientific collaborations.

        Speaker: Mariarosaria D'Alfonso (Massachusetts Inst. of Technology (US))
    • Round table [CMS internal]

      brain storming on ideas for LLMs applications in CMS

      Conveners: Gaia Grosso (IAIFI, MIT), Raghav Kansal (Caltech / Fermilab)