Open (and Closed) Data in the Age of AI

US/Central
Description

The aim of the workshop is to explore what it means to build “cross-experiment multi-modal” foundation models in a landscape where the relevant scientific data can be subject to a range of policies from “open data” to experiment-restricted (either proprietary or limited distribution/raw) data as well as the use of simulations in this context. Similarly, both the data and simulation come with a range of latency/embargo, “initial use vs reuse/reinterpretation” and experiment governance structures. The goal of this workshop is to explore the technical, policy and cyberinfrastructure questions that arise when pursuing such shared models.

Specific questions:

  • What does it actually mean in practice to build a “foundation model” across experiments with different detector designs, data formats, and physics goals?

  • What does it mean in practice to do pre-training on “diverse data” in a shared environment vs fine-tuning in a restricted (experiment) environment?

  • How is benchmarking of the models done? How are the models validated when issues arise spanning the pre-training and restricted fine-tuning?

  • What are the technical and cyberinfrastructure implications?

  • Who owns the resulting models and what are the implications given different experiment governance structures? (And data ownership by international collaborations?)

  • If initially trained on a set of current and archived data, how do these models evolve going forward as new data appears from new experiments/upgrades/detector configurations?

  • If industry is involved in parts of this process, how do we avoid issues related to vendor lock-in and/or retain the “public” expectation that underlies most of the government funding of fundamental science?

 

This event is sponsored in part by the National Science Foundation through grants OAC-2226378, OAC-2226379 and OAC-2226380 (FAIROS-HEP) Any opinions, findings, conclusions or recommendations expressed in this material are those of the developers and do not necessarily reflect the views of the National Science Foundation.
  • Thursday 23 April
    • 09:00 09:15
      Introduction 15m
      Speakers: Peter Elmer (Princeton University (US)), Robert William Gardner Jr (University of Chicago (US))
    • 09:15 09:30
      TREASURE Project 15m
      Speaker: Paolo Calafiura (Lawrence Berkeley National Lab. (US))
    • 09:30 09:45
      American Science Cloud 15m
      Speaker: Oliver Gutsche (Fermi National Accelerator Lab. (US))
    • 09:45 10:00
      ATLAS Open Data 15m
      Speaker: Zach Marshall (Lawrence Berkeley National Lab. (US))
    • 10:00 10:15
      CMS Open Data 15m
      Speaker: Matthew Bellis (Cornell University/Siena College (US))
    • 10:15 10:30
      Neutrinos Open Data 15m
      Speakers: Prof. Jianming Bian (University of California Irvine (US)), Prof. Jianming Bian (University of California, Irvine)
    • 10:30 11:00
      Coffee Break 30m
    • 11:00 11:15
      FM4NPP 15m
      Speaker: Shuhang Li
    • 11:15 12:30
      Discussion 1h 15m
    • 12:30 13:30
      Lunch 1h
    • 13:30 15:00
      Discussion 1h 30m
    • 15:00 15:30
      Coffee Break 30m
    • 15:30 17:00
      Discussion 1h 30m
    • 19:00 21:00
      Workshop Dinner 2h
    • 09:00 10:30
      Discussion 1h 30m
    • 10:30 11:00
      Coffee Break 30m
    • 11:00 12:30
      Summary Discussion 1h 30m
    • 12:30 13:30
      Lunch 1h
    • 13:30 15:00
      Small Group Working Sessions/Discussions 1h 30m