EUCaif Open Data and Simulator Day (Astro/Cosmo)

Europe/Zurich
Description

This meeting will be on zoom only, afternoon only. Zoom link available at the bottom of this page, only for registered participants.

Open, large-scale datasets are essential for training and evaluating modern Machine Learning, especially foundation models that learn versatile, reusable representations.

Focusing on astro & cosmology (with an emphasis on multi-messenger astronomy across γ-rays, X-rays, optical/IR, radio, GWs, neutrinos, CMB, and catalogs), this Data Day aims to survey available Open Data and open simulators relevant for building and testing foundation models.

If you have released an Open Data dataset, are preparing a release, or are developing an open simulator or benchmark, please contact the organisers (below).
If you are developing data-hungry ML models (self-supervised, multimodal, generative, retrieval-augmented, agents, etc.), we also encourage you to participate. We will close with an open discussion to collect needs and align on next steps (ie. joint tasks).

Please note that a similar event concerning LHC data has taken place 4th Nov 2025.

This event is organised by EUCaif WG1 (Foundation Model, see intro

David Rousseau (rousseau@ijclab.in2p3.fr), Roberto Ruiz de Austri (ruiz@ific.uv.es) and Ik Siong Heng (ik.heng@glasgow.ac.uk)

The Zoom recording passcode is mceA2^VH

Registration
Participants
Participants
  • Adnan Ghribi
  • Amine Lahouel
  • Anastasiia Petrovych
  • Andre Sznajder
  • Anna Zaborowska
  • Axel Naumann
  • Cedric Bhihe
  • Christian Glaser
  • Darius Jurciukonis
  • David Rousseau
  • Dmitriy Kostunin
  • Dmitry Malyshev
  • Enrico Lupi
  • Florian List
  • Florian List
  • Florian List
  • Francesco Xotta
  • Giacomo Principe
  • GuangZai Ye
  • Hannes Jakob Hansen
  • Hanyue Guo
  • Herwins Gangaram
  • Humberto Reyes-Gonzalez
  • Humberto Reyes-González
  • Ignacio Sevilla
  • Ik Siong Heng
  • Ippocratis Saltas
  • James Alvey
  • jiefeng chen
  • John Veitch
  • João A. Gonçalves
  • Judit Pérez-Romero
  • Judita Mamuzic
  • Marcos Cruz
  • Marina Migliaccio
  • Martin Eriksen
  • Martín Rodríguez Monroy
  • Mauricio Bustamante
  • Maurizio Pierini
  • Michael Kramer
  • Miguel Cárdenas-Montes
  • Nachiketa Chakraborty
  • Nikhil Mukund
  • Oscar Jose Pellicer Valero
  • Peter Pang
  • Piet Nogga
  • R. Belén Barreiro
  • Raulian-Ionut Chiorescu
  • Reem Alfaidi
  • Roberto Ruiz De Austri
  • Saptashwa Bhattacharyya
  • Sascha Caron
  • Thibeau Wouters
  • Thomas Vuillaume
  • Tobias Golling
  • Tsun Ho Pang
  • Waleed Esmail
  • Xiaofei Dong
  • Zhihua Liang
    • 14:00 14:10
      Introduction 10m
      Speaker: Roberto Ruiz De Austri (IFIC (UV-CSIC))
    • 14:10 14:30
      An introduction to Foundation models 20m

      Foundation models (FMs) represent a paradigm shift in artificial intelligence, marking a move from small, specialized systems to large, multi-modal, general-purpose ones. Starting from the seminal paper that coined the term, this talk will provide an overview of the rapidly evolving landscape, highlight key milestones such as GPT and CLIP, up to the latest state-of-the-art models, such as DINOv3. The talk will conclude with a critical look at the zero-shot promise of FMs, assessing where the reality currently stands in relation to the hype.

      Speaker: Oscar Pellicer (Universitat de Valencia)
    • 14:30 14:50
      Gamma-rays 20m

      The all-sky gamma-ray emission in the GeV-TeV range contains a treasure trove of information on the Galactic interstellar medium, the Galactic cosmic ray (CR) population and CR accelerators.
      In this talk, I will give you a brief introduction to the current state of the art satellite and ground-based gamma-ray detectors that are collecting the high-energy photons.
      Amidst a large number of detected γ-ray sources, there are still an undetected faint source population hidden within the Interstellar Emission (IEM) and possibly contributions from Dark Matter annihilation and decay.
      Going over the current deep learning based state-of-the-art point source detection and characterization pipeline, I will discuss the challenges in building a fully automated multi-wavelength source detection and characterization pipeline.

      Speaker: Saptashwa Bhattacharyya (University of Nova Gorica)
    • 14:50 15:10
      NMMA, a comprehensive nuclear-physics and multi-messenger astrophysics framework 20m

      The multi-messenger detection of the gravitational-wave signal GW170817, the corresponding kilonova AT2017gfo, the short gamma-ray burst GRB170817A, and the observed afterglow has delivered a scientific breakthrough. For an accurate interpretation of all these different messengers, one requires robust theoretical models that reliably describe the emitted gravitational wave, electromagnetic emission, and dense matter. In addition, one needs efficient, accurate computational tools to ensure correct cross-correlation between the models and the observational data. For this purpose, we have developed the Nuclear- physics and Multi-Messenger Astrophysics framework NMMA. The code allows incorporation of nuclear-physics constraints at low densities, as well as X-ray and radio observations of isolated neutron stars, and the processing of standardized multi-band light curves, gravitational-wave strain data, and gamma-ray spectral observations. In this talk, we show how the NMMA simultaneously analyzes the gravitational-wave signal, the kilonova, and the gamma-ray burst afterglow, how machine learning techniques have been used across various aspects of NMMA to achieve this goal, and its future prospects.

      Speaker: Peter Tsun Ho Pang (Nikhef)
    • 15:10 15:20
      Redback project 10m
      Speaker: Nikhil Sarin (University of Cambridge)
    • 15:20 15:40
      Tokenizing the Sky: Diffusion Autoencoding for Multimodal, Irregular Astronomical Data 20m

      Self-supervised learning (SSL) has transformed representation learning, yet most encoders are validated on regularly-sampled inputs (images, audio, video). Scientific data collected by synoptic survey telescopes violate these assumptions: measurements often arrive as long, irregular, heterogeneous sequences spanning multiple data modalities. I will present daep (Diffusion Autoencoder with Perceivers), a framework that tokenizes heterogeneous measurements, compresses them with a Perceiver encoder, and reconstructs them with a Perceiver-IO diffusion decoder. This design natively accommodates variable length, missingness, and cross-modal fusion while scaling to large datasets. Across both observed ZTF data and synthetic LSST observations, daep attains lower reconstruction error, yields more discriminative latent spaces, and better preserves fine-scale structure than popular SSL baselines. These results position diffusion-based autoencoders as a powerful architecture for foundation model pre-training in the LSST era, with natural extensions to multi-messenger astrophysics.

      Speaker: Alex Gagliano (MIT)
    • 15:40 15:50
      Break 10m
    • 15:50 16:10
      Modeling High-Energy Astrophysical Neutrino Production 20m

      High-energy astrophysical neutrinos (TeV-PeV) hold immense potential for revealing the physics of the most extreme cosmic accelerators. Yet, twelve years after their discovery, we have identified fewer than a handful of their sources, relying on coincident high-energy electromagnetic emission. Given the scarcity of high-purity neutrino data and the rarity of coincident observations, it is critical to maximize the physical insight extracted from the limited observations available. And we must do so while accounting for the limitations of theoretical models, including unknown quantities and parameter correlations. In this talk, I will outline two classes of theoretical models for high-energy neutrino and gamma-ray production: a simplified framework and a more sophisticated one. I will propose how these models could be used to train machine-learning methods run on publicly accessible data to infer the physical parameters driving particle production. This approach ensures that when the next coincident neutrino-electromagnetic detection occurs, we are prepared to glean the maximum possible physical insight from it.

      Speaker: Mauricio Bustamante (Niels Bohr Institute, University of Copenhagen)
    • 16:10 16:30
      Neutrino data from neutrino telescopes 20m

      I will give a brief overview of the experimental landscape of (cosmic) neutrino telescopes in the TeV to EeV energy range. I will discuss the detection technique and make suggestions on how neutrino data can be integrated into foundation models. I will present sensitivities of current and future telescopes to relate neutrino fluxes to expected event counts, which can serve as the basis of simulation studies. I will show public event catalogues, the IceCube real-time neutrino alerts, and discuss the observation of a binary neutron star merger and a flaring blazar as examples of MM astronomy with neutrinos.

      Speaker: Cristian Glaser (TU Dortmund University)
    • 16:30 16:50
      Cross-Correlations between the CMB and Large-Scale Structure: Linking Data and Simulations 20m

      In the coming years, deep and wide galaxy surveys will deliver an unprecedented wealth of large-scale structure (LSS) data, complementing the high-sensitivity Cosmic Microwave Background (CMB) measurements from Planck and current ground-based experiments. Together, these datasets probe a wide range of physical scales and cosmic epochs, making it timely to investigate their complementarities and combined constraining power. We will present how cross-correlating CMB and LSS observables that respond to the same underlying physics offers a powerful route to enhance cosmological constraints while mitigating instrumental and astrophysical systematics that affect each probe individually. Fully exploiting this synergy, however, demands robust theoretical modelling and rigorous validation of analysis pipelines, which in turn require high-fidelity simulations that consistently implement multiple astrophysical and cosmological effects within the same lightcone.

      Speaker: Marina Migliaccio (University of Rome)
    • 16:50 17:20
      Discussion 30m