The Journey of Developing Specialized Text Embedding Models- 15'+5'

11 Apr 2025, 10:20
20m
80/1-001 - Globe of Science and Innovation - 1st Floor (CERN)

80/1-001 - Globe of Science and Innovation - 1st Floor

CERN

Esplanade des Particules 1, 1211 Meyrin, Switzerland
60
Show room on map
Invited talks LLMs and AI Assistants LLMs and AI Assistants

Speaker

Thorsten Hellert

Description

The specialized terminology and complex concepts inherent in physics present significant challenges for Natural Language Processing (NLP), particularly when relying on general-purpose models. In this talk, I will discuss the development of physics-specific text embedding models designed to overcome these obstacles, beginning with PhysBERT—the first model pre-trained exclusively on a curated corpus of 1.2 million arXiv physics papers. Building upon this foundation, we turn our attention to accelerator physics, a subfield with even more intricate language and concepts. To effectively capture the nuances of this domain, we developed AccPhysBERT, a sentence embedding model fine-tuned specifically for accelerator physics literature. A key aspect of this development involved leveraging Large Language Models (LLMs) extensively to generate annotated training data, enabling AccPhysBERT to facilitate advanced NLP applications such as semantic paper-reviewer matching and integration into Retrieval-Augmented Generation systems.

Author

Co-authors

Presentation materials