25–29 May 2026
Chulalongkorn University
Asia/Bangkok timezone

Migrating INSPIREHEP Document Harvesting and Curation Workflows to Apache Airflow

Not scheduled
1m
Chulalongkorn University

Chulalongkorn University

Poster Presentation Track 6 - Software environment and maintainability Poster

Speaker

Ioannis Tsanaktsidis (CERN)

Description

The continuous ingestion of scientific documents from external sources into INSPIREHEP created challenges in scalability, transparency, and long term maintenance. This contribution describes the migration of our document harvesting and curation pipeline to the open source workflow orchestrator Apache Airflow. The work involved re-engineering legacy scripts and cron based tasks into modular Airflow Directed Acyclic Graphs (DAGs), improving error handling, dependency management, metadata validation, and automated notifications within the scheduler.

The integration of Airflow with our curation and ingestion systems provided full end to end traceability, reduced manual interventions, and simplified maintenance of complex workflows. Performance observations and lessons learned during the transition demonstrate clear gains in reliability, observability, and reproducibility.

Author

Co-author

Presentation materials

There are no materials yet.