Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio

Not scheduled
15m
OIST

OIST

1919-1 Tancha, Onna-son, Kunigami-gun Okinawa, Japan 904-0495
poster presentation Track4: Middleware, software development and tools, experiment frameworks, tools for distributed computing

Speaker

Dr Mario Lassnig (CERN)

Description

This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possibly to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy, Apache, and nginx. In this contribution, the reasons and design decisions for the deployment, the actual implementation, and an evaluation of all involved services and components are discussed.

Primary authors

Dr Mario Lassnig (CERN) Ralph Vigne (University of Vienna (AT))

Co-authors

Cedric Serfon (CERN) Martin Barisits (CERN) Thomas Beermann (Bergische Universitaet Wuppertal (DE)) Vincent Garonne (CERN)

Presentation materials