Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

CHEP 2018 Conference, Sofia, Bulgaria

Europe/Sofia
Sofia, Bulgaria

Sofia, Bulgaria

National Culture Palace, Boulevard "Bulgaria", 1463 NDK, Sofia, Bulgaria
Peter Hristov (CERN), Vasil Vasilev (Princeton University (US)), Petya Vasileva (CERN)
Description

CHEP 2018 took place on 9-13 July 2018 at the National Palace of Culture, Sofia, Bulgaria.
Details about the conference could be found here: CHEP2018.org

 

 

 


PLATINUM SPONSORS

    


SILVER SPONSORS

 

 

               

 


 

Participants
  • Adam Aurisano
  • Adam Morris
  • Adam Wegrzynek
  • Adrian Alan Pol
  • Adrian Coveney
  • Adriana Telesca
  • Adriano Di Florio
  • Alastair Dewhurst
  • Alastair Pidgeon
  • Alberto Aimar
  • Alessandra Doria
  • Alessandra Forti
  • Alessandro De Salvo
  • Alessandro Lonardo
  • Alex Kastanas
  • Alex Martyniuk
  • Alexander Undrus
  • Alexandre Sousa
  • Alexei Klimentov
  • Alexey Anisenkov
  • Alexey Rybalchenko
  • Alja Mrak Tadel
  • Alkis Papadopoulos
  • Alvaro Fernandez Casani
  • Amber Boehnlein
  • Andre Sailer
  • Andrea Ceccanti
  • Andrea Manzi
  • Andrea Perrotta
  • Andrea Rizzi
  • Andrea Sciaba
  • Andrea Valassi
  • Andrea Valassi
  • Andreas Heiss
  • Andreas Joachim Peters
  • Andreas Nowack
  • Andreas Petzold
  • Andreas Salzburger
  • Andrei Gheata
  • Andrei Gheata
  • Andrei Kazarov
  • Andrew Hanushevsky
  • Andrew John Washbrook
  • Andrew McNab
  • Andrew Melo
  • Andrew Wightman
  • Andrey Baginyan
  • Andrey Dolbilov
  • Andrey Lebedev
  • Andrey Nechaevskiy
  • Anna Fatkina
  • Anna Woodard
  • Anton Josef Gamel
  • Antoni Shtipliyski
  • Antoni Shtipliyski
  • Antonio Augusto Alves Junior
  • Antonio Dias
  • Antonio Dias
  • Antonio Perez-Calero Yzquierdo
  • Aresh Vedaee
  • Arif Bayirli
  • Aristotelis Kyriakis
  • Armenuhi Abramyan
  • Armin Nairz
  • Artem Petrosyan
  • Arturo Sanchez Pineda
  • Arun Kumar
  • Athanasios Kotisis
  • Attila Krasznahorkay
  • Axel Naumann
  • Baosong Shan
  • Barthelemy Von Haller
  • Bartlomiej Rachwal
  • Beijiang Liu
  • Belmiro Moreira
  • Ben Couturier
  • Ben Couturier
  • Benedetto Gianluca Siddi
  • Benjamin Fischer
  • Benjamin Fischer
  • Benjamin Krikler
  • Benjamin Krikler
  • Benjamin Morgan
  • Benjamin Richards
  • Beraldo Leal
  • Birgit Lewendel
  • Bjarte Kileng
  • Boretto Marco
  • Boris Bauermeister
  • Boris Bauermeister
  • Boris Wagner
  • Borja Aparicio Cotarelo
  • Bouhali Othmane
  • Bradley Spitzbart
  • Brian Paul Bockelman
  • Bruno Heinrich Hoeft
  • Carl Lundstedt
  • Catherine Biscarat
  • Catrin Bernius
  • Cesare Calabria
  • Charles Leggett
  • Chekalina Viktoriia
  • Chris Burr
  • Chris Burr
  • Chris Lee
  • Christian Caron
  • Christian Voss
  • Christoph Heidecker
  • Christoph Wissing
  • Christopher Jones
  • Christopher Tunnell
  • Christopher Tunnell
  • Chun-Yu Lin
  • Claire Adam Bourdarios
  • Clara Gaspar
  • Claudio Grandi
  • Claudio Grandi
  • Concezio Bozzi
  • Costin Grigoras
  • Dagmar Adamova
  • Dainius Simelevicius
  • Damian Dechev
  • Daniel Crawford
  • Daniel S. Katz
  • Daniela Bauer
  • Daniele Cesini
  • DANIELE SPIGA
  • Danilo Piparo
  • Dario Berzano
  • Dave Dykstra
  • David Cameron
  • David Colling
  • David Crooks
  • David Dossett
  • David Groep
  • David Groep
  • David Kelsey
  • David Lange
  • David Lange
  • David Lawrence
  • David Martin Clavo
  • David Rohr
  • david rousseau
  • David Schultz
  • David Smith
  • Dennis Klein
  • Diego Da Silva Gomes
  • Dimitri Bourilkov
  • Dimitri Bourilkov
  • Diogo Di Calafiori
  • Dirk Duellmann
  • Dirk Hufnagel
  • Dirk Hutter
  • Dmitry Popov
  • Dmytro Kresan
  • Domenico Giordano
  • Dominik Muller
  • Dorin-Daniel Lobontu
  • Doris Ressmann
  • Duncan Rand
  • Edgar Fajardo Hernandez
  • Edoardo Martelli
  • Eduardo Rodrigues
  • Eduardo Rodrigues
  • Edward Karavakis
  • Edward Moyse
  • Elizabeth Gallas
  • Elizabeth Sexton-Kennedy
  • Emanouil Atanassov
  • Emanuel Gouveia
  • Enric Tejedor Saavedra
  • Enric Tejedor Saavedra
  • Enrico Fasanelli
  • Enrico Fattibene
  • Enrico Gamberini
  • Enrico Guiraud
  • Enrico Guiraud
  • Eric Lancon
  • Eric Vaandering
  • Esteban Fullana Torregrosa
  • Fabian Lambert
  • Fabio Hernandez
  • Fabio Hernandez
  • Fabrizio Furano
  • FaHui Lin
  • Federica Legger
  • Federico Carminati
  • Federico Stagni
  • Felice Pantaleo
  • Fernando Abudinen
  • Fernando Harald Barreiro Megino
  • Filippo Costa
  • Fons Rademakers
  • Francesco Di Capua
  • Francesco Tenchini
  • Frank Berghaus
  • Frank Gaede
  • Frank Winklmeier
  • Frank Winklmeier
  • Gancho Dimitrov
  • Gang Chen
  • Gene Van Buren
  • Gennady Ososkov
  • Geonmo Ryu
  • Gerhard Raven
  • German Cancio Melia
  • Ghita Rahal
  • Ghita Rahal
  • Giacomo Cucciati
  • Gianfranco Sciacca
  • Gianpaolo Carlino
  • Gilles Grasseau
  • Gioacchino Vino
  • Giulio Eulisse
  • Giuseppe Avolio
  • Gleyzer Sergei
  • Gonzalo Martínez Lema
  • Gordon Watts
  • Graeme Stewart
  • Graeme Stewart
  • Grigori Rybkin
  • Grigorii Latyshev
  • Grigory Kozlov
  • Grzegorz Jereczek
  • Guilherme Amadio
  • Guy Barrand
  • Gvozden Neskovic
  • Hadrien Benjamin Grasland
  • Hadrien Benjamin Grasland
  • Hannah Short
  • Hannes Sakulin
  • Hasib Muhammad
  • Helge Meinhard
  • Henry Fredrick Schreiner
  • Henry Fredrick Schreiner
  • Herve Rousseau
  • Holger Schulz
  • Hristo Mohamed
  • Hugo Gonzalez Labrador
  • Håvard Helstrup
  • Ian Bird
  • Ian Collier
  • Ian Fisk
  • Ifan Williams
  • Ignacio Asensi Tortajada
  • Ignacio Heredia
  • Igor Pelevanyuk
  • Igor Soloviev
  • Ilaria Vai
  • Illya Shapoval
  • Imma Riu
  • Irina Filozova
  • Ivan Glushkov
  • Ivan Kadochnikov
  • Ivan Sokolov
  • Ivana Hrivnacova
  • Jack Cranshaw
  • Jakob Blomer
  • James Catmore
  • James Catmore
  • JAMES LETTS
  • Jan Erik Sundermann
  • Jan Knedlik
  • Jan Strube
  • Jana Schaarschmidt
  • Janusz Martyniak
  • Jaroslava Schovancova
  • Jaroslava Schovancova
  • Javier Montejo Berlingen
  • Jean-Roch Vlimant
  • Jean-Yves Le Meur
  • Jeff Dost
  • Jeff Porter
  • Jeff Templon
  • Jennifer Ngadiuba
  • Jerome LAURET
  • Jerome Odier
  • Jeronimо Bezerra
  • Jessica Stietzel
  • Jiaheng Zou
  • Jim Pivarski
  • Jim Pivarski
  • Joao Vitor Viana Barbosa
  • Joel Closier
  • Joel Closier
  • Johan Bregeon
  • Johannes Elmsheuser
  • Johannes Lehrbach
  • Jose Castro Leon
  • Jose Hernandez
  • Josep Flix Molina
  • João Fernandes
  • Julia Andreeva
  • Julian Wojciech Myrcha
  • Julie Kirk
  • Julien Leduc
  • Juraj Smiesko
  • Jurry de la Mar
  • Ka Vang Tsang
  • Kai Leffhalm
  • Karol Hennessy
  • Karol Hennessy
  • Katarzyna Maria Dziedziniewicz-Wojcik
  • Kenyi Hurtado Anampa
  • Kevin Patrick Lannon
  • Kevin Pedro
  • Khalil Bouaouda
  • Khalil Chawoshi
  • Kilian Schwarz
  • Kim Albertsson
  • Konstantin Gertsenberger
  • Konstantin Gertsenberger
  • Konstantin Skarbuskas
  • Kosuke Takeda
  • Kristin Fanebust Hetland
  • Kurt Rinnert
  • Kurt Rinnert
  • Kyle Knoepfel
  • Latchezar Betev
  • Laura Promberger
  • Laurent Aphecetche
  • LeManuel Bitsoi
  • Leo Piilonen
  • Li Haibo
  • Lindsey Gray
  • Lorenzo Moneta
  • Lorenzo Rinaldi
  • Lotta Mether
  • Luca dell'Agnello
  • Luca Perrozzi
  • Lucia Grillo
  • Luis Granado Cardoso
  • Luisa Arrabito
  • Lukas Heinrich
  • Lukasz Dutka
  • Lynn Wood
  • Maarten Litmaath
  • Maciej Maciejewski
  • Maciej Pawel Szymanski
  • Maciej Pawel Szymanski
  • Maiken Pedersen
  • Maksym Zyzak
  • Malachi Schram
  • Mantas Stankevicius
  • Manuel Giffels
  • Manuel Jesus Rodriguez
  • Marcel Andre Schneider
  • Marco Canaparo
  • Marco Cattaneo
  • Marco Clemencic
  • Marco Clemencic
  • Marco Mascheroni
  • Marcus Ebert
  • Maria Girone
  • Marica Antonacci
  • Marilena Bandieramonte
  • Marilena Bandieramonte
  • Marina Sahakyan
  • Mario Lassnig
  • Mark Neubauer
  • Mark Neubauer
  • Marko Petric
  • Markus Frank
  • Martin Adam
  • Martin Barisits
  • Martin Gasthuber
  • Martin Ritter
  • Martin Ritter
  • Martin Sevior
  • Mateusz Jacek Goncerz
  • Matevz Tadel
  • Matteo Concas
  • Matteo Cremonesi
  • Matteo Rama
  • Matthias Jochen Schnepf
  • Matthias Pink
  • Matthias Richter
  • Matthieu Marinangeli
  • Mauro Verzetti
  • Max Fischer
  • Maxim Borisyak
  • Maxim Potekhin
  • Mayank Sharma
  • Meade Patrick
  • Michael Andrews
  • Michael Bender
  • Michael Davis
  • Michael Papenbrock
  • Michael Papenbrock
  • Michael Russell
  • Michal Orzechowski
  • Michal Svatos
  • Michel Jouvin
  • Miguel Martinez Pedreira
  • Mihaela Gheata
  • Mihaela Gheata
  • Mikhail Hushchyn
  • Mikhail Titov
  • Milos Lokajicek
  • Mirena Paneva
  • Mohammad Al-Turany
  • Monica Rush
  • Moritz Kiehn
  • Narine Manukyan
  • Nico Madysa
  • Nico Madysa
  • Nicola Tarocco
  • Nicola Tarocco
  • Nicolas Liampotis
  • Nicolo Magini
  • Nikita Balashov
  • Nikita Balashov
  • Nikolay Voytishin
  • Nikos Kasioumis
  • Oksana Shadura
  • Oksana Shadura
  • Oleg Samoylov
  • Oliver Lantwin
  • Oliver Lantwin
  • Ondrej Subrt
  • Oxana Smirnova
  • Pablo Llopis Sanmillan
  • Pablo Martin Zamora
  • Panos Paparrigopoulos
  • Paolo Calafiura
  • Pardi Silvio
  • Patricia Mendez Lorenzo
  • Patrick Fuhrmann
  • Patrick Meade
  • Patrick Robbe
  • Paul James Laycock
  • Paul Millar
  • Paul Seyfert
  • Paul Seyfert
  • Pavel Goncharov
  • Pavlo Svirin
  • Pedro Ferreira
  • Pedro Ferreira
  • Pekko Metsä
  • Pere Mato
  • Pere Mato
  • Peter Clarke
  • Peter Elmer
  • Peter Hristov
  • Peter Hristov
  • Peter Love
  • Peter Onyisi
  • Peter Onyisi
  • Petr Fedchenkov
  • Petya Vasileva
  • Philipp Sitzmann
  • Philippe Canal
  • Philippe Charpentier
  • Pier Paolo Ricci
  • Placido Fernandez Declara
  • Prasanth Kothuri
  • Prasanth Kothuri
  • Predrag Buncic
  • Preslav Konstantinov
  • Qiumei Ma
  • Radu Popescu
  • Rafal Grzymkowski
  • Ralf Spiwoks
  • Ralf Vamosi
  • Ran Du
  • Randall Sobie
  • RAUL LOPES
  • Raymond Ehlers
  • Remi Ete
  • Remi Mommsen
  • Rene Caspart
  • Richard Cavanaugh
  • Richard Hughes-Jones
  • Rob Appleyard
  • Robert Andrew Currie
  • Rodney Walker
  • Roel Aaij
  • Roel Aaij
  • Rok Pestotnik
  • Roland Kunz
  • Rolf Seuster
  • Rosario Esposito
  • Rosen Matev
  • Rosie Bolton
  • Ryan Taylor
  • Ryosuke Itoh
  • Samoylov Oleg
  • Samuel Skipsey
  • Sandro Wenzel
  • Sang Un Ahn
  • Santiago Gonzalez De La Hoz
  • Scott Rowan
  • Scott Snyder
  • Scott Snyder
  • Sean Murray
  • Sebastian Merkt
  • Sebastien Binet
  • Sebastien Gadrat
  • Sebastien Ponce
  • Semen Lebedev
  • Sergey Gorbunov
  • Sergey Linev
  • Sergio Traldi
  • Serguei Kolos
  • Sharad Agarwal
  • Shawn Mc Kee
  • Shota Hayashida
  • Sibel Yasar
  • Simon Blyth
  • Simon C. Lin
  • Simon C. Lin
  • Simon Fayer
  • Simone Campana
  • Simone Sottocornola
  • Sioni Paris Summers
  • Slava Krutelyov
  • Sofia Vallecorsa
  • Stefan Roiser
  • Stefan Stancu
  • Stefan-Gabriel Chitic
  • Stefan-Gabriel Chitic
  • Stefano Bagnasco
  • Stefano Spataro
  • Steven Farrell
  • Steven Goldfarb
  • Stewart Martin-Haugh
  • Storetvedt Maksim
  • Sören Fleischer
  • Tadashi Maeno
  • Tadashi Murakami
  • Tadeas Bilka
  • Tadej Novak
  • Tai Sakuma
  • Tai Sakuma
  • Takahashi Yuka
  • Tao Lin
  • Tatiana Strizh
  • Templon Jeff
  • Teng LI
  • Teo Mrnjavac
  • Thomas Charman
  • Thomas Charman
  • Thomas Hartmann
  • Thomas Hauth
  • Thomas Hauth
  • Thomas Kress
  • Thomas Kuhr
  • Thomas Maier
  • Thomas Vuillaume
  • Thorsten Kollegger
  • Tian Yan
  • Tibor Simko
  • Tigran Mkrtchyan
  • Tim Hallyburton
  • Tobias Stockmanns
  • Tobias Wegner
  • Todd Seiss
  • Tomas Lindén
  • Tomasz Piotr Trzcinski
  • Tommaso Boccali
  • Tomoaki Nakamura
  • Tony Cass
  • Torre Wenaus
  • Toru Sugitate
  • Traynor daniel
  • Ulf Behrens
  • V Daniel Elvira
  • Valentin Kuznetsov
  • Valentina Akishina
  • Valerio Formato
  • Valery Mitsyn
  • Vardan Gyurjyan
  • Vasco Chibante Barroso
  • Vasil Georgiev Vasilev
  • Vassil Verguilov
  • Victor Estrade
  • Vidar Tonaas Fauske
  • Vincente Beisner
  • Vincenzo Capone
  • Virginia Azzolini
  • Vito Di Benedetto
  • Vladimir Ivantchenko
  • Vladimir Korenkov
  • Volker Friese
  • Volodimir Begy
  • Volodymyr Yurchenko
  • Wahid Bhimji
  • Walter Lampl
  • Waseem Kamleh
  • Wei Yang
  • Wenjing Wu
  • William Kalderon
  • William Leight
  • Wojciech Jan Krzemien
  • Wojciech Jan Krzemien
  • Xiaobin Ji
  • Xiaoguang Yue
  • Yao Zhang
  • Yaodong Cheng
  • Yuji Kato
  • Yuka Takahashi
  • Zach Marshall
  • Zbigniew Baranowski
  • Zbigniew Baranowski
  • Zhang Xiaomei
  • Zhechka Toteva
  • Ziyan Deng
    • 08:00
      Registration

      National Palace of Culture

    • Plenary: S1 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      • 1
      • 2
        Integrated HPC Infrastructure for the Research Communities in South-East Europe

        The region of South-East Europe has a long history of successful collaboration in sharing resources and managing distributed electronic infrastructures for the needs of research communities. The HPC resources like supercomputers and big clusters with low-latency interconnection are an especially valuable and scarce resource in the region. Building upon the successfully tested operational and policy models, the VI-SEEM project is focusing on establishing an integrated Virtual Research Environment that comprises not only HPC but also Grid, Cloud and Data management resources. By providing a common Authentication, Authorisation and Accounting system the VI-SEEM infrastructure simplifies the access to resources for research groups with complex workflows. Because of the diversity of hardware technologies, chosen by the different partners, the integrated infrastructure is able to fulfil widely different needs of HPC resources. Applications outside of the initial target user communities – Climatology, Life Sciences and Digital Cultural Heritage, are also accepted through open calls with a peer review process. Motivated by the challenges that the individual nations in the region are facing to provide HPC resources for their researchers, the integrated VI-SEEM infrastructure achieves optimisation of the use of the available resources and retention of the human capital of the region.

        Speaker: Emanouil Atanassov (Unknown)
      • 3
        HNSciCloud, a Hybrid Cloud for Science

        Helix Nebula Science Cloud (HNSciCloud) has developed a hybrid cloud platform that links together commercial cloud service providers and research organisations’ in-house IT resources via the GEANT network.
        The platform offers data management capabilities with transparent data access where applications can be deployed with no modifications on both sides of the hybrid cloud and compute services accessible via eduGAIN and ELIXIR federated identity and access management systems. In addition, it provides support services, account management facilities, full documentation and training.
        The cloud services are being tested by a group of 10 research organisations from across Europe (http://www.hnscicloud.eu/partner-type/buyers ) against the needs of use-cases (http://www.hnscicloud.eu/hnscicloud-user-groups ) from 7 ESFRI research infrastructures:

        • Euro-BioImaging: European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
        • CTA: Cherenkov Telescope Array
        • BBMRI: Biobanking and BioMolecular resources Research Infrastructure
        • ELIXIR: A distributed infrastructure for life-science information
        • ESRF Upgrades: Extremely Brilliant Source
        • European-XFEL: European X-Ray Free-Electron Laser Facility
        • HL-LHC: High-Luminosity Large Hadron Collider

        The capacity procured by the 10 research organisations from the commercial cloud service providers to support these use-cases during 2018 exceeds 20,000 cores and 2 petabytes of storage with a network bandwidth of 40Gbps.
        All the services are based on open source implementations that do not require licenses in order to be deployed on the in-house IT resources of research organisations connected to the hybrid platform.
        An early adopter scheme has been put in place so that more research organisations can connect to the platform and procure additional capacity to support their research programmes.

        This contribution will make an overview of the status of the project, go through the current findings and provide an outlook on the future beyond HNSciCloud.

        Speaker: João Fernandes (CERN)
      • 4
        HNSciCloud – Large-scale data processing and HPC for science with T-Systems hybrid cloud

        As the result of joint R&D work with 10 of Europe’s leading public research organisations, led by CERN and funded by the EU, T-Systems provides a hybrid cloud solution, enabling science users to seamlessly extend their existing e-Infrastructures with one of the leading European public cloud services based on OpenStack – the Open Telekom Cloud.
        With this new approach large-scale data-intensive and HPC-type scientific use cases can now be run more dynamically, reaping the benefits of the on-demand availability of commercial cloud services at attractive costs.
        Over the course of the last year, the prototyping and piloting has confirmed, that science users can get seamless, performing, secure and fully automated access to cloud resources over the GÉANT network, simplified by the identity federation with eduGAIN and Elixir AAI. Users can work in a cloud-native way, maintaining existing toolsets or choose from a large and fast-growing community other OpenStack and S3-compatible tools, e.g. Ansible and Terraform to run and manage applications. Users remain in full control and have access to all native functions of the cloud resources, either through web browser, APIs or CLI. Cloud Management Platforms or Broker solutions are not needed, but may be added if further abstraction is required.
        The extensive service menu of Open Telekom Cloud – based on OpenStack – is opening up new functionality and performance for scientific use cases with build-in support for e.g. Docker, Kubernetes, MapReduce, Data Management, Data Warehouse and Data Ingestion services. The services can be combined with a wide range of compute and storage options. Compute can consist of any combination of containers, virtual, dedicated or bare metal servers. Server-types can be optimized for disk-intensive, large-memory, HPC or GPU applications. The extensive network and security functions enable users to maintain a private and secure environment, whereby access to services can make full use of 10G networking.
        The keynote will give an overview of the performance and scale of use cases that have been successfully deployed. It will address how large-scale data can be processed at new performance levels with hundreds of containers and how data can be processed in an intelligent way by pre-fetching the data or leaving the data remote at the existing infrastructure, making use of the state-of-the-art Onedata Data Management solution from Cyfronet. Furthermore, the results of the new high level of transparency and budget control developed will be demonstrated. Details will be available at the T-Systems booth.
        Ten of Europe’s leading public research organisations led by CERN launched the Helix Nebula Science Cloud (HNSciCloud) Pre-Commercial Procurement to establish a European hybrid cloud platform that will support the high-performance, data-intensive scientific use-cases of this “Buyers Group” and of the research sector at large. The intention is to pave the way for the European Open Science Cloud, enabling more and better science.

        Speaker: Jurry de la Mar (T-Systems International GmbH)
      • 5
        HNSciCloud Multi-Cloud Approach – An open hybrid cloud for science

        Ten of Europe’s leading public research organisations led by CERN launched the Helix Nebula Science Cloud (HNSciCloud) Pre-Commercial Procurement to establish a European hybrid cloud platform that will support the high-performance, data-intensive scientific use-cases of this “Buyers Group” and of the research sector at large. It calls for the design and implementation of innovative Infrastructure as a Service (IaaS) solutions for compute, storage, network connectivity, Federated Identity Management and Service Payment Models, to augment & enhance the science community’s existing systems.

        The RHEA Group HNSciCloud Nuvla multi-cloud solution provides the European science community a means to access, via their existing academic network infrastructure, large scale commercial cloud resources (computing and storage) from multiple cloud providers securely and transparently. The Nuvla platform allows users to authenticate with their academic on-line credentials to manage compute resources and to access local and cloud-based data locally. Users can deploy Virtual Machines (VMs) and/or Containers to multiple clouds and monitor their usage and cloud performance. Nuvla’s brokering system means users (or the procuring authorities) can choose which cloud to use based on price, performance, location or other factors which are important to them. The Nuvla service is completely neutral with respect to cloud providers, so by using Nuvla, users can deploy their VMs and containers to any supported cloud. Users may have their own orchestration tools and can optionally deploy directly to the clouds of their choice by using their native APIs.

        Users signing-in via Nuvla can still access the underlying cloud resources directly; such users still benefit from the data management, usage and performance monitoring offered by Nuvla as a way the users maintain control while Nuvla helps them get the best value for money and performance from the resources to achieve their needs. In the future, almost any other public/commercial cloud could be integrated provided it is connected to GÉANT and meets the minimum performance and reliability requirements, as demonstrated with the inclusion of the Advania cloud.

        As well as delivering processing and resource at scale via the GÉANT network, the platform supports the eduGAIN and Elixir AAI (SAML 2.0) identity federations, allowing users to access cloud resources via a web browser, Application Programming Interface (API) or Command Line Interface (CLI) – with access rights accorded by their unique identity. OneData allows data to be shared across multiple clouds as well as with local infrastructures. Organisations with their own internal cloud can also access that cloud via Nuvla allowing commercial and internal clouds to be managed from one interface. Nuvla is based on the SlipStream open-source software and Cyfronet’s Onedata extended within the HNSciCloud project to support the diverse scientific requirements of the Buyers Group with three European commercial clouds - Open Telekom, Exoscale and Advania – that provide secure, high-performance cloud resources, including High-Performance Computing (HPC) as a service and access to Graphical Processing Units (GPU).

        Speaker: Mr Alastair Pidgeon (RHEA System S.A.)
    • 10:30
      Break
    • T1 - Online computing: S1 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Adriana Telesca (CERN)
      • 6
        The ToolDAQ DAQ software Framework and its use in the Hyper-K, E61 & ANNIE detectors

        Data AQuisition (DAQ) systems are a vital component of every experiment. The purpose of the underlying software of these systems is to coordinate all the hardware components and detector states, providing the means of data readout, triggering, online processing, persistence, user control and the routing of data. These tasks are made more challenging when also considering fault tolerance, scalability, computing and data distribution, tolerance to hardware changes and usability.
        We will present ToolDAQ a C++ DAQ framework which has been designed to be easy and fast to adapt in a modular and simple way. It has many features like in built service discovery, dynamic reconfiguration, remote control/monitoring via web and terminal interfaces and a highly scalable fault tolerant network communication infrastructure provided by ZMQ built in. It also is compatible with newer and older hardware, both being very lightweight and with low dependencies.
        The framework is currently in use on the ANNIE experiment in Fermilab and has been used to develop the DAQ for Hyper-Kamiokande and the E61 intermediate water Cherenkov detector at J-PARC. We will report on the experience gained in adapting and running the framework for the ANNIE experiment as well as the Hyper-Kamiokande and E61 developments and use cases

        Speaker: Dr Benjamin Richards (Queen Mary University London)
      • 7
        The Continuously Running iFDAQ of the COMPASS Experiment

        Recently, a stability of Data Acquisition System (DAQ) has become a vital precondition for a successful data taking in high energy physics experiments. The intelligent, FPGA-based Data Acquisition System (iFDAQ) of the COMPASS experiment at CERN is designed to be able to readout data at the maximum rate of the experiment and running in a mode without any stops. DAQ systems fulfilling such requirements reach the efficiency up to 99%. The iFDAQ runs nonstop 24/7 regardless of nights, weekends or bank holidays for most of the calendar year. Thus, it puts stress on reliability and robustness of the system. Every undesirable interruption of data taking results in a possible loss of physics data. To improve the iFDAQ stability, the communication library DIALOG for the inter-process communication has been implemented and the DAQ Debugger has been developed for an error detection and incorporated to all iFDAQ processes. Moreover, the continuously running mode enables to collect data in runs with 200 spills without a necessity of any other user intervention and runs 24/7. Such mode affects all processes of the iFDAQ with high emphasis on reliability and precise synchronization. It helped to collect more physics data in run 2017. In the paper, we present the continuously running mode in more detail and discuss the overall iFDAQ stability.

        Speaker: Ondrej Subrt (Czech Technical University (CZ))
      • 8
        The Data acquisition system of the NA62 experiment at CERN

        The NA62 experiment looks for the extremely rare Kaon decay K+->pinunu and aims at measuring its branching ratio with a 10% accuracy.
        In order to do so a very high intensity secondary beam from the CERN SPS is used to produce charged Kaons whose decay products are detected by many detectors installed along a 150m decay region.

        The NA62 Data Acquisition system exploits a multilevel trigger system; following a L0 trigger decision, the 1MHz data flux from about 60 sources is read by a PC farm, the partial event is built and then passed through a series of L1 algorithms to further reduce the trigger rate. Events passing this level are completed with the missing, larger, data sources (~400 units) at the rate of 100KHz.

        The DAQ is built around a high performance Ethernet network interconnecting the detectors to a farm of 30 PCs. After an overall description of the system design and the main implementation choices that allowed to reach the required performance and functionality, this paper describes the overall behaviour of the DAQ in the 2017 data taking period, which has been the first year of high intensity and high efficiency data taking for the experiment. It then concludes with an outlook of possible improvements and upgrades that may be applied to the system in the future

        Speaker: Mr Marco Boretto (CERN)
      • 9
        Experience with Shifter Assistant: an intelligent tool to help operations of ATLAS TDAQ system in LHC Run 2

        The Trigger and DAQ (TDAQ) system of the ATLAS experiment is a complex
        distributed computing system, composed of O(30000) of applications
        running on a farm of computers. The system is operated by a crew of
        operators on shift. An important aspect of operations is to minimize
        the downtime of the system caused by runtime failures, such as human
        errors, unawareness, miscommunication, etc.

        The paper describes recent developments in one of “intelligent” TDAQ
        frameworks, the Shifter Assistant (SA) and summarizes the experience
        of its use in operations of ATLAS in the course of LHC Run 2.

        SA is a framework whose main aim is to automatize routine system
        checks, error detection and diagnosis, events correlation etc. in
        order to help the operators to react on runtime problems promptly and
        effectively. The tool is based on CEP (Complex Event Processing)
        technology. It constantly processes the stream of operational events
        (O(100kHz)) over a set of “directives” (or rules) in the knowledge
        base, producing human-oriented alerts and making shifters aware of
        operational issues.

        More then 200 directives were developed by TDAQ and detector experts
        for different domains. In this paper we also describe different types
        of directives and present examples of the most interesting ones,
        demonstrating the power of CEP for this type of applications.

        Speaker: Andrei Kazarov (Petersburg Nuclear Physics Institut (RU))
      • 10
        Operational experience with the new CMS DAQ expert

        The data acquisition (DAQ) system of the Compact Muon Solenoid (CMS) at CERN reads out the detector at the level-1 trigger accept rate of 100 kHz, assembles events with a bandwidth of 200 GB/s, provides these events to the high level-trigger running on a farm of 26000 cores and records the accepted events. Comprising custom-built and cutting edge commercial hardware and several 1000 instances of software applications, the DAQ system is complex in itself and failures cannot be completely excluded. Moreover, problems in the detectors, in the first level trigger system or in the high level trigger may provoke anomalous behaviour of the DAQ system which sometimes cannot easily be differentiated from a problem in the DAQ system itself. In order to achieve high data taking efficiency with operators from the entire collaboration and without relying too heavily on the on-call experts, an expert system, the DAQ Expert, has been developed that can pinpoint the source of most failures and give advice to the shift crew on how to recover in the quickest way. The DAQ Expert constantly analyzes monitoring data from the DAQ system and the high level trigger by making use of logic modules written in Java that encapsulate the expert knowledge about potential operational problems. The results of the reasoning are presented to the operator in a web-based dashboard, may trigger sound alerts in the control room and are archived for post-mortem analysis - presented in a web-based timeline browser. We present the design of the DAQ Expert and report on the operational experience since 2017, when it was first put into production.

        Speaker: Dr Hannes Sakulin (CERN)
      • 11
        Software-based data acquisition system for Level-1 end-cap muon trigger in ATLAS Run-3

        In 2019, the ATLAS experiment at CERN is planning an upgrade
        in order to cope with the higher luminosity requirements. In this
        upgrade, the installation of the new muon chambers for the end-cap
        muon system will be carried out. Muon track reconstruction performance
        can be improved, and fake triggers can be reduced. It is also
        necessary to develop readout system of trigger data for the Level-1
        end-cap muon trigger.

        We have decided to develop software-based data acquisition system.
        Therefore, we have implemented SiTCP technology, which connects a FPGA
        with the network, on FPGA of new trigger processor boards.

        Due to this implementation, this new DAQ system can take advantage of
        the latest developments in computing industry. This new readout system
        architecture is based on multi-process software, and can assemble
        events at a rate of 100 kHz. For data collection, the 10 Gbit Ethernet
        network switch is used. Moreover, we have optimized these processes to
        send data to the following system without any error. Therefore, the
        built events can be sent with an average throughput of ~ 211 Mbps.

        Our newly developed readout system is very generic and it is flexible
        for modifications, extensions and easy to debug. In this talk, we will
        present the details of the new software-based DAQ system and report
        the development status for ATLAS Run-3.

        Speaker: Kosuke Takeda (Kobe University (JP))
    • T2 - Offline computing: S1 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Victor Daniel Elvira (Fermi National Accelerator Lab. (US))
      • 12
        A fully parametric option in the LHCb simulation framework

        Faster alternatives to a full, GEANT4-based simulation are being pursued within the LHCb experiment. In this context the integration of the Delphes toolkit in the LHCb simulation framework is intended to provide a fully parameterized option.
        Delphes is a modular software designed for general-purpose experiments such as ATLAS and CMS to quickly propagate stable particles using a parametric approach and to provide reconstructed physical objects as output. It includes propagation inside a magnetic field and parameterized response for tracking and muon systems as well as calorimeters. Particle energies are computed by smearing the initial visible particles momenta according to detector resolution.
        The present contribution illustrates the current status of the Delphes toolkit in Gauss, the LHCb simulation framework. In this integration the particle transport performed by GEANT4 and subsequent mimicking of detector response and reconstruction has been replaced with a parametric response of the various detector elements.
        The implementation required significant changes to Delphes itself to constrain the particle transport inside the detector acceptance and to match the LHCb dipole magnetic field. The configuration of various parameterizations of resolution and efficiency is also
        a major aspect of the work to provide a fully functional simulation. The output of the resulting fast simulation is formatted in such a way that can be used in the LHCb physics analysis framework.
        An outline of the use of Delphes to evaluate future detector upgrade options is also given.

        Speaker: Benedetto Gianluca Siddi (Universita di Ferrara & INFN (IT))
      • 13
        Improved Fast Calorimeter Simulation in ATLAS

        ATLAS relies on very large samples of simulated events for delivering high-quality
        and competitive physics results, but producing these samples takes much time and
        is very CPU intensive when using the full GEANT4 detector simulation.
        Fast simulation tools are a useful way of reducing CPU requirements when detailed
        detector simulations are not needed. During the LHC Runs 1 and 2, a fast
        calorimeter simulation (FastCaloSim) was successfully used in ATLAS.
        FastCaloSim provides a simulation of the particle energy response at the calorimeter
        read-out cell level, taking into account the detailed particle shower shapes and the
        correlations between the energy depositions in the various calorimeter layers. It is
        interfaced to the standard ATLAS digitization and reconstruction software, and it can
        be tuned to data more easily than GEANT4.
        Now an improved version of FastCaloSim is in development, incorporating the experience
        with the version used during Run-1. The new FastCaloSim makes use of machine
        learning techniques, such as principal component analysis and neural networks, to
        optimise the amount of information stored in the ATLAS simulation infrastructure. This
        allows for further performance improvement by reducing the I/O time and the memory
        usage during the simulation job.
        A prototype is being tested and validated, and it has shown significant improvements in the
        description of cluster level variables in electromagnetic and hadronic showers. ATLAS
        plans to use this new FastCaloSim parameterization to simulate several billion events in the
        upcoming LHC runs.
        It will be combined with other fast tools used in the ATLAS production chain. In this
        Fast Chain the simulation, digitisation and reconstruction of the events are handled by fast
        tools. In this talk, we will describe this new FastCaloSim parametrisation and the current status
        of the ATLAS Fast Chain.

        Speaker: Jana Schaarschmidt (University of Washington (US))
      • 14
        Fast calorimeter simulation in the LHCb Gauss framework

        In HEP experiments CPU resources required by MC simulations are constantly growing and becoming a very large fraction of the total computing power (greater than 75%). At the same time the pace of performance improvements given by technology is slowing down, so the only solution is a more efficient use of resources. Efforts are ongoing in the LHC experiment collaborations to provide multiple options for simulating events in a faster way when higher statistics is needed. A key of the success for this strategy is the possibility of enabling these fast simulation options in a common framework with minimal action by the final user.

        In this talk we will describe the solution adopted in Gauss, the LHCb simulation software framework, to selectively exclude particles from being simulated by the Geant4 toolkit and to insert the corresponding hits generated in a faster way. The approach, integrated within the Geant4 toolkit, has been applied to the LHCb calorimeter but it could also be used
        for other subdetectors. The hits generation can be carried out by any external tool, e.g. by a static library of showers or more complex machine-learning techniques. A first implementation of the calorimeter hit generation will be described. Detailed timing measurements and a comparison with the default simulation for reference physical quantities will be also presented.

        Speaker: Matteo Rama (Universita & INFN Pisa (IT))
      • 15
        Generative Models for Fast Calorimeter Simulation: LHCb Case

        The goal to obtain more precise physics results in current collider experiments drives the plans to significantly increase the instantaneous luminosity collected by the experiments . The increasing complexity of the events due to the resulting increased pileup requires new approaches to triggering, reconstruction, analysis,
        and event simulation. The last task brings to a critical problem: generating the significantly higher amount of Monte Carlo (MC) data, required for analysis of the data collected at higher collider luminosity, without a drastic increase in computing resources requires a significant speed up of the simulation algorithms.
        The largest part of computer resources in simulation is currently spent in the detailed GEANT modeling of particles interacting with the material of the experimental apparatus, in particular the shower development in electromagnetic and hadronic calorimeters.
        To accelerate these computations we use approach based on methods of sample creation by generative models, which are nowadays widely used for computer vision and image processing.
        These models are based on maximizing likelihood between real data and samples produced by a generator. The two main approaches to this problem are Generative Adversarial Networks (GAN), that takes into account explicit description of the real data, and Variational Autoencoders (VAE), that uses latent variables to describe ones.
        In this contribution we present both approaches being applied to the calorimeter simulation for the LHCb experiment at LHC, discuss advantages and possible problems of these approaches, and compare the results.

        Speaker: Viktoriia Chekalina (Yandex School of Data Analysis (RU))
      • 16
        A Machine Learning tool for fast simulation

        Machine Learning techniques have been used in different applications by the HEP community: in this talk, we discuss the case of detector simulation. The amount of simulated events, expected in the future for LHC experiments and their High Luminosity upgrades, is increasing dramatically and requires new fast simulation solutions. We will describe an R&D activity, aimed at providing a configurable tool capable of training a neural network to reproduce the detector response and replace standard Monte Carlo simulation. This represents a generic approach in the sense that such a network could be designed and trained to simulate any kind of detector response. Eventually, it could be extended to replace the whole data processing chain in order to get, directly in one step, the final reconstructed quantities, in just a small fraction of time. We will present the first application of three-dimensional convolutional Generative Adversarial Networks to the simulation of high granularity electromagnetic calorimeters. We will describe detailed validation studies comparing our results to Geant4 Monte Carlo simulation, showing, in particular, the very good agreement we obtain for high level physics quantities (such as energy shower shapes) and detailed calorimeter response (single cell response). Finally we will show how this tool can easily be generalized to describe a larger class of calorimeters, opening the way to a generic machine learning based fast simulation approach.

        Speaker: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 17
        A scalable and asynchronous detector simulation system based on ALFA

        In the context of the common online-offline computing infrastructure for Run3 (ALICE-O2), ALICE is reorganizing its detector simulation software to be based on FairRoot, offering a common toolkit to implement simulation based on the Virtual-Monte-Carlo (VMC) scheme. Recently, FairRoot has been augmented by ALFA, a software framework developed in collaboration between ALICE and FAIR, offering portable building blocks to construct message-based and loosely-coupled multiprocessing systems.

        We will report here on the implementation of a scalable and asynchronous detector simulation system which is based on ALFA. The system offers parallelization at the primary-track level, going beyond the usual inter-event parallelism of Geant4-MT, and the possibility to asynchronously and simultaneously process simulation data for the purpose of digitization and clusterization. Core advantages of our implementation are an ideal reduction of the processing time per event as well as a reduction of the memory footprint, allowing us to make significantly better use of opportunistic resources, such as HPC backfills, than before. Moreover, the track-level parallelism opens up the interesting possibility to use different simulation engines (such as Geant4 and Fluka) concurrently, based on simple selection filters on the primary particles. The integration of fast MC processes, such as machine learning kernels running on a dedicated GPU, are a natural extension to the system.

        Speaker: Sandro Christian Wenzel (CERN)
    • T3 - Distributed computing: Experiment Frameworks and HPC Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: David Cameron (University of Oslo (NO))
      • 18
        Experience running IceCube simulation workloads on the Titan supercomputer

        IceCube Neutrino Observatory is a neutrino detector located at the South Pole. Here we present experiences acquired when using HTCondor to run IceCube’s GPU simulation worksets on the Titan supercomputer. Titan is a large supercomputer geared for High Performance Computing (HPC). Several factors make it challenging to use Titan for IceCube’s High Throughput Computing (HTC) workloads: (1) Titan is designed for MPI applications, (2) Titan scheduling policies heavily favor very large resource reservations, (3) Titan compute nodes run a customized version of Linux, (4) Titan compute nodes cannot access outside network. In contrast, IceCube’s simulation workloads consist of large numbers of relatively small independent jobs intended to run in standard Linux environments, and may require connectivity to public networks. Here we present how we leveraged HTCondor batch scheduler within Singularity containers to provide an HTC-friendly interface to Titan suitable for IceCube’s GPU workloads.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 19
        Production experience and performance for ATLAS data processing on a Cray XC-50 at CSCS

        Predictions for requirements for the LHC computing for Run 3 and for Run 4 (HL_LHC) over the course of the next 10 years, show a considerable gap between required and available resources, assuming budgets will globally remain flat at best. This will require some radical changes to the computing models for the data processing of the LHC experiments. The use of large scale computational resources at HPC centres worldwide is expected to increase substantially the cost-efficiency of the processing. In order to pave the path towards the HL-LHC data processing, the Swiss Institute of Particle Physics (CHIPP) has taken the strategic decision to migrate the processing of all the Tier-2 workloads for ATLAS and other LHC experiments from a dedicated x86_64 cluster that has been in continuous operation and evolution since 2007, to Piz Daint, the current European flagship HPC, which ranks third in the TOP500 at the time of writing. We report on the technical challenges and solutions adopted to migrate to Piz Daint, and on the experience and measured performance for ATLAS in over one year of running in production.

        Speaker: Gianfranco Sciacca
      • 20
        Enabling production HEP workflows on Supercomputers at NERSC

        Many HEP experiments are moving beyond experimental studies to making large-scale production use of HPC resources at NERSC including the knights landing architectures on the Cori supercomputer. These include ATLAS, Alice, Belle2, CMS, LSST-DESC, and STAR among others. Achieving this has involved several different approaches and has required innovations both on NERSC and the experiments’ sides. We detail the approaches taken, comparing and contrasting the benefits and challenges. We also describe the innovations and improvements needed particularly in the areas of data transfer (via DTNs), containerization (via Shifter), I/O (via burst buffer, Lustre, or Shifter per-node-cache), scheduling (via developments in SLURM), workflow (via grid services or on-site engines), databases, external networking from compute nodes (via a new approach to networking on Cray systems), and software delivery (via a new approach to CVMFS on Cray systems).
        We also outline plans, and initial development, for future support of experimental science workloads at NERSC, via a ‘Superfacility API’ that will provide a more common, plug-and-play base for such workflows, building on best practises to provide a lower bar of entry to HPC for new experiments as well as consistency and performance.

        Speaker: Wahid Bhimji (Lawrence Berkeley National Lab. (US))
      • 21
        BigPanDA Workflow Management on Titan for HENP and extreme scale applications

        The Titan supercomputer at Oak Ridge National Laboratory prioritizes the scheduling of large leadership class jobs, but even when the supercomputer is fully loaded and large jobs are standing in the queue to run, 10 percent of the machine remains available for a mix of smaller jobs, essentially ‘filling in the cracks’ between the very large jobs. Such utilisation of the computer resources is called “backfill”.
        Smaller scientific groups and data science experiments are able to conduct their computations on Titan in backfill mode. For example, simulations in high-energy physics don’t require large MPI scale scale jobs on a supercomputer. As a universal jobs scheduler to optimize Titan utilization it was successfully used to schedule and to submit ATLAS MC simulation jobs earlier. The current R&D project is aimed to enabling the convergence of HPC and HTC paradigms for a range of application types and communities in particle physics and beyond.
        In March 2017, we implemented a new PanDA server instance within ORNL operating under Red Hat OpenShift Origin - a powerful container cluster management and orchestration system in order to serve various experiments at Titan supercomputer. We implemented a set of demonstrations serving diverse scientific workflows including LQCD and IceCube, biology studies of the genes and human brain, and molecular dynamics studies.

        Speaker: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 22
        PanDA and RADICAL-Pilot Integration: Enabling the Pilot Paradigm on HPC Resources

        PanDA executes millions of ATLAS jobs a month on Grid systems with more than
        300k cores. Currently, PanDA is compatible only with few HPC resources due to
        different edge services and operational policies, does not implement the pilot
        paradigm on HPC, and does not dynamically optimize resource allocation among
        queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP)
        system to overcome these disadvantages and enable the execution of ATLAS,
        Molecular Dynamics and other workflows on HPC resources.

        Harvester is a commonality layer which brings coherence to diverse HPC
        systems, providing integration with PanDA workflows at job and event level. RP
        is a pilot system capable of executing short/long-running single/many-cores
        tasks on diverse HPC machines, supporting CPUs, GPUs, and multiple MPI
        implementations.

        We integrated Harvester and RP on Titan at ORNL, prototyping a Next Generation
        Executor (NGE) to expose RP capabilities and manage the execution of PanDA
        workflows. RP acquires Titan resources via queues and backfill capabilities
        and publishes the available resources to NGE. Harvester requests available
        resources and submits tasks for execution to NGE. NGE uses RP to execute those
        tasks, managing input and output staging, and holding the states of resources
        and tasks on a dedicated database.

        Speaker: Pavlo Svirin
      • 23
        #585 slot
    • T4 - Data handling: S1 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Conveners: Tigran Mkrtchyan (A.Alikhanyan National Science Laboratory (AM)), Tigran Mkrtchyan (DESY)
      • 24
        JADE Long Term Archive

        IceCube is a cubic kilometer neutrino detector located at the south pole. Every year, 29 TB of data are transmitted via satellite, and 365 TB of data are shipped on archival media, to the data warehouse in Madison, WI, USA. The JADE Long Term Archive (JADE-LTA) software indexes and bundles IceCube files and transfers the archive bundles for long term storage and preservation into tape silos at collaborating institutions DESY and NERSC. The software and its features are discussed, including integration with Globus Online, and later migration to FTS, to manage data transfers.

        Speaker: PATRICK MEADE (University of Wisconsin-Madison)
      • 25
        CERN Tape Archive – from development to production deployment

        The first production version of the CERN Tape Archive (CTA) software is planned to be released for the end of 2018. CTA is designed to replace CASTOR as the CERN tape archive solution, in order to face scalability and performance challenges arriving with LHC Run-3.

        This contribution will describe the main commonalities and differences of CTA with CASTOR. We outline the functional enhancements and integration steps required to add the CTA tape back-end to an EOS disk storage system. We present and discuss the different deployment and migration scenarios for replacing the five CASTOR instances at CERN, including a description of how FTS will interface with EOS and CTA.

        Speaker: Michael Davis (CERN)
      • 26
        dCache: storage for advanced scientific use-cases and beyond

        The dCache project provides open-source storage software deployed internationally to satisfy ever more demanding scientific storage requirements. Its multifaceted approach provides an integrated way of supporting different use-cases with the same storage, from high throughput data ingest, through wide access and easy integration with existing systems.

        In supporting new communities, such as photon science and microbiology, dCache is evolving to provide new features and access to new technologies.

        In this paper, we describe some of these recent features that facilitate the use of storage to maximise the gain from stored data, including quality-of-service management, support for distributed and federated systems, and improvements with support for parallel NFS (pNFS).

        Speaker: Tigran Mkrtchyan (DESY)
      • 27
        The GridKa Tape System: status and outlook

        Tape storage is still a cost effective way to keep large amounts of data over a long period of time. It is expected that this will continue in the future. The GridKa tape environment is a complex system of many hardware components and software layers. Configuring this system for optimal performance for all use cases is a non-trivial task and requires a lot of experience. We present the current status of the GridKa tape environment, report on recent upgrades and improvements and plans to further develop and enhance the system, especially with regard to the future requirements of the HEP experiments to their large data centers. The short-term planning mainly includes the transition from TSM to HPSS as the backend and the effects on the connection of dCache and xrootd. Recent changes of the vendor situation of certain tape technologies require a precise analysis of the impact and eventual adaptation of the mid-term planning, in particular with respect to scalability challenge that comes with HL-LHC at the horizon.

        Speaker: Dr Doris Ressmann (KIT)
      • 28
        The archive solution for distributed workflow management agents of the CMS experiment at LHC

        The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this talk we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system leverages modern technologies such as a document oriented database and the Hadoop eco-system to provide the necessary flexibility in order to reliably process, store, and aggregate ~1M documents on a daily basis. We will discuss the data transformation, the short and long term storage layers, the query language, along with the aggregation pipeline developed to visualize various performance metrics to assist CMS data operators in assessing the performance of the CMS computing system.

        Speaker: Valentin Y Kuznetsov (Cornell University (US))
      • 29
        WLCG space accounting in the SRM-less world

        The WLCG computing infrastructure provides distributed storage capacity hosted at the geographically dispersed computing sites.
        In order to effectively organize storage and processing of the LHC data, the LHC experiments require a reliable and complete overview of the storage capacity in terms of the occupied and free space, the storage shares allocated to different computing activities, and the possibility to detect “dark” data that occupies space while being unknown to the experiment’s file catalog. The task of the WLCG space accounting activity is to provide such an overview and to assist LHC experiments and WLCG operations to manage storage space and to understand future requirements.
        Several space accounting solutions which have been developed by the LHC experiments are currently based on Storage Resource Manager (SRM). In the coming years SRM becomes an optional service for sites which do not provide tape storage. Moreover, already now some of the storage implementations do not provide an SRM interface. Therefore, the next generation of the space accounting systems should not be based on SRM. In order to enable possibility for exposing storage topology and space accounting information the Storage Resource Reporting proposal has been agreed between LHC experiments, sites and storage providers. This contribution describes the WLCG storage resource accounting system which is being developed based on Storage Resource Reporting proposal.

        Speaker: Julia Andreeva (CERN)
    • T5 - Software development: S1 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Patricia Mendez Lorenzo (CERN)
      • 30
        ALFA: ALICE-FAIR new message queuing based framework

        ALFA is a message queuing based framework for online and offline data processing. It is a flexible framework that supports an actor based computational model and allows to implement an experiment defined data model on top. The design of ALFA is modular with separate layers for data transport, process management and process deployment. Although still under ongoing development, ALFA is already being tested and used by different experiments in different stages of data processing as it offers an easy integration of heterogeneous hardware and software. Examples of ALFA usage in different stages of event processing will be presented; in a detector read-out as well as in an online reconstruction and in a pure offline world of detector simulations.

        Speaker: Mohammad Al-Turany (CERN)
      • 31
        The ALICE Analysis Framework for LHC Run 3

        The ALICE experiment at the LHC (CERN) is currently developing a new software framework designed for Run 3: detector and software will have to cope with Pb-Pb collision rates 100 times higher than today, leading to the combination of core Online-Offline operations into a single framework called O².
        The analysis code is expected to run on a few large Analysis Facilities counting 20k cores and sustaining a 100 GB/s throughput: this requires a conjoint effort between the definition of the data format, the configuration of the Analysis Facilities and the development of the Analysis Framework.
        We present the prototype of a new Analysis Object Data format based on timeframes and optimized for continuous readout. Such format is designed to be extensible and transported efficiently over the network. We also present the first iteration of the Analysis Framework, based on the O² Data Processing Layer and leveraging message passing across a topology of processes.
        We will also illustrate the implementation and benchmarking of a compatibility layer designed to mitigate the transition from the current event-oriented analysis model to the new time-oriented one.
        Finally, we will give a status report on the integration of the Analysis Framework and Analysis Facilities for Run 3 into our current organized analysis model.

        Speaker: Dr Dario Berzano (CERN)
      • 32
        From Catalogs to Cosmology: An Automated Weak Gravitational Lensing Analysis Pipeline

        Weak gravitational lensing is an extremely powerful probe for gaining insight into the nature of two of the greatest mysteries of the universe -- dark energy and dark matter. To help prepare for the massive amounts of data coming from next generation surveys like LSST that hope to advance our understanding of these mysteries, we have developed an automated and seamless weak lensing cosmic shear analysis pipeline that begins with survey catalogs and performs all the necessary analysis steps in order to derive the cosmological parameters at the end. This pipeline is designed to be generic enough to handle data from any galaxy survey and has been tested by re-analyzing existing datasets from four different surveys -- CFHTLens, DES-SV, KiDS-450, and DLS -- and comparing their results. It is flexible enough to allow running each pipeline step on distributed resources at different sites in order to optimally match the computational capabilities of each site to the unique requirements of each task. This talk will begin with a brief overview of the analysis performed by this pipeline and then delve into a discussion of the architecture and performance of an actual implementation using modern work-flow engines and description languages.

        Speaker: Eric Vaandering (Fermi National Accelerator Lab. (US))
      • 33
        Software development for the NICA experiments: MpdRoot & BmnRoot

        The software for detector simulation, reconstruction and analysis of physics data is an essential part of each high-energy physics experiment. A new generation of the experiments for the relativistic nuclear physics is expected to be started up in the nearest years at the Nuclotron-based Ion Collider facility (NICA) being under construction at the Joint Institute for Nuclear Research in Dubna: the fixed target experiment BM@N (Baryonic Matter at Nuclotron), whose technical runs were started in 2015, and the future MPD (Multi-Purpose Detector) experiment on ion collisions, which will operate at the storage rings of the NICA facility. The event data model of the experiments is shown. The status of the software frameworks MpdRoot and BmnRoot developed for the MPD and BM@N is considered. For these tasks many additional systems, such as raw data converter, monitoring systems, event display, databases, parallelization tools and others have been developed.

        Speaker: Konstantin Gertsenberger (Joint Institute for Nuclear Research (RU))
      • 34
        ATLAS Technical Coordination Expert System

        Technical details of the directly manipulated systems and the impact on non-obviously connected systems are required knowledge when preparing an intervention in a complex experiment like ATLAS. In order to improve the understanding of the parties involved in an intervention a rule-based expert system has been developed. On the one hand this helps to recognize dependencies that are not always evident and on the other hand it facilitates communication between experts with different backgrounds by translating domain specific vocabularies. To simulate an event this tool combines information from diverse areas such as detector control (DCS) and safety systems (DSS), gas, cooling, ventilation, and electricity distribution. The inference engine provides a fast response of the impacted systems that are connected at a low level although they belong to different domains. It also predicts the probability of failure for each of the components affected by an intervention. Risk assessment models considered are fault tree analysis and principal component analysis. The user interface is a web-based application that uses graphics and text to provide different views of the detector system adapted to the different user needs and to interpret the data.

        Speaker: Ignacio Asensi Tortajada (Univ. of Valencia and CSIC (ES))
      • 35
        The AMI (ATLAS Metadata Interface) 2.0 metadata ecosystem: new design principles and features

        AMI (ATLAS Metadata Interface) is a generic ecosystem for metadata aggregation, transformation and cataloguing. Benefitting from more than 15 years of feedback in the LHC context, the second major version was recently released. We describe the design choices and their benefits for providing high-level metadata-dedicated features. In particular, we focus on the implementation of the Metadata Querying Language (MQL), a domain-specific language allowing to query databases without knowing the relation between entities. Finally, we emphasise the assets of the ecosystem for deploying an AMI service and extending its functionalities.

        Speaker: Jerome Odier (IN2P3/CNRS (FR))
    • T6 - Machine learning and physics analysis: S1 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Conveners: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P), Mr Sergei Gleyser
      • 36
        Partnering with industry for machine learning at HL-LHC

        The High Luminosity LHC (HL-LHC) represents an unprecedented computing challenge. For the program to succeed the current estimates from the LHC experiments for the amount of processing and storage required are roughly 50 times more than are currently deployed. Although some of the increased capacity will be provided by technology improvements over time, the computing budget is expected to be flat and to close the gap huge gains in the efficiency for processing and analyzing events must be achieved. An area that has the potential for a significant breakthrough is Machine Learning. In recent years industry has invested heavily in both hardware and software to develop machine learning techniques to filter, process, analyze, and derive correlations from very large scale heterogeneous datasets. Through CERN openlab, with industry partners, and the R&D projects of the LHC experiments we are attempting to build on the industry investments to utilize these techniques for science. In this presentation we will discuss the activities of the CERN openlab industry partnerships in machine learning and how they can be extended to science applications. Industry has shown the ability to monitor the health of complex systems and predict failures and maintenance needs. We will show how these techniques could be applied to detector health. We will discuss how industry developments in anomaly detection might be applicable to monitoring data quality and identifying new signals. Industry has demonstrated the feasibility of automated resource scheduling and optimization. We will show how these techniques could be used for data placement and workflow execution. Industry has advanced systems for high speed and high accuracy image recognition. We will discuss explorations of how these techniques could be applied to physics object identification. In recent years there have been advancements in the use of adversarial networks to improve the accuracy and robustness of training. These techniques may be applicable generally to some physics machine learning applications, but are potentially particularly interesting for tuning fast event simulation. We will present examples of industry activity and how all of these are being explored in LHC applications. At the end we will look at data processing techniques and speed requirements for physics processing and how those compare to similar real time industry processing applications.

        Speaker: Maria Girone (CERN)
      • 37
        New Machine Learning Developments in ROOT/TMVA

        In this talk, we will describe the latest additions to the Toolkit for Multivariate Analysis (TMVA), the machine learning package integrated into the ROOT framework. In particular, we will focus on the new deep learning module that contains robust fully-connected, convolutional and recurrent deep neural networks implemented on CPU and GPU architectures. We will present performance of these new libraries on benchmark datasets from high-energy physics. Additionally, we will discuss new developments in parallelization, cross-validation, regression and unsupervised learning and new interfaces to external machine learning frameworks, such as Tensorflow and scikit-learn.

        Speaker: Kim Albertsson (Lulea University of Technology (SE))
      • 38
        The Scikit-HEP project

        The Scikit-HEP project is a community-driven and community-oriented effort with the aim of providing Particle Physics at large with a Python scientific toolset containing core and common tools. The project builds on five pillars that embrace the major topics involved in a physicist’s analysis work: datasets, data aggregations, modelling, simulation and visualisation. The vision is to build a user and developer community engaging collaboration across experiments, to emulate scikit-learn's unified interface with Astropy's embrace of third-party packages, and to improve discoverability of relevant tools. The Scikit-HEP toolset is comprised of the scikit-hep package and numerous third-party “affiliated” packages. These affiliated packages include well-known packages such as root_numpy and root_pandas, but also newer packages that add functionality such as reading ROOT files natively, and interfaces between NumPy and Pythia or FastJet. The project's website is http://scikit-hep.org/ and the GitHub repositories are at https://github.com/scikit-hep. An overview of the overall project, toolset, and main functionality, will be presented.

        Speaker: Eduardo Rodrigues (University of Cincinnati (US))
      • 39
        HIPSTER - a python package for particle physics analyses

        HIPSTER (Heavily Ionising Particle Standard Toolkit for Event Recognition) is an open source Python package designed to facilitate the use of TensorFlow in a high energy physics analysis context. The core functionality of the software is presented, with images from the MoEDAL experiment Nuclear Track Detectors (NTDs) serving as an example dataset. Convolutional neural networks are selected as the classification algorithm for this dataset and the process of training a variety of models with different hyperparameters is detailed. Next the results are shown for the MoEDAL problem demonstrating the rich information output by HIPSTER that enables the user to probe the performance of their model in detail.

        Speaker: Thomas Paul Charman (University of London (GB))
      • 40
        Interactive, scalable, reproducible data analysis with containers, Jupyter, and Parsl

        In the traditional HEP analysis paradigm, code, documentation, and results are separate entities that require significant effort to keep synchronized, which hinders reproducibility. Jupyter notebooks allow these elements to be combined into a single, repeatable narrative. HEP analyses, however, commonly rely on complex software stacks and the use of distributed computing resources, requirements that have been barriers to notebook adoption. In this presentation we describe how Jupyter can be combined with Parsl (Parallel Scripting Library) and containers to enable intuitive and interactive high performance computing in Python.

        Parsl is a pure Python library for orchestrating the concurrent execution of multiple tasks. Parsl is remarkable for its simplicity. Its primary construct is an “app” decorator, which the programmer uses to indicate that certain functions (either pure Python or wrappers around shell programs) are to be treated as “apps.” App function calls then result in the creation of a new “task” that runs concurrently with the main program and other tasks, subject to dataflow constraints defined by the availability of app function input data. Data dependencies can be in-memory objects, or external files. App decorators can further specify which computation resources to use and the required software environment to run the decorated function. Parsl abstracts hardware details, allowing a single script to be executed efficiently on one or more laptops, clusters, clouds, and/or supercomputers. To manage complex execution environments on various resources and also to improve reproducibility, Parsl can use containers— lightweight, virtualized constructs for packaging software with its environment— to wrap tasks.

        In this presentation we 1) show how a real-world complete HEP analysis workflow can be developed with Parsl and 2) demonstrate efficient and reproducible execution of such workflows on heterogeneous resources, including leadership-class computing facilities, using containers to wrap analysis code, Parsl to orchestrate the execution of these containers, and Jupyter as the interface for writing and executing the Parsl script.

        Speaker: Ms Anna Elizabeth Woodard (Computation Institute, University of Chicago)
      • 41
        #182 slot
    • T7 - Clouds, virtualization and containers: S1 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Convener: Andrew McNab (University of Manchester)
      • 42
        Lightweight WLCG Sites

        The WLCG unites resources from over 169 sites spread across the world and the number is expected to grow in the coming years. However, setting up and configuring new sites to support WLCG workloads is still no straightforward task and often requires significant assistance from WLCG experts. A survey presented in CHEP 2016 revealed a strong wish among site admins for reduction of overheads through the use of prefab Docker containers or OpenStack VM images, along with the adoption of popular tools like Puppet for configuration. In 2017, the Lightweight Sites project was initiated to construct shared community repositories providing such building blocks. In this contribution, we present the first sets of components made available and discuss the technical design choices and the subsequent roadmap. We describe a modular and extensible core system that abstracts low-level details through a YAML based site-wide configuration file which is used to configure all distributed components through a single command. To accommodate the diverse scenarios at different sites, the system will enable site admins to cherry pick their background technologies and methodologies for orchestration (Puppet, Ansible, ...), clustering (Docker Swarm, Kubernetes, ...) and networking (dedicated networks, custom overlay networks or a combination of both). As an alternative to setting up classic grid sites using the aforementioned technologies, we also discuss a different approach altogether to provide lightweight computing resources, featuring HTCondor and BOINC.

        Speaker: Mayank Sharma (CERN)
      • 43
        EVALUATING KUBERNETES AS AN ORCHESTRATOR OF THE HIGH LEVEL TRIGGER COMPUTING FARM OF THE TRIGGER AND DATA ACQUISITION SYSTEM OF THE ATLAS EXPERIMENT AT THE LARGE HADRON COLLIDER

        The ATLAS experiment at the LHC relies on a complex and distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data. The High Level Trigger (HLT) component of the TDAQ system is responsible for executing advanced selection algorithms, reducing the data rate to a level suitable for recording to permanent storage. The HLT functionality is provided by a computing farm made up of thousands of commodity servers, each executing one or more processes.
        Moving the HLT farm management towards a containerized solution is one of the main theme of the ATLAS TDAQ Phase-II upgrades in the area of the online software; it would make it possible to open new possibilities for fault tolerance, reliability and scalability.
        This paper presents the results of an evaluation of Kubernetes as a possible orchestrator of the ATLAS TDAQ HLT computing farm. Kubernetes is a system for advanced management of containerized applications in large clusters.
        We will first highlight some of the technical solutions adopted to run the offline version of today’s HLT software in a Docker container. Then we will focus on some scaling performance measurements executed with a cluster of 1000 CPU cores. In particular, we will:
        - Show the way Kubernetes scales in deploying containers as a function the cluster size;
        - Prove how a proper tuning of the Query Per Second (QPS) Kebernetes parameter set can improve the scaling of applications.
        Finally, we will conclude with an assessment about the possibility to use Kubernetes as an orchestrator of the HLT computing farm in LHC’s Run IV.

        Speaker: Giuseppe Avolio (CERN)
      • 44
        Lightweight on-demand computing with Elasticluster and Nordugrid ARC

        The cloud computing paradigm allows scientists to elastically grow or shrink computing resources as requirements demand, so that resources only need to be paid for when necessary. The challenge of integrating cloud computing into distributed computing frameworks used by HEP experiments has led to many different solutions in the past years, however none of these solutions offer a complete, fully integrated cloud resource out of the box. This paper describes how to offer such a resource using stripped-down minimal versions of existing distributed computing software components combined with off-the-shelf cloud tools. The basis of the cloud resource is Elasticluster, and the glue to join to the HEP computing infrastructure is provided by the NorduGrid ARC middleware and the ARC Control Tower. These latter two components are stripped down to bare minimum edge services, removing the need for administering complex grid middleware, yet still provide the complete job and data management required to fully exploit the cloud resource. The ARC architecture delegates all the grid-related interactions (e.g. data staging, communication with workload management system) to the edge service so that no grid middleware or sensitive credentials are required on the rest of the cloud. This model can be applied not only to cloud computing, but also to restrictive environments such as High Performance Computing centres. We show how to configure Elasticluster to automatically create the cloud resource and run these services at the click of a button and present results of the setup running real payload for the ATLAS experiment.

        Speaker: Maiken Pedersen (University of Oslo (NO))
      • 45
        CVMFS: Stratum0 in kubernetes

        IceCube is a cubic kilometer neutrino detector located at the south pole. CVMFS is a key component to IceCube’s Distributed High Throughput Computing analytics workflow for sharing 500GB of software across datacenters worldwide. Building the IceCube software suite across multiple platforms and deploying it into CVMFS has until recently been a manual, time consuming task that doesn’t fit well within an agile continuous delivery framework.

        Within the last 2 years a plethora of tooling around microservices has created an opportunity to upgrade the IceCube software build and deploy pipeline. We present a framework using Kubernetes to deploy Buildbot. The Buildbot pipeline is a set of pods (docker containers) in the Kubernetes cluster that builds the IceCube software across multiple platforms, tests the new software for critical errors, syncs the software to a containerized CVMFS server, and finally executes a publish. The time from code commit to CVMFS publish has been greatly reduced and has enabled the capability of publishing nightly builds to CVMFS.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 46
        Exploiting private and commercial clouds to generate on-demand CMS computing facilities with DODAS

        Reducing time and cost, through setup and operational efficiency increase is a key nowadays while exploiting private or commercial clouds. In turn this means that reducing the learning curve as well as the operational cost of managing community-specific services running on distributed environments became a key to success and sustainability, even more for communities seeking to exploit opportunistic cloud resources. DODAS (Dynamic On Demand Analysis Service) is a tool based on INDIGO-DataCloud services which allows on-demand generation of a container-based HTCondor batch system over cloud infrastructures implementing a “Batch System as a Service” that can be easily customized to cover multiple scenarios. DODAS has been designed to completely automate the process of provisioning, creating, managing and accessing a pool of heterogeneous computing and storage resources over cloud infrastructures. As such, DODAS provides a complete abstraction of the underlying IaaS, which is only required to provide VMs or bare metal resources. It uses a combination of TOSCA templates and Ansible roles to orchestrate and configure cloud resources, and it relies on Apache Mesos to abstract CPU, RAM and storage. Marathon is adopted as container orchestration platform on top of Mesos. Marathon takes care of setting up both HTCondor and any additional services (e.g squid proxy, proxy certificate cache) that experiments might require. Credential management is completely Json Web Tokens- based which relies on Identity Access Management (IAM) and Token Translation Services (TTS) provided by INDIGO. The DODAS architecture is highly flexible and customizable; as such, it has been evolved in order to provide also the configuration and the automatic setup services and dependencies required by a regular CMS computing facility, a Tier site. This means that DODAS allows CMS to utilize “any cloud provider” to generate sites on demand, with almost zero effort. We describe the overall architecture of DODAS, including the integration strategy adopted to connect it with the CMS HTCondor Global Pool. We also present performance, scalability and efficiency results of DODAS-generated Tiers used to process CMS jobs, showing how DODAS has been effectively used in different CMS sites in order to consolidate disparate computing resources being used by CMS scientists. We will present the different ways DODAS can be used to integrate different usage patterns, especially how it can fit the final stages of the CMS analysis.
        Finally, we present and compare results obtained both on private and commercial clouds spanning from Microsoft Azure, to Open Telekom Cloud up to private OpenStack instances.

        Speaker: Daniele Spiga (Universita e INFN, Perugia (IT))
      • 47
        Digital archive as a service: automatic deployment of an Invenio-based repository using TOSCA orchestration and Apache Mesos

        In the framework of the H2020 INDIGO-DataCloud project we have implemented an advanced solution for the automatic deployment of digital data repositories based on Invenio, the digital library framework developed by Cern. Exploiting cutting-edge technologies, like docker and Apache Mesos, and standard interfaces like TOSCA we are able to provide a service that simplifies the process of creating and managing repositories of various digital assets using cloud resources.
        An Invenio-based repository consists of a set of services (e.g. database, message queue, cache, worker and frontend) that need to be properly installed, configured and linked together. These operations along with the provisioning of the resources and their monitoring and maintenance can be challenging for individual researchers or small-to-moderate-sized research groups.
        To this purpose the INDIGO-Datacloud platform provides advanced features for orchestrating the deployment of complex virtual infrastructures on distributed cloud environments: it is able to provision the required resources automatically over heterogeneous and/or hybrid cloud infrastructures and to configure them automatically ensuring dynamic elasticity and resilience. This approach has been successfully adopted to support the “Digital Research Infrastructure for the Arts and Humanities” (DARIAH) use-case.

        Speaker: Marica Antonacci
    • 12:30
      Lunch break
    • T1 - Online computing: S2 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Clara Gaspar (CERN)
      • 48
        LHCb MiniDAQ Control System

        LHCb is one of the 4 experiments at the LHC accelerator at CERN, specialized in b-physics. During the next long shutdown period, the LHCb experiment will be upgraded to a trigger-less readout system with a full software trigger in order to be able to record data with a much higher instantaneous luminosity. To achieve this goal, the upgraded systems for trigger, timing and fast control (TFC) and data acquisition (DAQ) will have new electronic boards and a new software stack will be introduced for data acquisition. In the development stage, all these components are housed in servers named MiniDAQ. These new components will need to be integrated in an upgraded Experiment Control System (ECS) based on the WinCC OA SCADA and the CERN JCOP framework. The ECS provides full monitoring and control for all these sub-systems and many important features like configuration recipes and automated actions. This paper will describe the implementation of the upgraded ECS and its component fwMiniDAQ, which integrates all the new developments and can be easily distributed to the sub-detector developers and configured for their various setups.

        Speaker: Luis Granado Cardoso (CERN)
      • 49
        A Web-based control and monitoring system for DAQ applications

        Data acquisition and control play an important role in science applications especially in modern Experiments of high energy physics (HEP). A comprehensive and efficient monitoring system is a vital part of any HEP experiment. In this paper we describe the software web-based framework which is currently used by CMD-3 Collaboration during data taking with the CMD-3 Detector at the VEPP-2000 Collider and partially by the Muon g-2 experiment at Fermilab to monitor the status of data acquisition and control the quality of physics data taken in the experiments.
        The system is designed to meet typical requirements and cover various use-cases of DAQ applications (MIDAS oriented), starting from central configuration view, slow control data monitoring, user-oriented visualization and full control of DAQ modules in the end.
        Being an intermediate middleware between front-end electronics and DAQ applications the system is focused to provide a high-level coherent view for shifters and experts for robust operations and, in particular, is used to integrate various experiment dependent monitoring modules and tools into a unified Web oriented portal with appropriate access control policy implemented.

        The paper describes the design and overall architecture of the system, recent developments and the most important aspects of the framework implementation and features like real-time interactive monitoring plots, trend histograms, run log details, remote script execution, built-in widget concept (template tags), fine-granular authorization and others.

        Speaker: Alexey Anisenkov (Budker Institute of Nuclear Physics (RU))
      • 50
        The protoDUNE Single Phase Detector Control System

        This paper presents the Detector Control System (DCS) that is being designed and implemented for the NP04 experiment at CERN. NP04, also known as protoDUNE Single Phase (SP), aims at validating the engineering processes and detector performance of a large LAr Time Projection Chamber in view of the DUNE experiment. The detector is under construction and will be operated on a tertiary beam of the CERN SPS accelerator starting in 2018.
        After an overall description of the distributed control architecture that has been chosen for the control of this experiment, focus will be put on describing the software system design, based on the CERN control frameworks UNICOS and JCOP (built on top of WINCC OA), the hardware implementation, the deployment of the system into production and the integration of the detector subsystems and instrumentation into the overall NP04 DCS.
        The challenging specifications of the SP detector, such as temperature reading accuracy, sensitivity to any electromagnetic noise of the SP detector electronics and the measures applied at the hardware layout level are also described. The knowledge acquired during the operation of the Anode Plane Assemblies (APA) sub-detector components within the NP Coldbox test facility is discussed and improvements are presented.

        Speaker: Manuel Jesus Rodriguez Alonso (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
      • 51
        Software based control and monitoring of a hardware based track reconstruction system for the ATLAS experiment

        During the Run-2 of the Large Hadron Collider (LHC) the instantaneous luminosity exceeds the nominal value of 10^{34} cm^{−2} s^{−1} with a 25 ns bunch crossing period and the number of overlapping proton-proton interactions per bunch crossing increases up to about 80. These conditions pose a challenge to the trigger system of the experiments that has to control rates while keeping a good efficiency for interesting physics events.

        This document summarizes the software based control and monitoring of a hardware-based track reconstruction system for the ATLAS experiment, called Fast Tracker (FTK), composed of associative memories and FPGAs operating at the rate of 100 kHz and providing high quality track information within the available latency to the high-level trigger. In particular, we will detail the commissioning of the FTK within the ATLAS online software system presenting the solutions adopted for scaling up the system and ensuring robustness and redundancy. We will also describe the solutions to challenges such as controlling the occupancy of the buffers, managing the heterogeneous and large configuration, and providing monitoring information at sufficient rate.

        Speaker: Simone Sottocornola (Universita and INFN (IT))
      • 52
        Front-End Electronics Control and Monitoring for the LHCb Upgrade

        The LHCb experiment, one of the four operating in the LHC, will be enduring a major upgrade of its electronics during the third long shutdown period of the particle accelerator. One of the main objectives of the upgrade effort is to implement a 40MHz readout of collision data. For this purpose, the Front-End electronics will make extensive use of a radiation resistant chipset, the Gigabit Transceiver (GBT), for readout as well as for slow control, monitoring and synchronization. At LHCb, the tools to operate the front-end electronics are developed by a central team and distributed to the users. This contribution describes the architecture of the system that implements the slow control and monitoring of all Front-End electronics using the GBT chipset, namely the GBTx and GBT-SCA. The system is implemented in 3 layers starting with an FPGA based electronic board that interfaces the GBT chipset directly through optical fibers. The second layer is composed by a PCIe driver and a number of processes to operate these boards. The user operates the system in the third layer which is the WinCC OA SCADA that is interfaced with the Front-Ends via a message broker called DIM. The requirements of the system as well as the design and integration of each layer are discussed in detail. The results of the firmware implementation in hardware and operational tests are shown and the overall performance of the system is discussed.

        Speaker: Joao Vitor Viana Barbosa (CERN)
      • 53
        Status report on the architecture and future upgrades of the CMS Electromagnetic Calorimeter Control and Safety Systems

        The Electromagnetic Calorimeter (ECAL) is one of the sub-detectors of the Compact Muon Solenoid (CMS) experiment of the Large Hadron Collider (LHC) at CERN. Since more than 10 years, the ECAL Detector Control System (DCS) and the ECAL Safety System (ESS) have supported the experiment operation, contributing to its high availability and safety. The evolution of both systems to fulfill new requirements and constraints, in addition to optimizations towards improving usage and process automation, led to several changes to their original design. This paper presents the current software/hardware architecture of both CMS ECAL DCS and ESS and reviews the major changes applied to both systems during the past years. Furthermore, in view of the CMS Phase-II upgrade of this sub-detector, the corresponding plans for the control and safety systems are also discussed.

        Speaker: Mr Diogo Di Calafiori (Eidgenoessische Technische Hochschule Zuerich (ETHZ) (CH))
      • 54
        Towards the ALICE Online-Offline (O2) control system

        The ALICE Experiment at CERN LHC (Large Hadron Collider) is under
        preparation for a major upgrade that is scheduled to be deployed during Long
        Shutdown 2 in 2019-2020 and that includes new computing systems, called O2
        (Online-Offline).
        To ensure the efficient operation of the upgraded experiment along with its
        newly designed computing system, a reliable, high performance and automated
        control system will be developed with the goal of managing the lifetime of all
        the O2 processes, and of handling the various phases of the data taking activity
        by interacting with the detectors, the trigger system and the LHC.
        The ALICE O2 control system will be a distributed system based on state
        of the art cluster management and microservices which have recently emerged
        in the distributed computing ecosystem. Such technologies weren’t available
        during the design and development of the original LHC computing systems,
        and their use will allow the ALICE collaboration to benefit from a vibrant and
        innovating open source community.
        This paper illustrates the O2 control system architecture. It evaluates sev-
        eral solutions that were considered during an initial prototyping phase and pro-
        vides a rationale for the choices made. It also provides an in-depth overview of
        the components, features and design elements of the actual system.

        Speaker: Teo Mrnjavac (CERN)
      • 55
        The online DQM of BESIII

        The BESIII detector is a magnetic spectrometer operating at BEPCII, a
        double-ring e+e- collider with center-of-mass energies between 2.0 and
        4.6 GeV and a peak luminosity $10^{33}$ cm$^{-2}$ s$^{-1}$. The event rate
        is about 4 kHz after the online event filter (L3 trigger) at J/$\psi$
        peak.

        The BESIII online data quality monitoring (DQM) system is used to
        monitor the data and the detector in real time using full
        reconstructed data. It is a essential supplement to the DAQ
        monitoring, which using unprocessed data directly from the detector. I
        will give the review of the BESIII DQM system. The system is designed
        to be weak coupled to the DAQ online system, so it can be developed
        under the offline environment freely. The system includes the data
        server, which is used to get data from DAQ, the reconstruction
        processes for the data reconstruction, the histogram server, which is
        used to collect all histograms from all reconstruction processes, and
        the error handle of itself, and so on.

        Speaker: Xiaobin Ji (IHEP, Beijing, China)
    • T2 - Offline computing: S2 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Lucia Grillo (University of Manchester (GB))
      • 56
        Impact of Detector Simulation in Particle Physics Collider Experiments

        Detector simulation has become fundamental to the success of modern high-energy physics (HEP) experiments. For example, the Geant4-based simulation applications developed by the ATLAS and CMS experiments played a major role for them to produce physics measurements of unprecedented quality and precision with faster turnaround, from data taking to journal submission, than any previous hadron collider experiment. The material presented here add to a recent review on the impact of detector simulation in collider particle physics experiments published in Ref. [1]. Most recent examples illustrate the application of simulation to detector design and optimization, software development and testing of computing models, detector commissioning and data calibration studies, as well as physics signal and background estimations utilized in data analysis for physics measurements. The cost and economic impact of simulation in the CMS experiment are also discussed. Future experimental programs present new challenges for increasingly sophisticated detectors, events, and data samples. The presentation finishes with a short discussion on potential solutions to mitigate the impact of this new environment on computing performance, given that physics accuracy needs to be preserved or improved. They are based on re-engineering detector simulation software using modern software and computing techniques to run on emerging hardware architectures.

        [1] Physics Reports 695 (2017) 1-54, arXiv:1706.04293

        Speaker: Victor Daniel Elvira (Fermi National Accelerator Lab. (US))
      • 57
        Current and Future Performance of the CMS Simulation

        The CMS full simulation using Geant4 has delivered billions of simulated events for analysis during Runs 1 and 2 of the LHC. However, the HL-LHC dataset will be an order of magnitude larger, with a similar increase in occupancy per event. In addition, the upgraded CMS detector will be considerably more complex, with an extended silicon tracker and a high granularity calorimeter in the endcap region. Increases in conventional computing resources are subject to both technological and budgetary limitations, so novel approaches are needed to improve software efficiency and to take advantage of new architectures and heterogeneous resources. Several projects are in development to address these needs, including the vectorized geometry library VecGeom and the GeantV transport engine, which uses track-level parallelization. The current computing performance of the CMS simulation will be presented as a baseline, along with an overview of the various optimizations already available for Geant4. Finally, the progress and outlook for integrating VecGeom and GeantV in the CMS software framework will be discussed.

        Speaker: Kevin Pedro (Fermi National Accelerator Lab. (US))
      • 58
        New techniques for pile-up simulation in ATLAS

        The high-luminosity data produced by the LHC leads to many proton-proton interactions per beam
        crossing in ATLAS, known as pile-up. In order to understand the ATLAS data and extract the physics
        results it is important to model these effects accurately in the simulation. As the pile-up rate continues
        to grow towards an eventual rate of 200 for the HL-LHC, this puts increasing demands on computing resources required for the simulation and the current approach of simulating the pile-up interactions along with the hard-scatter for each Monte Carlo production is no longer feasible. The new ATLAS "overlay" approach to pile-up
        simulation is presented. Here a pre-simulated set of minimum bias interactions, either from simulation
        or from real data, is created once and events drawn from this are overlaid with the hard-scatter event
        being simulated. This leads to significant improvements in CPU time. The contribution will discuss the
        technical aspects of the implementation in the ATLAS simulation and production infrastructure and
        compare the performance, both in terms of computing and physics, to the previous approach.

        Speaker: Tadej Novak (Jozef Stefan Institute (SI))
      • 59
        FairRoot and ALICE O2 multithreading simulation

        To address the challenges of the major upgrade of the experiment, the ALICE simulations must be able to make efficient use of computing and opportunistic supercomputing resources available on the GRID. The Geant4 transport package, the performance of which has been demonstrated in a hybrid multithreading (MT) and multiprocessing (MPI) environment with up to ¼ million threads, is therefore of a particular interest.
        The O2 simulation framework is based on FairRoot, which itself is based on the Virtual Monte Carlo (VMC). The integration of MT into the VMC design and its impact on the Geant4 VMC was presented at CHEP 2014. Geant4 VMC MT and the scaling behavior of the computing time with the number of cores have been then tested using a simplified but realistic multithreaded simulation application.
        The focus was then put on the integration of MT in FairRoot classes as the necessary step towards MT in the FairRoot based experimental frameworks. The new O2 framework is the first one the migration to MT of which is achieved for all actually included detectors. The planned integration of the Geant4 MT mode in the standard FairRoot and O2 testing suites should allow to keep this mode operational during the period of further developments and enhancements in 2018.
        In this contribution we will present the progress with the integration of MT in FairRoot classes, the work for thread-safety in the O2 simulation classes and the experience with the integration of the MT mode in testing. We will also discuss plans for the further performance studies with a realistic detector setup.

        Speaker: Ivana Hrivnacova (IPNO, Université Paris-Saclay, CNRS/IN2P3)
      • 60
        Status of parallelized JUNO simulation software

        The Jiangmen Underground Neutrino Observatory (JUNO) is a multi-purpose neutrino experiment. It consists of a central detector, a water pool and a top tracker. The central detector, which is used for neutrino detection, consists of 20 kt liquid scintillator (LS) and about 18,000 20-inch photomultiplier tubes (PMTs) to collect lights from LS.
        Simulation software is one of the important parts in JUNO offline software. To speed up simulation, a parallelized simulation framework has been developed based on the SNiPER framework and Geant4 version 10. The SNiPER task components are in charge of event loop, which can run in sequential mode, Intel TBB mode and other modes. Based on SNiPER, the simulation framework and its underlying parallel libraries have been decoupled. However parallelized simulation of events with correlation is a challenge. In order to keep the correct events order, a component called global buffer is developed in SNiPER.
        In this presentation, an overview of parallelized JUNO simulation framework will be presented first. The global buffer is used in the parallelized event correlation simulation. An event generator produces events with timestamps in sequential mode. These events are put into global buffer and simulated by detector simulation algorithms in different tasks. After simulation, these events are saved into ROOT files with I/O services running in extra threads. Finally, we will present the software performance.

        Speaker: Dr Tao Lin (Institute of High Energy Physics, CAS)
      • 61
        Current status for simulation chain of the neutrino events in the NOvA experiment

        The NOvA experiment is a two-detectors, long-baseline neutrino experiment operating since 2014 in the NuMI muon neutrino beam (FNAL, USA). NOvA has already collected about 25% of its expected statistics in both neutrino and antineutrino modes for electron-neutrino appearance and muon-neutrino disappearance analyses. Careful simulation of neutrino events and backgrounds are required for precise measurements of the neutrino oscillation parameters. We will describe updates in the NOvA simulation chain in comparison to previous one presented at CHEP2015, including: a revision of the detector geometries, upgrading the GENIE neutrino event generator to access an array of 2p2h models, upgrading GEANT to improve neutron modeling, modifying our custom electronic simulation package to model energy dependent shaping changes, improvements to our fiber and scintillator light models, and modifications to the overlay procedure to allow for mixing cosmic ray data with simulated neutrino events as well as enhanced realism in the addition of beam-related activity from outside the detector to the detector simulation.

        Speaker: Oleg Samoylov (Joint Institute for Nuclear Research)
      • 62
        Adopting new technologies in the LHCb Gauss simulation framework

        The increase in luminosity foreseen in the future years of operation of the Large Hadron Collider (LHC) creates new challenges in computing efficiency for all participating experiment. These new challenges extend beyond the data-taking alone, because data analyses require more and more simulated events, whose creation already takes a large fraction of the overall computing resources. For Run 3 of the LHC, the LHCb collaboration needs to simulate about two orders of magnitude more Monte Carlo events to exploit the increased luminosity.
        Therefore, the LHCb simulation framework (Gauss) will go through a significant renovation, mostly driven by the upgraded core software framework (Gaudi) and the availability of a multithreaded version of Geant4. The upgraded Gaudi framework replaces single-threaded processing by a multithreaded approach, allowing concurrent execution of tasks with a single event as well as multiple events in parallel. A major task of the required overhaul of Gauss is the implementation of a new interface to the multithreaded version of Geant4. Furthermore, in order to implement fast simulation options for different sub-detectors, this interface needs to be designed as modular as possible to allow seamless customisation of the detector simulation and a different treatment based on particle type.
        In this talk, we will give an overview of the whole plan and discuss the status of the project , as well as the lessons learned facing the challenges above. Furthermore, we present the work on Gaussino, a new Gaudi-based core simulation framework which forms the basis of Gauss for Run 3. Gaussino is a cross-experiment simulation framework with no dependency on LHCb software and provides an ideal testbed for the implementation of future technologies such as GeantV.

        Speaker: Dominik Muller (CERN)
      • 63
        #573 slot
    • T3 - Distributed computing: Facilities Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: Julia Andreeva (CERN)
      • 64
        Advances and enhancements in the FabrIc for Frontier Experiments Project at Fermilab

        The FabrIc for Frontier Experiments (FIFE) project within the Fermilab Scientific Computing Division is charged with integrating offline computing components into a common computing stack for the non-LHC Fermilab experiments, supporting experiment offline computing, and consulting on new, novel workflows. We will discuss the general FIFE onboarding strategy, the upgrades and enhancements in the FIFE toolset, and plans for the coming year.
        These enhancements include: a refactoring of the Fermilab General Purpose Grid cluster for integration with the HEPCloud Project; expansion of opportunistic computing resources (including GPU and high-performance computing resources) ; assistance with commissioning computing resources at European sites for individual experiments; StashCache repositories for experiments; enhanced job monitoring tools; and a custom workflow management service. Additionally we have completed the first phase of a Federated Identity Management system to make it easier for FIFE users to access Fermilab computing resources. There is also a new service, known as FERRY, that will provide a central repository for authorization and job management attributes.

        Speaker: Vito Di Benedetto (Fermi National Accelerator Lab. (US))
      • 65
        HEPCloud, an Elastic Hybrid HEP Facility using an Intelligent Decision Support System

        HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet peak demands of the next generation of High Energy Physics experiments, Fermilab must either plan to locally provision enough resources to cover the forecasted need, or find ways to elastically expand its computational capabilities. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and to choose an appropriate scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources-spanning multiple cloud providers, multiple HPC centers, and grid computing federations. In this paper, we discuss the goals and architecture of the HEPCloud Facility, the architecture of the IDSS, and our early experience in using the IDSS for automated facility expansion both at Fermilab and Brookhaven Lab.

        Speaker: Eric Vaandering (Fermi National Accelerator Lab. (US))
      • 66
        Modeling and Simulation of Load Balancing Strategies for Computing in High Energy Physics

        The amount of data to be processed by experiments in high energy physics is tremendously increasing in the coming years. For the first time in history the expected technology advance itself will not be sufficient to cover the arising gap between required and available resources based on the assumption of maintaining the current flat budget hardware procurement strategy. This leads to unprecedented challenges to both HEP software development and computing models, respectively. A first step towards meeting these challenges is the development of load balancing strategies for the existing workflow management systems in order to ensure a most effective resource utilization during processing of the comprehensive and world-wide distributed datasets.

        We report on using Palladio [1], an existing simulator for simulating the performance of abstract software architectural models, to model and simulate the performance of computing jobs executed at GridKa Tier 1 center. We validate the model with real-world performance measurements. With this model, we will enable model-based evaluation of different load balancing strategies.

        [1] Becker, Steffen, Heiko Koziolek, and Ralf Reussner. "The Palladio component model for model-driven performance prediction." Journal of Systems and Software 82, no. 1 (2009): 3-22.

        Speaker: Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE))
      • 67
        The LZ UK Data Centre

        LZ is a Dark Matter experiment based at the Sanford Underground Research Facility. It is currently under construction and aims to start data taking in 2020. Its computing model is based on two data centres, one in the USA (USDC) and one in the UK (UKDC), both holding a complete copy of its data. During stable periods of running both data centres plan to concentrate on different aspects of reconstruction and analysis. However, both data centres are expected to be able to handle all aspects of data processing, including user analysis, to provide a failover should one of the data centres be temporarily unavailable.
        Unlike the USDC where both storage and processing facilities are located at NERSC, the UKDC uses a distributed approach on GridPP-provided computing resources. The data themselves will be hosted at the Tier-2 at Imperial College London, while their processing and analysis will take place at various UK Tier 2s and possibly other collaborating European institutes.
        Considerable effort has gone into providing an intuitive interface to the UKDC, hiding much of the complexity of the underlying infrastructure from the end users.
        We report on the setup of the UK Data Centre with an emphasis on the user interface, in addition to the planned interface to the US Data Centre and the experience gained from the mock data challenges.

        Speaker: Daniela Bauer (Imperial College (GB))
      • 68
        THE JINR DISTRIBUTED COMPUTING ENVIRONMENT

        Computing in the field of high energy physics requires usage of heterogeneous computing resources and IT, such as grid, high performance computing, cloud computing and big data analytics for data processing and analysis. The core of the distributed computing environment at the Joint Institute for Nuclear Research is the Multifunctional Information and Computing Complex (MICC). It includes Tier-1 for CMS experiment, Tier-2 site for all LHC experiments and other grid non-LHC VOs, such as BIOMED, СOMPASS, NICA/MPD, NOvA, STAR and BESIII, as well as cloud and HPC infrastructures. A brief status overview of each component is presented. Particular attention is given to the development of distributed computations performed in collaboration with CERN, BNL, FNAL, FAIR, China, and JINR Member States. One of the directions for the cloud infrastructure is development of integration methods of various cloud resources of the JINR Member States in order to perform common tasks, and also distribution of load on the integrated resources. We performed integration of cloud resources of several institutes in Azerbaijan, Ukraine, Russia, Bulgaria and Belarus. Extension of the HPC component will be carried through a specialized for HPC engineering infrastructure being created at MICC, which makes use of the contact liquid cooling technology implemented by the Russian company JSC "RSC Technologies". We present our plans to further develop MICC as a center for scientific computing within the multidisciplinary research environment of JINR and JINR Member States, and particularly for megascience projects, such as NICA.

        Speaker: Vladimir Korenkov (Joint Institute for Nuclear Research (RU))
      • 69
        Extending CERN computing to volunteers - LHC@home consolidation and outlook

        LHC@home has provided computing capacity for simulations under BOINC since 2005. Following the introduction of virtualisation with BOINC to run HEP Linux software in a virtual machine on volunteer desktops, initially started on the test BOINC projects, like Test4Theory and ATLAS@home, all CERN applications distributed to volunteers have been consolidated under a single LHC@home BOINC project. As part of an effort to unite CERN's batch, cloud, grid and volunteer computing efforts, the BOINC service has been integrated with the other computes services at CERN, notably HTCondor, in terms job submission and accounting. The paper will also address contributions to improve the BOINC software and community effort to evolve BOINC for a sustainable volunteer computing environment. Furthermore, we discuss future challenges to reduce the effort required by volunteers to run virtual machines for experiment simulations and improvements to BOINC to widen the appeal of volunteer computing.

        Speaker: David Cameron (University of Oslo (NO))
      • 70
        Many hands make light work: Experiences from a shared resource WLCG Tier-2 computing site

        The Edinburgh (UK) Tier-2 computing site has provided CPU and storage resources to the Worldwide LHC Computing Grid (WLCG) for close to 10 years. Unlike other sites, resources are shared amongst members of the hosting institute rather than being exclusively provisioned for Grid computing. Although this unconventional approach has posed challenges for troubleshooting and service delivery there are unique advantages, such as being able to run above baseline expectations through opportunistic use of the cluster during quiet periods. Furthermore, the economy of scale of a larger shared facility enables access to niche equipment without additional cost (e.g. large-scale GPU processing). This operational model has recently been expanded to incorporate access to national supercomputing and data storage facilities.

        This presentation will describe the steps taken to consolidate disparate computing and storage resources into a coherent Grid site endpoint. This effort includes: cloud provisioning and workload management using Openstack and Cloud Scheduler; enabling container technologies to ease workload management on HPC systems; connecting remote data facilities to grid storage middleware. We will reflect upon our experiences fostering links between stakeholders and consider how this model could be adopted at other institutes.

        Speaker: Andrew John Washbrook (The University of Edinburgh (GB))
      • 71
        Advances in ATLAS@Home towards a major ATLAS computing resource

        The volunteer computing project ATLAS@Home has been providing a stable computing resource for the ATLAS experiment since 2013. It has recently undergone some significant developments and as a result has become one of the largest resources contributing to ATLAS computing, by expanding its scope beyond traditional volunteers and into exploitation of idle computing power in ATLAS data centres. Removing the need for virtualization on Linux and instead using container technology has made the entry barrier significantly lower data centre participation and in this paper, we describe the implementation and results of this change. We also present other recent changes and improvements in the project. In early 2017 the ATLAS@Home project was merged into a combined LHC@Home platform, providing a unified gateway to all CERN-related volunteer computing projects. The ATLAS Event Service shifts data processing from file-level to event-level and we describe how ATLAS@Home was incorporated into this new paradigm. The finishing time of long tasks was also greatly improved by implementing a reassignment scheduling algorithm to assign late jobs to “reliable” volunteer hosts. Finally, the steps taken to allow regular ATLAS grid sites to move completely to ATLAS@Home are shown.

        Speaker: David Cameron (University of Oslo (NO))
    • T4 - Data handling: S2 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Conveners: Tigran Mkrtchyan (DESY), Tigran Mkrtchyan (A.Alikhanyan National Science Laboratory (AM))
      • 72
        Caching technologies for Tier-2 sites: a UK perspective.

        Pressures from both WLCG VOs and externalities have led to a desire to "simplify" data access and handling for Tier-2 resources across the Grid. This has mostly been imagined in terms of reducing book-keeping for VOs, and total replicas needed across sites. One common direction of motion is to increasing the amount of remote-access to data for jobs, which is also seen as enabling the development of administratively-cheaper Tier-2 subcategories, reducing manpower and equipment costs.
        Caching technologies are often seen as a "cheap" way to ameliorate the increased latency (and decreased bandwidth) introduced by ubiquitous remote-access approaches, but the usefulness of caches is strongly dependant on the reuse of the data thus cached.
        We report on work done in the UK at four GridPP Tier-2 sites - ECDF, Glasgow, RALPP and Lancaster - to investigate the suitability of transparent caching via the recently-rebranded XCache (Xrootd Proxy Cache) for both ATLAS and CMS workloads. We discuss issues with authentication, and measure the actual hit rates against the caches and their effect on efficiency of jobs at sites.
        We also examine the use of Xroot Proxy Caches backed by solid state media as internal caches within a site, buffering popular files between the site Storage Element and the workers.

        Speaker: Samuel Cadellin Skipsey
      • 73
        A data caching model for Tier-2 WLCG computing centres using XCache

        The XCache (XRootD Proxy Cache) provides a disk-based caching proxy for data access via the XRootD protocol. This can be deployed at WLCG Tier-2 computing sites to provide a transparent cache service for the optimisation of data access, placement and replication.

        We will describe the steps to enable full read/write operations to storage endpoints consistent with the distributed data management systems provided by LHC experiments using an example Tier-2 computing site (Edinburgh, UK). Performance studies will also be outlined to assess the benefits and limitations using the XCache service, leading to a generation of configuration templates and best practices for other grid computing sites to follow. We will also discuss how the XCache service can provide access to large regional storage pools for diskless or modest disk capacity sites as part of a regionally federated layout. The possibility of using a single XCache service for multiple VO support is also discussed.

        Speaker: Dr Teng LI (University of Edinburgh)
      • 74
        Advancing throughput of HEP analysis work-flows using caching concepts

        High throughput and short turnaround cycles are core requirements for the efficient processing of I/O-intense end-user analyses. Together with the tremendously increasing amount of data to be processed, this leads to enormous challenges for HEP storage systems, networks and the data distribution to end-users. This situation is even compounded by taking into account opportunistic resources without dedicated storage systems as possible extension of traditional HEP computing facilities for end-user analyses.

        Enabling data locality via local caches on the processing units is a very promising approach to solve throughput limitations and to ensure short turnaround cycles of end-user analyses. Therefore, two different caching concepts have been studied at the Karlsruhe Insitute of Technology. Both are transparently integrated into the HTCondor batch system in order to avoid job specific adaptations for end-users.

        The first concept relies on coordinated caches on SSDs in the worker nodes. Data locality is taken into account by custom developed components around the HTCondor batch system ensuring that jobs are assigned to nodes holding its input data.

        The second concept utilizes CEPH as a distributed file system acting as a system-wide cache. In this case no data locality specific adjustments need to be applied to the HTCondor batch system. In combination with developed XRootD caching and data locality plug-ins, this approach is also very well suited to tackle bandwidth limitations on opportunistic resources like HPC centers offering parallel file systems.

        In this talk an overview about the utilized technologies, the data locality concepts and the current status of the project will be presented.

        Speaker: Christoph Heidecker (KIT - Karlsruhe Institute of Technology (DE))
      • 75
        Evolution of the Hadoop platform for HEP

        The interest in using Big Data solutions based on Hadoop ecosystem is constantly growing in HEP community. This drives the need for increased reliability and availability of the central Hadoop service and underlying infrastructure provided to the community by the CERN IT department.
        This contribution will report on the overall status of the Hadoop platform and the recent enhancements and features introduced in many areas including the service configuration, availability, alerting, monitoring and data protection, in order to meet the new requirements posed by the users comminity.

        Speaker: Zbigniew Baranowski (CERN)
      • 76
        Disk failures in the EOS setup at CERN - A first systematic look at 1 year of collected data

        The EOS deployment at CERN is a core service used for both scientific data
        processing, analysis and as back-end for general end-user storage (eg home directories/CERNBOX).
        The collected disk failure metrics over a period of 1 year from a deployment
        size of some 70k disks allows a first systematic analysis of the behaviour
        of different hard disk types for the large CERN use-cases.

        In this presentation we will describe the data collection and analysis,
        summarise the measured rates and compare them with other large disk
        deployments. In a second part of the presentation we will present a first
        attempt to use the collected failure and SMART metrics to develop a machine
        learning model predicting imminent failures and hence avoid service degradation
        and repair costs.

        Speaker: Dirk Duellmann (CERN)
      • 77
        The challenges of mining logging data in ATLAS

        Processing ATLAS event data requires a wide variety of auxiliary information from geometry, trigger, and conditions database systems. This information is used to dictate the course of processing and refine the measurement of particle trajectories and energies to construct a complete and accurate picture of the remnants of particle collisions. Such processing occurs on a worldwide computing grid, necessitating wide-area access to this information.
        Event processing tasks may deploy thousands of jobs. Each job calls for a unique set of information from the databases via SQL queries to dedicated Squid servers in the ATLAS Frontier system, a system designed to pass queries to the database only if that result has not already been cached from another request. Many queries passing through Frontier are logged in an Elastic Search cluster along with pointers to the associated tasks and jobs, various metrics, and states at the time of execution. PanDA, which deploys the jobs, stores various configuration files as well as many log files after each job completes. Information is stored at each stage, but no system contains all information needed to draw a complete picture.
        This presentation describes the challenges of mining information from these sources to compile a view of database usage by jobs and tasks as well as assemble a global picture of the coherence and competition of tasks in resource usage to identify inefficiencies and bottlenecks within the overall system.

        Speaker: Elizabeth Gallas (University of Oxford (GB))
      • 78
        Parallel Event Selection Performance on HPC Systems

        In their measurement of the neutrino oscillation parameters (PRL 118, 231801
        (2017)), NOvA uses a sample of approximately 27 million reconstructed spills to
        search for electron-neutrino appearance events. These events are stored in an
        n-tuple format, in 180 thousand ROOT files. File sizes range from a few hundred KiB to a
        few MiB; the full dataset is approximately 3 TiB. These millions of events are
        reduced to a few tens of events by the application of strict event selection
        criteria, and then summarized by a handful of numbers each, which are used in
        the extraction of the neutrino oscillation parameters.

        The NOvA event selection code is currently a serial C++ program that reads
        these n-tuples. The current table data format and organization and the
        selection/reduction processing involved provides us with an opportunity to
        explore alternate approaches to represent the data and implement the
        processing. We represent our n-tuple data in HDF5 format that is optimized for
        the HPC environment and which allows us to use the machine's high-performance
        parallel I/O capabilities. We use MPI, numpy and h5py to implement our approach
        and compare the performance with the existing approach. We study the
        performance implications of using thousands of small files of different sizes
        as compared with one large file using HPC resources. This work has been done as
        part of the SciDAC project, “HEP analytics on HPC” in collaboration with the
        ASCR teams at ANL and LBNL.

        Speaker: Holger Schulz (Fermilab)
      • 79
        The ATLAS & Google "Data Ocean" Project for the HL-LHC era

        With the LHC High Luminosity upgrade the workload and data management systems are facing new major challenges. To address those challenges ATLAS and Google agreed to cooperate on a project to connect Google Cloud Storage and Compute Engine to the ATLAS computing environment. The idea is to allow ATLAS to explore the use of different computing models, to allow ATLAS user analysis to benefit from the Google infrastructure, and to give Google real science use cases to improve their cloud platform. Making the output of a distributed analysis from the grid quickly available to the analyst is a difficult problem. Redirecting the analysis output to Google Cloud Storage can provide an alternative, faster solution for the analyst. First, Google's Cloud Storage will be connected to the ATLAS Data Management System Rucio. The second part aims to let jobs run on Google Compute Engine, accessing data from either ATLAS storage or Google Cloud Storage. The third part involves Google implementing a global redirection between their regions to expose Google Cloud Storage as a single global entity. The last part will deal with the economic model necessary for sustainable cloud resource usage, including Google Cloud Storage costs, network costs, and peering costs with ESnet.

        Speaker: Mario Lassnig (CERN)
    • T5 - Software development: S2 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Mr Barthelemy Von Haller (CERN)
      • 80
        A tale of two programs: driving innovation in HEP computing

        In spite of the fact that HEP computing has evolved considerably over the years, the understanding of the evolution process seems to be still incomplete. There is no clear procedure to replace an established product with a new one, and most of the successful major transitions (e.g. PAW to Root or Geant3 to Geant4) have involved a large dose of serendipity and have caused splits in the community, while many other attempts have failed. In general, software innovation has proceeded with a relatively low efficiency, in an environment that is moreover already lacking sufficient resources for code development. The accelerating pace of evolution of the hardware and of the algorithms is certainly good news, since it offers the opportunity to address HEP dire need of computing resources, but only if the innovation process is able to take advantage of it in a timely and efficient way. One case in point is the evolution of the simulation code, responsible for 50% of the total computing resource consumption. This talk will describe the history of the GEANTV project, aimed at designing the next generation of simulation codes. In particular, we will describe how the concept of “disruptive innovation” has been adapted to our environment and how it has successfully led to the development of a revolutionary simulation prototype while continuously feeding technological innovation into the existing product. This process, begun with a no-compromise extreme design, has now led to the seamless merging of the development activities into the main line, hopefully leading to a seamless and efficient transition.

        Speaker: Federico Carminati (CERN)
      • 81
        Software training for the next generation of Physicists: joint experience of LHCb and ALICE

        The need for good software training is essential in the HEP community. Unfortunately, current training is non-homogeneous and the definition of a common baseline is unclear, making it difficult for newcomers to proficiently join large collaborations such as ALICE or LHCb.
        In the last years, both collaborations have started separate efforts to tackle this issue through training workshops, via Analysis Tutorials (organized by the ALICE Juniors since 2014) and the StarterKit (organized by LHCb students since 2015).
        In 2017, ALICE and LHCb have for the first time joined efforts to provide combined training by identifying common topics, such as version control systems (Git) and programming languages (e.g. Python). Given the positive experience and feedback, this collaboration will be repeated in the future.
        We will illustrate the teaching methods, experience and feedback from our first common training workshop. We will also discuss our efforts to extend our format to other HEP experiments for future iterations.

        Speakers: Dario Berzano (CERN), Chris Burr (University of Manchester (GB))
      • 82
        Exploring polyglot software frameworks in ALICE with FairMQ and fer

        In order to meet the challenges of the Run-3 data rates and volumes, the ALICE collaboration is merging the online and offline infrastructures into a common framework: ALICE-O2.
        O2 is based on FairRoot and FairMQ, a message-based, multi-threaded and multi-process control framework.
        In FairMQ, processes (possibly on different machines) exchange data via message queues either through 0MQ or nanomsg.
        FairMQ allows to build sophisticated topologies of processes: router/dealer, request/reply, publish/subscribe, client/server, etc...
        In turn, this enables developers to write their reconstruction or analysis process in whatever language they choose or deem appropriate for the task at hand, as long as that programming language can send and receive data through these message queues.

        This paper presents "fer", a Go-based library compatible and interoperable with FairMQ.
        We will start with a brief introduction of the builtin features that make Go a solid choice when dealing with I/O and concurrency.
        We will then describe the principal components of fer and how they interact with C++ FairMQ.
        Finally, we will report on the performances (CPU, VMem) of fer and conclude with the main figures of merit of fer, in the context of deployment in a distributed computing setup.

        Speaker: Dr Sebastien Binet (IN2P3/LPC)
      • 83
        HEPCon - A Cross-Platform Mobile Application for HEP Events

        Collaboration in research is essential for it is saving time and money. The field of high-energy physics (HEP) is no different. The higher level of collaboration the stronger community. The HEP field encourages organizing various events in format and size such as meetings, workshops and conferences. Making attending a HEP event easier leverages cooperation and dialogue and this is what makes Indico service defacto a community standard.

        We would like to present HEPCon, a cross-platform mobile application which collects all information available on Indico on a portable device. It keeps most of the data locally which speeds up the interaction. HEPCon uses a shared codebase which allows easy multiplatform development and support. There are iOS and Android implementations available for free download.

        The project is based on C# and we use the Xamarin mobile app technology for building native iOS and Android apps. SQLite database is responsible for retrieving and storing conference data. The app can be used to preview data from past CHEP conferences but the tool is implemented generic enough to support other kinds of indico events.

        Speaker: Mr Martin Vasilev (University of Plovdiv)
      • 84
        Application extensibility and customization - Indico's case

        Indico is a general-purpose event management system currently in use by more than 150 institutions world wide. Despite having been born at CERN and primarily adopted by the High Energy Physics Community, it has recently gained adoption in other communities (for example, the United Nations and its agencies) and received the attention of commercial vendors world wide. This growth in adoption implied finding better ways to customize the system to the needs of each individual organization. The application is written in Python and uses the Flask web framework, as well as SQLAlchemy as an ORM to PostgreSQL.

        Indico 2.0 brings improvements that make it an easily configurable application, starting from a simpler installation process and customizable user interface using CSS and a templating language; and ending with a plugins system providing easy solutions for developing custom plugins, which allow for functionality to be added without changes in the application's core. The project’s philosophy has always been that of “simplicity by default” and “optional complexity”.

        In this paper we will explain in detail the mechanisms we put in place in order to make Indico more extensible, configurable and easy to customize; we will describe some of the technical challenges that creating a feature-complete yet flexible web application involves and some of the technical solutions we have found and which can be reused by others. We will also detail the approaches that led to the development of flask-pluginengine and flask-multipass, two standalone modules that implement most of the application's extensibility and which can be reused by any Flask application. Finally, we will go over the parts of the tool suite which powers Indico, including the alembic-based migration utility and the asset generating pipeline, and analyze the implications this work had on them.

        Speaker: Pedro Ferreira (CERN)
      • 85
        Extending ROOT through Modules

        The ROOT software framework is foundational for the HEP ecosystem, providing capabilities such as IO, a C++ interpreter, GUI, and math libraries. It uses object-oriented concepts and build-time modules to layer between components. We believe additional layering formalisms will benefit ROOT and its users.

        We present the modularization strategy for ROOT which aims to formalize the description of existing source modules, making available the dependencies and other metadata externally from the build system, and allow post-install additions of functionality in the runtime environment. Modules can then be grouped into packages, installable from external repositories to deliver post-install step of missing packages. This provides a mechanism for the wider software ecosystem to interact with a minimalistic install. Reducing intra-module dependencies improves maintainability and code hygiene. We believe helping maintain the smallest “base install” possible will help embedding use cases.

        The modularization effort draws inspiration from the Java, Python, and Swift ecosystems. Keeping aligned with the modern C++, this strategy relies on forthcoming features such as C++ modules. We hope formalizing the module layer will provide simpler ROOT installs, improve extensibility, and decrease the complexity of embedding in other ecosystems.

        Speaker: Oksana Shadura (University of Nebraska Lincoln (US))
      • 86
        A Python upgrade to the GooFit package for parallel fitting

        The GooFit highly parallel fitting package for GPUs and CPUs has been substantially upgraded in the past year. Python bindings have been added to allow simple access to the fitting configuration, setup, and execution. A Python tool to write custom GooFit code given a (compact and elegant) MINT3/AmpGen amplitude description allows the corresponding C++ code to be written quickly and correctly. New PDFs have been added. The most recent release was built on top of the December 2017 2.0 release that added easier builds, new platforms, and a more robust and efficient underlying function evaluation engine.

        Speaker: Henry Fredrick Schreiner (University of Cincinnati (US))
      • 87
        Vectorization of ROOT Mathematical Libraries

        In order to take full advantage of new computer architectures and to satisfy the requirement of minimising the CPU usage with increasing amount of data to analysis, parallelisation and SIMD vectorisation have been introduced in the ROOT mathematical libraries. The VecCore library provides a very convenient solution to abstract SIMD vectorization and it has been found extremely useful for integrating vectorisation into an existing software. VecCore features a simple API to write SIMD-enabled algorithms, by avoiding using directly lower level vectorisation libraries. Furthermore VecCore supports several back-end such as Vc and UME::SIMD.
        We describe first how VecCore is integrated into ROOT and how it has been applied to implement SIMD vectorization in some of the most commonly used Mathematical functions. We report as well how the SIMD vectorisation has been deployed for describing model functions, which can be used for fitting. In particular, we show how ROOT functions built with a string expression such as those based on the ROOT TFormula class, or functions provided by users, can be vectorised and used for speeding up fitting on large data sets. We report performance figures from benchmarks on different architectures, such Intel Xeon and Xeon Phi processors, showing the gains from using SIMD parallelism.

        Speaker: Lorenzo Moneta (CERN)
    • T6 - Machine learning and physics analysis: S2 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 88
        Study of Neural Network Size Requirements for Approximating Functions Relevant to HEP

        Neural networks, and recently, specifically deep neural networks, are attractive candidates for machine learning problems in high energy physics because they can act as universal approximators. With a properly defined objective function and sufficient training data, neural networks are capable of approximating functions for which physicists lack sufficient insight to derive an analytic, closed-form solution. There are, however, a number of challenges that can prevent a neural network from achieving a sufficient approximation of the desired function. One of the chief challenges is that there is currently no fundamental understanding of the size—both in terms of number of layers and number of nodes per layer—necessary to approximate any given function. Networks that are too small are doomed to fail, and networks that are too large often encounter problems converging to an acceptable solution or develop issues with overtraining. In an attempt to gain some intuition, we performed a study of neural network approximations of functions known to be relevant to high energy physics, such as calculating the invariant mass from momentum four-vector components, or calculating the momentum four-vector vector of a parent particle from the four-vectors of its decay products. We report on the results of those studies and discuss possible future directions.

        Speaker: Jessica Stietzel (College of the Holy Cross)
      • 89
        Adversarial event generator tuning with Bayesian Optimization

        High Energy Physics experiments often rely on Monte-Carlo event generators. Such generators often contain a large number of parameters and need fine-tuning to closely match experimentally observed data. This task traditionally requires expert knowledge of the generator and the experimental setup as well as vast computing power.Generative Adversarial Networks (GAN) is a powerful method to match distribution of samples produced by a parametrized generator to a set of observations. Following the recently proposed study on adversarial variational optimization of non-differentiable generator, we adopt Bayesian Optimization as an efficient gradient-free optimization method for adversarial fine-tining of event generators. The proposed method requires minimal prior knowledge,nevertheless, allows for expert insights to be straightforwardly incorporated into the method.In this talk, we briefly describe a theoretical approach to the problem and show the results for parameter tunning of PYTHIA event generator.

        Speaker: Mr Maxim Borisyak (National Research University Higher School of Economics)
      • 90
        Anomaly detection using Deep Autoencoders for the assessment of the quality of the data acquired by the CMS experiment

        The certification of the CMS data as usable for physics analysis is a crucial task to ensure the quality of all physics results published by the collaboration. Currently, the certification conducted by human experts is labor intensive and can only be segmented on a run by run basis. This contribution focuses on the design and prototype of an automated certification system assessing data quality on a per-luminosity section (i.e. 23 seconds of data taking) basis. Anomalies caused by detector malfunctions or sub-optimal reconstruction are unpredictable and occur rarely, making it difficult to use classical supervised classification methods such as feedforward neural networks. We base our prototype on a semi-supervised model which employs deep autoencoders. This approach has been qualified successfully on CMS data collected during the 2016 LHC run: we demonstrate its ability to detect anomalies with high accuracy and low fake rate, when compared against the outcome of the manual certification by experts. A key advantage of this approach over other ML technologies is having great interpretability of the results, which can be further used to ascribe the origin of the problems in the data to a specific sub-detector or physics objects.

        Speaker: Adrian Alan Pol (Université Paris-Saclay (FR))
      • 91
        Online detector monitoring using AI: challenges, prototypes and performance evaluation for automation of online quality monitoring of the CMS experiment exploiting machine learning algorithms.

        Online Data Quality Monitoring (DQM) in High Energy Physics experiment is a key task which, nowadays, is extremely expensive in terms of human resources and required expertise.

        We investigate machine learning as a solution for automatised DQM. The contribution focuses on the peculiar challenges posed by the requirement of setting up and evaluating the AI algorithms in the online environment; it also presents the successful application of modern machine learning techniques, in particular deep learning, to concrete examples of detector monitorables (e.g. based on the Muon Spectrometer) integrated in the production DQM infrastructure of CMS.

        This pioneeristic work paves the way to the automatisation of many of the tasks currently performed in the online DQM system, allowing the check of large volumes of data in real-time and improving the ability to detect unexpected failures and reducing the manpower requirements simultaneously.

        Speaker: Adrian Alan Pol (Université Paris-Saclay (FR))
      • 92
        ATLAS Analytics and Machine Learning Platforms

        In 2015 ATLAS Distributed Computing started to migrate its monitoring systems away from Oracle DB and decided to adopt new big data platforms that are open source, horizontally scalable, and offer the flexibility of NoSQL systems. Three years later, the full software stack is in place, the system is considered in production and operating at near maximum capacity (in terms of storage capacity and tightly coupled analysis capability). The new model provides several tools for fast and easy to deploy monitoring and accounting. The main advantages are: ample ways to do complex analytics studies (using technologies such as java, pig, spark, python, jupyter), flexibility in reorganization of data flows, near real time and inline processing. The analytics studies improve our understanding of different computing systems and their interplay, thus enabling whole-system debugging and optimization. In addition, the platform provides services to alarm or warn on anomalous conditions, and several services closing feedback loops with the Distributed Computing systems. Here we briefly describe the main system components and data flows, but will concentrate on both hardware and software tools we use for in depth analytics/simulations, support for machine learning algorithms, specifically artificial neural network training and reinforcement learning techniques. We describe several applications the platform enables, and discuss ways for further scale up.

        Speaker: James Catmore (University of Oslo (NO))
      • 93
        REANA: A System for Reusable Research Data Analyses

        The revalidation, reinterpretation and reuse of research data analyses requires having access to the original computing environment, the experimental datasets, the analysis software, and the computational workflow steps which were used by the researcher to produce the original scientific results in the first place.

        REANA (=Reusable Analyses) is a nascent platform enabling researchers to structure their research data analyses in view of enabling future reuse. The analysis is described by means of a YAML file that captures sufficient information about the analysis assets, parameters and processes. The REANA platform offers a set of micro-services permitting to launch and monitor container-based computational workflow jobs on the cloud. The REANA user interface and the command-line client enables researchers to easily rerun analysis workflows with new input parameters. The REANA platform aims at supporting several container technologies (Docker), workflow engines (CWL, Yadage), shared storage systems (Ceph, EOS) and compute cloud infrastructures (Kubernetes/OpenStack, HTCondor) used by the community.

        REANA was developed with the particle physics use case in mind and profits from synergies with general reusable research data analysis patterns in other scientific disciplines, such as bioinformatics and life sciences.

        Speaker: Tibor Simko (CERN)
      • 94
        Continuous Analysis Preservation and Streamlining for the ATLAS Experiment

        We present recent work within the ATLAS collaboration to centrally provide tools to facilitate analysis management and highly automated container-based analysis execution in order to both enable non-experts to benefit from these best practices as well as the collaboration to track and re-execute analyses independently, e.g. during their review phase.

        Through integration with the ATLAS GLANCE system, users can request a pre-configured, but customizable version control setup, including continuous integration for automated build and testing as well as continuous Linux Container image building for software preservation purposes.

        As analyses typically require many individual steps, analysis workflow pipelines can then be defined using such images and the yadage workflow description language. The integration into the workflow exection service REANA allows the interactive or automated reproduction of the main analysis results by orchestrating a large number of container jobs using Kubernetes.

        For long-term archival, we present integration with the CERN Analysis Preservation Portal (CAP), where data, analysis repositories, software images and workflows can be stored and indexed for later re-use such as reinterpretations.

        Speaker: Lukas Alexander Heinrich (New York University (US))
      • 95
        Allocation Optimization for the ATLAS Rebalancing Data Service

        The distributed data management system Rucio manages all data of the ATLAS collaboration across the grid. Automation such as replication and rebalancing are an important part to ensure the minimum workflow execution times. In this paper, a new rebalancing algorithm based on machine learning is proposed. First, it can run independently of the existing rebalancing mechanism and can be modularised. It collects data from other services and learns optimality as it runs in the background. Periodically this learning agent takes a subset of the global datasets and proposes them for redistribution to reduce waiting times. The user can interact and choose to accept, decline, or override the dataset placement suggestions. The accepted items are shifted continuously between destination data centres as a background service while taking network and storage utilisation into account.

        Speaker: Ralf Vamosi (CERN)
    • T8 - Networks and facilities: S2 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Conveners: Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno), Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
      • 96
        Improving WLCG Networks Through Monitoring and Analytics

        WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. OSG Networking Area in partnership with WLCG has focused on collecting, storing and making available all the network related metrics for further analysis and discovery of issues that might impact network performance and operations. This has been facilitated by the already existing network of the perfSONAR instances that have been commissioned to operate in production. In order to help sites and experiments better understand and fix the networking issues, WLCG Network Throughput working group was formed, which works on the analysis and integration of the network-related monitoring data collected by the OSG/WLCG infrastructure and operates a support unit to help find and fix the network performance issues.
        In this talk we'll describe the current state of the OSG network measurement platform and summarise the activities taken by the working group, focusing mainly on the throughput issues that have been reported and resolved during the recent period with the help of the perfSONAR network. We will also cover the updates on the few higher level services that were developed to help bring the platform to its full potential as well as describe the currently ongoing analytical activities related to networking and their results. Finally, we will discuss and propose possible R&D areas to help improve our networks in general as well as prepare the infrastructure for the foreseen major changes in the way networks will be provisioned and operated in the future.

        Speaker: Edoardo Martelli (CERN)
      • 97
        IPv6 in production: its deployment and usage in WLCG

        The fraction of general internet traffic carried over IPv6 continues to grow rapidly. The transition of WLCG central and storage services to dual-stack IPv4/IPv6 is progressing well, thus enabling the use of IPv6-only CPU resources as agreed by the WLCG Management Board and presented by us at CHEP2016. By April 2018, all WLCG Tier 1 data centres will provide access to their services over IPv6. The LHC experiments have requested all WLCG Tier 2 centres to provide dual-stack access to their storage by the end of LHC Run 2.

        For several years the HEPiX IPv6 Working Group has been testing WLCG services to ensure their IPv6 compliance. Many Tier 2s are now successfully running their services in dual-stack mode, but there is still a large number who are not. The working group, driven by the requirements of the LHC VOs to be able to use IPv6-only opportunistic resources, continues to encourage wider deployment of dual-stack services and has been monitoring the transition.

        We will present the progress of the transition to IPv6. This includes making experiment central services dual stack as well as the storage and data services. The monitoring, accounting and information services that are used by jobs are also being upgraded. Finally, we will assess the timetable for being able to turn off IPv4 completely at a WLCG data centre, this being the point at which the transition can be declared truly complete.

        Speaker: David Kelsey (STFC-Rutherford Appleton Laboratory (GB))
      • 98
        Using Kytos SDN platform to enhance international big data transfers

        Data-intensive science collaborations still face challenges when transferring large data sets between globally distributed endpoints. Many issues need to be addressed to orchestrate the network resources in order to better explore the available infrastructure. In multi-domain scenarios, the complexity increases because network operators rarely export the network topology to researchers and users, resulting in a slow inter domain circuit provisioning. The LHC from CERN and the LSST are two examples of such scientific initiatives.

        Kytos SDN Platform is a new solution, developed at the Sao Paulo Research and Analysis Center (SPRACE) from Unesp, that enables a fast way to deploy an SDN infrastructure. Kytos was designed to be easy to install, use, develop and share via Network Apps (NApps). A circuit provisioning application has been developed on top of the Kytos platform, known as the "Kytos MEF E-Line". It is a service type defined by the Metro Ethernet Forum for connecting exactly two User Network Interfaces (UNI), so they can communicate only with each other. This NApp can be used to schedule a circuit with a minimum bandwidth before the beginning of a specific data transfer between two endpoints. We present in detail the Kytos open source SDN platform and introduce the "Kytos MEF E-Line" NApp, developed in collaboration with AmLight. We also show how this can be integrated with FTS via GFAL plugin in order to allow the provisioning of end-to-end circuits over a Software-Defined Network.

        Speaker: Beraldo Costa Leal (UNESP - Universidade Estadual Paulista (BR))
      • 99
        Integration and evaluation of QUIC and TCP-BBR in long-haul WLCG data transfers

        Recent years have seen the mass adoption of streaming in mobile computing, an increase in size and frequency of bulk long-haul data transfers
        in science in general, and the usage of big data sets in job processing
        demanding real-time long-haul accesses that can be greatly affected by
        variations in latency. It has been shown in the Physics and climate research communities that the need to exchange petabytes of data with
        global collaborators can be seriously hampered by the TCP congestion
        control and latency. Demands for faster and lower latency transfers have
        been stressed also by the increasing need of encryption both in mobile
        computing and in computational science.

        Two recent and promising additions to the internet protocols are TCPBBR and QUIC. BBR implements a congestion policy that promises a
        better control in TCP bottlenecks on long haul transfer. TCP-BBR is
        implemented in the Linux kernnels above 4.9. It has been shown, however,
        to demand some fine tuning in the interaction, for example, with the Linux
        Fair Queue. QUIC, on the other hand, replaces HTTP and TLS with a
        protocol on the top of UDP and thin layer to serve HTTP. It has been
        reported to account today for 7% of Google’s traffic. It hasn’t been used
        in server to server transfers even if its creators see that as a real possibility.

        Our work evaluates the applicability and tuning of TCP-BBR and
        QUIC for WLCG and data science transfers. We describe the integration
        of each of them into the transfer tool iperf and the xroot protocol. Possibly, for the first time, server to server deployment of QUIC and tests
        of the resulting performance evaluation of both QUIC and TCP-BBR on
        long haul transfers involving WLCG servers is presented.

        Speaker: Raul Cardoso Lopes (Brunel University (GB))
      • 100
        Integrating Networking into ATLAS

        Networking is foundational to the ATLAS distributed infrastructure and there are many ongoing activities related to networking both within and outside of ATLAS. We will report on the progress in a number of areas exploring ATLAS's use of networking and our ability to monitor the network, analyze metrics from the network, and tune and optimize application and end-host parameters to make the most effective use of the network. Specific topics will include work on Open vSwitch for production systems, network analytics, FTS testing and tuning, and network problem alerting and alarming.

        Speaker: Shawn McKee (University of Michigan (US))
      • 101
        Entry Stage for the CBM First-level Event Selector

        The First-level Event Selector (FLES) is the main event selection
        system of the upcoming CBM experiment at the future FAIR facility in
        Germany. As the central element, a high-performance compute
        cluster analyses free-streaming, time-stamped data delivered from the
        detector systems at rates exceeding 1 TByte/s and selects data
        for permanent storage.
        While the detector systems are located in a dedicated CBM building, the
        FLES compute resources will be hosted in the central Green IT Cube
        data center approx. 700 meters away from the experiment. This demands
        single-mode optics and protocols not susceptible to delay. Previous
        concepts therefore included an FPGA-based concentrator stage and custom
        optical links to the FLES cluster.
        Exploiting newest network technologies allows for a more flexible, yet
        cost effective design using COTS equipment for the long-haul link. A
        FLES entry cluster, placed near the detector, will terminate detector
        links in custom PCIe cards. It aggregates processing components and
        delivers them via a specially configured InfiniBand network to the main
        compute cluster. The network design facilitates flexible staging and
        local processing for testing and detector commissioning purposes.
        An overview of the FLES entry cluster and network architecture as well
        as newest performance measurements will be presented.

        Speaker: Dirk Hutter (Johann-Wolfgang-Goethe Univ. (DE))
      • 102
        Netbench – large-scale network device testing with real-life traffic patterns

        Network performance is key to the correct operation of any modern datacentre infrastructure or data acquisition (DAQ) system. Hence, it is crucial to ensure the devices employed in the network are carefully selected to meet the required needs.

        The established benchmarking methodology [1,2] consists of various tests that create perfectly reproducible traffic patterns. This has the advantage of being able to consistently asses the performance differences between various devices, but comes at the disadvantage of always using known, pre-defined traffic patterns (frame sizes and traffic distribution) that do not stress the buffering capabilities of the devices to the same extent as real-life traffic would.
        Netbench is a network-testing framework, relying on commodity servers and NICs, that aims at overcoming the previously mentioned shortcoming. While not providing identical conditions for every test, netbench enables assessing the devices’ behaviour when handling multiple TCP flows, which closely resembles real-life usage. e.g.:
        - a full-mesh traffic distribution [3] is most likely the best benchmarking pattern for a network device to be employed in a multi-purpose data-centre;
        - a unidirectional partial-mesh traffic distribution [3] will closely mimic the event-building traffic pattern from a DAQ system.

        Due to the prohibitive cost of specialized hardware equipment that implements RFC tests [1,2], few companies/organisations can afford a large-scale test setup. The compromise that is often employed is to use two hardware tester ports and feed the same traffic to the device multiple times through loop-back cables (the so called “snake-test”). This test fully exercises the per-port throughput capabilities, but barely stresses the switching engine of the device. The per-port price of a netbench test setup is significantly smaller than that of a testbed made using specialized hardware, especially if we take into account the fact that generic servers can be time-shared between netbench and day-to-day usage. Thus, a large-scale multi-port netbench setup is affordable, and enables organisations/companies to complement the snake test with benchmarks that stress test the switching fabric of network devices.

        Netbench has a layered architecture and uses standard technologies. At its core, it relies on iperf3 [4] as an engine to drive TCP flows between servers, but it can easily be extended to support other traffic generators. The orchestration platform that sets up multiple iperf3 sessions is written in Python and relies on XML-RPC for fast provisioning of flows. Per-flow statistics are gathered into a PostgreSQL database, and the results visualisation is based on a Python REST API and a web page using JavaScript and the D3.js library for displaying graphs. Statistics are presented at different levels of detail allowing the human tester to quickly asses the overall state of a test from both per-node and per-pair (source-destination) statistics.

        During its last call for tender for high-end routers, CERN has employed netbench for evaluating the behaviour of network devices when exposed to full and partial mesh TCP traffic. We will present sample results from the evaluation. Furthermore, during the evaluation it became apparent that, due to the temporary congestion caused by competing TCP flows, netbench provides a good estimation of the devices’ buffering capabilities.

        To summarize, we present netbench, a tool that allows provisioning TCP flows with various traffic distributions (pairs, partial and full-mesh). We consider netbench an essential complement to synthetic RFC tests [1][2], as it enables affordable, large-scale testing of network devices with traffic patterns that closely resemble real-life conditions.

        [1] RFC 2544, Bradner, S. and McQuaid J., "Benchmarking Methodology for Network Interconnect Devices"
        [2] RFC 2889, Mandeville, R. and Perser J., "Benchmarking Methodology for LAN Switching Devices"
        [3] RFC 2285, Mandeville, R., "Benchmarking Terminology for LAN Switching Devices"
        [4] iperf3 http://software.es.net/iperf/

        Speaker: Stefan Nicolae Stancu (CERN)
      • 103
        Long-term experiences in keeping balance between safety and usability in research activities in KEK

        We provide KEK general purpose network to support various kinds of research activities in the field of high-energy physics, material physics, and accelerator physics. Since the end of 20th century, cyber attacks to the network are on an almost daily basis, and attack techniques change rapidly and drastically. In such circumstances, we are constantly facing difficult tradeoff and are required to keep a balance between safety and usability in research activities. While restriction of network services/applications and network monitoring may improve the safety, too much restriction must be avoided and network monitoring should be done with a full consideration.

        In this talk, we present our long-term experiences in keeping a balance between safety and usability in KEK research activities. The key points are ensuring traceability and security management in a reasonable manner. We have been operating security devices such as firewalls, intrusion detection system, and vulnerability management system to realize the balance. In addition, CSIRT (Computer Security Incident Response Team) activities and cooperative activities among research organizations are presented.

        Speaker: Tadashi Murakami (KEK)
    • 16:00
      Break
    • Plenary: S2 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Marco Cattaneo (CERN)
    • Social events: Welcome drink Central Military Club

      Central Military Club

      Tsar Osvoboditel Blvd 7, 1000 Sofia Center
    • Plenary: S3 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Maria Girone (CERN)
      • 108
        Next generation generative neural networks for HEP

        Initial studies have suggested generative adversarial networks (GANs) have promise as fast simulations within HEP. These studies, while promising, have been insufficiently precise and also, like GANs in general, suffer from stability issues. We apply GANs to to generate full particle physics events (not individual physics objects), and to large weak lensing cosmology convergence maps. We explore recent developments in convergence, such as ‘optimal transport’ GANs; explore representations that match the structure of the detector better than images; and evaluate the precision and generalisability of the generated datasets.
        In addition we demonstrate a framework making use of distributed computing on the Cori supercomputer at NERSC launched via interactive jupyter notebook sessions,. This allows for tackling high-resolution detector data; model selection and hyper-parameter tuning in a productive yet scalable deep learning environment.

        Speaker: Steven Andrew Farrell (Lawrence Berkeley National Lab. (US))
      • 109
        Synthesizing Machine Learning algorithms on FPGA

        Machine learning methods are becoming ubiquitous across particle physics. However, the exploration of such techniques in low-latency environments like L1 trigger systems has only just begun. We present here a new software, based on High Level Synthesis (HLS), to generically port several kinds of network models (BDTs, DNNs, CNNs) into FPGA firmware. As a benchmark physics use case, we consider the task of tagging high-pT jets as H->bb candidates using jet substructure. We map out resource usage and latency versus types of machine learning algorithms and their hyper-parameters. We present a set of general practices to efficiently design low-latency machine-learning algorithms on FPGAs.

        Speaker: Jennifer Ngadiuba (INFN, Milano)
      • 110
        The obsolescence of Information and Information Systems

        In 2016 was started the CERN Digital Memory project with the main goal of preventing loss of historical content produced by the organisation. The first step of the project was targeted to address the risk of deterioration of the most vulnerable materials, mostly the multimedia assets created in analogue formats from 1954 to the late 1990's, like still and moving images on films or magnetic tapes. In parallel was studied today's best practices to guarantee a long life to digital content, either born digital or resulting from a digitization process. If traditional archives and libraries have grown up during centuries establishing recognized standards to deal with the preservation of printed content, the field of digital archiving is in its infancy.

        This talk will shortly expose the most surprising challenges when migrating hundreds of thousands of audio, slides, negatives, videotapes or films from the analogue to the digital era. It will then describe how a Digital Memory platform is being built, conform to the principles of the ISO-16363 digital object management norm that defines trustworthy digital repositories. Finally, as all information repository managers are faced with the necessary migration of underlying systems and the obsolescence of the information itself, the talk will explain how a digital archiving platform focusing only on content preservation could be of direct interest for most of the live systems.

        Speaker: Jean-Yves Le Meur (CERN)
    • 10:30
      Break
    • T1 - Online computing: S3 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Ryosuke Itoh (KEK)
      • 111
        Simultaneous usage of the LHCb HLT farm for Online and Offline processing workflows

        LHCb is one of the 4 LHC experiments and continues to revolutionise data acquisition and analysis techniques. Already two years ago the concepts of “online” and “offline” analysis were unified: the calibration and alignment processes take place automatically in real time and are used in the trigerring process such that Online data are immediately available offline for physics analysis (Turbo analysis), the computing capacity of the HLT farm has been used simultaneously for different workflows : synchronous first level trigger, asynchronous second level trigger, and Monte-Carlo simulation. Thanks to the same software environment, LHCb can switch seamlessly and fast from Online to Offline workflows, run them simultaneously and thus maximize the usage of the HLT Farm computing resources.

        Speaker: Joel Closier (CERN)
      • 112
        Alignment and Calibration of the Belle II Detector

        In spring 2018 the SuperKEKB electron-positron collider at High Energy Accelerator Research Organization (KEK, Tsukuba, Japan) will deliver its first collisions to the Belle II experiment. The aim of Belle II is to collect a data sample 50 times larger than the previous generation of B-Factories taking advantage of the unprecedented SuperKEKB design luminosity of 8x10^35 cm^-2 s^-1. The Belle II detector will allow to conduct precise measurements in the harsh collider environment, probing for signs of physics beyond the standard model at the precision frontier.

        In order to deliver data suitable for physics analysis, the detector has to be properly calibrated on a regular basis. Among other calibrations the detector alignment plays a key role. For example, precise measurements of time dependent CP-violation rely on the accurate alignment of the new vertex detector, as well as on the determination of the beamspot position and size. To automatize the calibration procedures and manage the large amount of data and processing power needed for detector calibration, a software framework has been developed which allows to define the complete workflow and to execute it on a computing cluster. The framework integrates the Millepede II algorithm to solve the large minimization problem emerging in the track-based alignment and calibration of the pixel and strip detector, the central drift chamber, and the muon system.

        The first collision data will allow to test and to further improve and tune the alignment and calibration procedures. Although the vertexing capabilities will be limited due to the installation of only a small slice of the full vertex detector, the commissioning phase will allow to test most of the alignment procedure features and to prepare for the full operation.
        We will present the results achieved during the first data taking, the experience gained and the plans for the first physics run with the full detector.

        Speaker: Tadeas Bilka (Charles University, Prague)
      • 113
        Spy-Agency service for monitoring of automated real-time calibration workflows and steering the injection of prompt reconstruction workflows at the CMS experiment

        The calibration of the detector in almost real time is a key to the exploitation of the large data volumes at the LHC experiments. For this purpose the CMS collaboration deployed a complex machinery involving several components of the processing infrastructure and of the condition DB system. Accurate reconstruction of data start only once all the calibrations become available for consumption and relies on continuous and detailed monitoring of the calibration machinery and the physics performance of its products. This monitoring task requires to aggregate, digest, and react upon information from all the different components based on very heterogeneous technologies: the Tier0 processing farm, the Oracle based condition DB, the data quality monitoring framework and various other logging and bookkeeping services . An application has been designed and deployed to spy data from various sources. The application presents information to a web based interface and communicates the readiness of the calibrations to the computing infrastructure. A dedicated application, spying data from these various different sources, presenting them on a web based interface and able to communicate to the computing infrastructure the readiness for reconstruction of any chunk of data, has been designed and deployed. The presentation reports on the design choices and operational experience of this new tool.

        Speaker: Arun Kumar (National Taiwan University (TW))
      • 114
        LHCb full-detector real-time alignment and calibration: latest developments and perspectives

        A key ingredient of the data taking strategy used by the LHCb experiment in Run-II is the novel real-time detector alignment and calibration. Data collected at the start of the fill are processed within minutes and used to update the alignment, while the calibration constants are evaluated hourly. This is one of the key elements which allow the reconstruction quality of the software trigger in Run-II to be as good as the offline quality of Run-I.
        The most recent developments of the real-time alignment and calibration paradigm enable the fully automated updates of the RICH detectors' mirror alignment and a novel calibration of the calorimeter systems. Both evolutions improve the particle identification performance stability resulting in higher purity selections. The latter leads also to an improvement in the energy measurement of neutral particles, resulting in a 15% better mass resolution of radiative b-hadron decays.
        A large variety of improvements has been explored for the last year of Run-II data taking and is under development for the LHCb detector upgrade foreseen in 2021.These range from the optimization of the data samples selection and strategy to the study of a more accurate magnetic field description. Technical and operational aspects as well as performance achievements are presented, focusing on the new developments for both the current and upgraded detector.

        Speaker: Chris Burr (University of Manchester (GB))
      • 115
        Data handling in the ALICE O2 event processing

        The ALICE experiment at the Large Hadron Collider (LHC) at CERN is planned to be operated in a continuous data-taking mode in Run 3.This will allow to inspect data from all collisions at a rate of 50 kHz for Pb-Pb, giving access to rare physics signals embedded into a large background.

        Based on experience with real-time reconstruction of particle trajectories and event properties in the ALICE High Level Trigger, the ALICE O2 facility is currently designed and developed to support processing of a continuous, trigger-less stream of data segmented into so-called time frames.

        ALICE O2 uses the ALFA software framework and message queue communication to distribute workload among many processes running on multiple compute nodes. After aggregation of all data forming a time frame on an Event Processing Node (EPN), algorithms can efficiently navigate through large data sets and add new data to the data stream. Both data format and software framework have been designed with emphasis on lightweight data organization and optimized in-memory data format, supporting an efficient pipelined data processing.

        In this contribution, the organization of data reconstruction in the O2 event processing together with prototype studies for detector reconstruction and performance measurements will be presented.

        Speaker: Matthias Richter (University of Oslo (NO))
      • 116
        Improving the use of data quality metadata via a partnership of technologies and resources between the CMS experiment at CERN and industry

        The CMS experiment dedicates a significant effort to supervise the quality of its data, online and offline. A real-time data quality (DQ) monitoring is in place to spot and diagnose problems as promptly as possible to avoid data loss. The evaluation a posteriori of processed data is designed to categorize the data in term of their usability for physics analysis. These activities produce DQ metadata.
        The DQ evaluation relies on a visual inspection of monitoring features. This practice has a high cost in term of human resources and is naturally subject to human arbitration. Potential limitations are linked to the ability to spot a problem within the overwhelming number of quantities to monitor, or to the understanding of detector evolving conditions.
        In view of Run III, CMS aims at integrating deep learning technique in the online workflow to promptly recognize and identify anomalies and improve DQ metadata precision.
        The CMS experiment engaged in a partnership with IBM with the objective to support, with automatization, the online operations and to generate benchmarking technological results. The research goals, agreed within the CERN Openlab framework, how they matured in a demonstration application and how they are achieved, through a collaborative contribution of technologies and resources, will be presented.

        Speaker: Virginia Azzolini (Massachusetts Inst. of Technology (US))
    • T2 - Offline computing: S3 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Victor Daniel Elvira (Fermi National Accelerator Lab. (US))
      • 117
        AREUS - a software framework for ATLAS Readout Electronics Upgrade Simulation

        The design of readout electronics for the LAr calorimeters of the ATLAS detector to be operated at the future High-Luminosity LHC (HL-LHC) requires a detailed simulation of the full readout chain in order to find optimal solutions for the analog and digital processing of the detector signals. Due to the long duration of the LAr calorimeter pulses relative to the LHC bunch crossing time, out-of-time signal pile-up needs to be taken into account and realistic pulse sequences must be simulated together with the response of the electronics. For this purpose, the ATLAS Readout Electronics Upgrade Simulation framework (AREUS) has been developed based on the Observer design pattern to provide a fast and flexible simulation tool. Energy deposits in the LAr calorimeters from fully simulated HL-LHC collision events are taken as input. Simulated and measured analog pulse shapes proportional to these energies are then combined in discrete time series with proper representation of electronics noise. Analog-to-digital conversion, gain selection and digital signal processing are modeled at bit precision, including digitization noise and detailed electronics effects. In this way signal processing techniques can be optimized with respect to physics parameters like reconstructed energy and signal time in each channel. Finally, trigger and object reconstruction algorithms are taken into account in the optimization process. The software implementation of AREUS, the concepts of its main functional blocks and examples of obtained simulation results will be presented.

        Speaker: Nico Madysa (Technische Universitaet Dresden (DE))
      • 118
        Fitting fixed target and ion collisions in the LHCb Gauss simulation framework

        The LHCb experiment is a fully instrumented forward spectrometer designed for
        precision studies in the flavour sector of the standard model with proton-proton
        collisions at the LHC. As part of its expanding physics programme, LHCb collected data also during the LHC proton-nucleus collisions in 2013 and 2016 and
        during nucleus-nucleus collisions in 2015. All the collected datasets are unique,
        due to the peculiar pseudo-rapidity range not covered by any other LHC experiment. Furthermore, in 2015 LHCb commissioned the internal gas target SMOG,
        becoming the only LHC experiment with a programme of fixed target physics.
        Any of these particular collision conditions required a different operational setup,
        as well as dedicated simulation production based on heavy-ion Monte-Carlo event
        generators and interface extensions of the standard LHCb simulation framework.
        In this talk, we present the work done to implement such a variety of simulation
        productions for heavy-ion collisions, and to validate the produced samples. The
        future perspectives of the heavy-ion collision simulations at LHCb will also be
        discussed.

        Speaker: Patrick Robbe (Université Paris-Saclay (FR))
      • 119
        Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

        Opticks is an open source project that integrates the NVIDIA OptiX
        GPU ray tracing engine with Geant4 toolkit based simulations.
        Massive parallelism brings drastic performance improvements with optical photon simulation speedup expected to exceed 1000 times Geant4 with workstation GPUs.

        Optical physics processes of scattering, absorption, reemission and
        boundary processes are implemented as CUDA OptiX programs based on the Geant4
        implementations. Wavelength dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into
        GPU textures providing fast interpolated property lookup or wavelength generation.
        OptiX handles the creation and application of a choice of acceleration structures
        such as boundary volume hierarchies and the transparent use of multiple GPUs.

        A major recent advance is the implementation of GPU ray tracing of complex constructive solid geometry shapes, enabling automated translation of Geant4 geometries to the GPU without approximation. Using common initial photons and random number sequences allows the Opticks and Geant4 simulations to be run point-by-point aligned. Aligned running has reached near perfect equivalence with test geometries.

        Speaker: Dr Simon Blyth (National Taiwan University)
      • 120
        Progress of Geant4 electromagnetic physics developments and applications

        We report developments for the Geant4 electromagnetic (EM) physics sub-packages for Geant4 release 10.4 and beyond. Modifications are introduced to the models of photo-electric effect, bremsstrahlung, gamma conversion, and multiple scattering. Important developments for calorimetry applications were carried out for the modeling of single and multiple scattering of charged particles. Corrections to scattering of positrons and to sampling of displacement have been recently added to the Geant4 default Urban model. The fully theory-based Goudsmit-Saunderson (GS) model for electron/positron multiple scattering has been recently reviewed and a new improved version is available. For testing purposes for novel calorimeters we provide a configuration of electron scattering based on the GS model or on the single scattering model (SS) instead of the Urban model. In addition, the GS model with enabled Mott corrections is included into the EM physics constructor, which recommended for accurate simulations. This EM configuration provides the most accurate results for scattering of electrons and positrons. To improve usability, a new set of User Interface (UI) commands and corresponding C++ interfaces have been added for easier configuration of EM physics. In particular, photo-absorption ionisation model may be enabled per detector region using corresponding UI command. Interfaces to material properties were improved, which includes density effect parameterization for compounds and optical parameters of materials. The last modification allows significantly speed up of the simulation of optical photon transport. These new interfaces and models are already used in LHC applications and may be useful for any type of simulations. We discuss their effective usage for various applications including possible improvements for simulation of LHC and dark matter experiments.

        Speaker: Vladimir Ivantchenko (CERN)
      • 121
        Towards full electromagnetic physics vectorization in the GeantV transport framework

        The development of the GeantV Electromagnetic (EM) physics package has evolved following two necessary paths towards code modernization. A first phase required the revision of the main electromagnetic physics models and their implementation. The main objectives were to improve their accuracy, extend them to the new high-energy frontiers posed by the Future Circular Collider (FCC) programme and allow a better adaptation to a multi-particle flow. Most of the EM physics models in GeantV have been reviewed from theoretical perspective and rewritten with vector-friendly implementations, being now available in scalar mode in the alpha release. The second phase consists of an thorough investigation on the possibility to vectorise the most CPU-intensive physics code parts, such as final state sampling. We have shown the feasibility of implementing electromagnetic physics models that take advantage of SIMD/SIMT architectures, thus obtaining gains in performance. After this phase, the time has come for the GeantV project to take a step forward towards the final proof of concept. This takes shape through the testing of the full simulation chain (transport + physics + geometry) running in vectorised mode. In this paper we will present the first benchmark results obtained after vectorizing a set of electromagnetic physics models, starting from the photoelectric effect.

        Speaker: Dr Marilena Bandieramonte (CERN)
      • 122
        A top to bottom framework approach to vectorization

        SIMD acceleration can potentially boost by factors the application throughput. However, achieving efficient SIMD vectorization for scalar code with complex data flow and branching logic, goes way beyond breaking loop dependencies and relying on the compiler. Since the re-factoring effort scales with the number of lines of code, it is important to understand what kind of performance gains can be expected in such complex cases. The GeantV R&D has started a couple of years ago a top to bottom vectorization approach to particle transport simulation. Percolating multiple data to algorithms was mandatory since not all the components offer natural internal vectorization capability. Vectorizing low-level algorithms such as position/direction geometry classifiers or field propagators were certainly necessary, but not sufficient to achieve relevant SIMD gains. Overheads for maintaining the concurrent vector data flow and data copying had to be minimized. GeantV developed a framework to allow different categories of scalar and vectorized components to co-exist, dealing with data flow management and real-time heuristic optimizations. The paper will describe our approach on co-ordinating SIMD vectorization at framework level, making a detailed quantitative analysis of the SIMD gain versus overheads with a break-down by component in terms of geometry, physics and magnetic field propagation. The more general context of the GeantV work and goals for 2018 will also be presented.

        Speaker: Andrei Gheata (CERN)
    • T3 - Distributed computing: Testing, Monitoring and Accounting Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: Julia Andreeva (CERN)
      • 123
        EGI Dataset Accounting and the WLCG

        While the WLCG and EGI have both made significant progress towards solutions for storage space accounting, one area that is still quite exploratory is that of dataset accounting. This type of accounting would enable resource centre and research community administrators to report on dataset usage to the data owners, data providers, and funding agencies. Eventually decisions could be made about the location and storage of data sets to make more efficient use of the infrastructure. By giving insight to data usage, dataset accounting also assists scientists in assessing the impact of their work.

        This paper reviews the status of the prototype dataset accounting developed during EGI-Engage and how it could be used to complement the view that the WLCG has of its datasets. This is a new feature of the EGI resource accounting system that will enable storing information on dataset usage such as who has accessed a dataset and how often, the transfer volumes and end points etc. The design of this new feature has been led by the users' requirements collected in the first part of the project, from which a set of dataset accounting metrics were derived. In these trials, the EGI Accounting Repository was integrated with the data provider Onedata (the underlying technology powering the EGI Open Data Platform and EGI DataHub) as an example of a generic data provider.

        Speaker: Mr Adrian Coveney (STFC)
      • 124
        GRACC: GRid ACcounting Collector

        The OSG has long maintained a central accounting system called Gratia. It uses small probes on each computing and storage resource in order to usage. The probes report to a central collector which stores the usage in a database. The database is then queried to generate reports. As the OSG aged, the size of the database grew very large. It became too large for the database technology to efficiently query to generate detailed reports.

        The design of a replacement required data storage that could be queried efficiently to generate multi-year reports. Additionally, it requires flexibility to add new attributes to the collected data.

        In this paper we will describe the GRACC architecture. GRACC uses modern web technologies that were designed for large data storage, query, and visualization. That includes the open source database Elasticsearch, message broker software RabbitMQ, and Grafana and Kibana as data visualization platforms. It uses multiple agents that perform operations on the data to transform it for easier querying and summarization.

        Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
      • 125
        Evolution of HammerCloud to commission CERN Compute resources

        HammerCloud is a testing service and framework to commission, run continuous tests or on-demand large-scale stress tests, and benchmark computing resources and components of various distributed systems with realistic full-chain experiment workflows.

        HammerCloud, userd by the ATLAS and CMS experiments in production, has been a useful service to commission both compute resources and various components of the complex distributed systems of the LHC experiments, as well as integral part of the monitoring suite essential for the computing operations of the experiments and their automation.

        In this contribution we review recent developments of the HammerCloud service that allow use of HammerCloud infrastructure to test Data Centre resources in the early phases of the infrastructure and services commissioning process. One of the benefits we believe HammerCloud can provide is to be able to tune the commissioning of the new infrastructure, functional and also stress testing, as well as benchmarking with a "standard candle" workflows, with experiment realistic workloads, that can be heavy for CPU, or I/O, or IOPS, or everything together. This extension of HammerCloud has been successfully used in CERN IT during prototype phase of the "BEER" Batch on EOS (Evaluation of Resources) project, and is being integrated with the continuous integration/continuous deployment suite for Batch service VMs.

        Speaker: Jaroslava Schovancova (CERN)
      • 126
        Monitoring system for the Belle II distributed computing

        The Belle II is an asymmetric energy e+e- collider experiment at KEK, Japan. The Belle II aims to reveal the physics beyond the standard model with a data set of about 5×10^10 BB^bar pairs and starts the physics run in 2018. In order to store such a huge amount of data including simulation events and analyze it in a timely manner, Belle II adopts a distributed computing model with DIRAC (Distributed Infrastructure with Remote Agent Control).

        The monitoring system for Belle II computing is developed as an extension of DIRAC. It collects and investigates various information related to the job processing or data transfer stored in DIRAC database, and then visualizes them. We also develop a system which regularly performs several accessibility tests to the various components such as computing/storage elements, database storing calibration information, and DIRAC servers. Finally, the system makes a total health check of the Belle II computing with all of the investigation and test results. Detected issues are displayed in single place so that even non-expert shifter can easily find the problem.

        In this contribution, we will present the details of the monitoring system as well as experience during the simulation data production campaign and a first few months of the data taking.

        Speaker: Yuji Kato
      • 127
        Towards the integrated ALICE Online-Offline (O2) monitoring subsystem

        ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout system and computing for LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect performance degradation and component failures. The monitoring subsystem will receive and collect up to 600 kHz of performance parameters. It consists of a custom monitoring library and a server-side, distributed software covering five main functional tasks: parameter collection and processing, storage, visualization and alarms.
        To select the most appropriate tools for these tasks, we evaluated three server-side systems: MonALISA, Zabbix and “Modular stack”. The latter one consists of a toolkit including collectd, Apache Flume, Apache Spark, InfluxDB, Grafana and Riemann.
        This paper describes the monitoring subsystem functional architecture. It goes through a complete evaluation of the three considered solutions, the selection processes, risk assessment and justification for the final decision. The in-depth comparison includes functional features, latency and throughput measurement to ensure the required processing and storage performance.

        Speaker: Adam Wegrzynek (CERN)
      • 128
        CRIC: a unified information system for WLCG and beyond

        The WLCG Information System (IS) is an important component of the huge heterogeneous distributed infrastructure. Considering the evolution of LHC computing towards high luminosity era and analyzing experience accumulated by the computing operations teams and limitations of the current information system, the WLCG IS evolution task force came up with the proposal to develop Computing Resource Information Catalogue (CRIC) which aims to consolidate WLCG topology information and to complement it with experiment-specific configuration required for computing operations, workload management and data management systems of the LHC experiments. CRIC represents evolution of ATLAS Grid Information System (AGIS) which has been evolved in order to serve not just ATLAS but any other virtual organization relying on large scale distributed infrastructure as well as the WLCG on the global scope.

        The contribution describes CRIC architecture, implementation of data models, collectors, UIs/APIs and advanced fine-granular authentication/authorization.

        Speaker: Alexey Anisenkov (Budker Institute of Nuclear Physics (RU))
    • T4 - Data handling: S3 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Costin Grigoras (CERN)
      • 129
        Conditions evolution of an experiment in mid-life, without the crisis (in ATLAS)

        The ATLAS experiment is approaching mid-life: the long shutdown period (LS2) between LHC Runs 1 and 2 (ending in 2018) and the future collision data-taking of Runs 3 and 4 (starting in 2021). In advance of LS2, we have been assessing the future viability of existing computing infrastructure systems. This will permit changes to be implemented in time for Run 3. In systems with broad impact such as the conditions database, making assessments now is critical as the full chain of operations from online data-taking to offline processing can be considered: evaluating capacity at peak times, looking for bottlenecks, identifying areas of high maintenance, and considering where new technology may serve to do more with less.

        We have been considering changes to the ATLAS conditions database related storage and distribution infrastructure based on similar systems of other experiments. We have also examined how new technologies may help and how we might provide more RESTful services to clients. In this presentation, we give an overview of the identified constraints and considerations, and our conclusions for the best way forward: balancing preservation of critical elements of the existing system with the deployment of the new technology in areas where the existing system falls short.

        Speaker: Lorenzo Rinaldi (Universita e INFN, Bologna (IT))
      • 130
        Performance of the Belle II Conditions Database

        The Belle II experiment at KEK is preparing for first collisions in early 2018. Processing the large amounts of data that will be produced requires conditions data to be readily available to systems worldwide in a fast and efficient manner that is straightforward for both the user and maintainer. This was accomplished by relying on industry-standard tools and methods: the conditions database is built as an HTTP REST service using tools such as Swagger for the API interface development, Payara for the Java EE application server, and Squid for the caching proxy. This talk will present the design of the Belle II conditions database production environment as well as go into detail about the capabilities and performance during both Monte Carlo campaigns and data reprocessing.

        Speaker: Lynn Wood (Pacific Northwest National Laboratory, USA)
      • 131
        The Open High Throughput Computing Content Delivery Network

        LHC experiments make extensive use of Web proxy caches, especially for software distribution via the CernVM File System and for conditions data via the Frontier Distributed Database Caching system. Since many jobs read the same data, cache hit rates are high and hence most of the traffic flows efficiently over Local Area Networks. However, it is not always possible to have local Web caches, particularly for opportunistic cases where experiments have little control over site services. The Open High Throughput Computing (HTC) Content Delivery Network (CDN), openhtc.io, aims to address this by using Web proxy caches from a commercial CDN provider. Cloudflare provides a simple interface for registering DNS aliases of any Web server and does reverse proxy Web caching on those aliases. The openhtc.io domain is hosted on Cloudflare's free tier CDN which has no bandwidth limit and makes use of data centers throughout the world, so the average performance for clients is much improved compared to reading from CERN or a Tier 1. The load on WLCG servers is also greatly reduced. WLCG Web Proxy Auto Discovery is used to select local Web caches when they are available and otherwise use openhtc.io caching. This paper describes the Open HTC CDN in detail and provides initial results from its use for USCMS opportunistic computing and LHC@Home.

        Speaker: Dave Dykstra (Fermi National Accelerator Lab. (US))
      • 132
        A new mechanism to use the Conditions Database REST API to serve the ATLAS detector description

        An efficient and fast access to the detector description of the ATLAS experiment is needed for many tasks, at different steps of the data chain: from detector development to reconstruction, from simulation to data visualization. Until now, the detector description was only accessible through dedicated services integrated into the experiment's software framework, or by the usage of external applications. In this work we explore the possibility of using a web access-based conditions database to store and serve the detector description, aiming at a simplification of the software architecture of the experiment and a reduction in the number of software packages to be maintained. We will also share the lessons learned while developing the new system to efficiently serve an experiment's geometry data to clients through a REST API.

        Speaker: Alessandro De Salvo (Sapienza Universita e INFN, Roma I (IT))
      • 133
        A Git-based Conditions Database backend for LHCb

        LHCb has been using the CERN/IT developed Conditions Database library COOL for several years, during LHC Run 1 and Run 2. With the opportunity window of the second long shutdown of LHC, in preparation for Run 3 and the upgraded LHCb detector, we decided to investigate alternatives to COOL as Conditions Database backend. In particular, given our conditions and detector description data model, we investigated the possibility of reusing the internal Git repository database as a conditions storage, and we adopted it since 2017 data taking. The adoption of Git gave us improved performance, smaller storage size and simplified maintenance and deployment.
        In this paper we describe the implementation of our Git Conditions Database and the way it simplified our detector description and conditions development workflow.

        Speaker: Marco Clemencic (CERN)
      • 134
        EventDB: an event indexer and caching system for BESIII experiment

        Beijing Spectrometer (BESIII) experiment has produced hundreds of billions of events. It has collected the world's largest data samples of J/ψ, ψ(3686), ψ(3770) andψ(4040) decays. The typical branching fractions for interesting physics channels are of the order of O(10^-3). The traditional event-wise accessing of BOSS (Bes Offline Software System) is not effective for the selective accessing with the rate of O(10^-3). We proposed and implemented an event indexer system - EventDB, which extracts the characteristics of all of events and stores them in hbase. Currently, the characteristics contain 22 variables including BeamEnergy, NTracks, NShowers, BeamVx, BeamVy, BeamVz and so on. We adopted inverted index technologies to store event index in hbase, and the index count is less than 1/1000 of event count. We also puts real event data into hbase for hot events caching to accelerate event access. After using EventDB, the data analysis workflow of BESIII experiment is changed as follows: the analysis program firstly queries the corresponding event index from database, then get event data from database if the event is cached, or get data from DST file using ROOT framework if it is not cached. Finally, the test results on one billion dataset showed the query speed was improved more than 10 times.

        Speaker: Yaodong Cheng (Chinese Academy of Sciences (CN))
    • T5 - Software development: S3 - Concurrency Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Barthelemy Von Haller (CERN)
      • 135
        I/O in the ATLAS multithreaded framework

        Scalable multithreading poses challenges to I/O, and the performance of a thread-safe I/O strategy
        may depend upon many factors, including I/O latencies, whether tasks are CPU- or I/O-intensive, and thread count.
        In a multithreaded framework, an I/O infrastructure must efficiently supply event data to and collect it from many threads processing multiple events in flight.
        In particular, on-demand reading from multiple threads may challenge caching strategies that were developed for serial processing and may need to be enhanced.
        This I/O infrastructure must also address how to read, make available, and propagate in-file metadata and other non-event data needed as context for event processing.

        We describe the design and scheduling of I/O components in the ATLAS multithreaded control framework, AthenaMT, for both event and non-event I/O.
        We discuss issues associated with exploiting the multithreading capabilities of our underlying persistence technology, ROOT,
        in a manner harmonious with the ATLAS framework’s own approach to thread management.
        Finally, we discuss opportunities for evolution and simplification of I/O components that have successfully supported ATLAS event processing for many years
        from their serial incarnations to their thread-safe counterparts.

        Speaker: Jack Cranshaw (Argonne National Laboratory (US))
      • 136
        Hydra: A framework for data analysis in massively parallel platforms

        Hydra is a templatized header-only, C++11-compliant library for data analysis on massively parallel platforms targeting, but not limited to, the field High Energy Physics research.
        Hydra supports the description of particle decays via the generation of phase-space Monte Carlo, generic function evaluation, data fitting, multidimensional adaptive numerical integration and histograming.
        Hydra is open source and the code is hosted in GitHub.

        The library deploys a series of techniques in order to achieve optimal
        performance in both computing and management of memory resources.
        The overall design exploits heavily C++ variadic templates in order to implement
        static polymorphism, kernel fusion and coalesced memory access patterns,
        avoiding completely the usage of function pointers,
        virtual methods calls and other known potential performance degrading constructs.

        Hydra is developed on top of the Thrust library and runs on Linux systems and can
        deploy transparently NVidia CUDA-enabled GPUs as well as multicore CPUs
        and accelerators.

        This contribution summarizes the main features of Hydra. A
        basic description of the user interface and some examples of applications
        are provided, along with measurements of performance in a variety of
        environments.

        Speaker: Antonio Augusto Alves Junior (University of Cincinnati (US))
      • 137
        Supporting Future HEP Data Processing with a Parallelised ROOT

        In the coming years, HEP data processing will need to exploit parallelism on present and future hardware resources to sustain the bandwidth requirements.
        As one of the cornerstones of the HEP software ecosystem, ROOT embraced an ambitious parallelisation plan which delivered compelling results.
        In this contribution the strategy is characterised as well as its evolution in the medium term.
        The units of the ROOT framework are discussed where task and data parallelism have been introduced, with runtime and scaling measurements. We will give an overview of concurrent operations in ROOT, for instance in the areas of I/O (reading and writing of data), fitting / minimization, and data analysis.
        The presentation introduces the programming model and use cases for explicit and implicit parallelism, where the former is explicit in user code and the latter is implicitly managed by ROOT internally.

        Speaker: Danilo Piparo (CERN)
      • 138
        Implementing Concurrent Non-Event Transitions in CMS

        Since the beginning of the LHC Run 2 in 2016 the CMS data processing framework, CMSSW, has been running with multiple threads during production of data and simulation via the use of Intel's Thread Building Blocks (TBB) library. The TBB library utilizes tasks as concurrent units of work. CMS used these tasks to allow both concurrent processing of events as well as concurrent running of modules processing the same event. This design has served CMS well and has allowed jobs to utilize less memory per core as well as reduce the number of jobs that must be tracked by CMS's worflow management system. As CMS has begun to utilize greater number of threads in a job, the effect of serialization points in the framework has decreased job's CPU efficiency. One major serialization point occurs when the framework processes a non-Event transition. These transitions occur when a new Run or LuminosityBlock is to be processed. In this talk we will discuss how the different transitions define the processing model for CMSSW and how we were able to successfully allow concurrent processing of those transitions using TBB via task queues. We will also show CPU efficiency comparisons between the same work being done with and without the additional concurrency.

        Speaker: Christopher Jones (Fermi National Accelerator Lab. (US))
      • 139
        Writing ROOT Data in Parallel with TBufferMerger

        The LHC experiments produce tens of petabytes of new data in ROOT format per year that need to be processed and analysed. In the next decade, following the planned upgrades of the LHC and the detectors, this rate is expected to increase at least ten-fold.
        Therefore, optimizing the ROOT I/O subsystem is of critical importance to the success of the LHC physics programme. This contribution presents ROOT's approach of writing data from multiple threads to the same file in a highly efficient way. Technical aspects of its implementation, the TBufferMerger class, as well as examples of its programming model are described. Runtime performance and writing rates measurements and associated improvements with respect to the single writer case are discussed in the context of standalone examples and production parallel data processing workflows such as CMS reconstruction.
        The interplay of parallel dataset writing with other parallelisation mechanisms in ROOT is discussed together with the future developments of the strategy aiming to maximise experiments’ data processing throughput

        Speaker: Guilherme Amadio (CERN)
      • 140
        A modular software framework for test-beam data analysis

        In high energy physics experiments, silicon detectors are often subjected to a harsh radiation environment, specially at hadronic colliders. Understanding the impact of radiation damage on the detector performance is an indispensable prerequisite for a successful operation throughout the lifetime of the planned experiment.

        A dedicated irradiation programme followed by detailed studies with particle beam are essential for proper evaluation of detector prototypes and predict their performance after accumulating the design fluence.

        In order to perform precise measuremens with the LHCb VELO detector prototypes a dedicated high resolution pixel beam telescope was developed based on 8 Timepix3 detector planes. This telescope has been taking data at CERN in the PS and SPS facilities since 2014. The Timepix3 can readout on data driven mode with very precise timestamps which makes triggering unnecessary.
        At the centre of the telescope a Device Under Test (DUT) can be installed on a motion stage allowing angular rotations about the y axis, and x and y translations, where the z coordinate is the beam direction. The telescope provides precise measurements of particle trajectories with a pointing resolution of ∼2 µm and a time resolution of ∼1 ns allowing in-depth analysis of the DUT performance.

        The data produced by the telescope can easily incorporate the signals from the DUT and in particular for Timepix3 devices the analysis is straight forward. The LHCb software embeded in the so-called Kepler project performs the decoding of raw data and produces track objects inside the official LHCb's Gaudi framework. In this way it is simple to produce histograms and ntuples with track and cluster data information.

        Subsequently, the offline analysis compares the performance after irradiation with several fluences of protons or neutrons from 2 to 8 × 10^15 1 MeV neq/cm2; different silicon substrates (n-on-p or n-on-n), distances from last pixel to the edge, guard rings designs and different vendors. Charge collection efficiencies (CCE), track resolution (TR), Eta correction are of particular interest as function of fluence.

        For purposes described above, a modular software framework has been developed. It allows to handle the test beam data for a set of runs, as well as the set of different DUTs (Devices Under Test) in one-go, and obtain an integrated workflow to present the results.

        All code is written in C++, which is a general-purpose objective programming language. A class based design makes it flexible to add any new features of the device under investigation following a plug-in scheme. Currently, the toolkit handles different types of analysis such as CCE, TR and Eta correction, implemented for non-uniform irradiated sensors.

        The presentation will describe the implemented analysis framework as a proposal of the prototype of a general design framework for test beam campaigns, that could be followed to obtain a quick ROOT based application for complex test beam data analyses being performed in different facilities and different irradiation sources or profiles. The interface is implemented for a series of algorithms in a user-friendly way. We believe it is a valuable complement to used by different groups performing detector R&D programs.

        Speaker: Bartlomiej Rachwal (AGH University of Science and Technology (PL))
    • T6 - Machine learning and physics analysis: S3 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 141
        Columnar data processing for HEP analysis

        In the last stages of data analysis, only order-of-magnitude computing speedups translate into increased human productivity, and only if they're not difficult to set up. Producing a plot in a second instead of an hour is life-changing, but not if it takes two hours to write the analysis code. Fortunately, HPC-inspired techniques can result in such large speedups, but unfortunately, they can be difficult to use in a HEP setting.

        These techniques generally favor operating on columns— arrays representing a single attribute across events, rather than whole events individually— which allows data to stream predictably from disk media to main memory and finally to CPU/GPU/KNL onboard memory (e.g. L* cache) for prefetching and sometimes allows for for vectorization. However, the need to work with variable-length structures in HEP, such as different numbers of particles per event, makes it difficult to apply this technique to HEP problems.

        We will describe several new software tools to make it easier to compute analysis functions with columnar arrays in HEP: array-at-a-time I/O in ROOT ("BulkIO") and Python/Numpy ("uproot"), compiling object-oriented analysis code into columnar operations ("oamap" for "object-array mapping"), and storage solutions with columnar granularity. We will show performance plots and usage examples.

        Speaker: Jim Pivarski (Princeton University)
      • 142
        Using Big Data Technologies for HEP analysis

        The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quick as possible by analyzing PB and EB datasets. Providing the scientists with more modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.

        In this talk, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data collected by the CMS experiment to 1 TB of data in a format suitable for physics analysis in 5 hours.

        The second goal is achieved by implementing multiple physics use-cases in Apache Spark using in input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.

        Speaker: Matteo Cremonesi (Fermi National Accelerator Lab. (US))
      • 143
        Pandas DataFrames for F.A.S.T. binned analysis at CMS

        Many analyses on CMS are based on the histogram, used throughout the workflow from data validation studies to fits for physics results. Binned data frames are a generalisation of multidimensional histograms, in a tabular representation where histogram bins are denoted by category labels. Pandas is an industry-standard tool, providing a data frame implementation that allows easy access to "big data” scientific libraries, including I/O, visualisation, and machine learning tools.

        F.A.S.T. (Faster Analysis Software Taskforce) runs regular hackdays and tutorials in the UK. We present a new toolkit where Pandas DataFrames are used as the basis for binned physics analysis. We demonstrate how this engenders faster, more robust, and more flexible development, using fewer lines of code, as well as improving accessibility for newcomers.

        The toolkit is presented in the context of a typical CMS search analysis, where we look for evidence of new physics signal in a multidimensional parameter space, but is not analysis specific. The code structure is simple and modular, with built-in bookkeeping and documentation. We envisage its adoption by other UK binned analyses, and plan associated tools, helping physicists to focus more on “what” their analysis should do rather than technically “how” it is done.

        Speaker: Dr Benjamin Krikler (University of Bristol (GB))
      • 144
        RDataFrame: Easy Parallel ROOT Analysis at 100 Threads

        The Physics programmes of LHC Run III and HL-LHC challenge the HEP community. The volume of data to be handled is unprecedented at every step of the data processing chain: analysis is no exception.
        First class analysis tools need to be provided to physicists which are easy to use, exploit the bleeding edge hardware technologies and allow to seamlessly express parallelism.
        This contribution discusses the declarative analysis engine of ROOT, RDataFrame, and gives details about how it allows to profitably exploit commodity hardware as well as high-end servers and manycore accelerators thanks to the synergy with the existing parallelised ROOT components.
        Real-life analyses of LHC experiments’ data expressed in terms of RDataFrame are presented, highlighting the programming model provided to express them in a concise and powerful way. The recent developments which make RDataFrame a lightweight data processing framework are described, for example callbacks and I/O capabilities.
        The flexibility of RDataFrame and its ability to read data formats other than ROOT’s are characterised, as an example it is discussed how RDataFrame can directly read and analyze LHCb's raw data format MDF.

        Speaker: Enrico Guiraud (CERN, University of Oldenburg (DE))
      • 145
        A further reduction in CMS event data for analysis: the NANOAOD format

        A new event data format has been designed and prototyped by the CMS collaboration to satisfy the needs of a large fraction of Physics Analyses (at least 50%) with a per event size of order 1 Kb. This new format is more than a factor 20x smaller than the MINIAOD format and contains only top level information typically used in the last steps of the analysis. The talk will review the current analysis strategy from the point of view of event format in CMS (both skims and format such as RECO, AOD, MINIAOD, NANOAOD) and will describe the design guidelines for the new NANO format.

        Speaker: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P)
      • 146
        Developing a Declarative Analysis Language: LINQToROOT

        The HEP community is preparing for the LHC’s Run 3 and 4. One of the big challenges for physics analysis will be developing tools to efficiently express an analysis and able to efficiently process the x10 more data expected. Recently, interest has focused on declarative analysis languages: a way of specifying a physicists’ intent, and leaving everything else to the underlying system. The underlying system takes care of finding the data - powering the event processing loop – and even exactly how to most efficiently apply a desired jet selection. If this works, this would allow an analyzer to test their algorithm on a small amount of data on their GPU-less laptop and then run it on a large amount of data on a server with multiple large GPU’s without having to alter their code. The LINQToROOT project, started almost seven years ago, fits this model. It has been used and tested in three ATLAS published analyses. LINQToROOT is based on the Language Integrated Query system built into the cross-platform C# language. It enables writing strongly-typed queries on a ROOT’s TTree’s data, and transcribes the data to a C++ algorithm that can run in ROOT. Recent work on this system has had two goals: improving analysis efficiency and better understanding the requirements of a declarative analysis language. For example, a good analysis language should be able to abstract away the backend – recent work has increased the possible back ends from formerly the single Windows ROOT backend to one that runs on Linux, the Windows Linux-subsystem, and an experimental one that allows for PROOF like parallel processing – all done with almost no change to the analysis code itself. Any analysis language must also be rich enough to support an experiment’s data model. To test this, some experiments with the full ATLAS xAOD data model have been performed. All of this has been done while attempting to keep the project close to its original goals: quick turn around for real ATLAS physics analysis. This work will be discussed in some detail along with thoughts and lessons that have helped shape our thinking about an Analysis Language and perhaps our approach to future physics analysis employing declarative analysis.

        Speaker: Gordon Watts (University of Washington (US))
    • T7 - Clouds, virtualization and containers: S3 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Convener: Fabio Hernandez (IN2P3/CNRS Computing Centre)
      • 147
        Optimizing OpenStack Nova for Scientific Workloads

        The CERN OpenStack Cloud provides over 200.000 CPU cores to run data processing analyses for the Large Hadron Collider (LHC) experiments. To deliver these services, with high performance and reliable service levels, while at the same time ensuring a continuous high resource utilization has been one of the major challenges for the CERN Cloud engineering team.

        Several optimizations like NUMA-aware scheduling and huge pages, have been deployed to improve Scientific Workloads performance, but the CERN Cloud team continues to explore new possibilities like pre-emptible instances and containers in baremetal.

        In this presentation we will dive into the concept and implementation challenges of pre-emptible instances and containers on baremetal for Scientific Workloads. We will also explore how they can improve Scientific Workloads throughput and infrastructure resource utilization.

        We will present the ongoing collaboration with the Square Kilometer Array (SKA) community to develop the necessary upstream enhancement to further improve OpenStack Nova to support large-scale scientific workloads.

        Speaker: Belmiro Moreira (CERN)
      • 148
        Advanced features of the CERN OpenStack Cloud

        The CERN OpenStack cloud has been delivering a wide variety of services to its 3000 customers since it entered in production in 2013. Initially, standard resources such a Virtual Machines and Block Storage were offered. Today, the cloud offering includes advanced features since as Container Orchestration (for Kubernetes, Docker Swarm mode, Mesos/DCOS clusters), File Shares and Bare Metal, and the Cloud team are preparing the addition of Networking and Workflow-as-a-Service components.

        In this presentation, we will describe these advanced features, the OpenStack projects that provide them, as well as some of the main usecases that benefit from them. We will present the ongoing work on those services that will increase functionality, such as container orchestration upgrades and networking features such as private networks and floating IPs.

        Speaker: Jose Castro Leon (CERN)
      • 149
        Sim@P1: Using Cloudscheduler for offline processing on the ATLAS HLT farm

        The Simulation at Point1 (Sim@P1) project was built in 2013 to take advantage of the ATLAS Trigger and Data Acquisition High Level Trigger (HLT) farm. The HLT farm provides more than 2,000 compute nodes, which are critical to ATLAS during data taking. When ATLAS is not recording data, this large compute resource is used to generate and process simulation data for the experiment. The Sim@P1 system uses virtual machines, deployed by OpenStack, in order to isolate the resources from the ATLAS technical and control network. During the upcoming long shutdown in 2019 (LS2), the HLT farm including the Sim@P1 infrastructure will be upgraded. A previous paper on the project emphasized the need for “simple, reliable, and efficient tools” to quickly switch between data acquisition operation and offline processing.In this contribution we assess various options for updating and simplifying the provisional tools. Cloudscheduler is a tool for provisioning cloud resources for batch computing that has been managing cloud resources in HEP offline computing since 2012. We present the argument for choosing Cloudscheduler, and describe technical details regarding optimal utilization of the Sim@P1 resources.

        Speaker: Frank Berghaus (University of Victoria (CA))
      • 150
        Experience with dynamic resource provisioning of the CMS online cluster using a cloud overlay

        The primary goal of the online cluster of the Compart Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is to build event data from the detector and to select interesting collisions in the High Level Trigger (HLT) farm for offline storage. With more than 1100 nodes and a capactity of about 600 kHEPSpec06, the HLT machines represent up to 40% of the combined Tier0/Tier-1 capacity on the Worldwide LHC Computing Grid (WLCG). Moreover, it is currently connected to the CERN IT datacenter via a dedicated 160 Gbps network connection and hence can access the remote EOS based storage with a high bandwidth. In the last few years, a cloud overlay based on Openstack has been commissioned to use these resources for the WLCG when they are not needed for data taking. This online cloud facility was designed for parasitic use of the HLT, which must never interfere with its primary function as part of the DAQ system. It also allows to abstract from the different types of machines and their underlying segmented networks. For the monitoring, an infrastructure based on Graphite for metric storage and aggregation, and Grafana for the visualisation has been deployed. During the LHC technical stop periods, the HLT cloud is set to its static mode of operation where it acts like other grid facilities. The online cloud was also extended to make dynamic use of resources during periods between LHC fills. These periods are a-priori unscheduled and of undetermined length, typically of several hours, once or more a day. For that, it dynamically follows LHC beam states and hibernates Virtual Machines (VM) accordingly. Finally, this work presents the design and implementation of a mechanism to dynamically ramp up VMs when the DAQ load on the HLT reduces towards the end of the fill.

        Speaker: Diego Da Silva Gomes (CERN)
      • 151
        Elastic resource allocation for multi-queue HEP workloads in cloud based environments

        As the development of cloud computing, more and more clouds are widely applied in the high-energy physics fields. OpenStack is generally considered as the future of cloud computing. However in OpenStack, the resource allocation model assigns a fixed number of resources to each group. It is not very suitable for scientific computing such as high energy physics applications whose demands of resource various, especially with strong peak demands. In a traditional static cluster, a fixed number of virtual machines are pre-allocated to the job queue of different experiments. What happens often is some queues are queued while some queues are idle. As a result, the overall efficacy of virtual cluster is rather low. To solve this problem, we developed a resource allocation service to provide OpenStack elastic scheduling. In this implement, each experiment queue in OpenStack will have a pair of {Min, Max} quota, which Min represents the minimum number of resources for this experiment and Max represents the highest available resources. The final resource allocation for each experiment is determined by the fair resource scheduling algorithm.

        In the paper, we will discuss development work of elastic resource allocation. An use case with VCondor (Virtual cluster with HTCondor implemented by IHEP ) is given. The result shows it can greatly improve cloud resource efficiency.

        Speaker: Haibo li (Institute of High Energy Physics Chinese Academy of Science)
      • 152
        Cyber security detection and monitoring at IHEP private cloud for web services

        To improve hardware utilization and save man power in system management, we have migrated most of the web services in our institute (Institute of High Energy Physics, IHEP) to a private cloud build upon OpenStack since last few years. However, cyber security attacks becomes a serious threats to the cloud progressively. Therefore, a detection and monitoring system for cyber security threats is necessary for such an important platform.

        This system collects network traffic data through OpenStack Neutron API and processes the traffic with Bro IDS, it's logs and the web/system log data of Virtual Machines (VM) are collected by Logstach and Filebeat. All the log data are stored in a storage server as well as Elasticsearch.The latter is used for quick search purpose during forensics. A group of analysis jobs are running to check the logs according to security policies, these policies are stored in databases and can be updated by security operators. The real time analysis results are illustrated in Web UI. Email alerts will be sent to security operators when an incident is detected.

        A prototype of this system has been developed and deployed at IHEP to enhance the security of the private cloud for web services.

        Speaker: Tian Yan (Institution of High Energy Physics, Chinese Academy of Science)
    • 12:30
      Lunch break
    • T1 - Online computing: S4 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Catrin Bernius (SLAC National Accelerator Laboratory (US))
      • 153
        The design and performance of the ATLAS Inner Detector trigger in high pileup collisions at 13 TeV at the Large Hardon Collider

        The design and performance of the ATLAS Inner Detector (ID) trigger
        algorithms running online on the High Level Trigger (HLT) processor
        farm for 13 TeV LHC collision data with high pileup are discussed.
        The HLT ID tracking is a vital component in all physics signatures
        in the ATLAS trigger for the precise selection of the rare or
        interesting events necessary for physics analysis without
        overwhelming the offline data storage in terms of both size and
        rate. To cope with the high interaction rates expected in the 13
        TeV LHC collisions the ID trigger was redesigned during the 2013-15
        long shutdown. The performance of the ID trigger in both the 2016
        and 2017 data from 13 TeV LHC collisions has been excellent and
        exceeded expectations, even at the very high interaction
        multiplicities observed at the end of data taking in 2017. The
        detailed efficiencies and resolutions of the trigger in a wide range
        of physics signatures are presented for the Run 2 data. The superb
        performance of the ID trigger algorithms in these extreme pileup
        conditions demonstrates how the ID tracking continues to lie at the
        heart of the trigger performance to enable the ATLAS physics
        program, and will continue to do so in the future.

        Speaker: Julie Hart Kirk (STFC-Rutherford Appleton Laboratory (GB))
      • 154
        Kalman Filter track reconstruction on FPGAs for acceleration of the High Level Trigger of the CMS experiment at the HL-LHC

        Track reconstruction at the CMS experiment uses the Combinatorial Kalman Filter. The algorithm computation time scales exponentially with pile-up, which will pose a problem for the High Level Trigger at the High Luminosity LHC. FPGAs, which are already used extensively in hardware triggers, are becoming more widely used for compute acceleration. With a combination of high perfor- mance, energy efficiency, and predictable low latency, FPGA accelerators are an interesting technology for high energy physics. Here, progress towards porting of the CMS track reconstruction is shown, using Maxeler Technologies’ DataFlow Engines, programmed with their high level language MaxJ. The performance is compared to CPUs, and further steps to optimise for the architecture are presented.

        Speaker: Sioni Paris Summers (Imperial College Sci., Tech. & Med. (GB))
      • 155
        Fast Kalman Filtering: new approaches for the LHCb upgrade

        In order to profit from the largely increased instantaneous luminosity provided by the accelerator in Run III (2021-2023), the upgraded LHCb detector will make usage of a fully software based trigger, with a real-time event reconstruction and selection performed at the bunch crossing rate of the LHC (~30 MHz). This assumption implies much tighter timing constraints for the event reconstruction than currently used in Run II (2015-2018). The reconstruction of particles’ trajectories passing through the LHCb detector uses, already at an early stage, the Kalman Filter, a linear quadratic estimator for the tracks parameters. In the present configuration, the Kalman Filter is one of the most time-consuming parts of the reconstruction. For this reason, different approaches are currently being explored in order to reduce its execution time while keeping high reconstruction performances. The latest developments and their impact will be presented. These include vectorization, use of different computer architectures, simplifications of the algorithm, use of a parametrized approach and changes in the formalism.

        Speaker: Placido Fernandez Declara (University Carlos III (ES))
      • 156
        Fast Boosted Decision Tree inference on FPGAs for triggering at the LHC

        Boosted Decision Trees are used extensively in offline analysis and reconstruction in high energy physics. The computation time of ensemble inference has previously prohibited their use in online reconstruction, whether at the software or hardware level. An implementation of BDT inference for FPGAs, targeting low latency by leveraging the platform’s enormous parallelism, is presented. Full pipelining provides high throughput, and the use of a high-level programming language adds flexibility to construction of new ensembles. Classification latencies of tens of nanoseconds are achieved within the FPGA, as well as a 600 times speedup over inference on a single threaded CPU with the FPGA as a coprocessor. A use case within the CMS Level 1 track trigger is presented, demonstrating use of the implementation on a real classification problem, including resource and latency aware hyperparameter tuning.

        Speaker: Sioni Paris Summers (Imperial College Sci., Tech. & Med. (GB))
      • 157
        Comparison of Track Reconstruction Methods of the PANDA Experiment

        PANDA is one of the main experiments of the future FAIR accelerator facility at Darmstadt. It utilizes an anti-proton beam with a momentum up to 15 GeV/c on a fixed proton or nuclear target to investigate the features of strong QCD.

        The reconstruction of charged particle tracks is one of the most challenging aspects in the online and offline reconstruction of the data taken by PANDA. Several features of the experiment make the reconstruction special in comparison to others:

        • The anti-proton beam of PANDA is quasi continuous with a Poisson distribution of the time differences between two consecutive events with a mean time of 50 ns. This leads to a very high probability that events overlap and to an a-priori unknown event time.

        • PANDA uses two different magnetic fields, a solenoidal one around the interaction point and a dipole field in the forward direction. Therefore, different track models have to be used and a connection between the different regions is necessary.

        • The experiment is equipped with straw tubes trackers which measure the drift time inside a straw. This leads to an isochrone ring for each hit where the track has passed. The radius of the ring is dependent of the unknown start time of the track.

        • PANDA reads out and reconstructs the complete detector data online. This imposes strong requirements onto the reconstruction speed of the online track finding and fitting algorithms.

        Many different track reconstruction algorithms have been developed and tested both for online and offline reconstruction and on different hardware platforms such as CPUs, GPUs and FPGAs. In this presentation, an overview and a comparison of the different methods will be given.

        Speaker: Tobias Stockmanns (Forschungszentrum Jülich GmbH)
      • 158
        ATLAS Hardware based track-finding: Present and Future

        The ATLAS Fast TracKer (FTK) is a hardware based track finder for the ATLAS trigger infrastructure currently under installation and commissioning. FTK sits between the two layers of the current ATLAS trigger system, the hardware-based Level 1 Trigger and the CPU-based High-Level Trigger (HLT). It will provide full-event tracking to the HLT with a design latency of 100 µs at a 100 kHz event rate, reducing the need for computationally intensive tracking in the HLT, which is several orders of magnitude slower. Having tracking information available at the trigger level will significantly improve trigger efficiencies and thresholds for many physics objects, including b-jets and tau leptons. FTK uses a two-stage tracking procedure implemented with massive parallelization in FPGAs and associative memory ASICs for fast pattern matching. In the first stage, hits are compared in parallel to 1 billion precomputed patterns shared across 8192 associative memory chips, and tracks in the matched patterns are then fit at a design speed of 1 fit per nanosecond per FPGA across 512 fitter FPGAs. The second stage tracks are extrapolations of the first stage tracks into additional layers of the tracking detector. The proposed Hardware Track Trigger (HTT) for the ATLAS Phase-II upgrade at the High Luminosity LHC (HL-LHC) uses a conceptually similar two-stage pattern matching tracking procedure. However, the HTT is instead a co-processor to the CPU-based Event Filter (EF) trigger, providing regional tracking at the full 1 MHz EF input event rate as well as full-event tracking on request at 100 kHz. This talk reviews the current design and performance of FTK, as well as the conceptual design and performance of the HTT co-processor for the HL-LHC.

        Speaker: Todd Michael Seiss (University of Chicago (US))
      • 159
        GPU-based Online Track Reconstructions for the ALICE TPC in Run 3 with Continuous Read Out

        In LHC Run 3, ALICE will increase the data taking rate significantly to 50 kHz continuous read out of minimum bias Pb-Pb collisions.
        The reconstruction strategy of the online offline computing upgrade foresees a first synchronous online reconstruction stage during data taking enabling detector calibration, and a posterior calibrated asynchronous reconstruction stage.
        Many new challenges arise, among them continuous TPC read out, more overlapping collisions, no a priori knowledge of the primary vertex and of location-dependent calibration in the synchronous phase, identification of low-momentum looping tracks, and a distorted refit to improve track model entropy coding for a total TPC compression factor of 20.
        The tracking algorithm for the Time Projection Chamber (TPC) will be based on a Cellular automaton and the Kalman filter.
        The reconstruction shall run online, processing 50 times more collisions than today, while yielding results comparable to current offline reconstruction.
        Our TPC track finding leverages the potential of hardware accelerators via the OpenCL and CUDA APIs in a shared source code for CPUs and GPUs for both reconstruction stages.
        We give an overview of the status of Run 3 tracking including track finding efficiency, resolution, treatment of continuous read out data, and performance on processors and GPUs.

        Speaker: David Rohr (CERN)
      • 160
        Quantum Associative Memory in HEP Track Pattern Recognition

        We have entered the Noisy Intermediate-Scale Quantum Era. A plethora of quantum processor prototypes allow evaluation of potential of the Quantum Computing paradigm in applications to pressing computational problems of the future. Growing data input rates and detector resolution foreseen in High-Energy LHC (2030s) experiments expose the often high time and/or space complexity of classical algorithms. Quantum algorithms can potentially become the lower-complexity alternatives in such cases. In this work we discuss the potential of Quantum Associative Memory (QuAM) in the context of LHC data triggering. We examine the practical limits of storage capacity, as well as store and recall efficiencies, from the viewpoints of state-of-the-art quantum hardware and LHC real-time charged track pattern recognition requirements. We present experimental tests of QuAM on the IBM 5Q chip - a cloud-based 5-qubit superconducting quantum processor. We further compare the results to QuAM simulations on LIQUi|> - the Microsoft’s Quantum Simulator toolsuite - as well as to theoretical expectations of QuAM efficiency bounds. We also review several difficulties integrating the end-to-end quantum pattern recognition into a real-time production workflow, and discuss possible mitigations.

        Speaker: Illya Shapoval (Lawrence Berkeley National Laboratory)
    • T2 - Offline computing: S4 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Gene Van Buren (Brookhaven National Laboratory)
      • 161
        Synergy between the tracker layer layout and track reconstruction in collider environment with high pileup

        Majority of currently planned or considered hadron colliders are expected to deliver data in collisions with hundreds of simultaneous interactions per beam bunch crossing on average, including the high luminosity LHC upgrade currently in preparation and the possible high energy LHC upgrade or a future circular collider FCC-hh. Running of charged particle track reconstruction for the general purpose detectors at these colliders is expected to be at least comparable in cost to the construction and operation of the tracking detectors. We show that the sensitive layer arrangement in the design of a tracking detector can play a vital role to reduce complexity and cost of the charged particle track reconstruction. Our case study is based on realistic simulation of a tracking detector similar to that of CMS to be operated in HL-LHC. We show that a customary layout with equal layer separation is inferior to that with grouped/paired layers with a smaller distance within a group, which allows for more performant track segment reconstruction. This synergy between the hardware and software choice can allow for a reduction of the total cost of a future collider experiment.

        Speaker: Slava Krutelyov (Univ. of California San Diego (US))
      • 162
        A High-Precision Tracking Algorithm for Mass Reconstruction of Heavy-Ion Fragments in the R3B Experiment at FAIR

        The multi-purpose R$^{3}$B (Reactions with Relativistic Radioactive Beams) detector at the future FAIR facility in Darmstadt will be used for various experiments with exotic beams in inverse kinematics. The two-fold setup will serve for particle identification and momentum measurement up- and downstream the secondary reaction target. In order to perform a high-precision charge identification of heavy-ion fragments and achieve a momentum resolution of 10$^{-4}$, a ToF measurement with 100 ps accuracy, position determination on the order of less than 0.5 mm and a dedicated algorithm for the heavy-ion tracking in highly non-homogeneous dipole field are required. With these constraints a tracking package is being developed and tested within the R$^{3}$B software framework, this package has to go into production in fall of 2018. An iterative approach has been chosen for simultaneous track finding and fitting. The design and concept of the package will be introduced, also the tests and efficiencies measured with simulated data will be presented.

        Speaker: Dmytro Kresan (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 163
        Parallelized and Vectorized Tracking Using Kalman Filters with CMS Detector Geometry and Events

        The High-Luminosity Large Hadron Collider (HL-LHC) at CERN will be characterized by higher event rate, greater pileup of events, and higher occupancy. Event reconstruction will therefore become far more computationally demanding, and given recent technology trends, the extra processing capacity will need to come from expanding the parallel capabilities in the tracking software. Existing algorithms at the LHC are based on Kalman filter techniques, which have proven themselves to be robust and offer good physics performance. We have therefore developed Kalman-filter-based methods for track finding and fitting that are adapted for many-core SIMD processors, since this type of hardware is increasingly dominant in high-performance systems.

        This effort has been underway for some time now, and our software has matured in several important ways. (1) The detector geometry now includes two endcaps as well as the barrel, and tracks can propagate through layers of both types, as well as the transition regions between them. (2) We are therefore able to reconstruct events in realistic detector geometries, including an accurate representation of the CMS-2017 tracker. (3) Hit data can be imported from CMSSW-generated events, including pileup, and is no longer restricted to artificial muon-only simulations. (4) The reconstructed tracks can be validated against either the CMSSW simulation that generated the hits, or the CMSSW reconstruction of the tracks. (5) Groups of track candidates can now be tracked and assessed all the way through the detector using a single, consistent set of vectorizable data structures. In general, the code's computational performance has continued to improve while the above capabilities were being added.

        The presentation summarizes the latest features of this software, beginning with the data structures and code constructs that facilitate vectorization, as well as the multiple levels of parallel tracking tasks that have been multithreaded with TBB. We demonstrate that the present Kalman filter implementation is able to reconstruct events with comparable physics performance to CMSSW, while providing generally better computational performance. Further plans for advancing the software are discussed.

        Speaker: Matevz Tadel (Univ. of California San Diego (US))
      • 164
        COMET multi turn track fitting

        One of the task of track reconstruction for COMET Phase-I drift chamber is to fit multi-turn curling tracks. A method of Deterministic Annealing Filter and implements a global competition between hits from different turn tracks is introduced. This method assigns the detector measurements to the track assumption based on the weighted mean of fitting quality on different turns. This method is studied on the simulated tracks in the COMET drift chamber. We show that it can be a candidate to tell the hit turn number.

        Speaker: Ms Yao Zhang
      • 165
        Belle II Track Reconstruction and Results from first Collisions

        In early 2018, e+e- collisions of the SuperKEKB B-Factory will be recorded by the Belle II detector in Tsukuba (Japan) for the first time. The new accelerator and detector represent a major upgrade from the previous Belle experiment and will achieve a 40-times higher instantaneous luminosity. Special considerations and challenges arise for track reconstruction at Belle II due to multiple factors. This high luminosity configuration of the collider increases the beam-induced background by many factors compared to Belle and a new track reconstruction software has been developed from scratch to achieve an excellent physics performance in this busy environment.
        Even though on average only eleven signal tracks are present in one event, all of them need to be reconstructed down to a transversal momentum of 50 MeV and no fake tracks should be present in the event. Many analyses at Belle II rely on the advantage that the initial state in B-factories is well known and a clean event reconstruction is possible if no tracks are left after assigning all tracks to particle hypotheses.

        This contribution will introduce the concepts and algorithms of the Belle II tracking software. Special emphasis will be put on the mitigation techniques developed to perform track reconstruction in high-occupancy events. First results from the data taking with the Belle II detector will be presented.

        Speaker: Thomas Hauth (KIT - Karlsruhe Institute of Technology (DE))
      • 166
        Track Fitting for the Belle II experiment

        The Belle II experiment is ready to take data in 2018, studying e+e- collisions at the KEK facility in Tsukuba (Japan), in a center of mass energy range of the Bottomonium states. The tracking system includes a combination of hit measurements coming from the vertex detector, made of pixel detectors and double-sided silicon strip detectors, and a central drift chamber, inside a solenoid of 1.5 T magnetic field. Once the pattern recognition routines have identified the track candidates, hit measurements are fitted together taking into account the different information coming from different detectors, the energy loss in the materials and the inhomogeneity of the magnetic field. Track fitting is performed by the generic track-fitting software GENFIT, which includes a Kalman filter improved by a deterministic annealing filter, in order to reject outlier hits coming from not correctly associated hits by the pattern recognition. Several mass hypotheses are used in the fit in parallel, in order to achieve the best track parameter estimation for each particle kind.
        This talk will present the design of the track fitting in the Belle II software, showing results in terms of track parameter estimation as well as computing performances.

        Speaker: Stefano Spataro (University of Turin)
      • 167
        Performance in offline event reconstruction landscape at CMS

        CMS offline event reconstruction algorithms cover simulated and acquired data processing starting from the detector raw data on input and providing high level reconstructed objects suitable for analysis. The landscape of supported data types and detector configuration scenarios has been expanding and covers the past and expected future configurations including proton-proton collisions and collisions with heavy ions through the high luminosity LHC era. All of this landscape is handled in the same software release development cycle. Every year the landscape is dominated by the current year proton-proton collision data taking in terms of computational and output data volume. By 2018 the landscape is relatively well established and application performance for Run-2 and Run-3 of LHC can be understood. We provide in-depth look at the performance of a typical production application running offline event reconstruction, focusing on specific areas of subdetector or higher level object reconstruction. We highlight the recent trends in developments and their impact on resource use required to run the application. An outlook for future challenges of running in HL-LHC era is provided based on the latest state of the art event reconstruction application setup.

        Speaker: Slava Krutelyov (Univ. of California San Diego (US))
      • 168
        JANA2 Mutli-threaded Event Reconstruction

        Development of the JANA multi-threaded event processing framework began in 2005. It’s primary application has been for GlueX, a major Nuclear Physics experiment at Jefferson Lab. Production data taking began in 2016 and JANA has been highly successful in analyzing that data on the JLab computing farm. Work has now begun on JANA2, a near complete rewrite emphasizing features targeted for large offsite computing facilities. Successful and less-than-successful strategies employed in JANA and how they are being addressed in JANA2 will be presented as well as new features suited to modern and future trends in data analysis.

        Speaker: David Lawrence (Jefferson Lab)
    • T3 - Distributed computing: Performance Optimization, Security and Federated Identity Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: David Cameron (University of Oslo (NO))
      • 169
        Minimising wasted CPU time with interruptible LHCb Monte Carlo

        During 2017 LHCb developed the ability to interrupt Monte Carlo
        simulation jobs and cause them to finish cleanly with the events
        simulated so far correctly uploaded to grid storage. We explain
        how this functionality is supported in the Gaudi framework and handled
        by the LHCb simulation framework Gauss. By extending DIRAC, we have been
        able to trigger these interruptions when running simulation on
        unoccupied capacity of the LHCb High Level Trigger farm, and are able to
        reclaim this capacity when needed for online data taking tasks. This has
        increased the opportunities for running Monte Carlo simulation during
        data taking runs as well as interfill periods and technical stops. We
        have also applied this mechanism to grid and cloud resources at external
        sites, providing the ability to reclaim capacity for operational reasons
        without long draining periods. In addition, the mechanism is used to
        increase the efficiency of the "job masonry" of packing
        single and multiprocessor jobs into the time slots on a single worker node,
        without the need for draining periods when multiple free processors must be
        assembled for a multiprocessor job. We explain how the Machine/Job
        Features mechanism is instrumental in communicating the desired finish
        time to LHCb jobs and virtual machines.

        Speaker: Andrew McNab (University of Manchester)
      • 170
        ATLAS Grid Workflow Performance Optimization

        The CERN ATLAS experiment grid workflow system manages routinely 250 to
        500 thousand concurrently running production and analysis jobs
        to process simulation and detector data. In total more than 300 PB
        of data is distributed over more than 150 sites in the WLCG.
        At this scale small improvements in the software and computing
        performance and workflows can lead to significant resource usage gains.
        ATLAS is reviewing together with CERN IT experts several typical
        simulation and data processing workloads for potential performance
        improvements in terms of memory and CPU usage, disk and network I/O.
        All ATLAS production and analysis grid jobs are instrumented to collect
        many performance metrics for detailed statistical studies using
        modern data analytics tools like ElasticSearch and Kibana.
        This presentation will review and explain the performance gains of
        several ATLAS simulation and data processing workflows and present
        analytics studies of the ATLAS grid workflows.

        Speaker: Johannes Elmsheuser (Brookhaven National Laboratory (US))
      • 171
        Improving efficiency of analysis jobs in CMS

        Hundreds of physicists analyse data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) using the CMS Remote Analysis builder (CRAB) and the CMS GlideinWMS global pool to exploit the resources of the World LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time the CMS collaboration is committed on minimizing time to insight for every scientist, by pushing for the fewer possible access restrictions to the full data sample and for freedom of choosing the application to run. Supporting such varied workflows while preserving efficient resource usage poses special challenges, like: scheduling of jobs in a multicore/pilot model where several single core jobs with an undefined runtime run inside pilot jobs with a fixed lifetime; balancing usage of every available CPU vs. use of CPU close to the data; avoiding that too many concurrent reads from same storage push jobs into I/O wait mode making CPU cycles go idle; watching over user activity to detect low efficiency workflows and prod them into smarter usage of the resources.

        In this paper we report on two complementary approaches adopted in CMS to improve the scheduling efficiency of user analysis jobs: job automatic splitting, and job automatic estimated running time tuning. They both aim at finding an appropriate value for the scheduling runtime, a number that tells how much walltime the user job needs, and it is used during scheduling to fit user's jobs into pilots that have enough lifetime. With the automatic splitting mechanism, an estimation of the runtime of the jobs is performed upfront so that an appropriate value can be estimated for the scheduling runtime. With the automatic time tuning mechanism instead, the scheduling runtime is dynamically modified by analyzing the real runtime of jobs after they finish. We also report on how we used the flexibility of the global computing pool to tune the amount, kind and running locations of jobs allowed to run exploiting remote access to the input data.

        We discuss the strategies concepts, details, and operational experiences, highlighting the pros and cons, and we show how such efforts helped improving the computing efficiency in CMS.

        Speakers: Todor Trendafilov Ivanov (University of Sofia (BG)), Jose Hernandez (CIEMAT)
      • 172
        Using AWS Athena analytics to monitor pilot job health on WLCG compute sites

        ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations.

        Speaker: Peter Love (Lancaster University (GB))
      • 173
        Federated Identity Management for Research

        Federated identity management (FIM) is an arrangement that can be made among multiple organisations that lets subscribers use the same identification data to obtain access to the secured resources of all organisations in the group. In many research communities there is an increasing interest in a common approach to FIM as there is obviously a large potential for synergies. FIM4R [1] provides a forum for communities to share challenges and ideas, and to shape the future of FIM for our researchers. Current participation covers high energy physics, life sciences and humanities, to mention but a few. In 2012 FIM4R converged on a common vision for FIM, enumerated a set of requirements and proposed a number of recommendations for ensuring a roadmap for the uptake of FIM is achieved [2]. FIM4R is currently working on an updated version of this paper, to be published in spring 2018.

        The HEP Community has been heavily involved in creating both the original whitepaper and this new version, which will document the progress made in FIM for Research, in addition to the current challenges. During this presentation we will share the conclusions of this second FIM4R whitepaper and present a summary of the identified requirements and recommendations. We shall focus particularly on the direction being taken by WLCG, through the WLCG Authorisation Working Group, and the requirements gathered from the HEP Community.

        [1] https://fim4r.org
        [2] https://fim4r.org/documents/
        [3] https://aarc-project.eu
        [4] https://www.geant.org/Projects/GEANT_Project_GN4

        Speaker: Hannah Short (CERN)
      • 174
        Operational security, threat intelligence & distributed computing: the WLCG Security Operations Center Working Group

        The modern security landscape for distributed computing in High Energy Physics (HEP) includes a wide range of threats employing different attack vectors. The nature of these threats is such that the most effective method for dealing with them is to work collaboratively, both within the HEP community and with partners further afield - these can, and should, include institutional and campus security teams. In parallel with this work, an appropriate technology stack is essential, incorporating current work on Big Data analytics.

        The work of the Worldwide LHC Computing Grid (WLCG) Security Operations Center (SOC) working group is to pursue these goals to form a reference design (or guidelines) for WLCG sites of different types. The strategy of the group is to identify necessary components - starting with threat intelligence (MISP) and network monitoring (Bro) - and build a working model over time.

        We present on the progress of the working group thus far, in particular on the programme of workshops now underway. These workshops give an opportunity to engage with sites to allow the development of advice and procedures for deployment, as well as facilitating wider discussions on how to best work with trust groups at different levels. These trust groups vary in scope but can include institutes, NGIs and the WLCG as a whole.

        Speaker: David Crooks (University of Glasgow (GB))
      • 175
        Macaroons: looking back and looking forward

        X.509 is the dominate security infrastructure used in WLCG. Although
        this technology has worked well, it has some issues. One is that,
        currently, a delegated proxy can do everything the parent credential
        can do. A stolen "production" proxy could be used from any machine in
        the world to delete all data owned by that VO on all storage systems
        in the grid.

        Generating a delegated X.509 credential is also computationally
        expensive. As a result, credentials tend to be cached and reused,
        potentially increasing the likelihood of theft.

        In October 2017, dCache v3.2 was released, which introduced support
        for macaroons: a pure authorisation, bearer token that supports
        delegation with autonomous attenuation.

        In contrast to X.509, minting a macaroon is much faster, so that
        creating a macaroon per request is feasible. Macaroons also support
        attenuation when delegating: the delegated macaroon can be limited to
        a specific task and to a specific machine.

        This paper reviews the current state of this support in dCache and
        present the different use-cases and projects that are using or are
        evaluating macaroons. It also compares macaroons to other pure bearer
        token solutions, such as the OAuth2 approach in SciToken, highlighting
        the relative strengths and weaknesses of each.

        Speaker: Paul Millar (DESY)
      • 176
        EOSC-hub AAI: A federated authentication and authorisation infrastructure for international scientific collaboration at scale

        The European Open Science Cloud (EOSC) aims to enable trusted access to services and the re-use of shared scientific data across disciplinary, social and geographical borders. The EOSC-hub will realise the EOSC infrastructure as an ecosystem of research e-Infrastructures leveraging existing national and European investments in digital research infrastructures. EGI Check-in and EUDAT B2ACCESS are the initial basis of an integrated EOSC-hub AAI that will allow the use of federated identities for accessing services. The adoption of standards and open technologies, including SAML 2.0, OpenID Connect, and OAuth 2.0, facilitates integration with web-based services. Options to support non-web services, which traditionally relied on X509 certificates, are based around the concept of online authorities with attached credential stores, such as RCauth.eu with a tightly-coupled MyProxy server. Such techniques allow science gateways to obtain credentials on behalf of the end-user that can be used to directly authenticate to services. Another user-centric approach considers certificate proxies as opaque tokens that can be obtained from a credential store from the command-line using SSH authentication. The deployed RCauth.eu and MasterPortal service from AARC features both these capabilities and has been shown to work for the production EGI and WLCG environments.

        The presentation will provide an overview of the EOSC-hub AAI architecture and the various integration workflows in support of today’s federated access requirements, with an eye to the non-web use cases.

        Speaker: Mr Nicolas Liampotis (Greek Research and Technology Network - GRNET)
    • T4 - Data handling: S4 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Costin Grigoras (CERN)
      • 177
        Echo - Experiences running an erasure coded object store

        Since February 2017, the RAL Tier-1 has been storing production data from the LHC experiments on its new Ceph backed object store called Echo. Echo has been designed to meet the data demands of LHC Run 3 and should scale to meet the challenges of HL-LHC. Echo is already providing better overall throughput than the service it will replace (CASTOR) even with significantly less hardware deployed.

        Echo relies on erasure coding rather than hardware RAID to provide data resilience. Of the publicly known Ceph clusters around the world, Echo is largest running erasure coding in production. This paper describes the erasure coding setup, its advantages over hardware RAID and our experience relying on it for data resilience.

        At the start of 2017, the LHC experiments had more than 14PB of data stored on disk in Castor. Migrating this to Echo is no small challenge and will take approximately 2 years. This paper describes the different approaches taken by the experiments as well as their current usage of Echo. This paper also describes the growing usage of the S3 and Swift APIs and lessons learnt.

        In the first year of operation there have been many pieces of scheduled work, including the addition of new hardware which resulted in significant data rebalancing, major software updates as well as security patching. There have also been operational problems such as a power cut and high disk failure rates. This paper describes how Echo has coped with these events and the higher level of data availability it is able to provide.

        Speaker: Rob Appleyard (STFC)
      • 178
        dCache - joining the noWORM storage club.

        The life cycle of the scientific data is well defined: data is collected, then processed,
        archived and finally deleted. Data is never modified. The original data is used or new,
        derived data is produced: Write Once Read Many times (WORM). With this model in
        mind, dCache was designed to handle immutable files as efficiently as possible. Currently,
        data replication, HSM connectivity and data-server independent operations are only
        possible due to the immutable nature of the stored data.

        dCache is seen increasingly as a general-purpose file system, helped by its support for
        the NFSv4.1standard,especially by new communities, such as photon science and
        microbiology. Although many users are aware of the immutability of data stored in
        dCache, some applications and use cases still require in-place update of stored files.
        Satisfying these requires some fundamental changes to dCache's core design. However,
        those changes must not compromise any aspect of existing functionality.

        In this presentation we will show the new developments in dCache that will turn it
        into a regular file system. We will discuss the challenges of building a POSIX-compliant
        distributed storage system, one that can handle multiple replicas and that remains
        backward compatible by providing both WORM and non-WORM capabilities within
        the same system.

        Speaker: Mr Tigran Mkrtchyan (DESY)
      • 179
        A milestone for DPM (Disk Pool Manager)

        The DPM (Disk Pool Manager) system is a multiprotocol scalable technology for Grid storage that supports about 130 sites for a total of about 90 Petabytes online.

        The system has recently completed the development phase that had been announced in the past years, which consolidates its core component (DOME: Disk Operations Management Engine) as a full-featured high performance engine that can also be operated with standard Web clients and uses a fully documented REST-based protocol.

        Together with a general improvement on performance and with a comprehensive administration command-line interface, this milestone also brings back features like the automatic disk server status detection and the volatile pools for deploying experimental disk caches.

        In this contribution we also discuss the end of support for the historical DPM components (that also include a dependency on the Globus toolkit), whose deployment is now only linked to the usage of the SRM protocols, hence can be uninstalled when these are not needed anymore by the site.

        Speaker: Fabrizio Furano (CERN)
      • 180
        Providing large-scale disk storage at CERN

        The CERN IT Storage group operates multiple distributed storage systems and is
        responsible
        for the support of the infrastructure to accommodate all CERN storage
        requirements, from the
        physics data generated by LHC and non-LHC experiments to the personnel users’
        files.

        EOS is now the key component of the CERN Storage strategy. It allows to
        operate at high incoming
        throughput for experiment data-taking while running concurrent complex
        production work-loads.
        This high-performance distributed storage provides now more than 250PB of raw
        disks and it is the
        key component behind the success of CERNBox, the CERN cloud synchronisation
        service which allows
        syncing and sharing files on all major mobile and desktop platforms to provide
        offline
        availability to any data stored in the EOS infrastructure.

        CERNBox recorded an exponential growth in the last couple of year in terms of
        files and data stored
        thanks to its increasing popularity inside CERN users community and thanks to
        its integration
        with a multitude of other CERN services (Batch, SWAN, Microsoft Office).

        In parallel CASTOR is being simplified and transitioning from an HSM into an
        archival system, focusing mainly
        in the long-term data recording of the primary data from the detectors,
        preparing the road to the next-generation
        tape archival system, CTA.

        The storage services at CERN cover as well the needs of the rest of our
        community: Ceph as data back-end for
        the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for
        legacy home directory filesystem
        services and its ongoing phase-out and CVMFS for software distribution.

        In this paper we will summarise our experience in supporting all our
        distributed storage system and the ongoing work
        in evolving our infrastructure, testing very-dense storage building block
        (nodes with more than 1PB of raw space)
        for the challenges waiting ahead.

        Speaker: Herve Rousseau (CERN)
      • 181
        Scaling the EOS namespace

        The EOS namespace has outgrown its legacy in-memory implementation, presenting the need for an alternative solution. In response to this need we developed QuarkDB, a highly-available datastore capable of serving as the metadata backend for EOS. Even though the datastore was tailored to the needs of the namespace, its capabilities are generic.

        We will present the overall system design, and our efforts in providing comparable performance with the in-memory approach, both when reading, through the use of extensive caching on the MGM, and when writing through the use of latency-hiding techniques involving a persistent, back-pressured local queue for batching updates to the QuarkDB backend.

        We will also discuss the architectural decisions taken when designing our datastore, including the choice of consensus algorithm to maintain strong consistency between identical replicas (raft), the choice of underlying storage backend (rocksdb) and communication protocol (redis serialization protocol - RESP), as well as the overall testing strategy to ensure correctness and stability of this important infrastructure component.

        Speaker: Andrea Manzi (CERN)
      • 182
        CERNBox: the CERN cloud storage hub

        CERNBox is the CERN cloud storage hub. It allows synchronising and sharing files on all major desktop and mobile platforms (Linux, Windows, MacOSX, Android, iOS) aiming to provide universal access and offline availability to any data stored in the CERN EOS infrastructure.

        With more than 12000 users registered in the system, CERNBox has responded to the high demand in our diverse community to an easily and accessible cloud storage solution that also provides integration with other CERN services for big science: visualisation tools, interactive data analysis and real-time collaborative editing.

        Collaborative authoring of documents is now becoming standard practice with public cloud services, and within CERNBox we are looking into several options: from the collaborative editing of shared office documents with different solutions (Microsoft, OnlyOffice, Collabora) to integrating mark-down as well as LaTeX editors, to exploring the evolution of Jupyter Notebooks towards collaborative editing, where the latter leverages on the existing SWAN Physics analysis service.

        We report on our experience managing this technology and applicable use-cases, also in a broader scientific and research context and its future evolution with highlights on the current development status and future roadmap. In particular we will highlight the future move to an architecture based on microservices to easily adapt and evolve the service to the technology and usage evolution, notably to unify CERN home directory services.

        Speaker: Hugo Gonzalez Labrador (CERN)
      • 183
        Cloud Storage for data-intensive sciences in science and industry

        In the last few years we have been seeing constant interest for technologies providing effective cloud storage for scientific use, matching the requirements of price, privacy and scientific usability. This interest is not limited to HEP and extends out to other scientific fields due to the fast data increase: for example, "big data" is a characteristic of modern genomics, energy and financial services to mention a few.

        The provision of cloud storage accessible via synchronisation and sharing interfaces became an essential element of services' portfolios offered by research laboratories and universities. "Dropbox-like" services were created and now support HEP and other communities in their day to day tasks. The scope for these systems is therefore much broader of HEP: we will describe the usage and the plans to adopt part of the tools originally conceived for our community in other areas. The adoption of cloud storage services in the main workflow for data analysis is the challenge we are now facing, extending out the functionality of "traditional" cloud storage.

        Which are the ingredients for these new classes of services? Is nowadays HEP proposing interesting solutions for other future projects on the timescale of high-luminosity LHC?

        The authors believe that HEP-developed technologies will constitute the backend for a new generation of services. Namely, our solution for exascale geographically distributed storage (EOS), the access and the federation of cloud storage across different domains (CERNBox) and the possibility to offer effective heavy-duty interactive data analysis services (SWAN) growing from this novel data infrastructure are the three key enablers for future evolution.

        In this presentation we will describe the usage of these technologies to build large content-deliver-networks (e.g. AARNET Australia), the collaboration with other activities (e.g. handling of satellite images from the Copernicus programme at JRC) and different partnerships with companies active in this field.

        Speaker: Hugo Gonzalez Labrador (CERN)
      • 184
        Ceph File System for the CERN HPC Infrastructure

        The Ceph File System (CephFS) is a software-defined network filesystem built upon the RADOS object store. In the Jewel and Luminous releases, CephFS was labeled as production ready with horizontally scalable metadata performance. This paper seeks to evaluate that statement in relation to both the HPC and general IT infrastructure needs at CERN. We highlights the key metrics required by four users, including: POSIX compliance, single-user small-file latency, multi-user metadata throughput, and metadata horizontal scalability and failure tolerance. We will report about the experience so far and future evolution of the service.

        In addition, we describe a new suite of micro-benchmark which measure the small latencies of buffered and synchronous filesystem operations and can be used to quantify the evolving quality of a storage cluster over time. We also introduce a simple ping-like latency tool---fsping---which evaluates the time needed for two clients to notice file modifications in a shared filesystem. These tests, in combination with several classical HPC benchmarks run in single- and multi-user scenarios, paint a picture of CephFS which is emerging as a viable option for HPC storage and NFS-appliance replacements.

        Speaker: Herve Rousseau (CERN)
    • T5 - Software development: S4 - Frameworks Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Patricia Mendez Lorenzo (CERN)
      • 185
        The core software framework for the LHCb Upgrade

        The LHCb detector will be upgraded for the LHC Run 3. The new, full software trigger must be able to sustain the 30MHz proton-proton inelastic collision rate. The Gaudi framework currently used in LHCb has been re-engineered in order to enable the efficient usage of vector registers and of multi- and many-core architectures. This contribution presents the critical points that had to be tackled, the current status of the core software framework and an outlook of the work program that will address the challenges of the software trigger.

        Speaker: Sebastien Ponce (CERN)
      • 186
        Evolution of the ALICE Software Framework for LHC Run 3

        ALICE is one of the four major LHC experiments at CERN. When the accelerator enters the Run 3 data-taking period, starting in 2021, ALICE expects almost 100 times more Pb-Pb central collisions than now, resulting in a large increase of data throughput. In order to cope with this new challenge, the collaboration had to extensively rethink the whole data processing chain, with a tighter integration between Online and Offline computing worlds. Such a system, codenamed ALICE O2, is being developed in collaboration with the FAIR experiments at GSI. It is based on the ALFA framework which provides a generalised implementation of the ALICE High Level Trigger approach, designed around distributed software entities coordinating and communicating via message passing.

        We will highlight our efforts to integrate ALFA within the ALICE O2 environment. We analyse the challenges arising from the different running environments for production and development, and conclude on requirements for a flexible and modular software framework. In particular we will present the ALICE O2 Data Processing Layer which exploits ALICE specific requirements in terms of Data Model. The main goal is to reduce the complexity of development of algorithms and managing a distributed system, and by that leading to a significant simplification for the large majority of the ALICE users. We will show examples for the usage of the Data Processing Layer in different contexts: local reconstruction after detector read-out, full reconstruction, simulation, and quality control. For each, we will provide a brief description, an evaluation of the performance and an assessment on the added value from the Data Processing Layer.

        Finally we will give an outlook on future developments and possible generalisations and reabsorption in the ALFA framework of some of the concepts and components we developed.

        Speaker: Giulio Eulisse (CERN)
      • 187
        The ATLAS multithreaded offline framework

        In preparation for Run 3 of the LHC, scheduled to start in 2021, the ATLAS
        experiment is revising its offline software so as to better take advantage
        of machines with many cores. A major part of this effort is migrating the
        software to run as a fully multithreaded application, as this has been
        shown to significantly improve the memory scaling behavior. This talk will
        outline changes made to the software framework to support this migration,
        as well as describe a static code checker used to assist in locating
        violations of thread-safety rules. Preliminary experiences in migrating
        reconstruction algorithms will also be discussed.

        Speaker: Scott Snyder (Brookhaven National Laboratory (US))
      • 188
        GNA: new framework for statistical data analysis

        Measurement of the physical parameters is usually done by fitting a numerical experiment model to the data. High precision experiments require detailed models with a large number of uncertain parameters. The models should be computationally efficient. In the same time they should be flexible enough since the analysis preparation requires a lot of testing.
        We are solving these problems by introducing the GNA framework. The computational model in GNA is represented by a directed acyclic graph. Each node is an operation on an array (matrix multiplication, reweighting or even cross section calculation) implemented in C++ for efficiency. The model is then assembled by connecting the nodes, outputs to inputs. The assembly is done within Python. Such an approach enables the user to build complex but flexible and efficient lazy evaluated models. GNA offers tools for handling large number of parameters with uncertainties and correlations, fitting and statistical analysis.
        GNA states for “global neutrino analysis” and is originally developed for the application for the neutrino physics for the data analysis of such experiments as Daya Bay, JUNO and NOvA.

        Speaker: Ms Anna Fatkina (JINR)
      • 189
        Physics data processing with micro-services

        In this paper, we present micro-services framework to develop data processing applications.
        We discuss functional decomposition strategies that help transitioning of existing data processing applications into a micro-services environment. We will also demonstrate advantages and disadvantages of this framework in terms of operational elasticity, vertical and horizontal scalability, productivity and maintenance. The discussion will accompany two real world implementations of the presented framework: 1) JLAB CLAS12 event reconstruction and analysis application (new design, using Java, C++ and Python micro-services within the same application), and 2) NASA EOS satellite data fusion and analytics (an application, designed using legacy software components written in Fortran presented as micro-services).

        Speaker: Vardan Gyurjyan (Jefferson Lab)
      • 190
        A generic data quality monitoring framework for HEP

        Data quality monitoring is the first step to the certification of the recorded data for off-line physics analysis. Dedicated monitoring framework have been developed by many experiments in the past and usually rely on the event data model (EDM) of the experiment, leading to a strong dependency on the data format and storage. We present here a generic data quality monitoring system, DQM4HEP, that has been developed without any assumption on the EDM. This increases the code maintenance, the portability across different experiments and re-usability for future experiment.
        We present the framework architecture and the various tools provided by the software package as well as various performances such as memory usage, stability and network bandwidth. We give an overview of the different experiments using DQM4HEP and the foreseen integration in future other experiments. We finally present the ongoing and future software development for DQM4HEP and long term prospects.

        Speaker: Remi Ete (DESY)
      • 191
        Performance optimization of the air shower simulation code for the Cherenkov Telescope Array

        The Cherenkov Telescope Array (CTA), currently under construction, is the next- generation instrument in the field of very high energy gamma-ray astronomy. The first data are expected by the end of 2018, while the scientific operations will start in 2022 for a duration of about 30 years. In order to characterise the instrument response to the Cherenkov light emitted by atmospheric cosmic ray showers, detailed Monte Carlo simulations will be regularly performed in parallel to CTA operations. The estimated CPU time associated to these simulations is very high, of the order of 200 millions HS06 hours per year. Reducing the CPU time devoted to simulations would allow either to reduce infrastructure cost or to better cover the large phase space.
        In these proceedings, we focus on the main computing step (70% of the whole CPU time) implemented in the corsika program, and specifically on the module responsible for the propagation of Cherenkov photons in the atmosphere. We present our preliminary studies about different options of code optimization, with a particular focus on vectorization facilities (SIMD instructions) together with computing precision tuning. Our proposals take care, as automatically as possible, of the hardware portability constraints introduced by the grid computing environment that hosts these simulations. Performance evaluation in terms of running-time and accuracy is provided.

        Speaker: Johan Bregeon (Laboratoire Univers et Particules, Université de Montpellier Place Eugène Bataillon - CC 72, CNRS/IN2P3, F-34095 Montpellier, France )
      • 192
        Floating-point profiling of ACTS using Verrou

        Numerical stability is not only critical to the correctness of scientific computations, but also has a direct impact on their software efficiency as it affects the convergence of iterative methods and the available choices of floating-point precision.

        Verrou is a Valgrind-based tool which challenges the stability of floating-point code by injecting random rounding errors in computations (a restricted form of Monte Carlo Arithmetic). Through bisection techniques, this tool is able to locate which parts of a codebase are correlated with significant losses of numerical accuracy.

        The ACTS project aims to establish a experiment-agnostic track reconstruction toolkit. It originates from the ATLAS Run2 tracking software and has already received strong adoption by FCC-hh, and is being evaluated for possible use by the CLICdp and Belle 2 experiments.

        We have used Verrou to evaluate the numerical stability of ACTS' Runge-Kutta track propagation. As this step is known to be a reconstruction bottleneck for many experiments, we expect significant performance gains to be achievable by introducing reduced-precision arithmetic in it. In this talk, we will present the results of this study and discuss the more general applicability of Monte Carlo arithmetic to High Energy Physics software.

        Speakers: Hadrien Benjamin Grasland (Université Paris-Saclay (FR)), Bruno Lathuilière
    • T6 - Machine learning and physics analysis: S4 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Mr Sergei Gleyser
      • 193
        Machine learning at the Cosmic Frontier

        Within the field of dark matter direct detection, there has been very little penetration of machine learning. This is primarily due to the difficulty of modeling such low-energy detectors for training sets (the keV energies are $10^{-10}$ smaller than LHC). Xenon detectors have been leading the field of dark matter direct detection for the last decade. The current front runner is XENON1T, which has invested heavily into full-chain simulations and a flexible Python-stack reconstruction chain. Accordingly, we have been able to explore what opportunities there exist at the Cosmic Frontier with machine learning. I will report on three things. First, a new type of energy and spatial reconstruction algorithm that may enable this technology to reduce a major source of background while opening up double-beta-decay measurements in later such TPCs. This regression is achieved using deep convolutional neural networks. Second, how machine learning has been used to reduce certain nonphysical backgrounds that we have encountered, primarily using random forests. Third, how machine learning can be used in signal processing for classifying the two signal types that we observe. These three machine-learining R&D topics are applicable to all such detectors in our field and this work constitutes the first known robust exploration of such opportunities for dark matter experiments.

        Speaker: Dr Christopher Tunnell (University of Chicago)
      • 194
        Deep Learning applied to the Cherenkov Telescope Array data analysis

        The Cherenkov Telescope Array (CTA) is the next generation of ground-based gamma-ray telescopes for gamma-ray astronomy. Two arrays will be deployed composed of 19 telescopes in the Northern hemisphere and 99 telescopes in the Southern hemisphere. Observatory operations are planned to start in 2021 but first data from prototypes should be available already in 2019. Due to its very high sensitivity, CTA will record a tremendous amount of data that represent a computing challenge to the reconstruction software. Moreover, the vast majority of triggered events come from protons whereas only photons are of interest for the study of astrophysical sources.
        Deep learning developments in the last few years have shown tremendous improvements in the analysis of data in many domains. Thanks to the huge amount of simulated data and later of real data, produced by CTA, these algorithms look particularly adapted and very promising. Moreover, the trained neural networks show very good computing performances during execution.
        Here we present a first study of deep learning architectures applied to CTA simulated data to perform the energy reconstruction of the particles as well as their discrimination.

        Speaker: Dr Thomas Vuillaume (LAPP, CNRS, Univ. Savoie Mont-Blanc)
      • 195
        Machine Learning Techniques in the CMS Higgs to Di-muons Search

        With the accumulation of large datasets at energy of 13 TeV, the LHC experiments can search for rare processes, where the extraction of the signal from the copious and varying Standard Model backgrounds poses increasing challenges. Techniques based on machine learning promise to achieve optimal search sensitivity and signal-to-background ratios for such searches. Taking the search for the Higgs boson decay to two muons in the CMS experiment as an example, we highlight several techniques for event classification and introduce a new application for creating automatic event categories, ultimately used to achieve the highest sensitivity of the analysis. We conclude with insights about strategies for optimal utilization of machine learning algorithms with the data expected to be accumulated during Run2 of the LHC.

        Speaker: Dimitri Bourilkov (University of Florida (US))
      • 196
        Using Generative Adversarial Networks for fast simulations in the ALICE Experiment

        Data Quality Assurance (QA) is an important aspect of every High-Energy Physics experiment, especially in the case of the ALICE Experiment at the Large Hadron Collider (LHC) whose detectors are extremely sophisticated and complex devices. To avoid processing low quality or redundant data, human experts are currently involved in assessing the detectors’ health during the collisions’ recording. However, for Run 3 where the amount of collected data is 100 times higher, manual detector health checks will not be feasible.

        To mitigate this problem, we use unsupervised machine learning methods, namely Generative Adversarial Networks (GANs), for fast simulations of trajectories reconstructed in the detector after particle collisions. We then compare the results of those simulations and physical parameters reconstructed in the detector to identify anomalies that might lead to the corruption of recorded data. Since GAN-based simulation is fast enough that it can be performed during runtime, our method allows for fully automatic online health checks of running conditions to be performed in real time. Furthermore, since GAN’s parameters can be updated using incoming stream of data, our method can automatically adjust itself to changing conditions resulting, for example, from detector’s aging.

        Speaker: Dr Tomasz Piotr Trzcinski (Warsaw University of Technology (PL))
      • 197
        The use of adversaries for optimal neural network configuration

        Data from B-physics experiments at the KEKB collider have a substantial background from $e^{+}e^{-}\to q \bar{q}$ events. To suppress this we employ deep neural network algorithms. These provide improved signal from background discrimination. However, the neural network develops a substantial correlation with the $\Delta E$ kinematic variable used to distinguish signal from background in the final fit due to its relationship with the input variables. The effect of this correlation is counter-acted by deploying an adversarial neural network. Overall the adversarial deep neural network performs better than an unoptimised commercial package, NeuroBayes.

        Speaker: Prof. Martin Sevior (University of Melbourne)
      • 198
        Systematics aware learning: a case study in High Energy Physics

        Experimental science often has to cope with systematic errors that coherently bias data. We analyze this issue on the analysis of data produced by experiments of the Large Hadron Collider at CERN as a case of supervised domain adaptation. The dataset used is a representative Higgs to tau tau analysis from ATLAS and released as part of the Kaggle Higgs ML challenge. Perturbations have been introduced into this dataset to mimick systematic errors. A classifier is trained to separate the Higgs signal from the background. The goal is to reduce the sensitivity of the classifier with respect to systematics uncertainty. The figure of merit is the total uncertainty, including statistical and systematics uncertainty.

        Systematics-aware learning should create an efficient representation that is insensitive to perturbations induced by the systematic effects. Different techniques have been experimented with and will be reported (i) Data Augmentation (training on a mix of data generated by varying the nuisance parameter), (ii) Adversarial Learning (using the Pivot technique, an adversarial network is trained simultaneously to the classifier to reduce the classifier sensitivity) (iii) Tangent Propagation (regularizing the partial derivative of the classifier score with respect to the nuisance parameter).

        Speaker: Mr Victor Estrade (LRI, UPSud, Université Paris-Saclay)
      • 199
        Binary classifier metrics for event selection optimization in HEP

        This presentation discusses some of the metrics used in HEP and other scientific domains for evaluating the relative quality of binary classifiers that are built using modern machine learning techniques. The use of the area under the ROC curve, which is common practice in the evaluation of diagnostic accuracy in the medical field and has now become widespread in many HEP applications, is critically reviewed and compared to other alternatives. In particular, the "precision-recall curve" that is routinely used in the information retrieval domain is pointed out as a more relevant tool for HEP applications, where it is equivalent to the use of signal selection efficiency and purity. Qualitative and quantitative arguments are presented to support this statement, including in particular the argument that the number of True Negatives (rejected background events) is irrelevant in HEP. Some specific metrics that are relevant to the optimization of various HEP analyses are also discussed. In particular, the relevance of the product of purity and efficiency is recalled for point estimation problems, where this metric has a simple interpretation as the fraction of Fisher information about the measured parameter which is retained after the selection of events (globally for counting measurements or locally in each histogram bin in the case of fits to differential distributions). While many of these concepts have been common knowledge since the 1990s, this presentation reviews them in the language of modern machine learning methodologies, also pointing out the many similarities and differences to other scientific domains where ML tools are used.

        Speaker: Andrea Valassi (CERN)
      • 200
        Application of a Convolutional Neural Network for image classification to the analysis of collisions in High Energy Physics

        The application of deep learning techniques using convolutional neu-
        ral networks to the classification of particle collisions in High Energy Physics is
        explored. An intuitive approach to transform physical variables, like momenta of
        particles and jets, into a single image that captures the relevant information, is
        proposed. The idea is tested using a well known deep learning framework on a sim-
        ulation dataset, including leptonic ttbar events and the corresponding background
        at 7 TeV from the CMS experiment at LHC, available as Open Data. This initial
        test shows competitive results when compared to more classical approaches, like
        those using feedforward neural networks.

        Speaker: Mr Ignacio Heredia Cacha (Instituto de Física de Cantabria)
    • T8 - Networks and facilities: S4 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Conveners: Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno), Wei Yang (SLAC National Accelerator Laboratory (US))
      • 201
        Next Generation of HEP CPU Benchmarks

        Benchmarking is a consolidated activity in High Energy Physics (HEP) computing where large computing power is needed to support scientific workloads. In HEP, great attention is paid to the speed of the CPU in accomplishing high-throughput tasks characterised by a mixture of integer and floating point operations and a memory footprint of few gigabytes.

        As of 2009, HEP-SPEC06 (HS06) is the benchmark adopted by the WLCG community to describe experiments’ computing requirements, assess data centres’ computing capacity and procure new hardware. It has been defined by the HEPiX Benchmarking Working Group and is based on a subset of the industry standard SPEC CPU2006 benchmark suite.

        In the recent years, following the evolution of CPU architectures and the adoption of new programming paradigms, such as multi-threading and vectorization, it has turned out that HS06 is less representative of the relevant applications running on the WLCG infrastructure. In the meantime, in 2017 a new SPEC generation of benchmarks (SPEC CPU 2017) for CPU intensive workloads has been delivered.

        This report summarises the findings of the HEPiX Benchmarking Working Group in comparing SPEC CPU 2017 as well as other HEP fast benchmarks with the typical WLCG workloads’ mixes.

        Speaker: Domenico Giordano (CERN)
      • 202
        Sharing server nodes for storage and compute

        Based on the observation of low average CPU utilisation of several hundred disk servers in the EOS storage system at CERN, the Batch on EOS Extra Resources (BEER) project developed an approach to utilise these resources for batch processing. After initial proof of concept tests, showing almost no interference between the batch and storage services, a model for production has been developed and implemented. This has been deployed on part the CERN EOS preproduction instance. The implementation and test results will be presented. The potential for additional resources at the CERN T0 centre is of the order of ten thousand hardware threads, as well as being a step towards a hyper-converged infrastructure.

        Speaker: David Smith (CERN)
      • 203
        MONIT: Monitoring the CERN Data Centres and the WLCG Infrastructure

        The new unified monitoring (MONIT) for the CERN Data Centres and for the WLCG Infrastructure is now based on established open source technologies for collection, streaming and storage of monitoring data. The previous solutions, based on in-house development and commercial software, are been replaced with widely- recognized technologies such as Collectd, Flume, Kafka, ElasticSearch, InfluxDB, Grafana and others. The monitoring infrastructure, fully based on CERN cloud resources, covers the whole workflow of the monitoring data: from collecting and validating metrics and logs to making them available for dashboards, reports and alarms.

        The deployment in production of this new DC and WLCG monitoring is well under way and this contribution provides a summary of the progress, hurdles met and lessons learned in using these open source technologies. It also focuses on the choices made to achieve the required levels of stability, scalability and performance of the MONIT monitoring service.

        Speaker: Alberto Aimar (CERN)
      • 204
        Challenges, solutions and lessons learnt in 7 years of Service Management at CERN

        CERN has been using ITIL Service Management methodologies and ServiceNow since early 2011. Initially a joint project between just the Information Technology and the General Services Departments, now most of CERN is using this common methodology and tool, and all departments are represented totally or partially in the CERN Service Catalogue.

        We will present a summary of the current situation of Service Management at CERN, as well as its recent evolution. Some topics will be explained in more detail, such as the incident versus request distinction, the Knowledge Base evolution, how to gather feedback about tickets from users, how to bring new services on board, how the recent renewals of the Service Desk and IT Helpdesk contracts were handled, and service privacy management.

        The various challenges, adopted solutions and lessons learnt in these topics will be presented. Finally, ongoing and future work will be presented, such as the upgrade of the CERN Service Portal, increased automation, new integrations and new services being brought on board.

        Speaker: David Martin Clavo (CERN)
      • 205
        Notifications workflows using the CERN IT central messaging infrastructure

        In the CERN IT agile infrastructure, Puppet, CERN IT central messaging infrastructure and the roger application are the key constituents handling the configuration of the machines of the computer centre. The machine configuration at any given moment depends on its declared state in roger and Puppet ensures the actual implementation of the desired configuration by running the puppet agent on the machine at regular intervals, typically every 90 minutes. Sometimes it is preferable that the configuration change is propagated immediately to the targeted machine, ahead of the next scheduled puppet agent run on this machine.
        The particular need of handling notifications in a highly scalable manner for a large scale infrastructure has been satisfied with the implementation of the CERN Megabus architecture, based on the ActiveMQ messaging system.
        The design and implementation of the CERN Megabus architecture are introduced, followed by the implementation of the roger notification workflow. The choice of ActiveMQ is analysed and the message flow between the roger notification producer and the CASTOR, EOS, BATCH and Load Balancing consumers are presented. The employment of predefined consumer modules in order to speed up the on-boarding of new CERN Megabus use cases are also described.

        Speaker: Zhechka Toteva (CERN)
      • 206
        Deploying a "push" model Prometheus

        Prometheus is a leading open source monitoring and alerting tool. Prometheus also utilizes a pull model, in the sense is pulls metrics from monitored entities, rather than receives them as a push. But sometimes this can be a major headache, even without security in mind, when performing network gymnastics to reach your monitored entities. Not only that, but sometimes system metrics might be required for consumption twice(For example you want to graph them, but at the same time you want to feed them to your fancy Machine Learning Apache Spark).
        Luckily, Prometheus main market competitor InfluxDB arrives on the scene, to help the situation with the main InfluxDB accompaning product - Telegraf.
        Telegraf is flexible, low profile, easily extensible metrics collector with a big community.
        In this talk we will offer our insight on implementing a "Push" modeled monitoring system,where metrics could be consumed multiple times, based on Telegraf,Kafka and Prometheus. We will see some pitfalls we met during the implementation, scaling issues and how we overcame them.
        We will also cover monitoring our monitoring system.

        Speaker: Hristo Umaru Mohamed (CERN)
      • 207
        Detection of erratic behavior in load balanced clusters of servers using a machine learning based method

        With the explosion of the number of distributed applications, a new dynamic server environment emerged grouping servers into clusters, which utilization depends on the current demand for the application.

        To provide reliable and smooth services it is crucial to detect and fix possible erratic behavior of individual servers in these clusters. Use of standard techniques for this purpose delivers suboptimal results.

        We have developed a method based on machine learning techniques which allows to detect outliers indicating a possible problematic situation. The method inspects the performance of the rest of a cluster and provides system operators with additional information which allows them to identify quickly the failing nodes. We applied this method to develop a Spark application using the CERN MONIT architecture and with this application we analyzed monitoring data from multiple clusters of dedicated servers in the CERN data center.

        In this contribution we present our results achieved with this new method and with the Spark application for analytics of CERN monitoring data.

        Speaker: Martin Adam (Acad. of Sciences of the Czech Rep. (CZ))
      • 208
        Monitoring System of the AMS Science Operation Centre

        The Alpha Magnetic Spectrometer (AMS) is a high energy physics experiment installed and operating on board of the International Space Station (ISS) from May 2011 and expected to last through Year 2024 and beyond. The Science Operation Centre is in charge of the offline computing for the AMS experiment, including flight data production, Monte-Carlo simulation, data management, data backup, etc. This paper introduces the design and implementation for the new monitoring system of AMS SOC, from the monitoring data collection (backend), to the visualisation (frontend). The system is designed to be distributed, collecting data from various sources in different AMS computing centres, and intelligent, attempting to analysis the causes of service incidents and to give resolving suggestions.

        Speaker: Baosong Shan (Beihang University (CN))
    • Posters: S1
      • 209
        FELIX: the new detector interface for the ATLAS experiment

        During the next major shutdown from 2019-2021, the ATLAS experiment at the LHC at CERN will adopt the Front-End Link eXchange (FELIX) system as the interface between the data acquisition, detector control and TTC (Timing, Trigger and Control) systems and new or updated trigger and detector front-end electronics. FELIX will function as a router between custom serial links from front end ASICs and FPGAs to data collection and processing components via a commodity switched network. Links may aggregate many slower links or be a single high bandwidth link. FELIX will also forward the LHC bunch-crossing clock, fixed latency trigger accepts and resets received from the TTC system to front-end electronics. The FELIX system uses commodity server technology in combination with FPGA-based PCIe I/O cards. The FELIX servers will run a software routing platform serving data to network clients. Commodity servers connected to FELIX systems via the same network will run the new Software Readout Driver (SW ROD) infrastructure for event fragment building and buffering, with support for detector or trigger specific data processing, and will serve the data upon request to the ATLAS High Level Trigger for Event Building and Selection. This presentation will cover the design and status of FELIX, the SW ROD, and results of early performance testing.

        Speaker: Serguei Kolos (University of California Irvine (US))
      • 210
        A cluster finding algorithm for free-streaming input data

        In position-sensitive detectors with segmented readout (pixels or strips), charged particles activate in general several adjacent read-out channels. The first step in the reconstruction of the hit position is thus to identify clusters of active channels associated to one particle crossing the detector. In conventionally triggered systems, where the association of raw data to events is given by a hardware trigger, this is an easy-to-solve problem. It, however, becomes more involved in untriggered, free-streaming read-out systems like the one employed by the CBM experiment. Here, the time coordinate of the single-channel measurement must be taken into account to decider whether neighbouring active channels belong to a cluster. A simple extension of well-known cluster finding algorithms is not satisfactory because of involving increasing combinatorics, which are prohibitive for reconstruction in real-time. In this talk, we present a cluster-finding solution for the Silicon Tracking System of the CBM experiment which avoids any combinatorics or loops over detector channels. Its execution time is thus independent on the size of the data packages (time slices) delivered by the data acquisition, making it suitable for being used in online reconstruction.

        Speaker: Volker Friese (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 211
        A fully modular framework for detector simulations in ROOT

        To study the performance of the Micro Vertex Detector (MVD), a fully modularized framework has been developed. The main goals of this framework have been: easy adaptability to new sensor specifications or changes in the geometry. This should be provided and additional high constrains on performance and memory usage had been set.
        To achieve these goals a framework has been build which decouples the functional model of the detector from the geometry model. This allows to adapt the framework for any given functional model and geometry without any needs for changes in the software.
        The active material of the detector is used as the smallest building block of this framework. The framework allows for a change in geometry and functional model without any needs to adapt the software.
        The framework will be presented on the example of the MVD in the Cbm-Root-Software and an overview about the performance will be given. Also possibilities for the usage of multi threading will be shown.

        Speaker: Philipp Sitzmann
      • 212
        A new approach for ATLAS Athena job configuration

        The offline software framework of the ATLAS experiment (Athena) consists of many small components of various types like Algorithm, Tool or Service. To assemble these components into an executable application for event processing, a dedicated configuration step is necessary. The configuration of a particular job depends on the workflow (simulation, reconstruction, high-level trigger, overlay, calibration, analysis ...) and the input data (real or simulated data, beam-energy, ...) leading to a large number of possible configurations.

        The configuration step is done by executing python code. The resulting configuration depends on optionally pre-set flags as well as meta-data about the data to be processed that is found by peeking into the input file and even into databases. For the python configuration code, there is almost no structure enforced, leaving the full power of python to the user.

        While this approach did work, it also proved to be error prone and complicated to use. It also leads to jobs containing more components that they actually need.

        For LHC Run 3 a more robust system is envisioned. It is still based on python but enforces some structure and emphasises modularity. Concepts of the current system that were found to be useful are maintained, in particular the auto-configuration based on the input file.

        This contribution will briefly report about the configuration system used during LHC Run 1 and Run 2 and detail the prototype of an improved system to be used in Run 3 and beyond.

        Speaker: Walter Lampl (University of Arizona (US))
      • 213
        A new cluster finder for ALICE MUON arm for Run3

        Muon reconstruction is currently all done offline for ALICE. In Run3 this is supposed to move online, with ALICE running in continuous readout with a minimum bias Pb-Pb interaction rate of 50kHz.

        There are numerous obstacles to getting the muon software to achieve the required performance, with the muon cluster finder being replaced and moved to run on a GPU inside the new O2 computing farm.
        We will give an overview of where the muon reconstruction currently spends most of its time and how the new cluster finder has been redone in line with the O2 performance requirements, and how it improves
        the performance of current offline muon reconstruction and future online.

        Speaker: Mr Sean Murray (University of Cape Town (ZA))
      • 214
        Addressing scalability with message queues: architecture and use cases for DIRAC interware

        The Message Queue architecture is an asynchronous communication scheme that provides an attractive solution for certain scenarios in the distributed computing model. The introduction of the intermediate component (queue) in-between the interacting processes, allows to decouple the end-points making the system more flexible and providing high scalability and redundancy. The message queue brokers such as RabbitMQ, ActiveMQ or Kafka are proven technologies widely used nowadays.

        DIRAC is a general-purpose Interware software for distributed computing systems, which offers a common interface to a number of heterogeneous providers and guarantees transparent and reliable usage of the resources. The DIRAC platform has been adapted by several scientific projects, including High Energy Physics communities like LHCb, the Linear Collider and Belle2.

        A Message Queue generic interface has been incorporated into the DIRAC framework to help solving the scalability challenges that must be addressed during LHC Run3 starting in 2021. It allows to use the MQ scheme for the message exchange among the DIRAC components, or to communicate with third-party services. Within this contribution we will describe the integration of MQ systems with DIRAC, and several use cases will be shown. The focus will be put on the incorporation of MQ into the pilot logging system. Message Queues are also foreseen to be used as a backbone of the DIRAC component logging system, and monitoring.

        The results of the first performance tests will be presented.

        Speaker: Wojciech Jan Krzemien (National Centre for Nuclear Research (PL))
      • 215
        Adoption of ARC-CE and HTCondor at GridKa Tier 1

        The GridKa Tier 1 data and computing center hosts a significant share of WLCG processing resources. Providing these resources to all major LHC and other VOs requires an efficient, scalable and reliable cluster management. To satisfy this, GridKa has recently migrated its batch resources from CREAM-CE and PBS to ARC-CE and HTCondor. This contribution discusses the key highlights of the adoption of this middleware at the scale of a European Tier 1 center:

        As the largest WLCG Tier 1 using the ARC-CE plus HTCondor stack, GridKa is exemplary for migrating more than 20.000 cores over the time span of just few weeks. Supporting multiple VOs, we have extensively studied the constraints and possibilities of scheduling jobs of vastly different requirements. We present a robust and maintainable optimisation of resource utilisation which still respects constraints desired by VOs. Furthermore, we explore the dynamic extension of our batch system, integrating cloud resources with a lightweight configuration mechanism.

        Speaker: Matthias Jochen Schnepf (KIT - Karlsruhe Institute of Technology (DE))
      • 216
        Advanced Analytics service to enhance workflow control at the ATLAS Production System

        Modern workload management systems that are responsible for central data production and processing in High Energy and Nuclear Physics experiments have highly complicated architectures and require a specialized control service for resource and processing components balancing. Such a service represents a comprehensive set of analytical tools, management utilities and monitoring views aimed at providing a deep understanding of internal processes, and is considered as an extension for situational awareness analytic service. Its key points are analysis of task processing, e.g., selection and regulation of key task features that affect its processing the most; modeling of processed data lifecycles for further analysis, e.g., generate guidelines for particular stage of data processing; and forecasting processes with focus on data and tasks states as well as on the management system itself, e.g., to detect the source of any potential malfunction.
        The prototype of the advanced analytics service will be an essential part of the analytical service of the ATLAS Production System (ProdSys2). Advanced analytics service uses such tools as Time-To-Complete estimation towards processing units (i.e., tasks and chains of tasks) to control the processing state and to be able to highlight abnormal operations and executions. Obtained metrics are used in decision making processes to regulate the system behaviour and resources consumption.

        Speaker: Mikhail Titov (National Research Centre Kurchatov Institute (RU))
      • 217
        Advanced Scheduling in IaaS Clouds

        IaaS clouds brought us greater flexibility in managing computing infrastructures enabling us to mix different computing environments (e.g. Grid systems, web-servers and even personal desktop-like systems) in form of virtual machines (VM) within the same hardware equipment. The new paradigm automatically introduced efficiency increase caused by switching from using single-task dedicated physical servers to splitting them up into a greater number of smaller virtual servers each running its own computing tasks. In this work we describe an approach to further improve hardware utilization in IaaS clouds through dynamic reallocation of VMs and overcommitment. This approach introduces new risks and challenges in automated cloud management, but can help increase the cloud resources utilization efficiency which may become a serious problem given constant growth of computing infrastructures and the variety of modern workloads. We introduce a meta-scheduling framework and an example scheduling strategy able to handle dynamic reallocation of virtual machines and which currently supports OpenNebula platform. The framework has a modular design and can be extended to support other platforms (e.g. OpenStack) which provide API to communicate with the cloud.

        Speaker: Nikita Balashov (JINR)
      • 218
        ALICE O2 setup for TPC read-out tests for the LHC run 3

        Abstract:

        ALICE (A Large Ion Collider Experiment) is one of the four big experiments at the Large Hadron Collider (LHC). For ALICE Run 3 there will be a major upgrade for several detectors as well as the compute infrastructure with a combined Online-Offline computing system (O2) to support continuous readout at much higher data rates than before (3TB/s). The ALICE Time Projection Chamber (TPC) is the main contributor to the data volume which is read out via the Common Readout Unit (CRU).
        For the first read-out test of a complete TPC inner read-out chamber (IROC) in the beginning of 2018 and the following tests of a complete TPC sector end of Q2 2018 we build an O2 prototype system using the O2 software stack which is currently developed for LHC run 3. The setup consists of 5 First-Level-Processing servers (FLPs) with two CRUs each as input nodes and 2 Event-Processing-Nodes (EPNs). With this test setup we can evaluate the feasibility of the planned software tools on a small scale as well as the read-out with the O2 processing and analysis framework. During the long shutdown this setup will be extended for pre-commissioning tests of the upgraded TPC detector.
        We give a summary of the deployed test setup for the TPC as well as the experiences so far and an outlook of the future steps towards the final O2 system.

        Speaker: Johannes Lehrbach (Johann-Wolfgang-Goethe Univ. (DE))
      • 219
        AlphaTwirl: a python library for summarizing event data into multi-dimensional categorical data

        AlphaTwirl is a python library that loops over event data and summarizes them into multi-dimensional categorical (binned) data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in ROOT TTree with one entry per collision event of an LHC experiment. Event data are often large -- too large to be loaded in memory -- because they have as many entries as events. Multi-dimensional categorical data, the output of AlphaTwirl, have one row for one category. They are usually small -- small enough to be loaded in memory -- because they only have as many rows as categories. Users can, for example, import them as data frames into R and pandas, which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas. In this presentation, I will show (a) an example workflow of data analysis using AlphaTwirl and data frames, (b) the user interface of AlphaTwirl, e.g., how to specify conditions of event selection, binning and categories, and methods to summarize data in each category, and (c) features of implementation, such as concurrency in looping over large event data. In addition, I will mention particular analyses in CMS using AlphaTwirl. I will also discuss possibilities for future development.

        Speaker: Dr Tai Sakuma (University of Bristol (GB))
      • 220
        An integrated system for data quality and conditions assessment for the ATLAS Tile Calorimeter

        The ATLAS experiment records data from the proton-proton collisions produced by the Large Hadron Collider (LHC). The Tile Calorimeter is the hadronic sampling calorimeter of ATLAS in the region |eta| < 1.7. It uses iron absorbers and scintillators as active material. Jointly with the other calorimeters it is designed for reconstruction of hadrons, jets, tau-particles and missing transverse energy. It also assists in muon identification. The TileCal is regularly monitored by several different systems, which were independently developed to meet distinct collaboration’s requirements. Any problems are reported and immediately investigated. The data quality (DQ) efficiency achieved was 100% in 2015, 98.9% in 2016 and 99.4% in 2017. Changes to detector status or calibrations are entered into a dedicated conditions database called COOL. Experts maintain the tools used by DQ shifters and the calibration teams during normal operation, and prepare new conditions for data reprocessing and MC production campaigns.

        Tile-in-ONE is a unique system, which integrates all the web DQ monitoring and calibration systems and tools used by the TileCal, with a standard development technology and documentation. It also intends to abstract the user from knowing where and how to get the wanted data by providing a user friendly interface. It is based in a server containing a core, which represents the basic framework that loads the configuration, manages user settings and loads plug-ins at run-time; a set of services, which provide common features to be used by the plug-ins, such as connectors to different databases and resources; and the plug-ins themselves which provide features at the top level layer for the users. Moreover, a web environment is being designed to allow collaborators develop their own plug-ins, test them and add them to the system. To make it possible, an API is used allowing any kind of application to be interpreted and displayed in a standard way.

        Speaker: Juraj Smiesko (Comenius University (SK))
      • 221
        Application of Deep Learning on Integrating Prediction, Provenance, and Optimization

        We investigate novel approaches using Deep Learning (DL) for efficient execution of workflows on distributed resources. Specifically, we studied the use of DL for job performance prediction, performance classification, and anomaly detection to improve the utilization of the computing resources.

        • Performance prediction:
        • capture performance of workflows on multiple resources
        • consider intra-node task assignment

        • Performance classification: Prediction of job success/failure

        • Predict at regular intervals job succeed/fail - site reliability
        • Long short-term memory (LSTM) neural networks

        • Performance anomaly detection:

        • Example: Functions that consume unexpectedly large/small amounts of time

        We used the Belle II distributed computing workflow and modifications to the DIRAC system for these studies.

        Speakers: Dr Malachi Schram, Malachi Schram (Pacific Northwest National Laboratory)
      • 222
        ATLAS Distributed Computing: Its Central Services core

        The ATLAS Distributed Computing (ADC) Project is responsible for the off-line processing of data produced by the ATLAS experiment at the Large Hadron Collider (LHC) at CERN. It facilitates data and workload management for ATLAS computing on the Worldwide LHC Computing Grid (WLCG).

        ADC Central Services operations (CSops)is a vital part of ADC, responsible for the deployment and configuration of services needed by ATLAS computing and operation of those services on CERN IT infrastructure, providing knowledge of CERN IT services to ATLAS service managers and developers, and supporting them in case of issues. Currently this entails the management of thirty seven different OpenStack projects, with more than five thousand cores allocated for these virtual machines, as well as overseeing the distribution of twenty nine petabytes of storage space in EOS for ATLAS.

        As the LHC begins to get ready for the next long shut-down, which will bring in many new upgrades to allow for more data to be captured by the on-line systems, CSops must not only continue to support the existing services, but plan ahead for the expected increase in data, users, and services that will be required. In this paper we explain the current state of CSops as well as the systems put in place to maintain the services provided.

        Speaker: Chris Lee (University of Cape Town (ZA))
      • 223
        ATLAS Global Shares Implementation in the PanDA Workload Management System

        PanDA (Production and Distributed Analysis) is the workload management system for ATLAS across the Worldwide LHC Computing Grid. While analysis tasks are submitted to PanDA by over a thousand users following personal schedules (e.g. PhD or conference deadlines), production campaigns are scheduled by a central Physics Coordination group based on the organization’s calendar. The Physics Coordination group needs to allocate the amount of Grid resources dedicated to each activity, in order to manage sharing of CPU resources among various parallel campaigns and to make sure that results can be achieved in time for important deadlines.
        While dynamic and static shares on batch systems have been around for a long time, we are trying to move away from local resource partitioning and manage shares at a global level in the PanDA system. The global solution is not straightforward, given different requirements of the activities (number of cores, memory, I/O and CPU intensity), the heterogeneity of Grid resources (site/HW capabilities, batch configuration and queue setup) and constraints on data locality. We have therefore started the Global Shares project that follows a requirements-driven multi-step execution plan, starting from definition of nestable shares, implementing share-aware job dispatch, aligning internal processes with global shares and finally implementing a pilot stream control for controlling the batch slots while keeping late binding.
        This contribution will explain the development work and architectural changes in PanDA to implement Global Shares, and provide an operational point of view with the difficulties we found along the way.

        Speaker: Fernando Harald Barreiro Megino (University of Texas at Arlington)
      • 224
        ATLAS Software Installation on Supercomputers

        PowerPC and high performance computers (HPC) are important resources for computing in the ATLAS experiment. The future LHC data processing will require more resources than Grid computing, currently using approximately 100,000 cores at well over 100 sites, can provide. Supercomputers are extremely powerful as they use resources of hundreds of thousands CPUs joined together. However their architectures have different instruction sets. ATLAS binary software distributions for x86 chipsets do not fit these architectures, as emulation of these chipsets results in huge performance loss. This presentation describes the methodology of ATLAS software installation from source code on supercomputers. The installation procedure includes downloading the ATLAS simulation release code with 0.7 million C++ and Python lines as well as the source code of more than 50 external packages, such as ROOT and Geant4, followed by compilation, and rigorous unit and integration testing. The presentation reports the application of this procedure at Titan HPC and Summit PowerPC at Oak Ridge Computing Facility (OLCF).

        Speaker: Alexander Undrus (Brookhaven National Laboratory (US))
      • 225
        ATLAS utilisation of the Czech national HPC center

        The Czech national HPC center IT4Innovations located in Ostrava provides two HPC systems, Anselm and Salomon. The Salomon HPC is amongst the hundred most powerful supercomputers on Earth since its commissioning in 2015. Both clusters were tested for usage by the ATLAS experiment for running simulation jobs. Several thousand core hours were allocated to the project for tests, but the main aim is to use free resources waiting for large parallel jobs of other users.

        Multiple strategies for ATLAS job execution were tested on the Salomon and Anselm HPCs. The solution described herein is based on the ATLAS experience with other HPC sites. ARC Compute Element (ARC-CE) installed at the grid site in Prague is used for job submission to Salomon. The ATLAS production system submits jobs to the ARC-CE via ARC Control Tower (aCT). The ARC-CE processes job requirements from aCT and creates a script for a batch system which is then executed via ssh. Sshfs is used to share scripts and input files between the site and the HPC cluster. The software used to run jobs is rsynced from the site's CVMFS installation to the HPC's scratch space every day to ensure availability of recent software.

        Speaker: Michal Svatos (Acad. of Sciences of the Czech Rep. (CZ))
      • 226
        Automating calibration at the Belle II detector

        In 2018 the Belle II detector will begin collecting data from $e^+e^-$ collisions at the SuperKEKB electron-positron collider at the High Energy Accelerator Research Organization (KEK, Tsukuba, Japan). Belle II aims to collect a data sample 50 times larger than the previous generation of B-Factories, taking advantage of the SuperKEKB design luminosity of $8\times10^{35} cm^{-2} s^{-1}$.

        It is crucial for the detector components at Belle II to be calibrated quickly and accurately to provide analysts with the best possible data promptly so that Belle II can remain competitive in key analyses. A calibration framework (CAF) has been developed in the Belle II Analysis Software Framework (basf2). This allows basf2 users to write and run calibration code themselves in an easy and common format, while taking advantage of parallelized submission to local batch systems.

        A system to perform the automatic submission and monitoring of CAF processes on newly acquired data is currently in development at Belle II. In recent years workflow management software (WMS) using Directed Acyclic Graphs (DAGs) has become increasingly popular as a way to specify complex and potentially dynamic workflows. A WMS + DAG system is used in the automated CAF submission system. The current status of this, and a brief review of the various WMS projects investigated, will be presented.

        Speaker: David Dossett (University of Melbourne)
      • 227
        Automation and Testing for Simplified Software Deployment

        Creating software releases is one of the more tedious occupations in the life of
        a software developer. For this purpose we have tried to automate as many of the
        repetitive tasks involved from getting the commits to running the software as
        possible. For this simplification we rely in large parts on free collaborative
        services build around GitHub: issue tracking, code review (GitHub), continuous
        integration (Travis-CI), static code analysis (coverity). The dependencies and
        compilers used in the continuous integration are obtained via CVMFS used inside
        docker. This allows one to run any desired compiler version (e.g., gcc 6.2 ,
        llvm 3.9) or tool (e.g, clang-format, pylint). To create tags for the software
        package the powerful GitHub API is used. A script was developed that first
        collates the release notes from the description of each pull request, commits
        the release notes file, and finally makes a tag. This moves the burden of
        writing release notes to the individual developer and away from the package
        maintainer. The deployment of software releases to CVMFS is handled via
        GitLab-CI. When a tag is made the software is built and automatically
        deployed. In this presentation we will describe the software infrastructure used
        for the iLCSoft and iLCDirac projects, which are used by CLICdp and the ILC detector
        collaborations, and give many examples of automation which might be useful
        for other collaborations as well.

        Speaker: Andre Sailer (CERN)
      • 228
        Background sensitivity studies for GEM based detectors using a geant4 simulation

        One of the main challenges the CMS collaboration must overcome during the phase-2 upgrade is the radiation damage to the detectors from the high integrated luminosity of the LHC and the very high pileup. The LHC will produce collisions at a rate of about 5x10^9/s. The particles emerging from these collisions and the radioactivity they induce will cause significant damage to the detectors and on-board electronics, in addition, background particles from this radiation field can induce spurious signals resulting in a progressive degradation of the detector performance.. A detailed simulation study of background particles and their interactions in a Gas Electron Multiplier based detector was performed using a standalone Geant4 simulation. Relevant information such as the probability to generate a signal (sensitivity), properties of secondary particles among others was extracted, this information leads to reduced uncertainties on experimental observables (i.e. accumulated charge) and could contribute on optimizing the detector design and an accurate estimation on the exposure time in the irradiation facilities. Finally the simulation results are compared with experimental measurments collected in past irradiation campaigns of GEM detetcors at CERN.

        Speaker: Ilaria Vai (Universita and INFN (IT))
      • 229
        Bayesian optimisation of the SHiP active muon shield

        The SHiP experiment is new general purpose fixed target experiment designed to complement collider experiments in the search for new physics. A 400 GeV/c proton beam from the CERN SPS will be dumped on a dense target to accumulate $2\times10^{20}$ protons on target in five years.

        A crucial part of the experiment is the active muon shield, which allows the detector to operate at a very high beam intensity while maintaining a zero-background environment for the search for new physics. In order to do this the muon flux has to be reduced from $10^{11}$ muons per second by 6 orders of magnitude in the shortest distance possible.

        The presentation will describe the concept of the active muon shield, the particular challenges of this optimisation problem, which necessitate the use of modern optimisation techniques, and how they are overcome with these techniques. Recent results and their implications for the SHiP comprehensive design study and beyond are presented.

        Speaker: Oliver Lantwin (Imperial College (GB))
      • 230
        Belle II Documentation Effort using Sphinx

        Over the last seven years the software stack of the next generation B factory experiment Belle II has grown to over one million lines of C++ and Python code, counting only the part included in offline software releases. This software is used by many physicists for their analysis, many of which will be students with no prior experience in HEP software. A beginner-friendly and up-to-date documentation is vital for quick and efficient analysis.

        High energy physics experiments have traditionally been documented using a mix of wiki pages and auto generated C++ documentation. This approach has a few drawbacks: much of the documentation which is not directly code related is prone to become outdated and the C++ documentation is seldom suited for new users.

        Sphinx, the well known documentation tool for Python, is aimed at providing a single, structured manual. It provides comfortable support for automatic Python documentation. Moreover it can be easily extended and thus allows us to document most of our custom concepts in an automated manner using ROOT to expose the necessary interface to Python.

        This contribution will show our ongoing efforts to provide a beginner friendly user manual for the Belle II software using a novel combination of Sphinx and custom extensions to provide up to date automatic documentation of our framework and analysis tools.

        Speaker: Dr Martin Ritter (LMU / Cluster Universe)
      • 231
        Belle2Lab - Interactive Tool for Public Analysis of Belle II Data

        Several data samples from a Belle II experiment will be available to the general public as a part of experiment outreach activities. Belle2Lab is designed as an interactive graphical user interface to reconstructed particles, offering users basic particle selection tools. The tool is based on a Blockly JavaScript graphical code generator and can be run in a HTML5 capable browser. It allows description of different particle decays by selecting and combining particles from the data file, easy histogramming tools and display of the results by using the JSROOT library. During the analysis, the user has a possibility to apply the cuts on selected variables. A pseudocode generated by the user interface is sent to the execution server which returns the histograms, that can also be interactively fitted. The Belle2Lab is accessible in two ways: hosted on a single public web server or as a part of the virtual appliance, which consists of a Linux operating system, a data sample, an analysis framework and a private web server. The former can be used for single access, while the latter is more suited for use in a classroom. In the presentation, I will describe a design and an implementation of the interface and demonstrate its use. I will also outline our plans for future development.

        Speaker: Rok Pestotnik (Jozef Stefan Institute (SI))
      • 232
        Belle2VR - A Virtual Reality Visualization of Subatomic Particle Physics

        I describe a novel interactive virtual reality visualization of subatomic particle physics, designed as an educational tool for learning about and exploring the subatomic particle collision events of the Belle II experiment. The visualization is designed for untethered, locomotive virtual reality, allowing multiple simultaneous users to walk naturally through a virtual model of the Belle II detector and interact with and gather information about the particles that result from electron-positron collisions in the SuperKEKB accelerator. Belle2VR displays the detailed GEANT4-simulated history of each collision event superimposed on the complete detector geometry; the user can move freely through the scalable detector geometry and manipulate the history timeline with handheld controllers. Developed by an interdisciplinary team of researchers in physics, education, and virtual environments, the simulation will be integrated into the undergraduate physics curriculum at Virginia Tech. I describe the tool, including visualization features and design decisions, and outline our plans for future development.

        Speaker: Leo Piilonen (Virginia Tech)
      • 233
        Beyond the run boundaries: monitoring and calibrating the detector and the trigger validation of the CMS experiment across runs

        Many of the workflows in the CMS offline operation are designed around the concept of acquisition of a run: a period of data-taking with stable detector and accelerator conditions. The capability of integrating statistics across several runs is an asset for statistically limited monitoring and calibration workflows. Crossing run boundaries requires careful evaluation of the conditions of the experiment, operation mode of the accelerator and version of the reconstruction software to build a robust aggregation logic capable of building homogeneous datasets. The CMS collaboration invested in the automation of this process building a framework for continuous running of calibration workflows, data quality monitoring tasks and performance studies across run boundaries. The contribution will illustrate the design principles of this new tool and report on the operational experience during the LHC Run-II and the prospect for its future development.

        Speaker: Gianluca Cerminara (CERN)
      • 234
        CAOS: a tool for OpenStack accounting management

        The analysis and understanding of resources utilization in shared infrastructures, such as cloud environments, is crucial in order to provide better performance, administration and capacity planning.

        The management of resource usage of the OpenStack-based cloud infrastructures hosted at INFN-Padova, the Cloud Area Padovana and the INFN-PADOVA-STACK instance of the EGI Federated Cloud, started with the deployment of Ceilometer, the OpenStack component responsible to collect and manage accounting information. However, by using Ceilometer alone we found some limiting problems related to the way it handles information: among others, the imbalance between storage and data retention requirements, and the complexity in computing custom metrics.

        In this contribution we present a tool, called CAOS, which we have been implementing to overcome the aforementioned issues. CAOS collects, manages and presents the data concerning resource usage of our OpenStack-based cloud infrastructures. By gathering data from both the Ceilometer service and OpenStack API, CAOS enables us to track resource usage at different levels (e.g. per project), in such a way that both current and past consumption of resources can be easily determined, stored and presented.

        Speakers: Fabrizio Chiarello (INFN - National Institute for Nuclear Physics), Dr Sergio Traldi (INFN - Sezione di Padova), Paolo Andreetto (Universita e INFN, Padova (IT))
      • 235
        CDS Videos - The new platform for CERN videos

        CERN Document Server (CDS, cds.cern.ch) is the CERN Institutional Repository based on the Invenio open source digital repository framework. It is a heterogeneous repository, containing more than 2 million records, including research publications, audiovisual material, images, and the CERN archives. Its mission is to store and preserve all the content produced at CERN as well as to make it easily available to any outlet interested.

        CDS aims to be the CERN’s document hub. To achieve this we are transforming CDS into an aggregator over specialized repositories, each having its own software stack, with features enabled based on the repository’s content. The aim is to enable each content producer community to have its own identity, both visually and functionally, as well as increased control on the data model and the submission, curation, management, and dissemination of the data. This separation is made possible by using the Invenio 3 framework.
        The first specialized repository created is CDS Videos (videos.cern.ch). It has been launched in December 2017, and is the first step in the long-term project to migrate the entire CDS to the Invenio 3 framework.

        CDS Videos provides an integrated submission, long-term archival and dissemination of CERN video material. It offers a complete solution for the CERN video team, as well as for any department or user at CERN, to upload video productions. The CDS Videos system will ingest the video material, interact with the transcoding server for generating web and broadcaster subformats, mint DOI persistent identifiers, generate embeddable code to be reused by any other website, and store the master files for long-term archival.

        The talk will detail the software architecture of the CDS Videos as well as the infrastructure needed to run such a large-scale web application. It will present the technical solutions adopted, including the Python-based software stack (using among others Flask, IIIF, ElasticSearch, Celery, RabbitMQ) and the new AngularJS-based user interface which was exclusively designed for CDS Videos. It will also present our solution to a lossless migration of data: more than 5'000 videos from 1954 to 2017, summing up to 30TB of files, have been migrated from DFS to EOS in order to populate the CDS Videos platform. All this could be of high interest to other institutes wanting to reuse the CDS Videos open source code for creating their own video platform. Last but not least, the talk will detail how the user community at CERN and beyond can take advantage of the CDS Videos platform for creating and disseminating video content.

        Speakers: Mr Flavio Costa (CERN), Mr Esteban Gabancho (CERN), Mr Jose Benito Gonzalez Lopez (CERN), Mrs Ludmila Marian (CERN), Mr Nicola Tarocco (CERN), Mr Sebastian Witowski (CERN)
      • 236
        Cherenkov Light in Liquid Scintillator at the NOvA Experiment: A Cautionary Tale

        The observation of neutrino oscillations provides evidence of physics beyond the Standard Model, and the precise measurement of those oscillations remains an essential goal for the field of particle physics. The NOvA experiment is a long-baseline neutrino experiment composed of two finely-segmented liquid-scintillator detectors located off-axis from the NuMI muon-neutrino beam having as its primary goals to perform precision measurements of the atmospheric mass splitting, determine the mass hierarchy, and probe CP violation in the leptonic sector.

        In early NOvA analyses, the systematic uncertainty on the non-linear scintillator response to the hadronic recoil component of neutrino interactions was one of the largest uncertainties considered. The initial scintillator model used second-order Birks-Chou suppression optimized to improve the data/Monte Carlo agreement for energy loss along proton and muon tracks; however, the optimal values were unrealistic and overestimated the brightness of untracked low-energy protons produced by neutron scattering. Subsequently, we determined that Cherenkov light produced at short wavelengths and absorbed and reemitted by the scintillator at experimentally useful wavelengths can almost entirely account for this mismodeling. This dramatically improved data/Monte Carlo agreement and reduced uncertainties due to scintillator response to nearly negligible levels. We will discuss the implementation and data-driven tuning of the Cherenkov absorption-reemission model and the impact of the new scintillator model on recent results.

        Speaker: Adam Aurisano (University of Cincinnati)
      • 237
        CMS ECAL DAQ Monitoring system

        The Large Hadron Collider (LHC) at CERN Geneva has entered the Run 2 era, colliding protons at a center of mass energy of 13 TeV at high instantaneous luminosity. The Compact Muon Solenoid (CMS) is a general-purpose particle detector experiment at the LHC. The CMS Electromagnetic Calorimeter (ECAL) has been designed to achieve excellent energy and position resolution for electrons and photons. The ECAL DAQ system is composed of more than 200 off-detector electronic boards. Monitoring the status of the DAQ boards during data taking is essential for an efficient and stable acquisition system. For this purpose a new web application, EcalView, has been developed. EcalView runs on a light JavaScript server framework based on Node.js and Express.js. It is composed of several routines (mainly a specific routine for each board type) that cyclically collect the status of the electronics and display the information for web requests. On the client side, graphical interfaces, based on Vue.js libraries, retrieve the data (only if new information are available) and display information regarding the electronics status and errors. For each board, detailed information can be loaded and presented in specific pages if requested by the expert. Server side routines store information on electronics errors in a SQLite database in order to perform offline analysis about the long term status of the boards.

        Speaker: Giacomo Cucciati (CERN)
      • 238
        CMS ECAL DAQ system

        In 2017 the Large Hadron Collider (LHC) at CERN has provided an astonishing 50 fb-1 of proton-proton collisions at a center of mass energy of 13 TeV. The Compact Muon Solenoid (CMS) detector has been able to record the 90.3% of this data. During this period, the CMS Electromagnetic Calorimeter (ECAL), based on 75000 scintillating PbWO4 crystals and a silicon and lead preshower, has continued exhibiting excellent performance with a very stable data acquisition (DAQ) system. The ECAL DAQ system follows a modular and scalar schema: the 75000 crystals are divided in sectors (FED), each of them controlled by 3 interconnected boards. These board are responsible for the configuration and control of the front-end electronics configuration, the generation of trigger primitive for the central CMS L1 trigger, and for the collection of data. A multi-machine distributed software configures the electronic boards and follows the life cycle of the acquisition process. The ECAL electronic modular configuration is reflected in the software where a tree control structure is applied. Through a master web application, the user controls the communication with the sub-applications which are responsible for the off-detector board configurations. Since the beginning of Run 2 in 2015, many improvements to the ECAL DAQ have been implemented to reduce occasional errors, as well as to mitigate single event upsets in the front-end electronics, and to improve the efficiency. Efforts at the software level have been made to introduce automatic recovery in case of errors. These procedures are mandatory to have a reliable and efficient acquisition system.

        Speaker: Giacomo Cucciati (CERN)
      • 239
        CMS event reconstruction status in Run 2

        LHC Run2 began in April 2015. With the restart of the collisions in the CERN Large Hadron Collider. In the perspective of the offline event reconstruction, the most relevant detector updates appeared in 2017: they were the restructuring of the pixel detector, with an additional layer closer to the beams, and the improved photodetectors and readout chips for the hadron calorimeter, which will eventually allow a finer longitudinal segmentation. The long shutdown between Run1 and Run2 was instrumental for the optimization of the reconstruction code and for the introduction of new algorithms to mitigate sensitivity to increased pileup, especially to out-of-time contribution with advent of 25 ns separation between collisions compared to 50 ns in Run1. Such an optimization continued in the following years, when the reconstruction code of CMS evolved together with the improving of the performance of the LHC. We will describe here the current status of the reconstruction software of the CMS experiment, with emphasis on some of the recently integrated developments.

        Speaker: Andrea Perrotta (Universita e INFN, Bologna (IT))
      • 240
        CMS Workflow Failures Recovery Panel, Towards AI-assisted Operation

        The central production system of CMS is utilizing the LHC grid and effectively about 200 thousand cores, over about a hundred computing centers worldwide. Such a wide and unique distributed computing system is bound to sustain a certain rate of failures of various types. These are appropriately addressed with site administrators a posteriori. With up to 50 different campaigns ongoing concurrently, the range of diversity of workload is wide and complex, leading to a certain amount of mis-configurations, despite all efforts in request preparation. Most of the 2000 to 4000 datasets produced each week are done so in full automation, and datasets are delivered within an agreed level of completion. Despite effort of reducing the rate of failure, there remains a good fraction of workflows that requires non trivial intervention. This work remains for computing operators to do. We present here a tool, which was developed to facilitate and improve this operation, in a view to reduce delays in delivery. A dense and comprehensive representation of what errors occurred during the processing of a request helps expediting the investigation. Workflows that suffered from similar failures are bundled and presented as such to the operator. A realistically simplified operating panel front-end is connected to a backend automatizing the technical operation required for ease of operation. The framework was built such that it is collecting both the decision and the information available to the operator for taking that decision. It is therefore possible to employ machine learning technique to learn from the operator by training on labelled data. The operator’s procedure is automatized further by applying the decisions that are predicted with acceptable confidence. We present this tool that improves operational efficiency and will lead to further development in handling failures in distributed computing resources using machine learning.

        Speaker: Jean-Roch Vlimant (California Institute of Technology (US))
      • 241
        Collaborative Scientific Authoring at CERN: A user-centered approach

        For over a year and a half we ran a CERN-wide trial of collaborative authoring platforms, understanding how the CERN community authors and co-authors, gathering the user needs and requirements and evaluating the available options. As a result, the Overleaf and ShareLaTeX cloud platforms are now fully available to the CERN Community. First, we will explain our user-centered approach step-by-step, how it led us to a decision and why it was important to follow this approach. Then, we will highlight how collaborative scientific authoring at CERN will benefit from the chosen cloud platforms.

        Speaker: Nikos Kasioumis (CERN)
      • 242
        Compact data stream for jets at ATLAS

        The LHC delivers an unprecedented number of proton-proton collisions
        to its experiments. In kinematic regimes first studied by earlier
        generations of collider experiments, the limiting factor to more
        deeply probing for new physics can be the online and offline
        computing, and offline storage, requirements for the recording and
        analysis of this data. In this contribution, we describe a strategy
        that the ATLAS experiment employs to overcome these limitations and
        make the most of LHC data during Run-2 - a compact data stream
        involving trigger-level jets, recorded at a far higher rate than is
        possible for full event data. We discuss the implementation of this
        stream and outline its technical challenges, as well as developments
        to further streamline it for more demanding 2018 conditions.

        Additionally, the results of an analysis of this dataset are shown to
        highlight the competitiveness and complementarity with traditional
        data streams.

        Speaker: William Kalderon (Lund University (SE))
      • 243
        Concurrent Adaptative Load Balancing at (@CERN)

        CERN is using an increasing number of DNS based load balanced aliases (currently over 600). We explain the Go based concurrent implementation of the Load Balancing Daemon (LBD), how it is being progressively deployed using Puppet and how concurrency greatly improves scalability, ultimately allowing a single master-slave couple of Openstack VMs to server all LB aliases. We explain the Lbclient that has been rewritten to be able to use Collectd metrics to determine the status of the node. We also used Go for the Lbclient rewrite as its concurrency features reduce the real time needed for the whole procedure in particular when a node belongs to multiple LB aliases. We explain that the LBD server acts as an arbiter getting feedback on load and health from the backend nodes using SNMP to decide which IP addresses the LB alias will present. While this architecture has been used since long at CERN for DNS based aliases, the LBD code is generic enough to drive other load balancers, so we explain how we investigated using it with HAProxy to provide adaptative responses to load and health monitoring.

        Speaker: Zhechka Toteva (CERN)
      • 244
        Conditions data handling in the multithreaded ATLAS framework

        In preparation for Run 3 of the LHC, the ATLAS experiment is migrating
        its offline software to use a multithreaded framework, which will allow
        multiple events to be processed simultaneously. This implies that the
        handling of non-event, time-dependent (conditions) data,
        such as calibrations and geometry, must also be extended to allow
        for multiple versions of such data to exist simultaneously. This has
        now been implemented as part of the new ATLAS framework. The detector
        geometry is included in this scheme by having sets of time-dependent
        displacements on top of a static base geometry.

        Speaker: Charles Leggett (Lawrence Berkeley National Lab. (US))
      • 245
        Containers usage on the ATLAS grid infrastructure

        Containerization is a lightweight form of virtualization that allows reproducibility and isolation responding to a number of long standing use cases in running the ATLAS software on the grid. The development of Singularity in particular with the capability to run as a standalone executable allows for containers to be integrated in the ATLAS (and other experiments) submission framework. Operating System upgrades, data and software preservation, users' development cycle, software distribution at sites that don’t have middleware and cvmfs, and isolation of the payload from the pilot environment are all use cases which can find a solution in using containers. While singularity seems easy enough to run, the variety of grid sites configurations and workflows still makes it a challenge to use everywhere seamlessly. As usual the answer is to maintain a flexible system. This paper will describe the deployment of containers in the ATLAS experiment.

        Speaker: Alessandra Forti (University of Manchester (GB))
      • 246
        Continuous Performance Benchmarking Framework for ROOT

        Foundational software libraries such as ROOT are under intense pressure to avoid software regression, including performance regressions. Continuous performance benchmarking, as a part of continuous integration and other code quality testing, is an industry best-practice to understand how the performance of a software product evolves over time. We present a framework, built from industry best practices and tools, to help to understand ROOT code performance and monitor the efficiency of the code for a several processor architectures. It additionally allows historical performance measurements for ROOT I/O, vectorization and parallelization sub-systems.

        We utilize the Google benchmarking library to execute micro benchmarks of selected hotspot functions in ROOT and related libraries. This provides detailed data measurements, including memory usage and CPU instruction counters. Additionally, the framework manages traditional benchmarking pitfalls via repeating unstable benchmarks and providing a stable performance environment over time. The performance data points from continuous benchmarking are fed into an InfluxDB database and provided to the developer community via a Grafana-based dashboard. This performance benchmarking framework, built on generic and flexible infrastructure, is meant to be reusable by other projects.

        Speaker: Oksana Shadura (University of Nebraska Lincoln)
      • 247
        Data acquisition and Software the ATLAS Tile Calorimeter Phase-II Upgrade Demonstrator

        The LHC has planned a series of upgrades culminating in the High Luminosity LHC (HL-LHC) which will have an average luminosity 5-7 times larger than the design LHC value. The Tile Calorimeter (TileCal) is the hadronic sampling calorimeter installed in the central region of the ATLAS detector. It uses iron absorbers and scintillators as active material. TileCal will undergo a substantial upgrade to accommodate to the HL-LHC parameters. The TileCal read-out electronics has been fully redesigned introducing a new read-out strategy.
        The photomultiplier signals will be digitized and transferred to the TileCal PreProcessors (TilePPr) located off-detector for every bunch crossing, requiring a data bandwidth of 40 Tbps. The TilePPr will provide preprocessed information to the first level of trigger through a low and deterministic data path for every bunch crossing (40 MHz). In parallel, the data samples will be stored in pipeline memories and the data of the events selected by the ATLAS central trigger system will be transferred to the ATLAS global Data AcQuisition (DAQ) system for further processing.
        Extensive standalonetests have been performed recently with beam at the CERN accelerator facilities. External beam detectors have been used to measure the beam position and to generate a trigger signal when the beam impinges the detector module, while a Demonstrator system of the TileCal electronics for the HL-LHC upgrade has been successfully employed to read-out the calorimeter signals.
        This contribution describes in detail the data processing and the hardware, firmware, software and simulation components of the TileCal Demonstrator readout system. It presents the beam detector elements and the logic to detect different beam particles and to generate the trigger signals. In addition, the system integration tests and results from the tests with beam performed at CERN will be presented.

        Speaker: Xiaoguang Yue (Ruprecht Karls Universitaet Heidelberg (DE))
      • 248
        Data Quality Monitoring and Prompt Processing in the protoDUNE-SP experiment

        The DUNE Collaboration is pursuing an experimental program (named protoDUNE)
        which involves a beam test of two large-scale prototypes of the DUNE Far Detector
        at CERN in 2018. The volume of data to be collected by the protoDUNE-SP (the single-phase detector) will amount to a few petabytes and the sustained rate of data sent to mass
        storage will be in the range of a few hundred MB per second. After collection the data
        will be committed to storage at CERN and immediately transmitted to Fermi National
        Accelerator Laboratory in the US for processing, analysis and long-term preservation.
        The protoDUNE experiment requires substantial Data Quality Monitoring
        capabilities in order to ascertain the condition of the detector and its various subsystems.
        We present the design of the protoDUNE Prompt Processing System, its deployment at
        CERN and its performance during the data challenges conducted in 2017 and 2018.

        Speaker: Maxim Potekhin (Brookhaven National Laboratory (US))
      • 249
        dCache development and testing on Openstack

        Various sites providing storage for experiments in high energy particle physics and photon science deploy dCache as flexible and modern large scale storage system. As such, dCache is a complex and elaborated software framework, which needs a test driven development in order to ensure a smooth and bug-free release cycle. So far, tests for dCache are performed on dedicated hosts emulating the usual HEP environment. Since the tests are only run as part of the release cycle, the load of the machines is on average quite small. In order to use the computing resources more efficiently, the dCache testing infrastructure is ported to the DESY OpenStack installation. Using the flexible resource  allocation provided by OpenStack the test-machines are spawned on demand and configured by puppet similarly to regular worker nodes. After finishing these tests, the machines are deleted freeing the computing resources for other users.

        Speaker: Christian Voss (Rheinisch-Westfaelische Tech. Hoch. (DE))
      • 250
        DDS – The Dynamic Deployment System

        The Dynamic Deployment System (DDS) is a tool-set that automates and significantly simplifies a deployment of user-defined processes and their dependencies on any resource management system (RMS) using a given topology. DDS is a part of the ALFA framework.

        A number of basic concepts are taken into account in DDS. DDS implements a single responsibility principle command line tool-set and API. The system treats users’ task as a black box – it can be an executable or a script. It also provides a watchdogging and a rule-based execution of tasks. DDS implements a plug-in system to abstract from RMS. Additionally it ships an SSH and a localhost plug-ins which can be used when no RMS is available. DDS doesn’t require pre-installation and pre-configuration on the worker nodes. It deploys private facilities on demand with isolated sandboxes. The system provides a key-value property propagation engine. That engine can be used to configure tasks at runtime. DDS also provides a lightweight API for tasks to exchange messages, so-called, custom commands.

        In this report a detailed description, current status and future development plans of the DDS will be highlighted.

        Speaker: Andrey Lebedev (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 251
        Deployment and monitoring of grid infrastructure and jobs

        We investigate the automatic deployment and scaling of grid infrastructure components as virtual machines in OpenStack. To optimize the CVMFS usage per hypervisor, we study different approaches to share CVMFS caches and cache VMs between multiple client VMs.\newline
        For monitoring, we study container solutions and extend these to monitor non-containerized applications within cgroups resource controls. This allows to harness standard container monitoring toolkits for non-containerized applications without the immediate need to migrate to a fully containerized deployment.

        Speaker: Thomas Hartmann (Deutsches Elektronen-Synchrotron (DE))
      • 252
        Design and development of vulnerability management portal for DMZ admins powered by DBPowder

        It is difficult to promote cyber security measures in research institutes, especially in DMZ network that allows connections from outside network. This difficulty mainly comes from two types of variety. One is the various requirements of servers operated by each research group. The other is the divergent skill level among server administrators. Unified manners rarely fit managing those servers.

        One of the solutions is vulnerability management. To address the challenges, our approaches are as follows. One is to offer a simple and powerful vulnerability management service to administrators of DMZ hosts (DMZ admins). The other is flexibility and efficiency in the development process of the service.

        To achieve the requirements, we designed and developed a vulnerability management portal site for DMZ admins, named DMZ User's Portal. This talk presents the design of DMZ User's Portal and the development process with a development framework, named DBPowder. Using the DMZ User's Portal, each DMZ admin can perform vulnerability scan on his/her own servers with ease. Then, each DMZ admin can grasp and manage the security by himself/herself. Also, we are developing DBPowder object-relational mapping (ORM) framework for the flexibility and efficiency in the development process of DMZ User's Portal.

        Speaker: Tadashi Murakami (KEK)
      • 253
        Developing a monitoring system for Cloud-based distributed datacenters

        Beside their increasing complexity and variety of provided resources and services, large data-centers nowadays often belong to a distributed network and need non-conventional monitoring tools. This contribution describes the implementation of a monitoring system able to provide active support for problem solving to the system administrators.
        The key components are information collection and analysis. Information is acquired from multiple levels in order to allow algorithms to recognize a malfunction and suggest possible root-cause reducing service downtime.
        The project has been developed using the Bari ReCaS data-center as testbed. The information is gathered from Zabbix, Openstack, HTCondor as local monitoring system, cloud platform and batch system respectively.
        Big Data solutions belonging to the Hadoop ecosystem have been selected: Flume and Kafka as transport layer and Spark as analysis component. Multiple tools have been used to store data, such as Hadoop Distributed File System, HBase and Neo4j. InfluxDB-Grafana and Elasticsearch-Kibana are used as visualization component. Event extraction, correlation and propagation algorithms have been also implemented using Artificial Intelligence and graph libraries to provide the root-cause feature. Results are forwarded to experts by email or Slack, using Riemann.

        Speaker: Gioacchino Vino (Universita e INFN, Bari (IT))
      • 254
        Developing a Software Management and Testing Platform-as-a-Service for Research Software

        There is a growing need to incorporate sustainable software practices into High Energy Physics. Widely supported tools offering source code management, continuous integration, unit testing and software quality assurance can greatly help improve standards. However, for resource-limited projects there is an understandable inertia in deviating effort to cover systems maintenance and application support for these tools. An externally hosted turnkey solution is therefore required to enable projects to deliver optimal code without incurring additional overhead.

        This presentation describes the development of a Platform-as-a-service (PAAS) solution that combines software management and testing applications into a consolidated service. We will report on how cloud computing platforms can satisfy the desired objectives by providing agile deployment, service orchestration and the automatic scaling of resources. We will outline how software management and testing experiences from larger HEP experiments have informed the platform design to make a solution that benefits the wider research software community. Based upon feedback from early adopters we will show how this solution can lead to cost-effective resource pooling and promotes expertise between projects through collaborative code review.

        Speaker: Andrew John Washbrook (The University of Edinburgh (GB))
      • 255
        Development of profiling system for low-energy physics

        The Standard Model in particle physics is refined. However, new physics beyond the Standard Model, such as dark matter, requires thousand to million times of simulation events compared to those of the Standard Model. Thus, the development of software is required, especially for the development of simulation tool kits. In addition, computing is evolving. It requires the development of the simulation tool kit to accommodate the evolving computing architecture. Therefore, an efficient simulation tool kit is needed. Then, a profiling system is required to confirm it. In Geant4, a typical simulation tool kit, a profiling system in higher-energy physics areas such as the LHC experiment is well developed, contributing to the development of the software. However, profiling systems in the low-energy physics domain are in the beginning stage. Therefore, we develop it and show performances using it. In addition, profiling is performed depending on the development of software. These profiling systems could be used to check the development of software for evolving computing architecture.

        Speaker: Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
      • 256
        Distributed and Integrated IaaS and PaaS Cloud for Research - The INFN Corporate Cloud solution.

        INFN Corporate Cloud (INFN-CC) is a geographically distributed private cloud infrastructure, based on OpenStack, that has recently been deployed in three of the major INFN data-centres in Italy. INFN-CC has a twofold purpose: on one hand its fully redundant architecture and its resiliency characteristics make of it the perfect environment for providing critical network services for the INFN community, on the other the fact that it is hosted in modern and large data-centres makes of INFN-CC the platform of choice for a number of scientific computing use cases.
        INFN-CC also deploys a higher PaaS layer, developed within the EU funded project INDIGO-DataCloud (www.indigo-datacloud.eu), in order to provide to the INFN scientific communities not only an easier access solution to computing and storage resources, but also both automatic instantiation and configuration of services or applications used for their everyday work, like batch-system on demand or big data analytics facilities. The PaaS layer, together with the underlying IaaS, is able to provide automatic scalability of the clusters instantiated and fault tolerance in case of a single node or complete site failures.
        We present the hardware and software architecture, the challenges and advantages of this solution, together with the results of the first supported use cases.

        Speaker: Stefano Stalio (INFN)
      • 257
        Equal cost multi pathing in high power systems with TRILL

        The work is devoted to the result of the creating a first module of the data processing center at the Joint Institute for Nuclear Research for modeling and processing experiments. The issues related to handling the enormous data flow from the LHC experimental installations and troubles of distributed storages are considered. The article presents a hierarchical diagram of the network farm and a basic model of the network architecture levels. Protocols for disposal full mesh network topologies are considered. Two monitoring systems of the network segment and download of the data channels are described. The former is a typical packet software; the latter is a newly designed software with an application to graph drawing. The data are presented which were obtained experimentally from 40G interfaces through by each monitoring systems, their behavior is analyzed. The result is that the discrepancy of experimental data with theoretical predictions to be equal to the weight balancing of the traffic when transmitting the batch information over the equivalent edges of the graph. It is shown that the distribution of the traffic over such routes is of arbitrary and inconsistent with the patent formula character. The conclusion analyzes the issues of the traffic behavior under extreme conditions. There are two main questions to be answered. Which way does the distribution of data transfer over four equivalent routes occur? What happens if overload takes place? An assumption is made of the need to compare the traffic behavior in various data centers with the help of the traffic generators.

        Speaker: Andrey Baginyan (Joint Institute for Nuclear Research (RU))
      • 258
        Establishment of new WLCG Tier Center using HTCondorCE on UMD middleware.

        WLCG, a Grid computing technology used by CERN researchers, is based on two kinds of middleware. One of them, UMD middleware, is widely used in many European research groups to build a grid computing environment. The most widely used system in the UMD middleware environment was the combination of CREAM-CE and the batch job manager "torque". In recent years, however, there have been many difficulties due to torque's problems. As an alternative, there was a lot of effort to use HTCondor instead of torque. One of them is to use HTCondorCE in UMD middleware. "HTCondorCE" was developed by the OSG middleware group. However, HTCondor, which is the foundation of "HTCondorCE", does not depend on middleware and can be used in UMD middleware as well. The KISTI Center, recently registered at the CMS Tier-2 Center, has built a computing center using "HTCondorCE" based on UMD middleware. We have prepared to share and share the experience and trial and error found in the establishment process with other groups.

        Speaker: Dr Geonmo Ryu (Korea Institute of Science & Technology Information (KR))
      • 259
        Evolution of monitoring, accounting and alerting services at INFN-CNAF Tier1

        CNAF is the national center of INFN for IT services. The Tier-1 data center operated at CNAF provides computing and storage resources mainly to scientific communities such as those working on the four LHC experiments and 30 more experiments in which INFN is involved.
        In past years, every CNAF departments used to choose their preferred tools for monitoring, accounting and alerting. In particular, the teams at Tier-1 department
        were using several software packages, such as Lemon (developed at CERN and customized to suit the needs of a datacenter managing scientific data), Nagios (mainly used for alerting purposes), a system based on Graphite database and a number of in-house developed services for web reporting.
        In 2015, a task force started with the goal of defining and implementing a common infrastructure based on Sensu, InfluxDB and Grafana, to be adopted by the various CNAF departments.
        Once the new infrastructure was deployed, a major task was to adapt the whole monitoring, accounting and alerting services.
        In this paper we present the steps followed at the Tier-1 in order to fully complete the migration, having all the new services put into production and the old ones dismissed or no more needed. We describe how the monitoring sensors and alerting checks have been redesigned to be adapted to the Sensu platform. Furthermore, we explain the collection of accounting data for storage and computing. The design of web dashboards for data presentation on a common web portal is then reported. Finally, we outline the porting of historical data from Lemon and Graphite systems to InfluxDB.

        Speaker: Mr Enrico Fattibene (INFN - CNAF)
      • 260
        Evolving CERN’s Network Configuration Management System

        The CERN IT Communication Systems group is in charge of providing various wired and wireless based communication services across the laboratory. Among them, the group designs, installs and manages a large complex of networks: external connectivity, data-centre network (deserving central services and the WLCG), campus network (providing connectivity to users on site), and last but not least dedicated networks for the LHC accelerator control or experiments.

        Overall, these networks comprise approximately 400 routers and 4000 switches from multiple vendors and from different generations, with heterogeneous configurations depending on the network area they deserve. To ensure a consistent reproducible configuration across all these devices, an in-house software (denoted as “cfmgr”, Perl based) has been developed and augmented over the past 20 years. Based on a central network database (which stores the network model information for all connected devices at CERN), cfmgr is able to derive and enforce the desired configuration on all these network devices.

        The cloud computing demand for increased agility in network provisioning has resulted in the development of the network configuration ecosystem (both from network vendors and from the open-source community). Faced with the requirement of deploying a new generation of routers, we have evaluated the functionality of various network configuration tools and libraries, and we are planning to evolve our network configuration management platform by modularizing it in components that can be changed independently, and making use of open-source libraries:
        - the generation of an abstract network device configuration;
        - the translation of the abstract model to a device specific configuration, and
        - enforcing the configuration on the network devices.

        We will present an overview of our study of the available network configuration tools (mainly NAPALM and the orchestration platforms that have modules for using it), as well as the architecture of our original and evolved network configuration management software.

        Speaker: Stefan Nicolae Stancu (CERN)
      • 261
        Experience in using public Cloud for Belle II experiment within HNSciCloud Project.

        The current level of flexibility reached by Cloud providers enables Physicists to take advantage of extra resources to extend the distributed computing infrastructure supporting High Energy Physics Experiments. However, the discussion about the optimal usage of such kind of resources is still ongoing. Moreover, because each Cloud provider offers his own interfaces, API set and different service level, the integration in the computing model of an experiment require a specific work for each provider.
        In this paper, we summarize the experience matured in the usage of Cloud resources offered by different providers within the European project Helix Nebula Science Cloud for Belle II experiment. The goal of the user-case were to run properly Montecarlo Production jobs on Virtual Machines created dynamically.
        To do that, we defined the golden image, based on CENTOS, then we setup the basic environment, the internal and external network and a local DNS services.
        For each cloud, we implemented a cache server based on Squid, while the orchestrator has been centralized on a single server located in the INFN-Napoli infrastructure. Finally, we integrated the available resources in the Belle II production framework based on DIRAC, using VCYCLE as Virtual Machine Life Cycle Manager.
        After a tuning activity, that required the optimization of several aspects included Image distribution and Network configuration, we were able to run production jobs over the different Cloud with success.

        Speaker: Silvio Pardi (INFN)
      • 262
        Factory Monitoring for the 21st century

        A key aspect of pilot-based grid operations are the pilot (glidein) factories. A proper and efficient use of any central blocks in the grid infrastructure is for operations inevitable, and glideinWMS factories are not the exception. The monitoring package for glideinWMS factory monitoring was originally developed when the factories were serving a couple of VO’s and tens of sites. Nowadays with factories serving tens of VO’s and hundreds of sites around the globe an update of the monitoring is due. Moreover with the new availability of industry and open source storage and graphing packages an opportunity remain open. In this work we present the changes made to factory monitoring to leverage different technologies: ElasticSearch, RabbitMQ, Grafana and influxDB to provide a centralized view of the status and work of several glideinWMS factories located in different continents of the globe.

        Speaker: Edgar Fajardo Hernandez (Univ. of California San Diego (US))
      • 263
        Fair Share Scheduler for OpenNebula (FaSS): implementation and performance tests

        A small Cloud infrastructure for scientific computing likely operates in a saturated regime, which imposes constraints to free applications’ auto-scaling. Tenants typically pay a priori for a fraction of the overall resources. Within this business model, an advanced scheduling strategy is needed in order to optimize the data centre occupancy.
        FaSS, a Fair Share Scheduler service for OpenNebula (ONE), addresses this issue by satisfying resource requests according to an algorithm, which prioritizes tasks according to an initial weight and to the historical resource usage of the project. In this contribution, we are going to describe the implementation of FaSS Version 1.0, released in March 2017 as a product of the INDIGO-DATACLOUD project. The software was designed to be less intrusive as possible in the ONE code, and interacts with ONE exclusively through its XML-RPC interface. The native ONE scheduler is preserved for matching requests to available resources.
        FaSS is made by five functional components: the Priority Manager (PM), a set of fair-share algorithms, Terminator (TM), the XML-RPC interface and the database. The main module, the PM, periodically requests the list of pending Virtual Machines (VMs) to ONE and re-calculates the priorities in the queue by interacting with an algorithm module of choice. In FaSS 1.0, the default algorithm is Slurm's MultiFactor. The TM module runs asynchronously with respect to the PM and it is responsible for removing from the queue VMs in pending state for too long as well as terminating or suspending running VMs after a configurable Time-to-Live. The XML-RPC server of FaSS intercepts the calls from the First-In-First-Out scheduler of ONE and sends back the reordered VMs queue. FaSS database is InfluxDB. It stores the initial and recalculated VM priorities and some additional information for accounting purposes. No information already present in the ONE DB is duplicated in FaSS.
        In this contribution we are also going to show the results of FaSS functional and stress tests performed at the Cloud infrastructure of the INFN-Torino computing centre.
        In this contribution we are also going to show the results of FaSS functional and stress tests performed at the Cloud infrastructure of the INFN-Torino computing centre.

        Speakers: Dr Stefano Bagnasco (Istituto Nazionale di Fisica Nucleare, Torino), Sara Vallero (Universita e INFN Torino (IT)), Valentina Zaccolo (Universita e INFN Torino (IT))
      • 264
        Fast online reconstruction of short-lived particles with KF Particle Finder in the CBM experiment

        The future heavy ion experiment CBM at the FAIR facility will study the QCD phase diagram in the region of high baryon chemical potential at relatively moderate temperatures, where a complex structure is predicted by modern theories. In order to detect possible signatures of this structures, the physics program of the experiment includes comprehensive study of the extremely rare probes like charmed particles, dileptons, multi-strange particles and their antiparticles. Thus, the operation scenario assumes extremely high interaction rates of up to $10^{7}$ collisions per second. Having no clear signatures for the hardware trigger, CBM will perform the full event reconstruction online including the stage of short-lived particles reconstruction.

        The KF Particle Finder package was developed to solve this task. The package searches for more than 100 decay channels covering signals from most of the physics cases of the CBM experiment: strange particles, strange resonances, hypernuclei, low mass vector mesons, charmonium, and open-charm particles. The package provides rich functionality for reconstruction of short-lived particles including reconstruction of decay trees. Based on the Kalman filter mathematics, it allows to estimate not only the parameters, but their errors as well.

        For fast operation the algorithms for particle reconstruction were optimised with respect to speed keeping the same reconstruction quality and efficiency. The KF Particle Finder package was fully SIMDised and parallelised. For better utilisation of the computational power of modern processing units all calculations were implemented in single precision. Also, tools for track fitting to prepare the input data were SIMDised and parallelised. The performance of KF Particle Finder on many-core computing architectures is presented and discussed.

        Speaker: Maksym Zyzak (GSI)
      • 265
        Feature Updates In Pyglidein, An HTCondor Glidein Generator

        IceCube is a cubic kilometer neutrino detector located at the south pole. IceCube’s simulation and production processing requirements far exceed the number of available CPUs and GPUs in house. Collaboration members commit resources in the form of cluster time at institutions around the world. IceCube also signs up for allocations from large clusters in the United States like XSEDE. All of these disparate cluster resources are homogenized with IceCube’s own Glidein software, Pyglidein. Pyglidein uses a pull model to launch HTCondor glideins around the world, which reduces the need for complicated firewall configuration changes at glidein sites. We present the most recent Pyglidein feature enhancements, including improvements to monitoring and integration testing. Glidein jobs now ship logs directly to S3 using presigned PUT URLs, monitor the state of resources via HTCondor startd cron jobs, and send metrics back to IceCube. The increased visibility reduces debugging time and frustration from administrators and data processors. The Pyglidein client has been containerized alongside HTCondor, Torque, and SLURM to automate integration testing across all three schedulers. New features can be confidently released quickly without disrupting processing jobs.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 266
        FERRY: Access Control and Quota Management Service

        Fermilab is developing the Frontier Experiments RegistRY (FERRY) service that provides a centralized repository for the access control and job management attributes such as batch and storage access policies, quotas, batch priorities and NIS attributes for cluster configuration. This paper describes FERRY architecture, deployment and integration with services that consume the stored information.
        The Grid community has developed several access control management services over the last decade. Over time support for Fermilab experiments has required the collection and management of more access control and quota attributes. At the same time, various services used for this purpose, namely VOMS-Admin, GUMS and Vulcan (in-house authorization service for CMS analysis jobs running locally), are being abandoned by the community.
        FERRY has multiple goals: maintaining a central repository for currently scattered information related to users' attributes; providing a Restful API that allows uniform data retrieval by services; providing a replacement service for all the abandoned grid services.
        FERRY will be integrated with the ServiceNow (SNOW) ticketing service and use it as its user interface. In addition to the standard workflows for request approval and task creation, SNOW will invoke orchestration that automates access to FERRY API. Our expectation is that FERRY will drastically improve user experience as well as decrease efforts spent on support by service administrators.

        Speaker: Vito Di Benedetto (Fermi National Accelerator Lab. (US))
      • 267
        g4tools/4.x with an HDF5 IO driver and row-wise ntuple to handle single file IO in the ROOT format in a parallel-computing context.

        g4tools is a collection of pure header classes intended to be a technical low level layer of the analysis category introduced in Geant4 release 9.5 to help Geant4 users to manage their histograms and ntuples in various file formats. In g4tools bundled with the latest Geant4 release (10.4, December 2017), we introduced a new HDF5 IO driver for histograms and column wise paged ntuples as well as a "row wise" paged ntuple logic for the ROOT format to support the management of a single file in a multi-thread or MPI parallel environment. We will present these developments, in particular putting accent on the usage of the HDF5 binary IO library to write/read typical HEP objects as histograms and paged ntuples, a library widely used in other domains of science but so far quite ignored in HEP.

        Speaker: Dr Guy Barrand (CNRS/IN2P3/LAL)
      • 268
        gopaw is for "Good Old PAW". gopaw is a rewriting of CERN/PAW by using the softinex inlib/exlib C++ classes.

        A user : with PAW I had the impression to do physics, with ROOT I have the impression to type C++. Then why not returning to do physics?! We will present how gopaw is done, especially putting accent on its portability, its way to handle multiple file formats (including ROOT/IO and HDF5), its unified graphics based on the inlib/sg scene graph manager (see CHEP 2013 for softinex) and its openness permitting also to support multiple scripting language (then at least KUIP but also Lua and Python). We want also to address the more deep issues of knowing if we have the right language(s) to manipulate HEP data, and also the problem of the relationship users/doers (physicists/engineers) around software in HEP. In the 1990s, when doers pushed to leave FORTRAN for C++ in order to have better IO, better plotting, better graphics, was it unavoidable to push users to leave KUIP? a command system that, put all together, was perhaps more close to the physics than a C++ interpreter prompt exposing too much technicalities not related to the essential for a user. In the 1990s, was it not possible to just keep the facade of CERN/PAW for users, and then "behind the scene" have the doers smoothly migrating everything toward new technologies? Right now more and more (scripting) languages are around: Python, Lua, Julia, etc... but all theses are generic languages that do not know straight HEP data. To do analysis in a more intuitive (and reliable) way, the author have the strong feeling that a piece is still lacking in our computing environment, some kind of language strongly HEP-data oriented having keywords as "run", "event", "cut", "histo", "ntuple", "fit", "plot" beside generic keywords as "for", "if", "class". We would like also to take the occasion of this CHEP, and this presentation, to share thoughts (and moods) about all these points.

        Speaker: Dr Guy Barrand (Laboratoire de l'Accélérateur Linéaire, Université Paris-Sud, CNRS-IN2P3, Orsay, France.)
      • 269
        GridKa network development towards LHC run 3 and run 4 (HL)

        The LAN and WAN development of DE-KIT will be shown from the very beginning to the current status. DE-KIT is the German Tier-1 center collaborating with the Large Hadron Collider (LHC) at CERN. This includes the local area network capacity level ramp up from 10Gbps over 40 Gbps to 100 Gbps as well as the wide area connections. It will be demonstrated how the deployed setup serves the current capacity requirements and enables scaling.
        The changes include the IPv6 dual-stack setup of DE-KIT, some of the challenges overcome during the deployment, the necessary tweaking that all routing requirements are fitting and the prevention of the two interface strategy. At IPv4 one interface is for internal use only with private addresses and the second interface has a public address. All systems are going to receive only one IPv6 address.
        In the talk a network forecast about a prediction of LAN and WAN of a LHC collaborating center including the two LHC VPNs, LHCOPN and LHCONE, which will offer the resources for LHC RUN 3 and moreover beyond lookout to RUN 4 (High Luminosity) will be presented. One focus will cover the challenges to build network connections and offer bandwidth capacities to LHC collaborating cloud providers. This last chapter will be written in close cooperation with NRENs.

        Speaker: Bruno Heinrich Hoeft (KIT - Karlsruhe Institute of Technology (DE))
      • 270
        HPC resource integration into CMS Computing via HEPCloud

        The higher energy and luminosity from the LHC in Run2 has put increased pressure on CMS computing resources. Extrapolating to even higher luminosities (and thus higher event complexities and trigger rates) beyond Run3, it becomes clear that simply scaling up the the current model of CMS computing alone will become economically unfeasible. High Performance Computing (HPC) facilities, widely used in scientific computing outside of HEP, have the potential to help fill the gap. Here we describe the USCMS efforts to integrate US HPC resources into CMS Computing via the HEPCloud project at Fermilab. We present advancements in our ability to use NERSC resources at scale and efforts to integrate other HPC sites as well. We present experiences in the elastic use of HPC resources, quickly scaling up use when so required by CMS workflows. We also present performance studies of the CMS multi-threaded framework on both Haswell and KNL HPC resources.

        Speaker: Dirk Hufnagel (Fermi National Accelerator Lab. (US))
      • 271
        Improving ATLAS computing resource utilization with HammerCloud

        HammerCloud is a framework to commission, test, and benchmark ATLAS computing resources and components of various distributed systems with realistic full-chain experiment workflows. HammerCloud contributes to ATLAS Distributed Computing (ADC) Operations and automation efforts, providing the automated resource exclusion and recovery tools, that help re-focus operational manpower to areas which have yet to be automated, and improve utilization of available computing resources.

        We present recent evolution of the auto-exclusion/recovery tools: faster inclusion of new resources in testing machinery, machine learning algorithms for anomaly detection, categorized resources as master vs. slave for the purpose of blacklisting, and a tool for auto-exclusion/recovery of resources triggered by Event Service job failures that is being extended to other workflows besides the Event Service.

        We describe how HammerCloud helped commissioning various concepts and components of distributed systems: simplified configuration of queues for workflows of different activities (unified queues), components of Pilot (new movers), components of AGIS (controller), distributed data management system (protocols, direct data access, ObjectStore tests).

        We summarize updates that brought HammerCloud up to date with developments in ADC and improved its flexibility to adapt to the new activities and workflows to respond to evolving needs of the ADC Operations team in a timely manner.

        Speaker: Jaroslava Schovancova (CERN)
      • 272
        Improving the Scheduling Efficiency of a Global Multi-core HTCondor Pool in CMS

        Scheduling multi-core workflows in a global HTCondor pool is a multi-dimensional problem whose solution depends on the requirements of the job payloads, the characteristics of available resources, and the boundary conditions such as fair share and prioritization imposed on the job matching to resources. Within the context of a dedicated task force, CMS has increased significantly the scheduling efficiency of workflows in reusable multi-core pilots by various improvements to the limitations of the glideinWMS pilots, accuracy of resource requests, efficiency and speed of the HTCondor infrastructure, and job matching algorithms.

        Speaker: James Letts (Univ. of California San Diego (US))
      • 273
        Increasing Windows security by hardening PC configurations

        Over 8000 Windows PCs are actively used on the CERN site for tasks ranging from controlling the accelerator facilities to processing invoices. PCs are managed through CERN's Computer Management Framework and Group Policies, with configurations deployed based on machine sets and a lot of autonomy left to the end-users. While the generic central configuration works well for the majority of the users, a specific hardened PC configuration is now provided for users who require stronger resilience against external attacks. This presentation will describe the technical choices and configurations involved and discuss the effectiveness of the hardened PC approach.

        Speaker: Pablo Martin Zamora (CERN)
      • 274
        Integrated automation for configuration management and operations in the ATLAS online computing farm

        The online farm of the ATLAS experiment at the LHC, consisting of
        nearly 4000 PCs with various characteristics, provides configuration
        and control of the detector and performs the collection, processing,
        selection, and conveyance of event data from the front-end electronics
        to mass storage.

        Different aspects of the farm management are already accessible via
        several tools. The status and health of each host are monitored by a
        system based on Icinga2 and Ganglia. PuppetDB gathers centrally all
        the status information from Puppet, the configuration management tool
        used to ensure configuration consistency of every host. The in-house
        Configuration Database controls DHCP and PXE, integrating also
        external information sources.

        In this paper we present our roadmap for integrating these and other
        data sources and systems, and building a higher level of abstraction
        on top of this foundation. An automation and orchestration tool will
        be able to use these systems and replace lengthy manual procedures,
        some of which also require interactions with other systems and teams,
        e.g. for the repair of a faulty host. Finally, an inventory and
        tracking system will complement the available data sources, keep track
        of host history, and improve the evaluation of long-term lifecycle
        management and purchase strategies.

        Speaker: Arturo Sanchez Pineda (Abdus Salam Int. Cent. Theor. Phys. (IT))
      • 275
        Integrating a dynamic data federation into the ATLAS distributed data management system

        Input data for applications that run in cloud computing centres can be stored at remote repositories, typically with multiple copies of the most popular data stored at many sites. Locating and retrieving the remote data can be challenging, and we believe that federating the storage can address this problem. In this approach, the closest copy of the data is used based on geographical or other information. Currently, we are using the dynamic data federation, Dynafed, a software solution developed by CERN IT. Dynafed supports several industry standards for connection protocols, such as Amazon S3, Microsoft Azure and HTTP with WebDAV extensions. Dynafed functions as an abstraction layer under which protocol-dependent authentication details are hidden from the user, requiring the user to only provide an X509 certificate. We have set up an instance of Dynafed and integrated it into the ATLAS distributed data management system, Rucio. We report on the challenges faced during the installation and integration.

        Speaker: Frank Berghaus (University of Victoria (CA))
      • 276
        Integration of a heterogeneous compute resource in the ATLAS workflow

        High-Performance Computing (HPC) and other research cluster computing resources provided by universities can be useful supplements to the collaboration’s own WLCG computing resources for data analysis and production of simulated event samples. The shared HPC cluster "NEMO" at the University of Freiburg has been made available to local ATLAS users through the provisioning of virtual machines incorporating the ATLAS software environment analogously to a WLCG center. In addition to the provisioning of the virtual enviroment, the talk describes the on-demand integration of these opportunistic resource into the Tier-3 scheduler in a dynamic way. Resources are scheduled using an intermediate layer that monitors requirements and requests the resources needed .

        The performance of the virtualized environment is evaluated. Recent developments on monitoring and work towards a more flexible scheduling of opportunistic resources are discussed. This flexibility is essential due to the different requirements needed for various data analysis tasks.
        
        Speaker: Felix Buhrer (Albert Ludwigs Universitaet Freiburg (DE))
      • 277
        IT Service Management at CERN: Data Centre and Service monitoring and status

        The Information Technology department at CERN has been using ITIL Service Management methodologies and ServiceNow since early 2011. In recent years, several developments have been accomplished regarding the data centre and service monitoring, as well as status management.

        ServiceNow has been integrated with the data centre monitoring infrastructure, via GNI (General Notification Infrastructure) in order to implement event management and generate incidents from hardware, network, operating system and application alarms. The CERN Service Portal, built on top of ServiceNow, is also showing a Service Availability Dashboard, which displays the status of CERN services. Finally, the Service Portal also houses the CERN Service Status Board, which informs end users and supporters of ongoing service incidents, planned interventions and service changes.

        We will present how these developments were implemented, and how they help supporters monitor and solve issues and keep users informed of service status. We will also highlight which lessons were learnt after the implementation. Finally, future possible improvements will be discussed, such as enabling users to subscribe to changes in the Service Status Board.

        Speaker: David Martin Clavo (CERN)
      • 278
        KairosDB and Chronix as longterm storage for Prometheus - For those who don't want to deal with Hbase.

        Prometheus is a leading open source monitoring and alerting tool. Prometheus's local storage is limited in its scalability and durability, but it integrates very well with other solutions which provide us with robust long term storage. This talk will cover two solutions which interface excellently and do not require us to deal with HBase - KairosDB and Chronix. Intended audience are people who are looking to evaluate a long term storage solution for their Prometheus data.
        This talk will cover the CERN@LHCb Online experience of choosing a monitoring solution for our data processing cluster. It will address two technologies on the maker Chronix and KairosDB which do not require us to maintain a HBase cluster.
        I will try to cover pros/cons of the two techologies. Bullet points! :

        • Why we wanted to escape HBase at LHCb Online
        • What is Chronix and why one might use it
        • Pros/Cons of Chronix compared to other TSDBs
        • Integrating KairosDB with Prometheus and pros/cons
        • Our experience running KairosDB
        Speaker: Hristo Umaru Mohamed (CERN)
      • 279
        Limits of the HTCondor transfer system

        In the past, several scaling tests have been performed on the HTCondor batch system regarding its job scheduling capabilities. In this talk we report on a first set of scalability measurements of the file transfer capabilities of the HTCondor batch system. Motivated by the GLUEX experiment needs we evaluate the limits and possible use of HTCondor as a solution to transport the output of jobs back to a submitter as a function of the number of concurrent jobs, the size of the output and the distance between a submitting node and computing node.

        Speaker: Edgar Fajardo Hernandez (Univ. of California San Diego (US))
      • 280
        Local Reconstruction Algorithms in the Cathode Strip Chambers of CMS

        The design of the CMS detector is specially optimized for muon measurements and includes gas-ionization detector technologies to make up the muon system. Cathode strip chambers (CSC) with both tracking and triggering capabilities are installed in the forward region. The first stage of muon reconstruction deals with information from within individual muon chambers and is thus called local reconstruction. First, rechits with well-defined locations are build from the electrical signals induced by the charged particles traversing the chamber. Next, from the reconstructed hits, straight-line track segments are build within each chamber. The local reconstruction becomes particularly challenging at high instantaneous luminosities such as the expected at HL-LHC. The high rate of particles traversing the detectors leads to increased rate of spurious rechits and segments thus increasing the combinatorial backgrounds. In this respect, work on improving the current and developing new algorithms is essential and is in progress. This talk presents the local reconstruction algorithms used in the CMS cathode strip chambers. Their performance as well as ongoing efforts towards HL-LHC will be discussed.

        Speaker: Mirena Paneva (University of California Riverside (US))
      • 281
        Low power, large scale HPC platforms for scientific and engineering applications: status of ExaNeSt and EuroExa H2020 FETHPC projects.

        In the last few years the European Union has launched several initiatives aiming to support the development of an European-based HPC industrial/academic eco-system made of scientific and data analysis application experts, software developers and computer technology providers. In this framework the ExaNeSt and EuroExa projects respectively funded in H2020 research framework programs call FETHPC-2014 and FETHPC-2016 are inscribed.
        ExaNeSt project is active from 2015 and its main goals are to evaluate enabling technologies and to design architecture of a novel system-level interconnect, distributed NVM (Non-Volatile Memory) storage and advanced cooling infrastructure for an ARM-based ExaFlops-class supercomputers. ExaNeSt is building a small scale (200 computing nodes) demonstrator based on last generation high end SoC FPGAs, interconnected by “ExaNet” - a novel, unified (for data and storage traffic), low latency, high throughput, RDMA-based interconnect architecture - and benchmarked through a rich set of real scalable applications.
        EuroExa projects, started on September 2017, will leverage on ExaNeSt results to deliver a world-class HPC pre-Exascale platform prototype. EuroExa will build a brand new high end microprocessor embedding multiple ARM A73 cores directly interfaced to large FPGAs integrating computing accelerators and scalable network architecture.
        INFN involvement is in the area of network development and benchmark through scientific application providing proprietary neural network simulation code (DPSNN) and Lattice Boltzmann Methods.
        In this talk, after a brief introduction of the motivations, goals and plan of activities of ExaNeST and EuroExa projects, I will report on the status of system R&D in particular on network architecture design, implementation and evaluation through execution of application kernels.

        Speaker: Alessandro Lonardo (Sapienza Universita e INFN, Roma I (IT))
      • 282
        Machine Learning based global particle identification algorithms at the LHCb experiment

        One of the most important aspects of data processing at LHC experiments is the particle identification (PID) algorithm. In LHCb, several different sub-detector systems provide PID information: the Ring Imaging Cherenkov detectors, the hadronic and electromagnetic calorimeters, and the muon chambers. The charged PID based on the sub-detectors response is considered as a machine learning problem solved in different modes: one-vs-rest, one-vs-one and multi-classification, which affect the models training and prediction. To improve charged particle identification for pions, kaons, protons, muons and electrons, several neural networks and gradient boosting models have been tested. These approaches provide larger area under the curve of receiver operator characteristics than existing implementations in most cases. To reduce the systematic uncertainty arising from the use of PID efficiencies in certain physics measurements, it is also beneficial to achieve a flat dependency between efficiencies and spectator variables such as particle momentum. For this purpose, "flat” algorithms based on the boosted decision trees that guarantee the flatness property for efficiencies have also been developed. This talk presents approaches based on the state-of-the-art machine learning techniques and its performance evaluated on Run 2 data and simulation samples. A discussion of the performances is also presented.

        Speaker: Mikhail Hushchyn (Yandex School of Data Analysis (RU))
      • 283
        Managing an heterogeneous scientific computing cluster with cloud-like tools: ideas and experience

        Current computing paradigms often involve concepts like microservices, containerisation and, of course, Cloud Computing.
        Scientific computing facilities, however, are usually conservatively managed through plain batch systems and as such can cater to a limited range of use cases. On the other side, scientific computing needs are in general orthogonal to each other in several dimensions.
        We have been operating the Open Computing Cluster for Advanced data Manipulation (OCCAM), a multi-purpose heterogeneous HPC cluster, for more than one year adopting a cloud-like paradigm. Each computing application is mapped to a dynamically expandable virtual farm, tuned and configured to the application’s needs and able to access special hardware like GPU accelerators or low-latency networks as needed, thus delivering computational frameworks that are well consolidated within the community (for a smooth end user experience) while leveraging modern computing paradigms.
        By using mostly mainstream software tools like Docker (used throughout our architecture to run both service and computational tasks), Calico for virtual network management, Mesos and Marathon for orchestration and by exploiting some of the work made in the context of the INDIGO-DataCloud project, we aimed at minimising the development and maintenance effort, while using a high-quality software stack.
        In this work we present the status of the system, operational experience, lessons learnt, and our outlook for further development. We will also present some preliminary performance comparisons between containerized and bare-metal scientific computing applications in an HPC environment.

        Speaker: Stefano Bagnasco (Istituto Nazionale di Fisica Nucleare, Torino)
      • 284
        Measurement and simulation of the background in the CMS muon detectors

        The CMS muon system presently consists of three detector technologies equipping different regions of the spectrometer. Drift Tube chambers (DT) are installed in the muon system barrel, while Cathode Strip Chambers (CSC) cover the end-caps; both serve as tracking and triggering detectors. Moreover, Resistive Plate Chambers (RPC) complement DT and CSC in barrel and end-caps respectively and are mostly used in the trigger. Finally, Gas Electron Multiplier (GEM) chambers are getting installed in the muon spectrometer end-caps at different stages of the CMS upgrade programme. The study of the different backgrounds the muon detectors are exposed to is fundamental to assess the system longevity and project its performance to the conditions expected for HL-LHC. In this respect, an accurate modelling of the backgrounds in simulation is of prime importance as many studies rely on simulation-based predictions while these conditions have never been experienced in reality. The state of the art of the work carried out to understand backgrounds observed with data collected during the LHC runs, as well as at CERN high-intensity gamma irradiation facility, (GIF++), will be presented. Furthermore, the effort made to improve the accuracy of Fluka and GEANT4 based simulations of background will be thoroughly described.

        Speaker: Cesare Calabria (Universita e INFN, Bari (IT))
      • 285
        Merging OpenStack based private clouds: the case of CloudVeneto.it

        The Cloud Area Padovana (CAP) is, since 2014, a scientific IaaS cloud, spread across two different sites: the INFN Padova Unit and the INFN Legnaro National Labs. It provides about 1100 logical cores and 50 TB of storage. The entire computing facility, owned by INFN, satisfies the computational and storage demands of more than 100 users afferent to about 30 research projects, mainly related to HEP and Nuclear Physics. The Padova data centre also hosts and operates since 2015 an independent IaaS cloud managing network, storage and computing resources owned by 10 departments of the University of Padova, spawning a broader range of scientific and engineering disciplines. This infrastructure provides about 480 logical cores and 90 TB of storage and supports more than 40 research projects. These two clouds share only a limited set of ICT services and tools (mainly for configuration, monitoring and accounting), whereas their daily operations and maintenance are carried out separately by INFN and University personnel. At the end of 2017 we planned to merge the two infrastructures in order to optimise the use of resources (both human and ICT) and to avoid needless duplication of services. We discuss here how we plan to implement such integration, that will result in the provision of a single Cloud infrastructure named CloudVeneto.

        Speaker: Dr Marco Verlato (INFN - Sezione di Padova)
      • 286
        Modeling Allocation Utilization Strategies on Supercomputers

        Most supercomputers provide computing resources that are shared between users and projects, with utilization determined by predefined policies, load and quotas. The efficiency of the utilization of resources in terms of user/project depends on factors such as particular supercomputer policy and dynamic workload of supercomputer based on users' activities. The load on a resource is characterized by the number and parameters of jobs: the number of required nodes, required execution time (walltime), and jobs generation rate.
        In this work we identify execution strategies geared towards the goal the maximizing the probability of utilization of allocated resources on a supercomputer. The execution strategies consist of find the optimal set of essential job parameters: number, size, length, rate. A simplified model for utilization of allocation time and a simulator based on queueing theory (with corresponding supercomputer Titan’s requirements) were designed, the model was tested on both synthetic and real log data over many months of Titan’s real work, identified strategies were compared with other possible strategies.
        Experiments conducted using the simulator, showed that in most cases identified strategies increase the probability of utilizing allocation faster than a random choice of job processing parameters. We also find that the accuracy of the model will be higher if the amount of resources for utilization is larger, analyzed time intervals are longer and supercomputer’s state is steadier over these intervals.

        Speakers: Alexey Poyda (National Research Centre Kurchatov Institute (RU)), Mikhail Titov (National Research Centre Kurchatov Institute (RU))
      • 287
        Monitoring tools for the CMS muon detector: present workflows and future automation

        The CMS muon system presently consists of three detector technologies equipping different regions of the spectrometer. Drift Tube chambers (DT) are installed in the muon system barrel, while Cathode Strip Chambers (CSC) cover the end-caps; both serve as tracking and triggering detectors. Moreover, Resistive Plate Chambers (RPC) complement DT and CSC in barrel and end-caps respectively and are mostly used in the trigger. Finally, Gas Electron Multiplier (GEM) chambers are getting installed in the muon spectrometer end-caps at different stages of the CMS upgrade programme. The CMS muon system has been operated successfully during the two LHC runs allowing to collect a very high fraction of data whose quality fulfills the requirements to be used for physics analysis. Nevertheless the workflows used nowadays to run and monitor the detector are rather expensive in term of human resources. Focus is therefore being put in improving such workflows, both by applying automated statistical tests and exploiting modern machine learning algorithms, inview of the future LHC runs. The ecosystem of tools presently in use will be presented, together with the status of the art of the developments toward more automatised monitoring and the roadmap for the future.

        Speaker: Cesare Calabria (Universita e INFN, Bari (IT))
      • 288
        Montoring virtual machines and containers with VacMon

        At the start of 2017, GridPP deployed VacMon, a new monitoring system
        suitable for recording and visualising the usage of virtual machines and
        containers at multiple sites. The system uses short JSON messages
        transmitted by logical machine lifecycle managers such as Vac and
        Vcycle. These are directed to a VacMon logging service which records the
        messages in an ElasticSearch database. The records can be viewed
        graphically using the VacMon web portal, with selections based on time
        ranges, virtual organisation, site, subclusters within sites, or
        individual hypervisors.

        Speaker: Dr Andrew McNab (University of Manchester)
      • 289
        namespace root

        We want to propose here a smooth migration plan for ROOT in order to have for 2040 at least and last an acceptable histogram class (a goal clearly not stated in the HSF common white paper for HL-LHC for 2020), but also to have a solid rock basement at this time for good part of this toolkit (IO, plotting, graphics, UI, math, etc...). The proposal is going to be technical because centred on a background library thought as the STL library (then the namespace "root" in lowercase), a library strongly inspired of what is done in my softinex inlib/exlib code (see CHEP-2013) which is highly portable, pure header, strictly layered (a used class having no relationship to a using class) and without any implicit auto management of objects (then no writeable statics and no singletons). The idea is going to be the same in spirit as what is done with the g4tools code used as a background library to the Geant4 analysis category. This proposal is going to be obviously a bit sociological: how to organise so that both users (physicists) and doers (software engineers) being satisfied around ROOT (and Geant4 in fact) in order to have the "best analysis (and simulation) software ever" for HEP in 2040.

        Speaker: Dr Guy Barrand (Laboratoire de l'Accélérateur Linéaire, Université Paris-Sud, CNRS-IN2P3, Orsay, France.)
      • 290
        New approaches for track reconstruction in LHCb's Vertex Locator

        Starting with Upgrade 1 in 2021, LHCb will move to a purely software-based trigger system. Therefore, the new trigger strategy is to process events at the full rate of 30MHz. Given that the increase of CPU performance has slowed down in recent years, the predicted performance of the software trigger currently falls short of the necessary 30MHz throughput. To cope with this shortfall, LHCb's real-time reconstruction will have to be sped up significantly.

        We aim to help solve this shortfall by speeding up the track reconstruction of the Vertex Locator which currently takes up roughly a third of the time spent in the first phase of the High Level Trigger. In order to obtain the needed speedup, profiling and technical optimizations are explored as well as new algorithmic approaches. For instance, a clustering based algorithm can reduce the event rate prior to the track reconstruction by separating hits into two sets - hits from particles originating from the proton-proton interaction point, and those from secondary particles - allowing the reconstruction to treat them separately.
        We present an overview of our latest efforts in solving this problem, which is crucial to the success of the LHCb upgrade.

        Speaker: Christoph Hasse (CERN / Technische Universitaet Dortmund (DE))
      • 291
        New Developments in DD4hep

        For a successful experiment, it is of utmost importance to provide a consistent detector description originating from a single source of information. This is also the main motivation behind DD4hep, which addresses detector description in a broad sense including the geometry and the materials used in the device, and additionally parameters describing, e.g., the detection techniques, constants required for alignment and calibration, description of the readout structures and conditions data. An integral part of DD4hep is DDG4 which is a powerful tool that converts arbitrary DD4hep detector geometries to Geant4 and gives also access to all Geant4 action stages. It is equipped with a comprehensive plugins suite that includes handling of different IO formats; Monte Carlo truth linking and a large set of segmentation and sensitive detector classes, allowing the simulation of a wide variety of detector technologies. In this presentation, we will showcase recent developments in DD4hep/DDG4 like the addition of a ROOT based persistency mechanism for the detector description and the development of framework support for DDG4. Through this mechanism an experiment’s data processing framework can exhibit its essential tools to all DDG4 actions. This allows for the easy integration of DD4hep in existing frameworks like Gaudi.

        Speaker: Marko Petric (CERN)
      • 292
        New fitting concept in ATLAS muon tracking for the LHC Run II

        Muons with high momentum -- above 500 GeV/c -- are an important constituent of new physics signatures in many models. Run-2 of the LHC is greatly increasing ATLAS's sensitivity to such signatures thanks to an ever-larger dataset of such particles. The ATLAS Muon Spectrometer chamber alignment contributes significantly to the uncertainty of the reconstruction of these high-momentum objects. The proper treatment of measurements during tracking and the correct propagation of the alignment effects is extremely challenging. Recently, an innovative approach that imposes Gaussian constraints on ensembles of detector hits was implemented. It provides a significant improvement to high-momentum tracking without increasing the CPU budget. Furthermore, it allows for the verification of the expected alignment quality using high-statistics collision data. A detailed discussion of the algorithmic realization is given, the expected performance gains are presented and prospects for further applications of the approach are outlined.

        Speaker: Zachary Louis Marshall (University of California Berkeley (US))
      • 293
        Numerical predictions of GEM nonlinear mechanical properties under large deformations

        Gas Electron Multiplier (GEM) based detectors have been used in many applications since their introduction in 1997. Large areas of GEM are foreseen in several experiments such as the future upgrade of the CMS muon detection system, where triple GEM based detectors will be installed and operated. During the assembly and operation, GEM foils are stretched in order to keep the vertical distance between adjacent horizontal layers constant and uniform across the large area. This is a serious challenge for the assembly process. Indeed, the deformation of the perforated holes in the GEM are proven to have significant effect on the detector gain. Numerical calculations are therefore required to optimize the stretching forces to ensure uniform GEM sheet assembly and consequently uniform detector gain. In this work, two methods are used to predict the nonlinear mechanical properties of the GEM foil using finite element method. In the first method, linear elastic properties are predicted in the longitudinal and transvers alignment of the GEM sheet. In the second method, the plastic deformations are considered in the model to predict the load limitations. The results show slight variation of elastic moduli in longitudinal and transvers directions due to the hexagon pattern of the perforated holes. However, the directional Poisson rations are nearly similar. The nonlinear analysis shows local plastic strains at the perforated holes edges. These plastic deformations start at very low loadings and developed without affecting the linearity of the deformation loadings relation. We investigate the linearity limit at both longitudinal and transverse loadings and define the safety margin for the GEM stretching operation. Resulted are compared to recent measurements.

        Speaker: Othmane Bouhali (Texas A & M University (US))
      • 294
        Object store stress testing for ATLAS distributed computing

        Various workflows used by ATLAS Distributed Computing (ADC) are now using object stores as a convenient storage resource via boto S3 libraries. The load and performance requirement varies widely across the different workflows and for heavier cases it has been useful to understand the limits of the underlying object store implementation. This work describes the performance of various object stores used by ADC and describes a tool which run periodic functional tests and on-demand stress testing. Connection concurrency limits are measured and concurrency effects on error rate and object writing times are measured.

        Speaker: Peter Love (Lancaster University (GB))
      • 295
        Online reconstruction of the collision topology in the CBM experiment

        The CBM experiment is a future fixed-target experiment at FAIR/GSI (Darmstadt, Germany). It is being designed to study heavy-ion collisions at extremely high interaction rates of up to 10 MHz. Therefore, the experiment will use a very novel concept of data processing based on free streaming triggerless front-end electronics. In CBM time-stamped data will be collected into a readout buffer in a form of continuous time-slices and delivered to a large computer farm to be reconstructed and selected online. Therefore the full online reconstruction of the event topology is required already at the First Level Event Selection (FLES) stage. In this case both the efficiency and speed of the reconstruction algorithms are crucial.

        The task of reconstructing individual collisions out of continuous time-stamped data, where information on different events overlaps in time, must be performed in software and requires so-called 4-dimensional (4D) reconstruction: both in space and time. The FLES reconstruction package consists of several modules: track finding, track fitting, event building, short-lived particles finding, and event selection. The package is optimised with respect to speed and parallelised both at the data (using the SIMD instruction set) and the task (using OpenMP and Pthreads) levels.

        The input data are distributed within the FLES farm in a form of a time-slice, which is processed in parallel. The particle trajectories, reconstructed with the 4D Cellular Automaton track finder and fitted with the 4D Kalman Filter method and combined into event-corresponding clusters according to the estimated production time. Such track clusters are given as an event input to the KF Particle Finder package for short-lived particle reconstruction. The last stage of the FLES package is a selection of events according to the requested trigger signatures. The reconstruction procedure and the results of its application to simulated collisions for the CBM detector setup are presented.

        Speaker: Dr Valentina Akishina (Johann-Wolfgang-Goethe Univ. (DE))
      • 296
        Online reconstruction of tracks and vertices using the Upgraded Inner Tracking System of ALICE at the LHC

        In view of the LHC Run3 starting in 2021, the ALICE experiment is preparing a major upgrade including the construction of an entirely new inner silicon tracker (the Inner Tracking System) and a complete renewal of its Online and Offline systems (O²).

        In this context, one of the requirements for a prompt calibration of external detectors and a fast offline data processing is to run online the reconstruction of tracks in the Upgraded Inner Tracking System.

        A new algorithm based on Cellular Automata has been developed to tackle this issue. In this algorithm, the tracking is split in multiple phases to profit from data locality and using the same approach it is possible to determine the primary vertex position.

        To cope with the specification of the O² and with the foreseen Pb-Pb interaction rate of 50 kHz, this algorithm has been developed exploiting the most common parallelisation technologies available, such as SIMD and multi-threading on x86 CPUs and offloading to GPU using CUDA and OpenCL.

        In this contribution we will show the speedup obtained using different technologies to implement the vertexing and tracking algorithms. The obtained computing and physics performance are compliant with the requirements of ALICE for Run 3.

        Speaker: Matteo Concas (INFN e Politecnico di Torino (IT))
      • 297
        Online track fit for the ALICE TPC detector in Online-Offline framework

        The upcoming LHC Run 3 brings new challenges for the ALICE online reconstruction which will be used also for the offline data processing in the O2 (combined Online-Offline) framework. To improve the accuracy of the existing online algorithms they need to be enhanced with all the necessary offline features, while still satisfying speed requirements of the synchronous data processing.

        Here we present our enhancements to the track fit algorithm which is currently used in the ALICE High Level Trigger (HLT) for the online reconstruction. The algorithm is based on the Kalman filter method. The fit is applied at the final stage of the track reconstruction in order to improve parameters of the trajectories found by the combinatorial track finder.

        We describe various features of the track fit, such as trajectory parameterisation, strategy of linearisation of equations, correction of the track model due to minor magnetic field components and parameterisation of the magnetic field for efficient computation. Finally we demonstrate the quality of the fit.

        Speaker: Sergey Gorbunov (Johann-Wolfgang-Goethe Univ. (DE))
      • 298
        Operation of the ATLAS distributed computing

        We describe the central operation of the ATLAS distributed computing system. The majority of compute intensive activities within ATLAS are carried out on some 350,000 CPU cores on the Grid, augmented by opportunistic usage of significant HPC and volunteer resources. The increasing scale, and challenging new payloads, demand fine-tuning of operational procedures together with timely developments of the production system. We describe several such developments, motivated directly from operational experience. Optimization of inefficient task requests, from both official production and users, is made possible by automatic detection of payload properties. User education, job shaping or preventative throttling help to increase the overall throughput of the available resources.

        Speaker: Ivan Glushkov (University of Texas at Arlington (US))
      • 299
        Optimising Lattice QCD for GPUs

        The University of Adelaide has invested several million dollars in the Phoenix HPC facility. Phoenix features a large number of GPUs, which were
        critical to its entry in the June 2016 Top500 supercomputing list. The status of high performance computing in Australia relative to other nations
        poses a unique challenge to researchers, in particular those involved in computationally intensive fields such as Lattice QCD.

        Quantum chromodynamics (QCD) is the fundamental theory that describes the strong interaction. Lattice QCD provides a computational framework to study
        the highly non-perturbative phenomena that arise in QCD. Lattice QCD calculations demand the most powerful HPC resources available. Their highly
        parallel nature makes GPUs an ideal hardware platform for Lattice QCD. The technological power offered by GPUs has transformed the Centre for the
        Subatomic Structure of Matter's research program at the University of Adelaide, using high performance computing to solve the fundamental equations
        that describe the interactions of subatomic particles and reveal their internal structure. The quark propagator code is a key piece of software in
        these calculations, and we discuss how this code has been optimised to run within the CUDA framework.

        Speaker: Dr Waseem Kamleh (University of Adelaide)
      • 300
        Optimizing Frameworks' Performance Using C++ Modules-Aware ROOT

        The LLVM community advances its C++ Modules technology providing an io-efficient, on-disk code representation capable of reducing build times and peak memory usage. Significant amount of efforts were invested in teaching ROOT and its toolchain to operate with clang's implementation of the C++ Modules. Currently, C++ Modules files are used by: cling to avoid header re-parsing; rootcling to store io information in a cross-platform way; and ROOT's reflection layer to implement implicit and explicit runtime shared library loading.

        ROOT builds and works with C++ Modules in its dictionary system showing promising performance. The conducted work paved our way for optimizing experiments' software stacks to achieve performance even beyond the observed in ROOT. The new technology adds three new requirements. First, every dictionary header needs to be parsable on its own. Secondly, it should not depend on outside macro definitions which can alter its content. Thirdly, all its direct or indirect includes should be also in a C++ Module. This talk shows the status of the C++ Modularization in ROOT and CMSSW. It argues for a bottom-up adoption approach and outlines a set of tools to aid the process. The authors share performance results and implementation experience gained when migrating CMSSW and its dependencies such as HepMC, Geant and boost.

        Speaker: Yuka Takahashi (University of Cincinnati (US))
      • 301
        PaaS for web applications as key to the evolution of CERN Web Services

        Following the deployment of OpenShift Origin by the CERN Web Frameworks team in 2016, this Platform-as-a-Service “PaaS” solution oriented for web applications has become rapidly a key component of the CERN Web Services infrastructure. We will present the evolution of the PaaS service since its introduction, detailed usage trends and statistics, its integration with other CERN services and the present and future use as main backend for a wide range of Web framework applications and infrastructure components.

        Speaker: Andreas Wagner (CERN)
      • 302
        PanDA WMS for Lattice QCD Computations

        Lattice QCD (LQCD) is a well-established non-perturbative approach to solving the quantum chromodynamics (QCD) theory of quarks and gluons. It is understood that future LQCD calculations will require exascale computing capacities and workload management system (WMS) in order to manage them efficiently.
        In this talk we will discuss the use of the PanDA WMS for LQCD simulations. The PanDA WMS was developed by the ATLAS Experiment at the LHC to manage data analysis and detector simulations on distributed and heterogeneous computing resources which include hundreds of Grid and Cloud sites, as well as HPC machines. Currently PanDA is also used for projects and experiments outside of ATLAS.
        For this project we have created a prototype on Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). In order to provide communication with PanDA server as well as job submissions to the local batch system we have deployed dedicated PanDA edge services on Titan’s data transfer nodes. The system was tested with realistic LQCD workloads submitted via PanDA server instance running in the Docker container at the OLCF. In our talk we will present results of these tests and discuss future plans for extending current setup to other HPC sites.

        Speaker: Pavlo Svirin (Brookhaven National Laboratory (US))
      • 303
        Particle track building with Recurrent Neural Networks

        A core component of particle tracking algorithms in LHC experiments is the Kalman Filter. Its capability to iteratively model dynamics (linear or non-linear) in noisy data makes it powerful for state estimation and extrapolation in a combinatorial track builder (the CKF). In practice, the CKF computational cost scales quadratically with the detector occupancy and will become a heavy burden on computing resources during the HL-LHC. Alternatively, recurrent neural networks (RNNs) have also been shown to be effective at state estimation and modeling non-linear dependencies in sequence data. We therefore propose an approach to particle track building that utilizes an RNN-based tree search algorithm. The RNN model, utilizing Long Short-term Memory (LSTM) units, takes a sequence of candidate track hits and predicts the location of a hit on a successive detector layer. We have developed two types of prediction models: one which produces single value predictions and another which produces Gaussian predictions in the form of central values and a covariance matrix. The latter model is thus able to express uncertainty in its predictions and is trained using a log-likelihood cost function. The hit predictions are used to select candidate hits from the data. By including an additional overall track quality estimate, we are able to efficiently explore the combinatorial search tree by prioritizing and extrapolating only the most likely hits until the track is fully formed. In this contribution we will present the RNN track building algorithm and discuss its performance for track-finding on simulated ACTS data compared to traditional solutions.

        Speaker: Steven Andrew Farrell (Lawrence Berkeley National Lab. (US))
      • 304
        Patatrack

        Starting from 2017, during CMS Phase-I, the increased accelerator luminosity with the consequently increased number of simultaneous proton-proton collisions (pile-up) will pose significant new challenges for the CMS experiment. The main goal of the HLT is to apply a specific set of physics selection algorithms and to accept the events with the most interesting physics content. To cope with the incoming event rate, the online reconstruction of a single event for the HLT has to be done within 220ms on average. The increasing complexity of events will make track reconstruction especially challenging. For this reason, reconstruction of Pixel Tracks is not executed for every event or is executed in ROIs.

        The quest of retaining those events which are potentially interesting for searches of new physics phenomena, led to the evaluation of GPUs for the enhancement of the existing computing infrastructure used at High-Level Trigger (HLT).

        We will show the results of the effort in reducing the effect of pile-up in CMS Tracking by redesigning the seeding with novel algorithms which are intrinsically parallel and executing these new algorithms on massively parallel architectures. We will also show how Pixel Tracks can be evaluated globally for every event on GPUs.

        Speaker: Felice Pantaleo (CERN)
      • 305
        Performance of the AMS Offline Software at National Energy Research Scientific Computing Centre and Argonne Leadership Computing Facility

        The Alpha Magnetic Spectrometer (AMS) is a high energy physics experiment installed and operating on board of the International Space Station (ISS) from May 2011 and expected to last through Year 2024 and beyond. More than 50 million of CPU hours has been delivered for AMS Monte Carlo simulations using NERSC and ALCF facilities in 2017. The details of porting of the AMS software to the 2nd Generation Intel Xeon Phi Knights Landing architecture are discussed, including the MPI emulation module to allow the AMS offline software to be run as multiple-node batch jobs. The performance of the AMS simulation software at NERSC Cori (KNL 7250) , ALCF Theta (KNL 7230). and Mira (IBM BG/Q) farms is also discussed.

        Speaker: Baosong Shan (Beihang University (CN))
      • 306
        Perspectives for the migration of the LHCb geometry to the DD4hep toolkit

        The LHCb experiment uses a custom made C++ detector and geometry description toolkit, integrated with the Gaudi framework, designed in the early 2000s when the LHCb software was first implemented. With the LHCb upgrade scheduled for 2021, it is necessary for the experiment to review this choice to adapt to the evolution of software and computing (need to support multi-threading, importance of vectorization...)

        The Detector Description Toolkit for High Energy Physics (DD4hep) is a good candidate for the replacement for LHCb own's geometry description framework: it is possible to integrate it with Gaudi and its features theoretically match what is needed by LHCb: in term of geometry description and detector description but also concerning the possibility to add detector alignment parameters and the integration with simulation tools.

        In this paper we will report on detailed studies undertaken to compare the feature set proposed by the DD4hep toolkit, to what is needed by LHCb. We will show not only how the main description could be migrated, but also how to integrate the LHCb real-time alignment tools in this toolkit, in order to identify the main obstacles to the migration of the experiment to DD4hep.

        Speaker: Ben Couturier (CERN)
      • 307
        Power Usage Effectiveness analysis and optimization in the INFN CNAF Tier-1 data center infrastructure.

        The accurate calculation of the power usage effectiveness (PUE) is the most important factor when trying to analyse the overall efficiency of the power consumption in a big data center. In the INFN CNAF Tier-1, a new monitoring infrastructure as Building Management System (BMS) was implemented during the last years using the Schneider StruxureWare Building Operation (SBO) software. During this new BMS design phase, a great attention was given to the possibility of collecting several detailed information about the electric absorption of different and specific parts of the facility. In particular, all the relevant technical plants areas, consisting of 3 transformers, 2 redundant 1.4MW diesel rotary uninterruptible power supplies (DRUPS) and 6 free cooling chillers of 320 kW were evaluated for the analysis of the Power Usage Effectiveness (PUE) and partial PUE (pPUE) of the data center. Considering the annual trends of these parameters and the demands for reducing the operating costs it was clear that some improvements were certainly needed in the very short time. For this reason, a hardware upgrade of the cooling chillers and chilled water pumps distribution system was seriously considered and therefore we started the design of a new project using innovative cooling technology. We focused on chillers using the Danfoss Turbocor centrifugal compressors technology that uses magnetic levitation and an oil-free approach for obtaining the best efficiency in the HVAC (heat, ventilation and air conditioning) applications. Subsequently, we studied a solution that could easily compensate the initial investment during the first years of usage (considering the Total Cost of Ownership of the project) and that will improve the overall PUE of the Tier-1 datacenter in addition to the usage of low-GWP (Global Warming Potential) refrigerants in the chillers, according to the dictates of the Green IT program. The future realization of this important infrastructure upgrade could greatly increase the energy efficiency in our Tier-1 with a sustainable efficient solution that will bring great benefits in the maintenance costs of our center.

        Speaker: Pier Paolo Ricci (INFN CNAF)
      • 308
        Preliminary results coupling SMF and BLOB with Geant4

        Despite their frequent use, the hadronic models implemented in Geant4 have shown severe limitations in reproducing the measured yield of secondaries in ions interaction below 100 MeV/A, in term of production rates, angular and energy distributions [1,2,3]. We will present a benchmark of the Geant4 models with double-differential cross section and angular distributions of the secondary fragments produced in the 12C fragmentation at 62 MeV/A on thin carbon target; such a benchmark includes the recently implemented model INCL++ [4,5]. Moreover, we will present the preliminary results, obtained in simulating the same interaction, with SMF [6] and BLOB [7]. Both, SMF and BLOB are semiclassical one-body approaches to solve the Boltzmann-Langevin equation. They include an identical treatment of the mean-field propagation, on the basis of the same effective interaction, but they differ in the way fluctuations are included.
        In particular, while SMF employs a Uehling-Uhlenbeck collision term and introduces fluctuations as projected on the density space, BLOB introduces fluctuations in full phase space through a modified collision term where nucleon-nucleon correlations are explicitly involved. Both of them, SMF and BLOB, have been developed to simulate the heavy ion interactions in the Fermi-energy regime. We will show their capabilities in describing 12C fragmentation coupled with the de-excitation phase of Geant4, as their implementation in Geant4 is foreseen.

        Speaker: Carlo Mancini-Terracciano (INFN - Roma1)
      • 309
        Producing Madgraph5_aMC@NLO gridpacks and using TensorFlow GPU resources in the CMS HTCondor Global Pool

        The CMS experiment has an HTCondor Global Pool, composed of more than 200K CPU cores available for Monte Carlo production and the analysis of data. The submission of user jobs to this pool is handled by either CRAB3, the standard workflow management tool used by CMS users to submit analysis jobs requiring event processing of large amounts of data, or by CMS Connect, a service focused on final stage condor-like analysis jobs and applications that already have a workflow job manager in place. The latest scenario can bring cases in which workflows need further adjustments in order to efficiently work in a globally distributed pool of resources. For instance, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO and the usage of tools not (yet) fully supported by the CMS software, such as TensorFlow with GPU support, are tasks with particular requirements. A special adaption, either at the pool factory level (advertising GPU resources) or at the execute level (e.g: to handle special parameters that describe certain needs for the remote execute nodes during submission) is needed in order to adequately work in the CMS global pool. This contribution describes the challenges and efforts performed towards adapting such workflows so they can properly profit from the Global Pool via CMS Connect.

        Speaker: Kenyi Paolo Hurtado Anampa (University of Notre Dame (US))
      • 310
        Recent developments in glideinWMS: minimizing resource wastages

        GlideinWMS is a workload management system that allows different scientific communities, or Virtual Organizations (VO), to share computing resources distributed over independent sites. A dynamically sized pool of resources is created by different VO-independent glideinWMS pilot factories, based on the requests made by the several VO-dependant glideinWMS frontends. For example, the CMS VO employs a pool that ranges from 80k to over 200k cores, worldwide distributed over 50 sites. With such big numbers, calculating the proper pilot pressure is essential to efficiently provision resources, and being able to effectively drain resources during site downtimes also helps minimize the wastages in the system. However, the often spiky nature of the demand, and the geometry of certain sites that may have multiple entry points, makes it difficult to tightly couple pilot submission to the actual demand, especially in a multicore environment. In this contribution we discuss recent enhancements to glideinWMS that aim at improving the efficiency of the resource provisioning by adding the concept of metasite, and by introducing new handles the VOs can use to manage the pilot pressure. New ways of draining sites using the Machine/Job Featurse mechanism and plans for the future are also discussed.

        Speaker: Marco Mascheroni (Univ. of California San Diego (US))
      • 311
        Research and Exploit of Resource Sharing Strategy at IHEP

        At IHEP, computing resources are contributed by different experiments including BES, JUNO, DYW, HXMT, etc. The resources were divided into different partitions to satisfy the dedicated experiment data processing requirements. IHEP had a local torque maui cluster with 50 queues serving for above 10 experiments. The separated resource partitions leaded to resource imbalance load. Sometimes, BES resource partition was quite busy without free slot but with lots of jobs in queue, while JUON resources kept idle for a long time. However, sometimes the situations is contrary.
        After migrating resources from torque maui to HTCondor in 2016, job scheduling efficiency has been improved a lot. To aim at imbalance resource load, we designed and presented an efficient sharing strategy to improve the overall resource utilization. We created a sharing pool to support all experiments. Resources of each experiment was divided into two parts: dedicated resource and sharing resource. The slots in dedicated resource only run jobs of own experiment, and the slots in sharing resource are shared by jobs of all experiments. Default ratio of dedicated resource to sharing resource is 1:4. To maximize sharing effect, the ratio is dynamically adjusted between 0:5 and 4:1 based on amount of jobs from each experiment.
        We have developed a central control system to allocate resources for each experiment group. This system is consist of two parts: server side and client side. A management database is built at server side, which is storing resource, group and experiment information. Once the sharing ratio needs to be adjusted, resource group will be changed and updated into database. The resource group information is published to the server buffer in real time. The Client periodically pulls resource group information from server buffer via https protocol. And resource scheduling conditions at client side is changed based on the dynamic resource group information. By this process, share ratio can be regulated dynamically.
        We have implemented resource sharing strategy by combining central control system with HTCondor. ClassAd mechanism and accounting-group provided by HTCondor facilitate to utilize our sharing strategy at IHEP computing cluster. With sharing strategy, overall resource utilization of IHEP computing cluster has dramatically increased from about 50% to more than 90%. The total wall-time without sharing strategy in 2016 is 40,645,124 CPU hours, while it’s 73,341,585 CPU hours with sharing strategy in 2017, increasing by 80.44%. The results indicate sharing strategy is efficient and integrally promotes experiment data processing.

        Speakers: Xiaowei Jiang (IHEP(中国科学院高能物理研究所)), Jingyan Shi (IHEP), Jiaheng Zou (IHEP)
      • 312
        Run Control Software for the Upgrade of the ATLAS Muon to Central Trigger Processor Interface (MUCTPI)

        The Muon to Central Trigger Processor Interface (MUCTPI) of the ATLAS experiment at the
        Large Hadron Collider (LHC) at CERN is being upgraded for the next run of the LHC in order
        to use optical inputs and to provide full-precision information for muon candidates to the
        topological trigger processor (L1TOPO) of the Level-1 trigger system. The new MUCTPI is
        implemented as a single ATCA blade with high-end processing FPGAs which eliminate double
        counting of muon candidates in overlapping regions, send muon candidates to L1TOPO,
        and muon multiplicities to the Central Trigger Processor (CTP), as well as readout data to
        the data acquisition system of the experiment. A Xilinx Zynq System-on-Chip (SoC) with a
        programmable logic part and a processor part is used for the communication to the processing
        FPGAs and the run control system. The processor part, based on ARM processor cores, is running
        embedded Linux prepared using the framework of the Linux Foundation's Yocto project. The
        ATLAS run control software was ported to the processor part and a run control application
        was developed which receives, at configuration, all data necessary for the overlap handling
        and candidate counting of the processing FPGAs. During running, the application provides
        ample monitoring of the physics data and of the operation of the hardware.

        Speaker: Ralf Spiwoks (CERN)
      • 313
        Running Oracle WebLogic on containers

        CERN IT department is providing production services to run container technologies. Given that, the IT-DB team, responsible to run the Java based platforms, has started a new project to move the WebLogic deployments from virtual or bare metal servers to containers: Docker together with Kubernetes allow us to improve the overall productivity of the team, reducing operations time and speeding up the time-to-delivery as well as introduce repetitive and reproducible environments. The scope of the project goes from the design and production of docker images, deployment environments based on Kubernetes as well as all procedures and operations including the needed tools to hand over to users the management of their deployed applications.

        In this presentation you will see how at CERN we have faced all the technical and design aspects to run it in a production environment. That means the implementation of the solutions needed including monitoring, logging, HA, security and traceability. Finally we will show you the integration of the developed product in the CERN IT ecosystem including SSO, storage and user directory among others.

        Speaker: Borja Aparicio Cotarelo (CERN)
      • 314
        Service monitoring system for JINR Tier-1

        Tier-1 for CMS was created in JINR in 2015. It is important to keep an eye on the Tier-1 center all the time in order to maintain its performance. The one monitoring system is based on Nagios: it monitors the center on the several levels: engineering infrastructure, network and hardware. It collects many metrics, creates plots and determines some statuses like HDD state, temperatures, loads and many other. But this information is not always enough to tell if the sTier services are working properly. For that purpose the service monitoring system was developed in order to collect data from different resources including WLCG monitoring services. The purpose of this system is to aggregate data from different sources, determine statuses based on new and historical date and react according to some predefined instructions. The systems, general idea and architecture are described and analyzed in this work.

        Speaker: Igor Pelevanyuk (Joint Institute for Nuclear Research (RU))
      • 315
        SGSI project at CNAF

        The Italian Tier1 center is mainly focused on LHC and physics experiments in general. Recently we tried to widen our area of activity and established a collaboration with the University of Bologna to set-up an area inside our computing center for hosting expriments with high demands of security and privacy requirements on stored data. The first experiment we are going to host is Harmony, a project part of IMI's Big Data for Better Outcomes programme (IMI stands for Innovative Medicines Initiative).
        In order to be able to accept this kind of data we had to make a subset of our computing center compliant with the ISO 27001 regulation.

        ISO/IEC 27001:2013 is a specification for an information security management system (ISMS). Organizations that meet the standard may be certified compliant by an independent and accredited certification body on successful completion of a formal compliance audit.
        Achieving accredited certification to ISO 27001 demonstrates that an institute/company is following information security best practice, and provides an independent, expert verification that information security is managed in line with international best practice and business objectives.

        In this talk we will describe the SGSI project (Sistema Gestione Sicurezza Informazioni, System for Safe Management of Information) with details of all the processes we have been through in order to become ISO 27001 compliant, with a particular focus on the separation of the project dedicated resources from all the others hosted in the center.
        We will also describe the software solutions adopted to allow this project to accept in the future any experiment or collaboration in need for this kind of security procedures.

        Speaker: Enrico Fattibene (INFN - National Institute for Nuclear Physics)
      • 316
        Simulation approach for improving the computing network topology and performance of the China IHEP Data Center

        The goal of the project is to improve the computing network topology and performance of the China IHEP Data Center taking into account growing numbers of hosts, experiments and computing resources. The analysis of the computing performance of the IHEP Data Center in order to optimize its distributed data processing system is a really hard problem due to the great scale and complexity of shared computing and storage resources between various HEP experiments. In order to fulfil the requirements, we adopt the simulation program SymSim developed in IT lab of the Joint Institute of Nuclear Research [1]. This simulation system is focused on improving the efficiency of the grid/cloud structures development by using the work quality indicators of some real system. SymSim facilitates making a decision regarding required equipment and resources. The simulation uses input parameters from the data base of the IHEP computing infrastructure, besides we use some data of the BESIII [2] experiments to indicate workflow and data flow parameters for simulation two different variants of organizing IHEP computing infrastructure. The first simulation results show that the proposed approach allows us to make an optimal choice of the network topology improving its performance and saving resources.

        References
        1. Korenkov V. V., Nechaevskiy A. V., Ososkov G. A., Pryahina D. I., Trofomov V. V., Uzhinskiy A. V., Simulation concept of NICA-MPD-SPD Tier0-Tier1 computing facilities // Particles and Nuclei Letters. — 2016. — Vol. 13, No 5. — P. 1074–1083.
        2. BESIII Experiment [Electronic resource]: http://bes3.ihep.ac.cn/

        Speaker: Andrey Nechaevskiy (JINR)
      • 317
        Simulation of a Fast Timing Micro-Pattern Gaseous Detector for TOF-PET and future accelerators

        Full MC simulation is a powerful tool for designing new detectors and guide the construction of new prototypes.
        Improved micro-structure technology has lead to the rise of Micro-Pattern Gas Detectors (MPGDs), with main features: fexible geometry; high rate capability; excellent spatial resolution; and reduced radiation length. A new detector layout, the Fast Timing MPGD (FTM), could combine both the high spatial resolution (100um) and high rate capability (100MHz/cm^2) of the current state-of-the-art MPGDs with a high time resolution of 100ps. Adopting the design of the FTM for the detection of photons can potentially reach sub-millimeter spatial resolution and 100 ps time resolution. Such a detector will allow the development of an affordable TOF-PET scanner with improved image contrast.
        This contribution introduces the FTM prototype as an innovative PET imaging detector concept and emphases the importance of full detector simulation to guide the design of the detector geometry.
        Interaction of particles with different materials is simulated with the GEANT4 toolkit. Electric fields are calculated with ANSYS and COMSOL, while the simulation of the detector response and detector gain is simulated with Magboltz and Garfield++. Furthermore detector simulations are validated against first measurements of detector prototypes.

        Speaker: Raffaella Radogna (Universita e INFN, Bari (IT))
      • 318
        Software packaging and distribution for LHCb using Nix

        Software is an essential and rapidly evolving component of modern high energy physics research. The ability to be agile and take advantage of new and updated packages from the wider data science community is allowing physicists to efficiently utilise the data available to them. However, these packages often introduce complex dependency chains and evolve rapidly introducing specific, and sometimes conflicting, version requirements which can make managing environments challenging. Additionally, there is a need to replicate old environments when generating simulated data and to utilise pre-existing datasets.

        Nix is a "purely functional package manager" which allows for software to be built and distributed with fully specified dependencies, making packages independent from those available on the host. Builds are reproducible and multiple versions/configurations of each package can coexist with the build configuration of each perfectly preserved.

        Here we will give an overview of Nix followed by the work that has been done to use Nix in LHCb and the advantages and challenges that this brings.

        Speaker: Chris Burr (University of Manchester (GB))
      • 319
        Spanish ATLAS Federated Tier2 and Tier1 perspective on computing over the next years

        Since the beginning of the WLCG Project the Spanish ATLAS computer centres have contributed with reliable and stable resources as well as personnel for the ATLAS Collaboration.

        Our contribution to the ATLAS Tier2s and Tier1s computing resources (disk and CPUs) in the last 10 years has been around 5%, even though the Spanish contribution to the ATLAS detector construction as well as the number of authors are both close to 3%. In 2015 an international advisory committee recommended to revise our contribution according to the participation in the ATLAS experiment. With this scenario, we are optimising the federation of three sites located in Barcelona, Madrid and Valencia, taking into account that the ATLAS collaboration has developed workflows and tools to flexibly use all the resources available to the collaboration, enlarging the original statically-tiered structure of WLCG with a more dynamic structure nucleated around highly reliable centers with sufficient disk storage resources.

        In this contribution, we would like to show the evolution and technical updates in the ATLAS Spanish Federated Tier2 and Tier1. Some developments we are involved in, like the Event Index and Event WitheBoard projects, as well as the use of opportunistic resources will be useful to reach our goal. We discuss the foreseen/proposed scenario towards a sustainable computing environment for the Spanish ATLAS community in the Run3 and HL-LHC period.

        Speaker: Santiago Gonzalez De La Hoz (Univ. of Valencia and CSIC (ES))
      • 320
        Speedup approaches in a TPC Cellular Automaton track finder

        Track finding procedure is one of the key steps of events reconstruction in high energy physics experiments. Track finding algorithms combine hits into tracks and reconstruct trajectories of particles flying through the detector. The tracking procedure is considered as an extremely time consuming task because of large combinatorics. Thus, calculation speed is crucial in heavy ion experiments, especially for the online reconstruction. The Cellular Automaton (CA) algorithm provides a perfect solution for this task.
        In this work we consider improvements of the CA track finder, which was developed and applied for TPC detector. The track finder is being prepared to operate in the online mode, thus, requires maximum possible speedup. To achieve these goal, the data structures were improved for better vectorisation and parallelisation. The algorithm was optimised to decrease the combinatoric level and simplify the calculations. As a result, achieved speed up allows to use the algorithm for the online reconstruction with the efficiency being as good as before even in case of high particle multiplicity.

        Speaker: Grigory Kozlov (FIAS, JINR)
      • 321
        SPT-3G Computing

        SPT-3G, the third generation camera on the South Pole Telescope (SPT), was deployed in the 2016-2017 Austral summer season. The SPT is a 10-meter telescope located at the geographic South Pole and designed for observations in the millimeter-wave and submillimeter-wave regions of the electromagnetic spectrum. The SPT is primarily used to study the Cosmic Microwave Background (CMB). The upgraded camera produces an order of magnitude more data than the previous generations of SPT cameras. The telescope is expected to collect a petabyte (PB) of data over course of five years, which is a significantly larger data volume than any other CMB telescope in operation. The increase in data rate required radical changes to the SPT computing model both at the South Pole and University of Chicago. This paper will describe the overall integration of distributed storage and compute resources into a common interface, deployment of on-site data reduction and storage infrastructure, and the usage of the Open Science Grid (OSG) by the South Pole Telescope collaboration.

        Speaker: Benedikt Riedel (University of Chicago)
      • 322
        The ATLAS Access Manager Policy Browser: state-of-the-art web technologies for a rich and interactive data visualization experience

        The ATLAS experiment is operated daily by many users and experts working concurrently on several aspects of the detector.
        The safe and optimal access to the various software and hardware resources of the experiment is guaranteed by a role-based access control system (RBAC) provided by the ATLAS Trigger and Data Acquisition (TDAQ) system. The roles are defined by an inheritance hierarchy. Depending on duties, every ATLAS user has a well-defined set of access privileges (rules) corresponding to a specific set of assigned roles. In total, there are several hundred roles and several thousand users. Over the years, the system grew up in terms of users and roles, motivating the deployment of a visualization tool named "Policy Browser". Currently, it is the primary tool for role administrators to manage all the aspects of the
        Access Management via a rich web-based interface.

        This paper presents the requirements, design and implementation of the "Policy Browser". The tool is able to aggregate and correlate all the information provided by the RBAC system and offers a visual representation of the interrelations occurring among roles, users, hosts and rules. Additionally, the "Policy Browser" implements a powerful and flexible query mechanism facilitating the browsing of all the authorizations granted by the system. As an example of the available visual representations, the "Policy Browser" is capable of dynamically generating
        graphs to quickly display the role giving a user some defined privileges. A graph explorer is also provided in order to browse the role's inheritance hierarchy.
        The "Policy Browser" is implemented using robust JavaScript frameworks: AngularJS, Bootstrap, D3.js for the front-end, and Django a python framework for the back-end. The use cases and the results based on an informal evaluation provided by the roles administrators are also presented.

        Speaker: Igor Soloviev (University of California Irvine (US))
      • 323
        The ATLAS Muon Trigger

        Events containing muons in the final state are an important signature
        for many analyses being carried out at the Large Hadron Collider
        (LHC), including both standard model measurements and searches for new
        physics. To be able to study such events, it is required to have an
        efficient and well-understood muon trigger. The ATLAS muon trigger
        consists of a hardware based system (Level 1), as well as a software
        based reconstruction (High Level Trigger). Due to high luminosity and
        pile up conditions in Run-2, several improvements have been
        implemented to keep the trigger rate low while still maintaining a
        high efficiency. Some examples of recent improvements include
        requiring coincidence hits between different layers of the muon
        spectrometer, improvements for handling overlapping muons, and
        optimised muon isolation. We will present an overview of how we
        trigger on muons, recent improvements, and the performance of the muon
        trigger in Run-2 data.

        Speaker: Shota Hayashida (Nagoya University (JP))
      • 324
        The ATLAS Online Luminosity Software

        The Online Luminosity software of the ATLAS experiment has been upgraded in the last two years to improve scalability, robustness, and redundancy and to increase automation keeping Run-3 requirements in mind.
        The software package is responsible for computing the instantaneous and integrated luminosity for particle collisions at the ATLAS interaction point at the Large Hadron Collider (LHC). The software processes and calibrates the information coming from the luminometer detectors, archives the ATLAS luminosity information for physics data analysis, handles Van Der Meer scans, and feeds back luminosity and beam-related information to the LHC control system.
        Enhancements to the software will both enable its use in 2018 in emittance scans (publishing to LHC per bunch information
        and errors, and handling the scan protocol in an automatic way) and, with its multithreading and improved robustness,
        benefit physics data taking as well.
        The online luminosity software package and the operational experience during the LHC Run-2 will be described in this article.

        Speaker: Alex Kastanas (KTH Royal Institute of Technology (SE))
      • 325
        The ATLAS Trigger Menu design for higher luminosities in Run 2

        The ATLAS experiment records about 1 kHz of physics collisions, starting from an LHC design bunch crossing rate of 40 MHz. To reduce the large background rate while maintaining a high selection efficiency for rare and Beyond-the-Standard-Model physics events, a two-level trigger system is used.

        Events are selected based on physics signatures, such as the presence
        of energetic leptons, photons, jets or large missing energy. A Trigger Menu is the compilation of about 2,000 individual triggers, specifying the selection algorithms to be used during data taking and the rate and bandwidth a given trigger is allocated. Trigger menus also take into consideration the instantaneous luminosity and limitations from the detector readout and offline processing farm. For Run-2, the ATLAS trigger has been enhanced to be able to handle higher instantaneous luminosities (up to 2.0x10^{34}cm^{-2}s^{-1}) and to ensure the selection robustness against higher average multiple interactions per bunch crossing.

        In this presentation we discuss several aspects of the trigger menu design, such as physics goals, detector calibration, and constraints on rate, bandwidth, and CPU. We give an overview of the updated 2E34 Trigger Menu and the online system that is used to monitor the trigger performance in real time.

        Speaker: Javier Montejo Berlingen (CERN)
      • 326
        The ATLAS Trigger Simulation with Legacy Software

        Physics analyses at the LHC require accurate simulations of the detector response and the event selection processes. The accurate simulation of the trigger response is crucial for determining the overall selection efficiencies and signal sensitivities. For the generation and reconstruction of simulated event data, the most recent software releases are used to ensure the best agreement between simulated and real data. However for the simulation of the trigger selection process the same software release with which real data were taken should be used. This requires potentially running with legacy software dating back many years.
        The requirements and possibilities for such a simulation scheme within the ATLAS software framework were examined and a proof-of-concept simulation chain has been successfully implemented. One of the greatest challenges was the choice of a data format which promises long term compatibility with old and new software releases, databases and other external support services. The encountered problems and developed solutions will be presented, and proposals for future development will be discussed. Some ideas reach beyond the retrospective trigger simulation scheme in ATLAS as they also touch more generally aspects of data preservation.

        Speaker: Catrin Bernius (SLAC National Accelerator Laboratory (US))
      • 327
        The Cloud of geographically distributed data centers

        University ITMO (ifmo.ru) is developing the cloud of geographically distributed data
        centers. The “geographically distributed” means data centers (DC) located in
        different places far from each other by hundred or thousand kilometers.
        Usage of the geographically distributed data centers promises a number of advantages
        for end users such as opportunity to add additional DC and service availability
        through redundancy and geographical distribution.
        The service like data transfer, computing, and data storage are suggested to users in
        the form of virtual objects such the virtual machines, virtual storage, virtual data
        transfer line.
        The user might require to create and use an one or group of virtual object(s) with
        cloud dashboard. Each user request for virtual object should be passed through the
        checking of feasibility to deployment and integration with existing objects and
        restrictions in the cloud. In order to do that it is developed an abstract layer over
        available hardware components.
        Above features are implemented in form of software agents interconnected by
        distributed messages queues. Authors plan to describe experiments with distributed
        multi-agent system for management, monitoring, deployment of the cloud.

        Speaker: Mr Petr Fedchenkov (ITMO University)
      • 328
        The ComputeOps project: containers for HPC

        The High Performance Computing (HPC) domain aims to optimize code in order to use the last multicore and parallel technologies including specific processor instructions. In this computing framework, portability and reproducibility are key concepts. A way to handle these requirements is to use Linux containers. These "light virtual machines" allow to encapsulate applications within its environment in Linux processes. Containers has been recently rediscovered due to their abilities to provide both multi-infrastructure environnement for developers and system administrators and reproducibility due to image building file. Two container solutions are emerging: Docker (1) for micro-services and Singularity (2) for computing applications.

        We present here the ComputeOps project which has the goal to study the benefit of containers for HPC applications. This three years project starting in 2018 will provide innovative deliverables such as containerized tutorials, a Singularity Hub for Research, a catalog of manifests allowing the computing reproducibility and a multi-ressources scheduler based on XtremWeb-HEP (3) allowing to launch jobs both on virtual machines and bare metal containers. In collaboration with public and private partners, the project will provide to developers testing platforms with recent processors such as Nvidia GPU and Xeon Phi. Based on several use cases such as ray tracing and machine learning applications using the TensorFlow library, we will present the first results of the ComputeOps project.

        (1) Docker: https://www.docker.com/
        (2) Singularity: http://singularity.lbl.gov/
        (3) XtremWeb-HEP: https://xtremweb-hep.lal.in2p3.fr/

        Speaker: Cecile Cavet (APC)
      • 329
        The EU Up to University Project - Bridging the gap between secondary and higher education

        In January 2017, a consortium of European companies, research labs, universities, and education networks started the “Up to University” project (Up2U). Up2U is a 3-year EU-funded project that aims at creating a bridge between high schools and higher education. Up2U addresses both the technological and methodological gaps between secondary school and higher education by (a.) provisioning the tools commonly used in academia and big science to secondary school students and by (b.) training and enabling teachers to use new technologies and experiment with novel pedagogical methods in their lectures.

        Up2U proposes the development of an innovative e-learning environment for high school students and teachers. This is a platform consisting of tools for real-time interactions on-line, educational content harvesting, lectures recording and publishing, document creation, collaborative editing, cloud file storage and sharing. All tools are open source, accessible from a web interface and entirely cloud-based. More importantly, they can be easily distributed and allow for portability leveraging on container technologies. This gives the possibility to National Research and Education Networks (NRENs), or even individual schools, to deploy the Up2U platform on their cloud infrastructure, customizing and scaling the deployment according to their needs.

        The ultimate goal of Up2U is to remove the disconnection between high schools and higher education by creating new communities of interested people sharing and improving high-quality educational content, re-elaborating scientific papers to make them accessible by a wider audience, and using the same technology tools and components. In such a way, students now in high schools will have the unique opportunity to experiment and familiarize with tools, approaches, and methodologies used by scientists in demanding environments and consequently develop the required skills for a seamless transition to university.

        Speaker: Enric Tejedor Saavedra (CERN)
      • 330
        The GridKa WLCG Tier-1 Center: Status and Plans

        The GridKa center is serving the ALICE, ATLAS, CMS, LHCb and Belle-II experiments as one of the biggest WLCG Tier-1 centers world wide with compute and storage resources. It is operated by the Steinbuch Centre for Computing at Karlsruhe Institute of Technology in Germany. In this presentation, we will describe the current status of the compute, online and offline storage resources and we will present details on operational aspects and on the configuration and deployment management infrastructure in use at GridKa. In addition, our plans for expanding the resources including utilization of external cloud resources and integrated operations of GridKa and the Large Scale Data Facility at KIT will be presented.

        Speaker: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))
      • 331
        The impact of applying WildCards to dead modules for FTK pattern banks on efficiency and data flow.

        Online selection is an essential step to collect the most interesting collisions among a very large number of events delivered by the ATLAS detector at the Large Hadron Collider (LHC). The Fast TracKer (FTK) is a hardware based track finder, for the ATLAS trigger system, that rapidly identifies important physics processes through their track-based signatures, in the Inner Detector pixel and SCT layers. FTK will do global track reconstruction after each Level-1 trigger for all tracks with pT > 1GeV to enable the Level-2 trigger to have early access to tracking information. Tracking is seeded by a pattern recognition step. Patterns are recognized with hits in seven out of eight possible detector layers. Because the detectors are not perfect, the presence of dead modules leads to a loss of efficiency in some regions. To recover efficiency, a WildCards (WC) algorithm will be implemented in the FTK. The modules to which the WC algorithm is applied are assumed to always have hits. The WC algorithm recovers inefficiency but also causes higher data flow, that may exceed the hardware limitation. To correct this we will apply a penalty on patterns with WildCards, which will reduce slightly the efficiency but allows to reduce the data flow to values closer to the ideal ones.

        Speaker: Khalil Bouaouda (Universite Hassan II, Ain Chock (MA))
      • 332
        The INFN scientific computing infrastructure: present status and future evolution

        The INFN scientific computing infrastructure is composed of more than 30 sites, ranging from CNAF (Tier-1 for LHC and main data center for nearly 30 other experiments) and 9 LHC Tier-2s to ~20 smaller sites, including LHC Tier-3s and not-LHC experiment farms.
        A comprehensive review of the installed resources, together with plans for the near future, has been collected during the second half of 2017, and provides a general view of the infrastructure, its costs and its potential for expansions; it also shows the general trends in software and hardware solutions utilized in a complex reality as INFN.
        As of the end of 2017, the total installed CPU power exceeded 800 kHS06 (~80,000 cores) while the total storage net capacity over 57 PB on disk and 97 PB on tape: the vast majority of resources (95% of cores and 95% of disk) are concentrated in the 16 largest centers.
        Future evolutions are explored and are towards the consolidation into big centers; this has required a rethinking of the access policies and protocols in order to enable diverse scientific communities, beyond LHC, to fruitfully exploit the INFN resources. On top of that, such an infrastructure will be used beyond INFN experiments, and will be part of the italian infrastructure, comprising other research institutes, universities and HPC centers.

        Speakers: Tommaso Boccali (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P), Gianpaolo Carlino (INFN Napoli), Luca dell'Agnello (INFN), Donatella Lucchesi (INFN Padova)
      • 333
        The next generation PanDA Pilot for and beyond the ATLAS experiment

        The Production and Distributed Analysis system (PanDA) is a pilot-based workload management system that was originally designed for the ATLAS Experiment at the LHC to operate on grid sites. Since the coming LHC data taking runs will require more resources than grid computing alone can provide, the various LHC experiments are engaged in an ambitious program to extend the computing model to include opportunistically used resources such as High Performance Computers (HPCs), clouds and volunteer computers. To this end, PanDA is being extended beyond grids and ATLAS to be used on the new types of resources as well as by other experiments. A new key component is being developed, the next generation PanDA Pilot (Pilot 2). Pilot 2 is a complete rewrite of the original PanDA Pilot which has been used in the ATLAS Experiment for over a decade. The new Pilot architecture follows a component-based approach which improves system flexibility, enables a clear workflow control, evolves the system according to modern functional use-cases to facilitate coming feature requests from new and old PanDA users.
        The paper describes Pilot 2, its architecture and place in the PanDA hierarchy. Furthermore, its ability to be used either as a command tool or through APIs is explained, as well as how its workflows and components are being streamlined for usage on both grids and opportunisticaly used resources for and beyond the ATLAS experiment.

        Speaker: Paul Nilsson (Brookhaven National Laboratory (US))
      • 334
        The Online Monitoring API for the DIALOG Library of the COMPASS Experiment

        Modern experiments demand a powerful and efficient Data Acquisition System (DAQ). The intelligent, FPGA-based Data Acquisition System (iFDAQ) of the COMPASS experiment at CERN is composed of many processes communicating between each other. The DIALOG library covers a communication mechanism between processes and establishes a communication layer to each of them. It has been introduced to the iFDAQ in run 2016 and improved significantly the stability of the system. The paper presents the online monitoring API for the DIALOG library. Communication between processes is challenging from a synchronization, reliability and robustness point of view. Online monitoring tools of the communication between processes are capable to reveal communication problems to be fixed in future. The debugging purpose has been crucial during introduction period to the iFDAQ. On the other hand, based on the measurement of communication between processes, the proper load balancing of processes among machines can improve the stability of the system. The online monitoring API offers a general approach for the implementation of many monitoring tools with different purposes. In the paper, it is discussed its fundamental concept, integration to a new monitoring tool and a few examples of monitoring tools are given.

        Speaker: Ondrej Subrt (Czech Technical University (CZ))
      • 335
        The Security model of the ALICE next generation Grid framework

        JAliEn (Java-AliEn) is the ALICE’s next generation Grid framework which will be used for the top-level distributed computing resources management during the LHC Run3 and onward. While preserving an interface familiar to the ALICE users, its performance and scalability are an order of magnitude better than the currently used system.
        To enhance the JAliEn security, we have developed the so-called Token Certificates – short lived full Grid certificates, generated by central services automatically or on client’s request. The new system provides fine-grained control over user/client authorization, e.g. filtering out unauthorized requests based on the client’s type: generic user, job agent, job payload. These and other parameters (like job ID) are encrypted in the token by the issuing service and cannot be altered.
        Client-side security implementation is also described in the aspect of interaction between user jobs and job agents. User jobs will use JAliEn tokens for authentication and authorization to the central JAliEn services. These tokens are passed from the job agent through a pipe stream, thus are not stored on disk or in environment visible to anyone except the job process. Furthermore, we foresee improvement at the level of isolation of users’ payloads by running them in containers.
        While JAliEn doesn't rely on X509 proxies, the backward compatibility is kept to assure interoperability with services, which require these.

        Speaker: Volodymyr Yurchenko (National Academy of Sciences of Ukraine (UA))
      • 336
        THE SERVICE FOR PARALLEL APPLICATIONS BASED ON THE JINR CLOUD AND HYBRILIT RESOURCES

        Cloud computing became a routine tool for scientists in many domains. The JINR cloud infrastructure provides JINR users computational resources for performing various scientific calculations. In order to speed up achievements of scientific results the JINR cloud service for parallel applications was developed. It consists of several components and implements a flexible and modular architecture which allows to utilize both more applications and various types of resources, as computational backends. Besides this architecture increases the utilization of cloud idle resources.
        An example of using the Cloud&HybriLIT resources in the scientific computing is the study of superconducting processes in the stacked long Josephson junctions (LJJ). LJJ systems are undergone the intensive research because of a perspective of practical applications in nano-electronics and quantum computing. Respective mathematical model is described by a system of the sine-Gordon type partial differential equations [1,2] where the spatial derivatives are approximated with help of standard finite difference formulas and the resulting system of ODEs is numerically solved by means of the 4th order Runge-Kutta procedure. Parallel MPI-implementation of the numerical algorithm was developed in [3,4]. Preliminary results of numerical experiments of LJJs calculations at the JINR cloud infrastructure are presented in [5]. In this contribution, we generalize the experience on application of the Cloud&HybriLIT resources for the high performance computing of physical characteristics in the LJJ system.
        The second example is provisioning the HybriLIT cloud service for training and testing the deep recurrent neural networks (RNN) specially designed for BM@N track reconstruction [6]. Trained RNN can process 6500 simulated track-candidates in one second on the single Nvidia Tesla M60 with 97.5% recognition efficiency.

        The work is supported by RFBR under grant 15-29-01217.

        1. I. R. Rahmonov, Y. M. Shukrinov, and A. Irie, JETP Letters 99, 632 (2014);
        2. I. R. Rahmonov, Yu. M. Shukrinov, P. Kh. Atanasova, E. V. Zemlyanaya, and M. V. Bashashin,JETP 124 131 (2017);
        3. M.V. Bashashin, E.V. Zemlyanaya, I.R. Rahmonov, Yu.M. Shukrinov, P.Kh. Atanasova, A.V. Volokhova Numerical approach and parallel implementation for computer simulation of stacked long Josephson Junctions // Computer Research and Modeling, T.8, № 4, 2016, P.593–604;
        4. E. V. Zemlyanaya, M. V. Bashashin, I. R. Rahmonov, Yu. M. Shukrinov, P. Kh. Atanasova, and A. V. Volokhova. Model of stacked long Josephson junctions: Parallel algorithm and numerical results in case of weak coupling. AIP Conference Proceedings 1773, 110018(1-9) (2016);
        5. Aleksandrov E.I., Bashashin M.V., Belyakov D.V., Volohova A.V., Zemlyanaya E.V., Zuev M.I., Kutovskiy N.A., Matveev M.A., Nechaevskiy A.V., Ososkov G.A., Podgainy D.V., Rahmonov I.R., Streltsova O.I., Trofimov V.V., Shukrinov Yu.M. Investigation of efficiency of MPI-calculations on cloud and heterogeneous infrastructures of MICC JINR. Materials of the All-Russian Conference with International Participation "Information and Telecommunication Technologies and Mathematical Modeling of High-Tech Systems" (April 24-28, 2017, Moscow), PFUR, 2017, 206-208;
        6. D.Baranov, S.Mitsyn, G.Ososkov, P.Goncharov, A.Tsytrinov, Novel approach to the particle track reconstruction based on deep learning methods, Proceedings of the XXVI International Symposium on Nuclear Electronics & Computing (NEC’2017), рр 37-45.
        Speaker: Prof. G. Ososkov (Joint Institute for Nuclear Research)
      • 337
        The SLAC RCE Platform for ProtoDUNE-SP

        The ProtoDUNE-SP is a single-phase liquid argon time projection chamber (LArTPC) prototype for the Deep Underground Neutrino Experiment (DUNE). Signals from 15,360 electronic channels are received by 60 Reconfigurable Cluster Elements (RCEs), which are processing elements designed at SLAC for a wide range of applications and are based upon the "system-on-chip” Xilinx Zynq family of FPGAs. The RCEs are housed in industry-standard ATCA shelves on a custom blade, called the Cluster on Board (COB). The RCE platform and its processing functions for the ProtoDUNE-SP will be presented. Additionally, a conceptual design of the RCE upgrade and its potential applications to the full-scale DUNE detectors will be discussed.

        Speaker: Dr Ka Vang Tsang (SLAC National Accelerator Laboratory)
      • 338
        Thoughts on using python, numpy, and scikit-learn for HEP analysis

        The HEP community has voted strongly with its feet to adopt ROOT as the current de facto analysis toolkit. It is used to write out and store our RAW data, our reconstructed data, and to drive our analysis. Almost all modern data models in particle physics are written in ROOT. New tools in industry have are making appearance in particle physics analysis, however, driven by the massive interest in Machine Learning. Further, datasets in industry rival or exceed those in particle physics. Given the large number of people outside HEP devoting time to optimizing these tools, it makes a lot of sense for HEP to adopt what it can, and devote resources to things unique to HEP. There are several external toolkits – the most popular are based on the R language and Python. Python seems to have the most interest within the HEP community, and parts of the community are devoting serious resources to its incorporation. This work discusses some of the high-level differences between the ROOT approach and the numpy/scikit-learn approach, the technical details driving them, along with some ideas of what could be adopted in the long run for use in our community. It is based on work implementing an analysis in the ATLAS experiment.

        Speaker: Gordon Watts (University of Washington (US))
      • 339
        Three weeks long hackathon - LHCb's Puppet 3.5 to Puppet 4.9 migration.

        Up until September 2017 LHCb Online was running on Puppet 3.5 Master/Server non redundant architecture. As a result, we had problem with outages, both planned and unplanned, as well as with scalability issues (How do you run 3000 nodes at the same time? How do you even run 100 without bringing down the Puppet Master). On top of that Puppet 5.0 was released, so we were running now 2 versions behind!
        As Puppet 4.9 was the de facto standard, something had to be done right now, so a quick self inflicted three weeks long nonstop hackathon had to happen. This talk will cover the pitfalls, mistakes and architecture decisions we took when migrating our entire Puppet codebase nearly from scratch, to a more modular one, addressing both existing exceptions and anticipating arising ones in the future - All while our entire infrastructure was running in physics productions and on top of that causing 0 outages. We will cover mistakes we had made in our Puppet 3 installment and how we fixed them in the end, in order to lower cotalogue compile time and reduce our overall codebase around 50%.
        We will cover how we setup a quickly scalable Puppet Core(Masters,CAs,Foreman,etc) infrastructure.

        Speaker: Hristo Umaru Mohamed (CERN)
      • 340
        Towards a responsive CernVM-FS architecture

        The CernVM File System (CernVM-FS) provides a scalable and reliable software distribution service implemented as a POSIX read-only filesystem in user space (FUSE). It was originally developed at CERN to assist High Energy Physics (HEP) collaborations in deploying software on the worldwide distributed computing infrastructure for data processing applications. Files are stored remotely as content-addressed blocks on standard web servers and are retrieved and cached on-demand through outgoing HTTP connections only. Repository metadata is recorded in SQLite catalogs, which represent implicit Merkle tree encodings of the repository state. For writing, CernVM-FS follows a publish-subscribe pattern with a single source of new content that is propagated to a large number of readers.

        This talk focuses on the work to move the CernVM-FS architecture in the direction of a responsive data distribution system. A new distributed publication backend allows scaling out large publication tasks across multiple machines, reducing the time to publish. For the faster propagation of new published content, the addition of a notification system allows clients to subscribe to messages about changes in the repository and to request new root catalogs as soon as they become available. These developments make CernVM-FS more responsive and are particularly relevant for use cases where a short propagation delay from repository down to individual clients is important, such as using CernVM-FS as an AFS replacement for distributing software stacks. Additionally, they permit the implementation of more complex workflows, with producer-consumer pipelines, as for example in the ALICE analysis trains system.

        Speaker: Radu Popescu (CERN)
      • 341
        Track extrapolation and muon identification in Belle II event reconstruction

        I describe the charged-track extrapolation and muon-identification modules in the Belle II data-analysis code framework (basf2). These modules use GEANT4E to extrapolate reconstructed charged tracks outward from the Belle II Central Drift Chamber into the outer particle-identification detectors, the electromagnetic calorimeter, and the K-long and muon detector (KLM). These modules propagate the position, momentum, 6-dimensional covariance matrix, and time of flight through the detailed Belle II detector geometry to permit comparison of the extrapolated track with the hits detected in the outer detectors. In the KLM, a Kalman filter is employed to adjust the extrapolation based on the matching measurements. Downward-moving cosmic rays in the upper half of the detector are back-propagated upward. Several modifications were made to permit GEANT4E to interoperate with GEANT4 within a single basf2 event-generation/simulation/reconstruction job and to expand the number of particle species that are extrapolated.

        Speaker: Leo Piilonen (Virginia Tech)
      • 342
        Tracking System Performance of the BM@N Experiment

        The Baryonic Matter at Nuclotron (BM@N) experiment represents the 1st phase of Nuclotron-based Ion Collider fAcility (NICA) Mega science project at the Joint Institute for Nuclear Research. It is a fixed target experiment built for studying nuclear matter in conditions of extreme density and temperature.
        The tracking system of the BM@N experiment consists of three main detector systems: Multiwire Proportional Chambers (MWPC) situated before the magnet, Gas Electron Multipliers (GEM) placed inside the magnet and Drift Chambers (DC) placed after the magnet. These systems provide the reconstruction of charged particles’ trajectories and their momentum in the magnetic field. This information is further used by time-of-flight detectors for the particle identification procedure.
        The performance of the system is reviewed and the spatial resolutions along with efficiencies of the detectors are estimated using the data collected in the recent physical runs of the Nuclotron.

        Speaker: Nikolay Voytishin (Joint Institute for Nuclear Research (RU))
      • 343
        Trident: A three pronged approach to analysing node utilisation

        We describe the development of a tool (Trident) using a three pronged approach to analysing node utilisation while aiming to be user friendly. The three areas of focus are data IO, CPU core and memory.

        Compute applications running in a batch system node will stress different parts of the node over time. It is usual to look at metrics such as CPU load average and memory consumed. However, this often does not provide enough information to form a detailed picture of how the system is performing and in most cases detecting performance problems is impossible.

        Monitoring and collecting further performance metrics at near real time is intended to understand compute demands better and which changes can improve utilisation. We are investigating methodologies at CERN Tier-0 to allow collection of metrics such as memory bandwidth, detailed CPU core utilisation and active processor cycles. This is done with minimal overhead and without instrumenting the user code. When combined with modern analytics the metrics can provide information relevant to the users, developers and site administrators. The raw metrics are often difficult to interpret, hence development of a tool to allow the target communities to both collect and interpret resource utilisation data more easily.

        Speaker: Servesh Muralidharan (CERN)
      • 344
        Trigger Rate Monitoring Tools at CMS

        One of the major challenges for the Compact Muon Solenoid (CMS) experiment, is the task of reducing event rate from roughly 40 MHz down to a more manageable 1 kHz while keeping as many interesting physics events as possible. This is accomplished through the use of a Level-1 (L1) hardware based trigger as well as a software based High-Level-Trigger (HLT). Monitoring and understanding the output rates of the L1 and HLT triggers is of key importance for determining the overall performance of the trigger system and is intimately tied to what type of data is being recorded for physics analyses. We present here a collection of tools used by CMS to monitor the L1 and HLT trigger rates. One of these tools is a script (run in the CMS control room) that gives valuable real-time feedback of trigger rates to the shift crew. Another useful tool is a plotting library, that is used for observing how trigger rates vary over a range of beam and detector conditions, in particular how the rates of individual triggers scale with event pile-up.

        Speaker: Andrew Wightman (University of Notre Dame (US))
      • 345
        Triggering on hadronic signatures in ATLAS – developments for 2017 and 2018

        Hadronic signatures are critical to the ATLAS physics program, and are used extensively for both Standard Model measurements and searches for new physics. These signatures include generic quark and gluon jets, as well as jets originating from b-quarks or the decay of massive particles (such as electroweak bosons or top quarks). Additionally, missing transverse momentum from non-interacting particles provides an interesting probe in the search for new physics beyond the Standard Model.

        Developing trigger selections that target these events is a huge challenge at the LHC due to the enormous rates associated with hadronic signatures. This challenge is exacerbated by the amount of pile-up activity, which continues to grow. In order to address these challenges, several new techniques were developed to significantly improve the potential of the 2017 dataset.

        This talk presents an overview of how we trigger on hadronic signatures at the ATLAS experiment, outlining the challenges of hadronic object triggering and describing the improvements performed over the course of the Run-2 LHC data-taking program. The performance in Run-2 data is shown, including demonstrations of the new techniques being used in 2017. We also discuss further critical developments implemented for the rest of Run-2 and their performance in early 2018 data.

        Speaker: Emanuel Gouveia (LIP Laboratorio de Instrumacao e Fisica Experimental de Particu)
      • 346
        Understanding the evolution of conditions data access through Frontier for the ATLAS Experiment

        The ATLAS Distributed Computing system uses the Frontier system to access the Conditions, Trigger, and Geometry database data stored in the Oracle Offline Database at CERN by means of the http protocol. All ATLAS computing sites use squid web proxies to cache the data, greatly reducing the load on the Frontier servers and the databases. One feature of the Frontier client is that in the event of failure, it retries to different services. While this allows transient errors and scheduled maintenance to happen transparently, it does open the system up to cascading failures if the load is high enough.

        Throughout LHC Run 2 there has been an ever increasing demand on the Frontier service. There have been multiple incidents where parts of the service failed due to high load. A significant improvement in the monitoring of the Frontier service was required. The monitoring was needed to identify both problematic tasks, which could then be killed or throttled, and to identify failing site services as the risk of a cascading failure is much higher. This presentation describes the implementation and features of the monitoring system.

        Speaker: Michal Svatos (Acad. of Sciences of the Czech Rep. (CZ))
      • 347
        Upgrade of CMS Full Simulation for Run-2

        We report status of the CMS full simulation for run-2. Initially, Geant4 10.0p02 was used in sequential mode, about 16 billion events were produced for analysis of 2015-2016 data. In 2017, the CMS detector was updated: new tracking pixel detector is installed, hadronic calorimeter electronics is modified, and extra muon detectors are added. Corresponding modifications were introduced in the full simulation and Geant4 10.2p02 was adopted for 2017 simulation production, which includes improved Geant4 for the multi-threaded mode, which becomes the default for 2017.
        For the 2018 Monte Carlo productions, the full simulation is updated. The new Geant4 version 10.4 will be used, it was adopted for the production after detailed validations versus test-beam and collision data. Results of validations will be described in details. Several aspects of the migration to Geant4 10.4 and modifications of CMSSW simulation software will be also discussed.

        Speaker: Prof. Vladimir Ivantchenko (CERN)
      • 348
        Using Fat Containers on HPCs for the ATLAS Experiment

        CVMFS helps ATLAS in distributing software to the Grid, and isolating software lookup to batch nodes’ local filesystems. But CVMFS is rarely available in HPC environments. ATLAS computing has experimented with "fat" containers, and later developed an environment to produce such containers for both Shifter and Singularity. The fat containers include most of the recent ATLAS software releases, conditions database, and other tools extracted from CVMFS. The approach not only helped ATLAS to distribute software automatically to HPCs with an environment identical to those in CVMFS, but also significantly reduced the metadata I/O to HPCs’ shared filesystems. The production operation at NERSC has proved that by using fat containers, we can both scale up to run many more jobs simultaneously, and transparently fit into the previously developed ATLAS operation methods.

        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 349
        Using the autopilot pattern to deploy container resources at a WLCG Tier-2

        Containers are becoming ubiquitous within the WLCG with CMS announcing a requirement for Singularity at supporting sites in 2018. The ubiquity of containers means it is now possible to reify configuration along with applications as a single easy to deploy unit rather than via a myriad of configuration management tools such as Puppet, Ansible or Salt. This allows more use of industry devops techniques such as Continous Integration (CI) and Continous Deployment (CD) within the operations domain, leading to faster upgrades and more secure systems.

        One intersting technique is the AutoPilot pattern [1] which provides mechanisims for application lifecycle management from within the container itself. Using modern service discovery techniques each container manages it's own configuration, monitors it's own health and adapts to changing requirements through the use of event triggers.

        In this paper we expand on previous work [2] to create and deploy resources to a WLCG Tier-2 via containers to investigate the viability of using the autopilot pattern at a WLCG site to deploy and manage computational resources.

        [1] http://autopilotpattern.io/
        [2] G. Roy et al. "A container model for resource provision at a WLCG Tier-2", ACAT 2017

        Speaker: Gareth Douglas Roy (University of Glasgow (GB))
      • 350
        WISE Information Security for Collaborating e-Infrastructures

        As most are fully aware, cybersecurity attacks are an ever-growing problem as larger parts of our lives take place on-line. Distributed digital infrastructures are no exception and action must be taken to both reduce the security risk and to handle security incidents when they inevitably happen. These activities are carried out by the various e-Infrastructures and it has become very clear in recent years that collaboration with others both helps to improve the security and to work more efficiently.

        WISE provides a trusted forum where security experts can share information on topics such as risk management, experiences about certification processes and threat intelligence. With participants from e-Infrastructures such as WLCG, EGI, EUDAT, PRACE, XSEDE, NRENs and more, WISE focuses on standards, guidelines and practices, and promotes the protection of critical infrastructure. To date WISE has published two documents; a risk management template and a second version of the SCI framework, endorsed by multiple, large-scale e-Infrastructures. In 2018 WISE will begin work on new areas of relevance to the HEP community, including a focus on operational security and incident response for interoperating infrastructures.

        We present an overview of the available WISE recommendations, and extend an invitation to participate in our working groups.

        [1] https://wise-community.org

        Speaker: Hannah Short (CERN)
      • 351
        A Historic Data Quality Monitor (HDQM) tool for the CMS TRACKER Detector

        The Historic Data Quality Monitor (HDQM) of the CMS experiment is a framework developed by the Tracker group of the CMS collaboration that permits a web-based monitoring of the time evolution of measurements ( S/N ratio, cluster size etc) in the Tracker silicon micro-strip and pixel detectors. It addition, it provides a flexible way for the implementation of HDQM to the other detector systems of the CMS experiment. The framework offers a way to build and deploy trend plots based on two steps of actions. In the first step data are retrieved from the offline DQM data base periodically via a cron scheduled job and a list of datasets is produced. In the second step, the web interface dynamically and interactively produces the trend plots from the datasets. The monitoring of time evolution of detector sensitive quantities is fundamental to the control of data quality. At the same time, the effect of calibration sequences performed during a period is checked. The overall organization of the tool will be presented along with its internal structure and representative examples

        Speaker: Dimitrios Loukas (Nat. Cent. for Sci. Res. Demokritos (GR))
      • 352
        A Simplified Computing Framework for FPGA-Accelerated Workloads

        We introduce SWiF - Simplified Workload-intuitive Framework - a workload-centric, application programming framework designed to simplify the large-scale deployment of FPGAs in end-to-end applications. SWiF intelligently mediates access to shared resources by orchestrating the distribution and scheduling of tasks across a heterogeneous mix of FPGA and CPU resources in order to improve utilization and maintain system requirements. We implemented SWiF atop Intel Accelerator Abstraction Layer (AAL) and deployed the resulting software stack in a datacenter with an Intel-based Xeon+FPGA server running Apache Spark. We demonstrate that by using SWiF's API, developers can flexibly and easily deploy FPGA-enabled applications and frameworks with the addition of only 3 lines of code to existing code base. In particular, we demonstrate that by offloading through SWiF the compression workload of Spark to FPGA, we realize a 3.2X speedup in total job execution, and up to 5X when Spark's Resilient Distributed Datasets (RDDs) are persisted.

        Speaker: David Nonso Ojika (University of Florida (US))
      • 353
        A simulation tool for ALICE storage and computing resource usage

        The LHC experiments produce petabytes of data each year, which must be stored, processed and analyzed. This requires a significant amount of storage and computing resources. In addition to that, the requirements to these resources are increasing over the years, at each LHC running period.
        In order to predict the resource usage requirements of the ALICE Experiment for a particular LHC Run period, we developed flexible and highly configurable simulation tool, which performs discrete event simulation of ALICE data flow processes by taking into account the LHC running conditions and schedule, data management/removal policies, conferences and other criteria.
        The tool provides a Web GUI, which allows entering of all the necessary parameters and to graphically visualize the results of simulations, which are done using SIM.JS JavaScript library.
        In this paper we will describe the design and the functional structure of the aforementioned tool and present the preliminary results of the simulations.

        Speakers: Armenuhi Abramyan (A.Alikhanyan National Science Laboratory (AM)), Narine Manukyan (A.Alikhanyan National Laboratory (AM))
      • 354
        An open source data transfer toolkit for research data

        Many areas of academic research are increasingly catching up with the LHC experiments
        when it comes to data volumes, and just as in particle physics they require large data sets to be moved between analysis locations.
        The LHC experiments have built a global e-Infrastructure in order to handle hundreds of
        petabytes of data and massive compute requirements. Yet, there is nothing particle physics specific in this infrastructure. However its command line interfaces and its use of an individual’s X.509 certificate as a means of user identification can be seen as barriers by other communities. These users typically have access to multiple endpoints via site specific logins, but no X.509 certificates or any other shared authentication infrastructure.
        This project aims to provide an easy to use ‘drag-and-drop’ web interface for users to transfer files between institutions that do not have a grid infrastructure in place. Consideration was given to the simplicity of deployment, by providing packaged version for common operating systems. The underlying technology. e.g. automatic generation of X.509 certificates, is hidden from the user, but leverages standard grid technologies.
        We report on the development of a prototype for the UK distributed supercomputing facility (DiRAC) and the planned extension of the project to other communities.

        Speaker: David Colling (Imperial College (GB))
      • 355
        Backup Infrastructure at CERN

        CERN's current Backup and Archive Service hosts 11 PB of data in more than 2.1 billion files. We have over 500 clients which back up or restore an average of 80 TB of data each day. At the current growth rate, we expect to have about 13 PB by the end of 2018.

        In this contribution we present CERN's Backup and Archive Service based on IBM Spectrum Protect (previously known as Tivoli Storage Manager). We show the architecture and design of the system, the user requirements, the operational issues, the growth challenges and the options for resolving them.

        Speakers: Alex Iribarren (CERN), Julien Leduc (CERN)
      • 356
        Best Practices in Accessing Tape-Resident Data in HPSS

        Tape is an excellent choice for archival storage because of the capacity, cost per GB and long retention intervals, but its main drawback is the slow access time due to the nature of sequential medium. Modern enterprise tape drives now support Recommended Access Ordering (RAO), which is designed to improve recall/retrieval times.

        BNL's mass storage system currently holds more than 100 PB of data on tapes, managed by HPSS. Starting with HPSS version 7.5.1, a new feature called “Tape Order Recall (TOR) has been introduced. It uses the RAO mechanism for improving the access times over sequential media claiming a performance improvement by 30% to 60%.

        Prior to HPSS 7.5.1, we have been using an in-house developed scheduling software, called ERADAT. ERADAT accesses files based on the order of the logical position of each file and consequently has demonstrated great performance at BNL. We have compared the new TOR method to the method of using logical position under different conditions such as number of access requests.

        In this presentation we will demonstrate a series of tests, which indicate how effective the TOR (RAO) is under different scenarios and what are the best methods in restoring data from tape storage under different conditions.

        Speaker: David Yu (Brookhaven National Laboratory (US))
      • 357
        Broadcasting dynamic metadata content to external web pages using AMI (ATLAS Metadata Interface) embeddable components

        AMI (ATLAS Metadata Interface) is a generic ecosystem for metadata
        aggregation, transformation and cataloguing. Often, it is interesting
        to share up-to-date metadata with other content services such as wikis.
        Here, we describe the cross-domain solution implemented in the AMI Web
        Framework: a system of embeddable controls, communicating with the
        central AMI service and based on the AJAX and CORS technologies. We
        also describe the main available controls and their basic usage.

        Speaker: Mr Fabian Lambert (LPSC Grenoble IN2P3/CNRS (FR))
      • 358
        Dataclient

        Since the current data infrastructure of the HEP experiments is based on gridftp, most computing centres have adapted and based their own access to the data on the X.509. This is an issue for smaller experiments who do not have the resources to train their researchers about the complexities of X.509 certificates and who clearly would prefer an approach based on username/password.

        On the other hand, asking computing centres to support different access strategies is not so straightforward, since it would require a significant expenditure of effort and manpower.

        At CNAF we tackled this problem by creating a layer on top of the gridftp client/server that completely hides the X.509 infrastructure under an authentication/authorization process based on the Kerberos realm of our centre, and therefore based on username/password. We called this 'dataclient.'

        In this article we will describe both the principles that drove its design and its general architecture, with the measures taken to simplify the user's experience and maintenance burden.

        Speaker: Enrico Fattibene (INFN - National Institute for Nuclear Physics)
      • 359
        Decay Chain Reconstruction in Belle II

        The Belle II detector will begin its data taking phase in 2018. Featuring a state of the art vertex detector with innovative pixel sensors, it will record collisions of e+e- beams from the SuperKEKB accelerator which is slated to provide luminosities 40x higher than KEKB.
        This large amount of data will come at the price of an increased beam background, as well as an operating point providing a lowered Lorentz boost when compared to Belle.

        This leads to new challenges for analysts seeking to perform precision measurements, as increasingly precise vertex position reconstruction and background rejection power is needed to take advantage of the unprecedented statistics provided. I will discuss the decay reconstruction techniques available to make such measurements possible.

        I will present an approach to particle decay reconstruction based on the global fit of the full decay chain, implemented using a Kalman filter. Unlike traditional reconstruction workflows where vertices are individually reconstructed, this technique (first used in BaBar [arXiv:physics/0503191]) is able to fit underconstrained vertices by exploiting well resolved elements elsewhere in the decay chain; this is especially well suited to the treatment of processes with neutral decays. It simultaneously allows for more precise vertex position resolution (directly impacting e.g. time-dependent CP violation measurements), improved background rejection, and increased statistical power through the incorporation of a larger number of decay modes.

        Speaker: Dr Francesco Tenchini (University of Melbourne)
      • 360
        Deploying and extending CMS Tier 3s using VC3 and the OSG Hosted CE service

        CMS Tier 3 centers, frequently located at universities, play an important role in the physics analysis of CMS data. Although different computing resources are often available at universities, meeting all requirements to deploy a valid Tier 3 able to run CMS workflows can be challenging in certain scenarios. For instance, providing the right operating system (OS) with access to the CERNVM File System (CVMFS) on the worker nodes or having a Compute Element (CE) on the submit host is not always allowed or possible due to e.g: lack of root access to the nodes, TCP port network policies, maintenance of a CE, etc. The Notre Dame group operates a CMS Tier 3 with ~1K cores. In addition to this, researchers have access to an opportunistic pool with +25K cores that are used via lobster for CMS jobs, but cannot be used with other standard CMS submission tools on the grid like CRAB, as these resources are not part of the Tier 3 due to its opportunistic nature. This work describes the use of VC3, a service for automating the deployment of virtual cluster infrastructures, in order to provide the environment (user-space CVMFS access and customized OS via singularity containers) needed for CMS workflows to work. Also, its integration with the OSG Hosted CE service, to add these resources to CMS as part of our existing Tier 3 in a seamless way.

        Speaker: Kenyi Paolo Hurtado Anampa (University of Notre Dame (US))
      • 361
        Distributed caching system for a multi-site DPM storage

        The experience gained in several years of storage system administration has shown that the WLCG distributed grid infrastructure is very performing for the needs of the LHC experiments. However, an excessive number of storage sites leads to inefficiencies in the system administration because of the needs of having experienced manpower in each site and of the increased burden on the central operations. On the other hand, user analysis is often based on clusters hosted in small sites such as Tier3s: thus, it is important to provide dynamic and efficient data access in such sites as well. Both these requirements can be met by diskless Tier3s with a local data cache.
        A prototype of a system fulfilling such requirements is presented in this work. The system storage relies on the possibility to implement volatile storage pools behaving as caches, offered by the latest releases of the Disk Pool Manager (DPM) with its new core called DOME (Disk Operation Management Engine).
        Leveraging the fast and reliable connections among different sites, this study proposes a configuration where a primary site represents a single entry point for the whole storage system that includes disk areas located at remote sites. A remote disk pool, configured as volatile, works as local cache with zoning access mechanisms. It may take as data source the other pools of the same system or external data areas, with mechanisms under investigation in terms of scalability and performance. With such a system, in a Tier3 the user analysis would be able to access locally cached input data, relieving the local system administrator from managing a full storage system and making the site transparent for central operations.

        Speaker: Alessandra Doria (INFN, Napoli (IT))
      • 362
        Dynafed: Common directions for http and cloud storage federations

        Dynafed is a system that allows the creation of flexible and seamless storage federations out of participating sites that expose WebDAV, HTTP, S3 or Azure interfaces. The core components are considered stable since a few years, and the recent focus has been on supporting various important initiatives willing to exploit the potential of Cloud storage in the context of Grid computing for various VOs. These include the XDC project, ATLAS, ATLAS@home (link sends e-mail), CMS@home (link sends e-mail), Belle-II, EUDAT.

        Here we report on and discuss various common features that these initiatives exploit in similar ways (like the authorization subsystem), and on some other planned features that would be of common interest, like aggregating additional metadata or checksums, aggregating disk caches or using Dynafed as a "protocol translator" for the use case of massive data transfers through FTS.

        Speaker: Fabrizio Furano (CERN)
      • 363
        Dynamic and on-demand data streams

        Replicability and efficiency of data processing on the same data samples are a major challenge for the analysis of data produced by HEP experiments. High-level data analyzed by end-users are typically produced as a subset of the whole experiment data sample to study interesting selection of data (streams). For standard applications, streams may be eventually copied from servers and analyzed on local computing centers or user machine clients. The creation of streams as copy of a subset of the original data results in redundant information stored in filesystems and may be not efficient: if the definition of streams changes, it may force a reprocessing of the low-level files with consequent impact on the data analysis efficiency.

        We propose a flexible solution for implementation of an analysis framework based on a database of lookup tables intended for dynamic and on-demand definition of data streams. This enables the end-users, as the data analysis strategy evolves, to explore different definitions of streams with minimal cost in computing resources. We also present a prototype demonstration application of this database for the analysis of the AMS-02 experiment data.

        Speaker: Valerio Formato (Universita e INFN, Perugia (IT))
      • 364
        Estimating Time To Complete for ATLAS data transfers

        Transfer Time To Complete (T³C) is a new extension for the data management system Rucio that allows to make predictions about the duration of a file transfer. The extension has a modular architecture which allows to make predictions based on simple to more sophisticated models, depending on available data and computation power. The ability to predict file transfer times with reasonable accuracy provides a tool for better transfer scheduling and thus reduces both the load on storage systems and the associated networks. The accuracy of the model requires fine tuning for its parameters on a link basis. As the underlying infrastructure varies depending on the source and destination of the transfer, the parameters modeling the network between these sites will also be studied.

        Speaker: Joaquin Ignacio Bogado Garcia (Universidad Nacional de La Plata (AR))
      • 365
        Evolution of the VISPA-project

        VISPA (Visual Physics Analysis) is a web-platform that enables users to work on any SSH reachable resource using just their web-browser. It is used successfully in research and education for HEP data analysis.
        The emerging JupyterLab is an ideal choice for a comprehensive, browser-based, and extensible work environment and we seek to unify it with the efforts of the VISPA project. The primary objective is to provide the user with the freedom to access any external resources at their disposal, while maintaining a smooth integration of preconfigured ones including their access permissions. Additionally, specialized HEP tools, such as native format data browsers (ROOT, PXL), are being migrated from VISPA- to JupyterLab-extensions as well.
        We present these concepts and their implementation progress.

        Speaker: Benjamin Fischer (RWTH Aachen University)
      • 366
        Extending the Manchester site with containers and cloud technology

        We describe how the Blackett facility at the University of Manchester
        High Energy Physics group has been extended to provide Docker container and
        cloud platforms as part of the UKT0 initiative. We show how these new
        technologies can be managed using the facility's existing fabric
        management based on Puppet and Foreman. We explain how use of the
        facility has evolved beyond its origins as a WLCG Tier-2 centre and now
        also supports astronomy workloads including LSST and SKA.

        Speaker: Andrew McNab (University of Manchester)
      • 367
        Final Analysis Sample Metadata

        IceCube is a cubic kilometer neutrino detector located at the south pole. Data are processed and filtered in a data center at the south pole. After transfer to a data warehouse in the north, data are further refined through multiple levels of selection and reconstruction to reach analysis samples. So far, the production and curation of these analysis samples has been handled in an ad-hoc way in IceCube. New tools have been developed to capture and validate the metadata associated to these data samples in consistent and machine readable specification. Development was driven by analysis use-cases and pursuing reproducibility of scientific results.

        Speaker: PATRICK MEADE (University of Wisconsin-Madison)
      • 368
        Gaining insight from large data volumes with ease

        Efficient handling of large data-volumes becomes a necessity in today's world. It is driven by desire to get more insight from the data and to gain a better understanding of user trends which can be transformed into economic incentives (profits, cost-reduction and various optimization of data workflows and pipelines). In this talk we discuss how modern technologies are transforming a well established pattern to handle the data based on relational databases. This trend which was originally introduced in commercial world is adopting at a large pace in HEP. We present a few examples of such systems successfully designed and deployed in CMS to handle millions of Framework Job Report documents from distributed agents as well as efficiently aggregating large data volumes coming from a variety of data providers (Data-Bookkeeping System, Data location system, EOS, AAA, CRAB, WMArchive) and understanding the data trends based on build-in dashboards.

        Speaker: Valentin Y Kuznetsov (Cornell University (US))
      • 369
        Geant4 validation web application

        One of the key factors for the successful development of a physics Monte-Carlo is the ability to properly organize regression testing and validation. Geant4, a world-standard toolkit for HEP detector simulation, is one such example that requires thorough validation. The CERN/SFT group, which contributes to the development, testing, deployment and support of the toolkit, is also responsible for the monthly validation of the development releases of Geant4, based on a set of community-developed tests.

        In this talk we present the web application "geant-val" developed specifically to analyse and visualise results whilst comparing them between different Geant4 releases. The application is written using the Node.js and Angular frameworks, and uses PostgreSQL for storing test results. The output is visualised using ROOT and JSROOT. In addition to a pure visual comparison, we perform different statistical tests (chi-square, Kolmogorov-Smirnov etc) on the client side using HTML5 Web Workers.

        The "geant-val" application has been demonstrated to be a robust tool for the regression testing and validation of the Geant4 toolkit. The generic design is such that it can be applied to compare any histograms no matter from which Monte-Carlo code/generator the data were produced.

        Speakers: Ivan Razumov (Institute for High Energy Physics (RU)), Witold Pokorski (CERN)
      • 370
        GenEx - a modular software framework of MC event generator for exclusive processes

        Central Exclusive Production (CEP) is a class of diffractional processes studied at the Large Hadron Collider, that offers a very clean experimental environment for probing the low energy regime of Quantum Chromodynamics.
        As any other analyses in High Energy Physics, it requires a large amount of simulated Monte Carlo data, that is usually created by means of the so-called MC event generators. In most cases, a general-purpose applications (e.g. Pythia8) are enough, but in specialized areas of study, where phenomenological models are used they tend to be cumbersome.
        GenEx is a simple class structure for construction of a Monte Carlo event generators, that is able to self-adapt to the provided matrix element and acceptance cuts. Based on that a new Monte Carlo event generator for CEP has been introduced to the LHCb experiment framework with a standalone version available.
        Written in C++, it is aimed at replacing the older generation of FORTRAN based generators with a lightweight and modern package. The key idea is to provide the community with a user-friendly and effective way of testing new models and generating Monte Carlo samples. Standalone version relies heavily on the programming techniques established in the recent years such as concurrent computing.
        The presentation will describe the general package structure and various issues related to the implementation of new exclusive physics models. In addition, a generator-level production optimization and the performance test results for generation on many threads and comparison with other existing generators will also be presented. We believe it is a valuable complement for CEP physics to the big, general-purpose generators available at the market.

        Speaker: Mateusz Jacek Goncerz (AGH University of Science and Technology (PL))
      • 371
        Grid production with the ATLAS Event Service

        ATLAS has developed and previously presented a new computing architecture, the Event Service, that allows real time delivery of fine grained workloads which process
        dispatched events (or event ranges) and immediately streams outputs.
        The principal aim was to profit from opportunistic resources such as commercial
        cloud, supercomputing, and volunteer computing, and otherwise unused cycles on clusters and grids.
        During the development and deployment phase, its utility also on the grid and conventional clusters for the exploitation of otherwise unused cycles became apparent.
        Here we describe our experience commissioning the Event Service on the grid in the ATLAS
        production system. We study the performance compared with standard simulation production.
        We describe the integration with the ATLAS data management
        system to ensure scalability and compatibility with object stores.
        Finally, we outline the remaining steps towards a fully commissioned system.

        Speaker: Esteban Fullana Torregrosa (Univ. of Valencia and CSIC (ES))
      • 372
        GRID-based off-line infrastructure for the PADME experiment at the DAFNE BTF

        The long standing problem of reconciling the cosmological evidence of the existence of dark matter with the lack of any clear experimental observation of it, has recently revived the idea that the new particles are not directly connected with the Standard Model gauge fields, but only through mediator fields or "portals", connecting our world with new "secluded" or "hidden" sectors. One of the simplest models just adds an additional U(1) symmetry, with its corresponding vector boson A'.

        At the end of 2015 INFN formally approved a new experiment, PADME (Positron Annihilation into Dark Matter Experiment), to search for invisible decays of the A' at the DAFNE BTF in the INFN Frascati National Laboratories (LNF). The experiment is designed to detect dark photons produced in positron on fixed target annihilations ($e^+e^-\to \gamma A'$) decaying to dark matter by measuring the final state missing mass.

        The collaboration aims to complete the design and construction of the experiment by April 2018 and to collect $\sim 10^{13}$ positrons on target by the end of 2018, thus allowing to reach the $\epsilon \sim 10^{-3}$ sensitivity up to a dark photon mass of $\sim 24$ MeV/c$^2$.

        Data coming from the experiment’s DAQ system, currently estimated to a total of 300 TB of raw data for the first year of run, will be transferred to the GRID-interfaced storage system of the Tier 2 site at INFN LNF. Here the data will be reconstructed and analyzed using GRID resources initially located at INFN LNF, INFN CNAF, and Sofia University. All GRID resources will also be used for the simulation, reconstruction, and analysis of MC events.

        A full copy of the data will be stored to the INFN CNAF tape library with an emergency backup copy on the KLOE2 tape library, located at INFN LNF.

        Given the limited amount of data, of GRID sites, and of available man-power, instead of using one the powerful but complex GRID infrastructures in use by, e.g., the four LHC experiments, we decided to handle job submissions and data transfer using a set of self-developed services based on the standard GRID tools.

        This talk will describe the off-line infrastructure of the PADME experiment with details on data production and handling, on the available CPU and storage resources, and on the services created to handle the whole system.

        Speaker: Alessandro De Salvo (Sapienza Universita e INFN, Roma I (IT))
      • 373
        Improvements to the LHCb software performance testing infrastructure using message queues and big data technologies

        Software is an essential component of the experiments in High Energy Physics. Due to the fact that it is upgraded on relatively short timescales, software provides flexibility, but at the same time is susceptible to issues introduced during development process, which enforces systematic testing. We present recent improvements to LHCbPR, the framework implemented at LHCb to measure physics and computational performance of complete applications. Such infrastructure is essential for keeping track of the optimisation activities related to the upgrade of computing systems which is crucial to meet the requirements of the LHCb detector upgrade for the next stage of data taking of the LHC. Latest developments in LHCbPR include application of messaging system to trigger the tests right after the corresponding software version is built within LHCb Nightly Builds infrastructure. We will also report on the investigation of using big data technologies in LHCbPR. We have found that using tools such as Apache Spark and Hadoop Distributed File System may significantly improve the functionality of the framework, providing an interactive exploration of the test results with efficient data filtering and flexible development of reports.

        Speaker: Maciej Pawel Szymanski (University of Chinese Academy of Sciences (CN))
      • 374
        JADE3

        IceCube is a cubic kilometer neutrino detector located at the south pole. Data handling has been managed by three separate applications: JADE, JADE North, and JADE Long Term Archve (JADE-LTA). JADE3 is the new version of JADE that merges these diverse data handling applications into a configurable data handling pipeline (“LEGO® Block JADE”). The reconfigurability of JADE3 has enabled specialized handling for real-time astrophysical alerts, a new IceCube data source which originates in the Northern Hemisphere.

        Speaker: PATRICK MEADE (University of Wisconsin-Madison)
      • 375
        JavaScript ROOT v5

        The new version of JSROOT provides full implementation of the ROOT binary I/O, now including TTree. Powerful JSROOT.TreeDraw functionality provides a simple way to inspect complex data in web browsers directly, without need to involve ROOT-based code.

        JSROOT is now fully integrated into Node.js environment. Without binding to any C++ code, one get direct access to all kinds of ROOT data. JSROOT is available as NPM package and can be installed with a single command. Node.js is used for extensive testing of the JSROOT functionality itself.

        The TBufferJSON class, which was designed to store object data in JSON format, is now also capable to read objects from JSON. This simplifies a lot the data exchange in client (JSROOT) - server (ROOT C++) applications, where complex object data can be send now in both directions.

        Recent JSROOT version provides an interactive graphical representation for the majority of commonly-used ROOT classes.

        Speaker: Serguei Linev (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 376
        Kudu prototypes for the ATLAS EventIndex and the future Event WhiteBoard

        The ATLAS EventIndex has been in operation since the beginning of LHC Run 2 in 2015. Like all software projects, its components have been constantly evolving and improving in performance. The main data store in Hadoop, based on MapFiles and HBase, can work for the rest of Run 2 but new solutions are explored for the future. Kudu offers an interesting environment, with a mixture of BigData and relational database features, which look promising at the design level and is used to build a prototype to measure the scaling capabilities as functions of data input rates, total data volumes and data query and retrieval rates.
        The Event WhiteBoard will be an extension of the EventIndex functionalities to support the ATLAS Event Service and the concept of Virtual Datasets. These additional requirements are tested on the same Kudu prototype, in order to estimate the system performance and response times for different internal data organisations.
        This talk will report on the selected data schemas and on the current performance measurements with the Kudu prototype.

        Speaker: Zbigniew Baranowski (CERN)
      • 377
        Managing data recovery for Long Term Data Preservation

        In the latest years, CNAF worked at a project of Long Term Data Preservation (LTDP) for the CDF experiment, that ran at Fermilab after 1985. A part of this project has the goal of archiving data produced during Run I into recent and reliable storage devices, in order to preserve their availability for further access through not obsolete technologies. In this paper, we report and explain the work done to manage the process of retrieving the aforementioned data, which were stored into about four thousands 2.5/5GB 8mm tape cartridges of different producers, which were widely popular in the nineties. The hardware setup for tape reading is briefly detailed. Particular focus is on describing in-house software tools and backend database that have been set up to drive and orchestrate the tape readers and to deal with the high number of possible problems arising during the process of reading data from hardly reliable media. The outcome of each operation is accounted into the database, making possible to monitor the overall progress and to retry unsuccessful read attempts at a later stage. The implemented solution has proved effective at reading a first 20% of the total amount. The process is currently ongoing. Eventhought a few aspects of this work are strictly dependant on how the CDF experiment organized its datasets, we believe that several decisions taken and the overall organization still make sense on a variety of use cases, where a relevant amount of data has to be retrieved from obsolete media.

        Speaker: Dr Stefano Dal Pra (INFN)
      • 378
        Optimising XRootD access to erasure coded object stores

        Since the start of 2017, the RAL Tier-1’s Echo object store has been providing disk storage to the LHC experiments. Echo provides access via both the GridFTP and XRootD protocols. GridFTP is primarily used for WAN transfers between sites while XRootD is used for data analysis.

        Object stores and those using erasure coding in particular are designed to efficiently serve entire objects which are normally assumed to be small (few MBs). Some experiment workflows fit this use case extremely well, however certain work flows have been shown to be extremely inefficient when run directly against Echo. To solve this problem XRootD caches were deployed. This paper describes the performance testing of various configurations of both disk and memory caches.

        Access to Echo happens via gateways. Echo has a small number of dedicated gateway machines that provide external connectivity. Jobs running on the RAL batch farm can access Echo through an XRootD gateway running inside a container on every worker node. This paper describes the setup of these worker node gateways which also provide a cache and the ability to redirect requests in the event a file is not found.

        For CMS, a dedicated XCache service has been provided to allow remote jobs to access data directly from Echo. This paper describes the setup and testing of this service.

        Speakers: Alastair Dewhurst (STFC-Rutherford Appleton Laboratory (GB)), Rob Appleyard (STFC)
      • 379
        Optimizing access to conditions data in ATLAS event data processing

        The processing of ATLAS event data requires access to conditions data which is stored in database systems. This data includes, for example alignment, calibration, and configuration information which may be characterized by large volumes, diverse content, and/or information which evolves over time as refinements are made in those conditions. Additional layers of complexity are added by the need to provide this information across the worldwide ATLAS computing grid and the sheer number of simultaneously executing processes on the grid, each demanding a unique set of conditions to proceed. Distributing this data to all the processes that require it in an efficient manner has proven to be an increasing challenge with the growing needs and numbers of event-wise tasks.
        In this presentation, we briefly describe the systems in which we have collected information about the use of conditions in event data processing. We then proceed to explain how this information has been used only to refine reconstruction software and job configuration but also to guide modifications of underlying conditions data configuration and in some cases, rewrites of the data in the database into a more harmonious form for offline usage in the processing of both real and simulated data.

        Speaker: Lorenzo Rinaldi (Universita e INFN, Bologna (IT))
      • 380
        Particle Flow reconstruction in the Level-1 trigger at CMS for the HL-LHC

        With the planned addition of the tracking information in the Level 1 trigger in CMS for the HL-LHC, the algorithms for Level 1 trigger can be completely reconceptualized. Following the example for offline reconstruction in CMS to use complementary subsystem information and mitigate pileup, we explore the feasibility of using Particle Flow-like and pileup per particle identification techniques at the hardware trigger level. We present the challenges of adapting these algorithm to the timing and resource constraints of the Level 1 trigger, the first prototype implementations, and the expected performance on physics object reconstruction.

        Speaker: Jennifer Ngadiuba (INFN, Milano)
      • 381
        Performance Analysis of Effective Symbolic Methods for Solving Band Matrix SLEs

        Systems of linear algebraic equations (SLEs) with heptadiagonal (HD), pentadiagonal (PD) and tridiagonal (TD) coefficient matrices arise in many scientific problems. Three symbolic algorithms for solving SLEs with HD, PD and TD coefficient matrices are considered. The only assumption on the coefficient matrix is nonsingularity. These algorithms are implemented using the GiNaC library of C++ and SymPy library of Python.
        The choice of mathematical methods and software is crucial for the effectiveness of most of the programs for computer modelling of processes, especially when solving complex problems with a big dimension. Usually, these problems cannot be modelled on ordinary PCs for a reasonable amount of time and this enforces the usage of supercomputers and computer clusters instead. Therefore, the aim of this paper is to investigate the performance characteristics of the considered methods with their two implementations being executed on modern (as of 2018) computer clusters. To that purpose, the experimental setup and the results from the conducted computations on the individual computer systems are presented and discussed.

        Keywords: heptadiagonal matrix SLEs, pentadiagonal matrix SLEs, tridiagonal matrix SLEs, symbolic algorithms, symbolic libraries, HPC, performance analysis.

        Speaker: Milena Veneva (Joint Institute for Nuclear Research)
      • 382
        RDMA-accelerated data transport in ALFA

        ALFA is a modern software framework for simulation, reconstruction and analysis of particle physics experiments. ALFA provides building blocks for highly parallelized processing pipelines required by the next generation of experiments, e.g. the upgraded ALICE detector or the FAIR experiments. The FairMQ library in ALFA provides the means to easily create actors (so-called devices) that communicate via asynchronous message passing. Several data transport implementations in FairMQ exist today to leverage the powers of modern communication technologies. The available implementations are based on ZeroMQ, nanomsg and shared memory; they can be selected at run-time. In order to achieve highest inter-node data throughput and to exploit CPU-offloading features offered by RDMA-capable network fabrics (e.g. Infiniband), a new transport implementation based on the OFA libfabric technology is under development. The new transport implementation as well as first performance evaluations will be presented and discussed.

        Speaker: Dennis Klein (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 383
        Ready to Go Data Transfers: supporting the long tail of science

        The "File Transfer Service" (FTS) has been proven capable of satisfying the equirements – in terms of functionality, reliability and volume – of three major LHC experiments: ATLAS, CMS and LHCb.

        We believe small experiments, or individual scientists, can also benefit from FTS advantages, and integrate it into their frameworks, allowing to effectively outsource the complexities of data transfer.

        For this, not only do we provide a running, stable and well monitored service, but also offer end to end, easy to run, software packages, reducing the entry barrier. These include from containerized, ready to use, GridFTP/FTP servers; a browser based solution to support file transfers from users' PCs/laptops to remote Storage/Cloud endpoints (known as Last Mile Transfers); not forgetting user assistance and support.

        Two successful examples of one of these use cases are the experiments NA62 and ArDM, both of which now base their data acquisition framework on FTS, moving their data from where it is generated to the CERN Data Center.

        Speaker: Andrea Manzi (CERN)
      • 384
        Remote data access in computational jobs on the ATLAS data grid

        This work describes the technique of remote data access from computational jobs on the ATLAS data grid. In comparison to traditional data movement and stage-in approaches it is well suited for data transfers which are asynchronous with respect to the job execution. Hence, it can be used for optimization of data access patterns based on various policies. In this study, remote data access is realized with the HTTP and WebDAV protocols, and is investigated in the context of intra- and inter-computing site data transfers. In both cases, the typical scenarios for application of remote data access are identified. The paper also presents an analysis of parameters influencing the data goodput between heterogeneous storage element - worker node pairs on the grid.

        Speaker: Volodimir Begy (University of Vienna (AT))
      • 385
        Securing and sharing Elasticsearch resources with ReadonlyREST

        In early 2016 CERN IT created a new project to consolidate and centralise Elasticsearch instances across the site, with the aim to offer a production quality new IT services to experiments and departements. We'll present the solutions we adapted for securing the system using open source only tools, which allowes us to consolidate up to 20 different use cases on a single Elasticsearch cluster.

        Speaker: Zhechka Toteva (CERN)
      • 386
        Shared Memory Transport for ALFA

        The high data rates expected for the next generation of particle physics experiments (e.g.: new experiments at FAIR/GSI and the upgrade of CERN experiments) call for dedicated attention with respect to design of the needed computing infrastructure. The common ALICE-FAIR framework ALFA is a modern software layer, that serves as a platform for simulation, reconstruction and analysis of particle physics experiments. Beside standard services needed for simulation and reconstruction of particle physics experiments, ALFA also provides tools for data transport, configuration and deployment. The FairMQ module in ALFA offers building blocks for creating distributed software components (processes) that communicate between each other via message passing.

        The abstract "message passing" interface in FairMQ has at the moment three implementations: ZeroMQ, nanomsg and shared memory. The newly developed shared memory transport will be presented, that provides significant performance benefits for transferring large data chunks between components on the same node. The implementation in FairMQ allows users to switch between the different transports via a trivial configuration change. This presentation will highlight the design decisions, implementation details and performance numbers of the shared memory transport in FairMQ/ALFA.

        Speaker: Alexey Rybalchenko (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 387
        SHiP Spectrometer Optimization using Bayesian optimization with Gaussian processes

        SHiP is a new proposed fixed-target experiment at the CERN SPS accelerator. The goal of the experiment is to search for hidden particles predicted by models of Hidden Sectors. The purpose of the SHiP Spectrometer Tracker is to reconstruct tracks of charged particles from the decay of neutral New Physics objects with high efficiency. Efficiency of the track reconstruction depends on the spectrometer geometry. Parameters of the geometry can be optimized to archive the higher efficiency. One of the most popular methods of optimization is grid search. This method is reasonable when the number of the parameters is small. However, the grid size increases exponentially with number of parameters. This requires large computational resources to check all grid nodes. It is especially important when one node checking is expensive and requires to run the spectrometer simulation. Bayesian optimization is an optimization method of expensive objective function. At each iteration it estimates a new point to check where the objective function may has the optimum. This allows to find the optimum with lower number of checks than for the grid search. In this study the SHiP Spectrometer Tracker optimization using Bayesian optimization with Gaussian processes in considered. The study have been done on MC data. The results of the optimization is also considered.

        Speaker: Mikhail Hushchyn (Yandex School of Data Analysis (RU))
      • 388
        The ATLAS Wide-Range Database & Application Monitoring

        In HEP experiments at LHC the database applications often become complex by reflecting the ever demanding requirements of the researchers. The ATLAS experiment has several Oracle DB clusters with over 216 database schemes each with its own set of database objects. To effectively monitor them, we designed a modern and portable application with exceptionally good characteristics. Some of them include: concise view of the most important DB metrics; top SQL statements based on CPU, executions, block reads, etc.; volume growth plots per schema and DB object type; database jobs section with signaling for problematic ones; in-depth analysis in case of contention on data or processes.
        This contribution describes also the technical aspects of the implementation. The project can be separated into three independent layers. The first layer consists in highly-optimized database objects hiding all complicated calculations. The second layer represents a server providing REST access to the underlying database backend. The third layer is a JavaScript/AngularJS web interface. In addition, we will summarize the continuous integration cycle of the application, which uses gitlab-ci pipelines for basic testing, containerization and deployment on the CERN Openshift infrastructure.

        Speaker: Petya Tsvetanova Vasileva (CERN)
      • 389
        The GridKa Tape System: monitoring and failure analysis

        A tape system usually comprises lots of tape drives, several thousand or even tens of thousands of cartridges, robots, software applications and machines which are running these applications. All involved components are able to log failures and statistical data. However, correlation is a laborious and ambiguous process and a wrong interpretation can easily result in a wrong decision. A single defective drive or cartridge can silently put the data on many other cartridges at stake, so it is extremely import to discover problems as early as possible. The longer it takes to identify and isolate a defective component the more extensive is the damage. To be able to take the right decision at the right time an effective monitoring system is essential. But how effective is the monitoring system? Is there any off-the-shelf software which can do the whole work for you? This paper is going to give an insight into the failures of the tape drives and tape cartridges we have been experiencing over the years at GridKa and the procedures we have developed to keep the data on tape as safe as possible.

        Speaker: Mr Dorin Lobontu (Karlsruhe Institut of Technology)
      • 390
        Using Lustre and SLURM to process Hadoop workloads and extending to the WLCG

        The Queen Mary University of London Grid site has investigated the use of its' Lustre file system to support Hadoop work flows using the newly open sourced Hadoop adaptor for Lustre. Lustre is an open source, POSIX compatible, clustered file system often used in high performance computing clusters and, is often paired with the SLURM batch system as it is at Queen Mary. Hadoop is an open-source software framework for distributed storage and processing of data normally run on dedicated hardware utilising the HDFS file system and Yarn batch system. Hadoop is an important modern tool for data analytics used by a large range of organisation including CERN. By using our existing Lustre file system and SLURM batch system we remove the need to have dedicated hardware and only have to maintain a single platform for data storage and processing. The motivation and benefits of using Lustre rather than HDFS with Hadoop are presented. The installation, benchmarks and real world performance are presented and future plans discussed. We also investigate using the standard WLCG grid middleware services (Cream CE, Storm SRM) to provide a grid enabled Hadoop service.

        Speaker: Daniel Peter Traynor (University of London (GB))
      • 391
        Using Machine Learning for Precision Measurements

        The use of machine learning techniques for classification is well established. They are applied widely to improve the signal-to-noise ratio and the sensitivity of searches for new physics at colliders. In this study I explore the use of machine learning for optimizing the output of high precision experiments by selecting the most sensitive variables to the quantity being measured. The precise determination of the electroweak mixing angle at the Large Hadron Collider using random forests is developed as a test study.

        Speaker: Dr Dimitri Bourilkov (University of Florida)
      • 392
        Using ZFS to manage Grid storage and improve middleware resilience

        ZFS is a powerful storage management technology combining filesystem, volume management and software raid technology into a single solution. The WLCG Tier2 computing at Edinburgh was an early adopter of ZFS on Linux, with this technology being used to manage all of our storage systems including servers with aging components. Our experiences of ZFS deployment have been shared with the Grid storage community which has led to additional sites adopting this technology.

        ZFS is highly configurable therefore allowing systems to be tuned to give best performance under diverse workloads. This paper highlights our experiences in tuning our systems for best performance when combining ZFS with DPM storage management. This has resulted in reduced system load and better data throughput. This configuration also provides the high redundancy required for managing older storage servers. We also demonstrate how ZFS can be combined with Distributed Replicated Block Device (DRBD) technology to provide a performant and resilient hypervisor solution to host multiple production Grid services.

        Speakers: Robert Andrew Currie (The University of Edinburgh (GB)), Teng LI (Shandong University, China)
      • 393
        Xcache in the ATLAS Distributed Computing Environment

        Built upon the Xrootd Proxy Cache (Xcache), we developed additional features to adapt the ATLAS distributed computing and data environment, especially its data management system Rucio, to help improve the cache hit rate, as well as features that make the Xcache easy to use, similar to the way the Squid cache is used by the HTTP protocol. We packaged the software in CVMFS and in singularity containers in order to standardize the deployment and reduce the cost to resolve issues at remote sites. We are also integrating it into Rucio as a volatile storage systems, and into various ATLAS workflow such as user analysis and event streaming services.

        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 394
        XRootD Erasure Coding Plugin

        XRootD is distributed low-latency file access system with its own communication protocol and scalable, plugin based architecture. It is the primary data access framework for the high-energy physics community, and the backbone of the EOS service at CERN.

        In order to bring the potential of Erasure Coding (EC) to the XrootD / EOS ecosystem an effort has been undertaken to implement a native EC XrootD plugin based on the Intel Storage Acceleration Library (ISAL). In this contribution we discuss the architecture of the plugin, carefully engineered in order to enable low latency data streaming and 2D erasure coding. Moreover, we give an overview of the data repair and placement strategies. Finally, we report on the observed performance of the EC plugin and the project outcomes.

        Speakers: Michal Kamil Simon (CERN), Andreas Joachim Peters (CERN)
      • 395
        XRootD plug-in based solutions for site specific requirements

        XRootD has been established as a standard for WAN data access in HEP and HENP. Site specific features, like those existing at GSI, have historically been hard to implement with native methods. XRootD allows a custom replacement of basic functionality for native XRootD functions through the use of plug-ins. XRootD clients allow this since version 4.0. In this contribution, our XRootD based developments motivated by the use in the current ALICE Tier 2 Centre at GSI and the upcoming ALICE Analysis Facility will be shown. Among other things, an XRootD redirector plug-in which redirects local clients directly to a shared filesystem, as well as the needed changes to the XRootD base code, which are publicly available since XRootD version 4.8.0, will be presented. Furthermore, a prototype for an XRootD based disk caching system for opportunistic resources has been developed.

        Speaker: Jan Knedlik (GSI Helmholtzzentrum für Schwerionenforschung)
    • Plenary: S4 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Graeme Stewart
      • 396
        ATLAS and CMS Trigger and Data Acquisition Upgrades for the High Luminosity LHC

        The ATLAS and CMS experiments at CERN are planning a second phase of upgrades to prepare for the "High Luminosity LHC", with collisions due to start in 2026. In order to deliver an order of magnitude more data than previous runs, protons at 14 TeV center-of-mass energy will collide with an instantaneous luminosity of 7.5 x 10^34 cm^-2 s^-1, resulting in much higher pileup and data rates than the current experiments were designed to handle. While this is essential to realise the physics programme, it is a huge challenge for the detectors, trigger, data acquisition and computing. The detector upgrades themselves also present new requirements and opportunities for the trigger and data acquisition systems. With the written ATLAS Technical Design Report and CMS DAQ and L1 interim Technical Design Reports, the ATLAS baseline and CMS preliminary designs of the TDAQ upgrades will be described. The ATLAS system comprises: a hardware-based low-latency real-time Trigger, Data Acquisition which combines custom readout with commodity hardware and networking, and an Event Filter which combines offline-like algorithms on a large commodity compute service augmented by fast hardware tracking. The CMS trigger system is designed in two levels with the first level including tracking information from the Outer Tracker for the first time. Throughout the system, use of precision algorithms running on FPGAs or commodity hardware are pushed to lower latencies and higher rates than before. Precision calorimeter reconstruction with offline-style clustering and jet-finding in FPGAs, and track reconstruction in Associative Memory and FPGAs are used to combat pileup in the Trigger. The physics motivation and expected performance will be shown for key physics processes.

        Speaker: Imma Riu (IFAE Barcelona (ES))
      • 397
        HEP strategy on multithreading / vectorisation
        Speaker: Gerhard Raven (Natuurkundig Laboratorium-Vrije Universiteit (VU)-Unknown)
      • 398
        Community driven common tracking software - the ACTS project

        The reconstruction of particle trajectories is one of the most complex and CPU intensive tasks of event reconstruction at current LHC experiments. The growing particle multiplicity stemming from an increasing number of instantaneous collisions as forseen for the upcoming high luminosity upgrade of the LHC (HL-LHC) and future hadron collider studies will intensify this problem significantly. In addition, the HL-LHC will see a factor 10 in data rates to be processed and written on disk. The well performing track reconstruction programs of the current LHC experiments have successfully served event processing of several billions of events.  However, parts of software concepts and code base stem from the pre LHC era and need to undergo significant modernization. This includes most prominently the preparation for concurrent code execution. Such disruptive changes to the software structure are almost impossible to be done within the existing software stacks of the experiments, as they need to maintaining their day-to-day performance.
        ACTS (A Common Tracking Software) has been created in 2016 in order to serve as a detector independent development platform for track reconstruction by
        and for the community. It has been initiated by the need of ATLAS to develop a modern track reconstruction setup for the HL-LHC and to serve at the same as the backbone of the FCC-hh event reconstruction. It is an open source toolkit initially based on code and concepts of the ATLAS track reconstruction software, and has attracted an increasing number of both developers and clients from the HEP community.  Software development of common modules to be used by several experiments has been identified by the HEP Software Foundation (HSF) as one of the necessary steps towards maintaining and managing the future computing challenges. This has been recently published in the community white paper released in 2017.  Consequently, ACTS has been embedded into the HSF structure. The open access policy of ACTS allows and encourages contributions from across and outside the field, including inputs from the growing research sector of machine learning or e.g. as a supported HSF project for the Google Summer of
        Code 2018. We present the structure and concepts of ACTS, the modern open source workflow including extensive unit testing, and benchmark performance examples for some of the main ACTS clients.

        Speaker: Andreas Salzburger (CERN)
    • Plenary: S5 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Simone Campana (CERN)
      • 399
        The HEP Software Foundation Community White Paper

        Most HEP experiments coming in the next decade will have computing requirements that cannot be met by adding more hardware (HL-LHC, FAIR, DUNE...). A major software re-engineering is needed and more collaboration between experiments around software developments is
        needed. This was the reason for setting up the HEP Software Foundation (HSF) in 2015. In 2017, the HSF published "A Roadmap for HEP Software and Computing R&D for the 2020s", a Community White Paper (CWP) which is the result of a one-year community-wide discussion process.

        This presentation will highlight the main topics covered by the CWP and describe the main actions that have been started since the beginning of the year, based on the proposed roadmap.

        Speaker: Michel Jouvin (Université Paris-Saclay (FR))
      • 400
        Belle II at the start of data taking

        The Belle II experiment is taking first collision data in 2018. This is an exciting time for the collaboration and allows to not only assess the performance of accelerator and detector, but also of the computing system and the software. Is Belle II ready to quickly process the data and produce physics results? Which parts are well prepared and where do we have to invest more effort? The collaboration was founded ten years ago in 2008. At that time also a broader discussion about the software and computing started. The software development history involves (revisions of) decisions about the framework, the event data model, the geometry package, the conditions data handling, the development infrastructure, the coding conventions, and the release procedures. We will take a look back, assess the decisions from the current point of view, and tell the lessons learned, including those from first data analyses. Based on this experience, we will also anticipate further challenges and present the plans to resolve them.

        Speaker: Thomas Kuhr
      • 401
        The SKA project and development of Regional Data Centres
        Speaker: Rosie Bolton
      • 402
        The DAQ systems of the DUNE Prototypes at CERN

        DUNE will be the world's largest neutrino experiment due to take data in 2025. Here are described the data acquisition (DAQ) systems for both of its prototypes, ProtoDUNE single-phase (SP) and ProtoDUNE dual-phase (DP)  - due to take data later this year.  ProtoDUNE also breaks records as the largest beam test experiment yet constructed, and are the fundamental elements of CERN's Neutrino Platform. This renders each ProtoDUNE an experiment in its own right and the design and construction have been chosen to meet this scale. Due to the aggressive timescale, off-the-shelf electronics have been chosen to meet the demands of the experiments where possible. The ProtoDUNE-SP cryostat comprises two primary subdetectors - a single phase liquid Argon TPC and a companion Photon Detector. The TPC has two candidate readout solutions under test in ProtoDUNE-SP - RCE (ATCA-based) and FELIX (PCIe-based). Fermilab's artDAQ is used as the dataflow software for the single phase experiment. ProtoDUNE-DP will readout out the dual-phase liquid argon detector using a microTCA readout solution. Timing and trigger electronics and software are also described.  Compression and triggering will take ~480 Gb/s (SP); ~130 Gb/s (DP) of data from the front-end and reduce it sufficiently to 20 Gb/s (each) bandwidth to permanent data storage in CERN's EOS infrastructure. This paper describes the design and implementation of the TDAQ systems as well as first measurements of their performance.

        Speaker: Karol Hennessy (University of Liverpool (GB))
    • 11:00
      Break
    • T1 - Online computing: S5 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Catrin Bernius (SLAC National Accelerator Laboratory (US))
      • 403
        The Belle II Online Reconstruction and Software Trigger

        The Belle II detector is currently commissioned for operation in early 2018. It is designed to record collision events with an instantaneous luminosity of up to 8⋅10^35 cm−2*s−1 which is delivered by the SuperKEKB collider in Tsukuba, Japan. Such a large luminosity is required to significantly improve the precision on measurements of B and D mesons and Tau lepton decays to probe for signs of physics beyond the standard model.

        After a fast, hardware-based level 1 trigger stage, a high-level trigger (HLT) system is used to combine the measurement signals from the sub-detectors. This HLT is comprised of a x86-based computing cluster of up to 20 units with 320 cores per unit for highly parallel event processing. A software-based reconstruction is performed to fully reject or prescale background processes and to ensure that relevant signal channels are stored for later analysis. Furthermore, the regions of the innermost pixel detector which should be read out are determined using found tracks in the outer detectors and fed-back to the ONSEN read-out system.

        The challenging aspects of a software trigger at Belle II is the large beam-induced background due to the much increased instantaneous luminosity compared to Belle.

        This contribution will report on the design of the online reconstruction software, the properties and performance of the event selection and of first results and insights gained in the 2018 data taking with the Belle II HLT and online reconstruction.

        Speaker: Thomas Hauth (KIT - Karlsruhe Institute of Technology (DE))
      • 404
        The ATLAS Trigger in 2017 - improvements, performance and challenges

        The ATLAS Trigger system has been operating successfully during 2017, its excellent performance has been vital for the ATLAS physics program.

        The trigger selection capabilities of the ATLAS detector have been significantly enhanced for Run-2 compared to Run-1, in order to cope with the higher event rates and with the large number of simultaneous interactions (pile-up). The improvements at Level-1 resulted in more pile-up-robust selection efficiencies and event rates. A new hardware system, designed to analyse event-topologies at Level-1, came to full use in 2017. A hardware-based track reconstruction system, expected to be inaugurated in early 2018, is designed to provide track information to the high-level software trigger at its full input rate. The high-level trigger selections were replaced with more offline-like reconstruction techniques. In the middle of 2017 an unexpected change in the LHC operating conditions resulted in much higher pile-up and a much-increased CPU usage of the software trigger and required adaptation of the software and the general trigger design.

        This presentation gives a brief yet comprehensive review of the dramatic improvements of the ATLAS trigger system in 2017. Special focus will be put on the significant CPU usage improvements that were made in response to the high pileup conditions at the LHC.

        Speaker: Alex Christopher Martyniuk (University College London)
      • 405
        Event reconstruction of free-streaming data for the RICH detector in the CBM experiment

        The Compressed Baryonic Matter (CBM) experiment at the future FAIR facility requires fast and efficient event reconstruction algorithms. CBM will be one of the first HEP experiments which works in a triggerless mode: data received in the DAQ from the detectors will not be associated with events by a hardware trigger anymore. All raw data within a given period of time will be collected continuously in containers, so called time-slices. The task of the reconstruction algorithms is to create events out of this raw data stream.

        In this contribution, our adoption of the reconstruction software in the RICH detector to the free-streaming data flow are presented. We present concepts and implementations to adopt the classical event based reconstruction to the free-streaming mode. Two possible options are discussed, namely: hit preselection in a time window and ring reconstruction with time measurements as an additional parameter. The possibility of event overlaps is discussed in detail.

        Speaker: Dr Semen Lebedev (Justus Liebig University Giessen)
      • 406
        Overview of the HL-LHC Upgrade for the CMS Level-1 Trigger

        The High-Luminosity LHC will open an unprecedented window on the weak-scale nature of the universe, providing high-precision measurements of the standard model as well as searches for new physics beyond the standard model. Such precision measurements and searches require information-rich datasets with a statistical power that matches the high-luminosity provided by the Phase-2 upgrade of the LHC. Efficiently collecting those datasets will be a challenging task, given the harsh environment of 200 proton-proton interactions per LHC bunch crossing. For this purpose, CMS is designing an efficient data-processing hardware Level-1 trigger that will include tracking information and high-granularity calorimeter information. The current conceptual system design is expected to take full advantage of advances in FPGA and link technologies over the coming years, providing a high-performance, low-latency computing platform for large throughput and sophisticated data correlation across diverse sources.

        Speaker: Rick Cavanaugh (University of Illinois at Chicago (US))
      • 407
        Implementation of the ATLAS trigger within the ATLAS Multi-Threaded Software Framework AthenaMT

        We present an implementation of the ATLAS High Level Trigger (HLT)
        that provides parallel execution of trigger algorithms within the
        ATLAS multi-threaded software framework, AthenaMT. This development
        will enable the HLT to meet future challenges from the evolution of
        computing hardware and upgrades of the Large Hadron Collider (LHC) and
        ATLAS Detector. During the LHC data-taking period starting in 2021,
        luminosity will reach up to three times the original design value. In
        the following data-taking period (2026) upgrades to the ATLAS trigger
        architecture will increase the HLT input rate by a factor of 4-10,
        while the luminosity will increase by a further factor of 2-3.

        AthenaMT provides a uniform interface for offline and trigger
        algorithms, facilitating the use of offline code in the HLT.
        Trigger-specific optimizations provided by the framework include early
        event rejection and reconstruction within restricted geometrical
        regions. We report on the current status, including experience of
        migrating trigger selections to this new framework,and present the
        next steps towards a full implementation of the redesigned ATLAS
        trigger.

        Speaker: Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))
      • 408
        A 30 MHz software trigger for the LHCb upgrade

        The first LHCb upgrade will take data at an instantaneous luminosity of $2\times10^{33}\mathrm{cm}^{-2}s^{-1}$ starting in 2021. Due to the high rate of beauty and charm signals LHCb will read out the entire detector into a software trigger running on commodity hardware at the LHC collision frequency of 30 MHz. In this talk we present the challenges of triggering in the MHz signal era. We pay particular attention to the need for flexibility in the selection and reconstruction of events without sacrificing performance.

        Speaker: Rosen Matev (CERN)
    • T2 - Offline computing: S5 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Lucia Grillo (University of Manchester (GB))
      • 409
        DyTER - A framework for Dynamic Track and Event Reconstruction

        The upcoming PANDA at FAIR experiment in Darmstadt, Germany will belong to a new generation of accelerator-based experiments relying exclusively on software filters for data selection. Due to the likeness of signal and background as well as the multitude of investigated physics channels, this paradigm shift is driven by the need for having full and precise information from all detectors in order to perform a reliable data selection. In our experiment, this software trigger will need to cope with incoming event rates of up to 20 MHz, corresponding to a raw data rate of up to 200 GB/s.
        At Uppsala University, our PANDA group’s physics interest is focused on hyperon reactions. With respect to the reconstruction algorithms, hyperons impose a particular challenge due to their relatively long-lived nature. This causes their decay vertices to be separated from the beam-target interaction point by up to several metres. In order to filter interesting data, Uppsala is, in collaboration with other international groups, developing a framework for Dynamic Track and Event Reconstruction (DyTER). DyTER draws upon a variety of reconstruction algorithms, such as a cellular automaton and pattern matching. This presentation will give an overview of the general concept as well as showcase the current development.

        Speaker: Michael Papenbrock (Uppsala University)
      • 410
        Multi-threaded Output in CMS using ROOT

        CMS has worked aggressively to make use of multi-core architectures, routinely running 4 to 8 core production jobs in 2017. The primary impediment to efficiently scaling beyond 8 cores has been our ROOT-based output module, which has been necessarily single threaded. In this presentation we explore the changes made to the CMS framework and our ROOT output module to overcome the previous scaling limits, using two new ROOT features: the TBufferMerger asynchronous file merger, and Implicit Multi-Threading. We examine the architecture of the new parallel output module, the specific accommodations and modifications that were made to ensure compatibility with the CMS framework scheduler, and the performance characteristics of the new output module.

        Speaker: Christopher Jones (Fermi National Accelerator Lab. (US))
      • 411
        High Performance Computing applied to the Cherenkov Telescope Array data analysis

        The Cherenkov Telescope Array (CTA) is the next generation of ground-based gamma-ray telescopes for gamma-ray astronomy. Two arrays will be deployed composed of 19 telescopes in the Northern hemisphere and 99 telescopes in the Southern hemisphere. Observatory operations are planned to start in 2021 but first data from prototypes should be available already in 2019. Due to its very high sensitivity, CTA will record a tremendous amount of data that represent a computing challenge to the reconstruction software. We develop High Performance Computing (HPC) solutions able to tackle these challenges and to perform the reconstruction of CTA raw data maximizing the usage of computing resources and thus minimizing their cost. The developed algorithms are organized in a reduction pipeline.
        Here we present this pipeline including the compression and decompression of data with our HPC compression algorithm, a data format automatically generated with HPC features, the computation of Hillas parameters and their application to the events reconstruction. Physics and computing performances are presented and compared to the state of the art.

        Speaker: Dr Thomas Vuillaume (LAPP, CNRS, Univ. Savoie Mont-Blanc)
      • 412
        Data Preparation for NA62

        In 2017, NA62 recorded over a petabyte of raw data, collecting around a billion events per day of running. Data are collected in bursts of 3-5 seconds, producing output files of a few gigabytes. A typical run, a sequence of bursts with the same detector configuration and similar experimental conditions, contains 1500 bursts and constitutes the basic unit for offline data processing. A sample of 100 random bursts is used to make timing calibrations of all detectors, after which every burst in the run is reconstructed. Finally the reconstructed events are filtered by physics channel with an average reduction factor of 20, and data quality metrics are calculated.

        Initially a bespoke data processing solution was implemented using a simple finite state machine with limited production system functionality. In 2017, the ATLAS Tier-0 team offered the use of their production system, together with the necessary support. Data processing workflows were rewritten with better error-handling and IOPs were minimised, the reconstruction software was improved and conditions data handling was changed to follow best practices suggested by the HSF conditions database working group. This contribution describes the experience gained in using these tools and methods for data-processing on a petabyte scale experiment.

        Speaker: Paul James Laycock (CERN)
      • 413
        Teaching PROFESSOR New Math

        We present a range of conceptual improvements and extensions to the popular
        tuning tool "Professor".

        Its core functionality remains the construction of multivariate analytic
        approximations to an otherwise computationally expensive function. A typical
        example would be histograms obtained from Monte-Carlo (MC) event generators for
        standard model and new physics processes.

        The fast Professor model then allows for numerical optimisation in a number of
        different contexts such as chi-square minimisation and likelihood evaluation.

        Previously, Professor was based on ordinary polynomials. Those, albeit highly
        successful, showed limitations whenever the true functional form exhibited some
        form of 1/x behaviour (e.g. due to masses in propagators). We describe our efforts to
        replace the polynomials with rational, or "Pade", approximations as well as
        radial basis functions (RBF).

        Further, we introduce a new and better optimization routine that replaces the
        gradient-based optimization inside Professor by an RBF-based approach that can
        be shown to generate superior parameter fits.

        We illustrate our improvements for the task of MC-generator tuning and limit setting.

        Speaker: Dr Holger Schulz (Fermi National Accelerator Laboratory)
      • 414
        Monte Carlo Generators in CMS for Run II and the Needs for Run III

        We describe the CMS computing model for MC event generation, and technical integration and workflows for generator tools in CMS. We discuss the most commonly used generators, standard configurations, their event tunes, and the technical performance of these configurations for Run II as well as the needs for Run III.

        Speaker: Luca Perrozzi (Eidgenoessische Technische Hochschule Zuerich (ETHZ) (CH))
    • T3 - Distributed computing: Experiment Frameworks and Operational Experiences (1) Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: Hannah Short (CERN)
      • 415
        COMPASS Grid Production System

        LHC Computing Grid was a pioneer integration effort, managed to unite computing and
        storage resources all over the world, thus making them available to experiments on the Large Hadron Collider. During decade of LHC computing, Grid software has learned to effectively utilise different types of computing resources, such as classic computing clusters, clouds and hyper power computers. While the resources experiments use are the same, data flow differs from experiment to experiment. A crucial part of each experiment computing is a production system, which describes logic and controls data processing of the experiment. COMPASS always relied on CERN facilities, and, when CERN, during hardware and software upgrade, started migration to resources, available only via Grid, faced the problem of insufficiency of resources to process data on. To make COMPASS data processing able to work via Grid, the development of the new production system has started. Key features of modern production system for COMPASS are: distributed data processing, support of different type of computing resources, support of arbitrary amount of computing sites. Build blocks for the production system are taken from achievements of LHC experiments, but logic of data processing is COMPASS-specific.

        Speaker: Artem Petrosyan (Joint Institute for Nuclear Research (RU))
      • 416
        Overview of the ATLAS distributed computing system

        The CERN ATLAS experiment successfully uses a worldwide
        computing infrastructure to support the physics program during LHC
        Run 2. The grid workflow system PanDA routinely manages 250 to
        500 thousand concurrently running production and analysis jobs
        to process simulation and detector data. In total more than 300 PB
        of data is distributed over more than 150 sites in the WLCG and
        handled by the ATLAS data management system Rucio. To prepare
        for the ever growing LHC luminosity in future runs new
        developments are underway to even more efficiently use
        opportunistic resources such as HPCs and utilize new technologies.
        This presentation will review and explain the outline and the
        performance of the ATLAS distributed computing system and give an
        outlook to new workflow and data management ideas for the
        beginning of the LHC Run 3.

        Speaker: Johannes Elmsheuser (Brookhaven National Laboratory (US))
      • 417
        IceProd - A dataset management system for IceCube: Update

        IceCube is a cubic kilometer neutrino detector located at the south pole. IceProd is IceCube’s internal dataset management system, keeping track of where, when, and how jobs run. It schedules jobs from submitted datasets to HTCondor, keeping track of them at every stage of the lifecycle. Many updates have happened in the last years to improve stability and scalability, as well as increase user access. Major additions include better heterogenous job scheduling, resource requests and tracking, monitoring, and multi-user support. Along the way, the IceProd codebase switched from Python 2 to Python 3.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 418
        LHCb and DIRAC strategy towards the LHCb upgrade

        The DIRAC project is developing interware to build and operate distributed computing systems. It provides a development framework and a rich set of services for both Workload and Data Management tasks of large scientific communities. DIRAC is adopted by a growing number of collaborations, including LHCb, Belle2, the Linear Collider, and CTA.

        The LHCb experiment will be upgraded during the second long shutdown (2019-2020). At restart of data taking in Run 3, the instantaneous luminosity will increase by a factor five. The LHCb computing model will also need an upgrade. Oversimplifying, this translates into the need for significantly more computing (power or resources) and more storage with respect to what LHCb use right now. The DIRAC interware will keep being the tool to handle all of LHCb distributed computing resources.

        Within this contribution, we will highlight the ongoing and planned efforts to ensure that DIRAC will be able to ensure an optimal usage of its distributed computing resources. This contribution focus on DIRAC plans for increasing the scalability of the overall system, taking in consideration that the main requirement is keeping a running system working, with continuity. This requirement translates into the need of studies and developments within the current DIRAC architecture. We believe that scalability is about traffic growth, dataset growth, and maintainability: within this contribution we will address all of them, showing the technical solutions we are adopting.

        Speaker: Federico Stagni (CERN)
      • 419
        Challenges of processing growing volumes of data for the CMS experiment during the LHC Run2

        In recent years the LHC delivered a record-breaking luminosity to the CMS experiment making it a challenge to successfully handle all the demands for the efficient Data and Monte Carlo processing. In the presentation we will review major issues managing such requests and how we were able to address them. Our main strategy relies on the increased automation and dynamic workload and data distribution. We maximize sharing of CPU resources using an HTCondor-based global pool, which was recently expanded by the dedicated Tier0 resources. To avoid underutilization of Tier-2 sites, we heavily rely on the remote data access (AAA). Multicore resizable jobs reduce the load on the workflow managing tools and improve efficiency across all type of resources. A wide range of opportunistic resources such as the CMS trigger farm, supercomputing centers and cloud resources were integrated into the global pool, which gives reach to more than 250k CPU cores.

        Speaker: Matteo Cremonesi (Fermi National Accelerator Lab. (US))
      • 420
        #461 slot
    • T4 - Data handling: S5 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Costin Grigoras (CERN)
      • 421
        Towards an Event Streaming Service for ATLAS data processing

        he ATLAS experiment is gradually transitioning from the traditional file-based processing model to dynamic workflow management at the event level with the ATLAS Event Service (AES). The AES assigns fine-grained processing jobs to workers and streams out the data in quasi-real time, ensuring fully efficient utilization of all resources, including the most volatile. The next major step in this evolution is the possibility to intelligently stream the input data itself to workers. The Event Streaming Service (ESS) is now in development to asynchronously deliver only the input data required for processing when it is needed, protecting the application payload from WAN latency without creating expensive long-term replicas. In the current prototype implementation, ESS processes run on compute nodes in parallel to the payload, reading the input event ranges remotely over the network, and replicating them in small input files that are passed to the application. In this contribution, we present the performance of the ESS prototype for different types of workflows in comparison to tasks accessing remote data directly. Based on the experience gained with the current prototype, we are now moving to the development of a server-side component of the ESS. The service can evolve progressively into a powerful CDN-like capability for data streaming, ultimately enabling the delivery of 'virtual data' generated on demand.

        Speaker: Nicolo Magini (INFN e Universita Genova (IT))
      • 422
        Development and operational experience of the web based application to collect, manage, and release the alignment and calibration configurations for data processing at CMS

        Alignment and calibration workflows in CMS require a significant operational effort, due to the complexity of the systems involved. To serve the variety of condition data management needs of the experiment, the alignment and calibration team has developed and deployed a set of web-based applications. The Condition DB Browser is the main portal to search, navigate and prepare a consistent set of calibrations to be consumed in reconstruction of data for physics. It also hosts various data management tools for the conditions including a customized display for certain calibration sets, an automatic user-defined notification agent for updates, a logging service for the user and the automatic online-to-offline uploads. In this presentation we report on the operational experience of this web application from 2017 data taking, with focus on new features and tools incorporated during this period.

        Speaker: Hasib Md (University of Delhi (IN))
      • 423
        Performance and impact of dynamic data placement in ATLAS

        For high-throughput computing the efficient use of distributed computing resources relies on an evenly distributed workload, which in turn requires wide availability of input data that is used in physics analysis. In ATLAS, the dynamic data placement agent C3PO was implemented in the ATLAS distributed data management system Rucio which identifies popular data and creates additional, transient replicas to make data more widely and more reliably available.

        This contribution presents studies on the performance of C3PO and the impact it has on throughput rates of distributed computing in ATLAS. This includes analysis of the placement algorithm selection behaviour regarding the data considered for replication and destination storage elements, usage after the placement decision of the chosen datasets in general and the newly created copies in particular, and the impact on metrics like job waiting times, task completion times and failure rates of tasks.

        Speaker: Thomas Maier (Ludwig Maximilians Universitat (DE))
      • 424
        Distributed Data Collection for the Next Generation ATLAS EventIndex Project

        The ATLAS EventIndex currently runs in production in order to build a
        complete catalogue of events for experiments with large amounts of data.

        The current approach is to index all final produced data files at CERN Tier0,
        and at hundreds of grid sites, with a distributed data collection architecture
        using Object Stores to temporarily maintain the conveyed information, with
        references to them sent with a Messaging System. The final backend of all the
        indexed data is a central Hadoop infrastructure at CERN; an Oracle
        relational database is used for faster access to a subset of this information.

        In the future of ATLAS, instead of files, the event should be the atomic
        information unit for metadata. This motivation arises in order to accommodate
        future data processing and storage technologies. Files will no longer be static
        quantities, possibly dynamically aggregating data, and also allowing event-level
        granularity processing in heavily parallel computing environments. It also
        simplifies the handling of loss and or extension of data. In this sense
        the EventIndex will evolve towards a generalized event WhiteBoard,
        with the ability to build collections and virtual datasets for end users.

        This paper describes the current Distributed Data Collection Architecture of the
        ATLAS EventIndex project, with details of the Producer, Consumer and Supervisor
        entities, and the protocol and information temporarily stored in the ObjectStore.
        It also shows the data flow rates and performance achieved since the new Object
        Store as temporary store approach was put in production in July 2017.

        We review the challenges imposed by the expected increasing rates that will
        reach 35 billion new real events per year in Run 3, and 100 billion new real
        events per year in Run 4. For simulated events the numbers are even higher, with
        100 billion events/year in run 3, and 300 billion events/year in run 4.

        We also outline the challenges we face in order to accommodate this approach for
        the future Event White Board in ATLAS.

        Speaker: Alvaro Fernandez Casani (Univ. of Valencia and CSIC (ES))
      • 425
        Bootstrapping a New LHC Data Transfer Ecosystem

        GridFTP transfers and the corresponding Grid Security Infrastructure (GSI)-based authentication and authorization system have been data transfer pillars of the Worldwide LHC Computing Grid (WLCG) for more than a decade. However, in 2017, the end of support for the Globus Toolkit - the reference platform for these technologies - was announced. This has reinvigorated and expanded efforts to replace these pillars. We present an end-to-end alternate utilizing HTTP-based WebDAV as the transfer protocol, and bearer tokens for distributed authorization.

        This alternate ecosystem, integrating significant pre-existing work and ideas in the area, adheres to common industry standards to the fullest extent possible, with minimal agreed-upon extensions or common interpretations of the core protocols. The bearer token approach allows resource providers to delegate authorization decisions to the LHC experiments for experiment-dedicated storage areas.

        This demonstration touches the entirety of the stack - from multiple storage element implementations to FTS3 to the Rucio data management system. We show how the traditional production and user workflows can be reworked utilizing bearer tokens, eliminating the need for GSI proxy certificates for storage interactions.

        Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
      • 426
        Capability-Based Authorization for HEP

        Outside the HEP computing ecosystem, it is vanishingly rare to encounter user X509 certificate authentication (and proxy certificates are even more rare). The web never widely adopted the user certificate model, but increasingly sees the need for federated identity services and distributed authorization. For example, Dropbox, Google and Box instead use bearer tokens issued via the OAuth2 protocol to authorize actions on their services. Thus, the HEP ecosystem has the opportunity to reuse recent work in industry that now covers our needs. We present a token-based ecosystem for authorization tailored for use by CMS.

        We base the tokens on the SciTokens profile for the standardized JSON Web Token (JWT) format. The token embeds a signed description of what capabilities the VO grants the bearer; the site-level service can verify the VO’s signature without contacting a central service.

        In this paper, we describe the modifications done to enable token-based authorization in various software packages used by CMS, including XRootD, CVMFS, and HTCondor. We describe the token-issuing workflows that would be used to get tokens to running jobs in order to authorize data access and file stageout, and explain the advantages for hosted web services. Finally, we outline what the transition would look like for an experiment like CMS.

        Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
    • T5 - Software development: S5.1 Interactive Frameworks Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Maiken Pedersen (University of Oslo (NO))
      • 427
        Beautiful, simple and remote ROOT graphics and GUI

        For two decades, ROOT brought its own window system abstraction (for X11, GL, Cocoa, and Windows) together with its own GUI library. X11 is nearing the end of its lifetime; new windowing systems shine with performance and features. To make best use of them, the ROOT team has decided to re-implement its graphics and GUI subsystem using web technology.

        This presentation introduces the model, data and control flow. The concepts are likely similar for any application that wishes to offer remote GUI and graphics and will follow clear client-server concept. Here the server is a C++-application, which produces data and prepares it for visualization. The client(s) will be JavaScript-based code, creating a HTML/SVG/WebGL representation of the provided data and implements all kind of interactivity – zooming, tooltips, context menus, etc. The communication will be done via websocket-based protocol, allowing data push from the server side. Local displays will be implemented with libraries like Chromium Embedded Framework (CEF).

        Taking new TCanvas as prominent example, different aspects of new concept will be discussed: data organization on application side; creation of data model and versioning of different components in the canvas; communication patterns with multiple clients; efficient JavaScript-based rendering with significant amount of code sharing with JSROOT. Example with new FitPanel can be used to show how main interactive part can be implemented with open-source GUI library, but main fitting job will be performed with standard ROOT methods.

        Speaker: Serguei Linev (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 428
        Exploring server/web-client event display for CMS

        The divergence of windowing systems among modern Linux distributions and OSX is making the current mode of operations difficult to maintain. In order to continue support the CMS experiment event display, aka Fireworks, we need to explore other options beyond the current distribution model of centrally built tarballs.

        We think that C++-server web-client event display is a promising direction that can maintain the full functionality of Fireworks, including operation from the full experiment framework. In addition, it brings new features like multi-user debugging and the possibility to implement more elaborate visualization of non-event data through remote access to independent services.

        We have been exploring mainly in the direction of Fireworks-based C++ server and thin web-client user interface as it allows for a large degree of reuse of existing algorithms as well as for full access to CMS data formats and accompanying functions that are crucial for correct physical interpretation of event data. We will show the basic architecture of the system, discuss the communication protocol between server and client, and show existing prototypes that demonstrate feasibility of advanced event display features.

        Speaker: Alja Mrak Tadel (Univ. of California San Diego (US))
    • T6 - Machine learning and physics analysis: S5 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P)
      • 429
        End-to-end Deep Learning Applications for Event Classification at CMS

        An essential part of new physics searches at the Large Hadron Collider (LHC) at CERN involves event classification, or distinguishing signal events from the background. Current machine learning techniques accomplish this using traditional hand-engineered features like particle 4-momenta, motivated by our understanding of particle decay phenomenology. While such techniques have proven useful for simple decays, they are highly dependent on our ability to model all aspects of the phenomenology and detector response. Meanwhile, powerful deep learning algorithms are capable of not only training on high-level features, but of performing feature extraction. In computer vision, convolutional neural networks have become the state-of-the-art for many applications. Motivated by their success, we apply deep learning algorithms to raw detector-level data from the CMS experiment to directly learn useful features, in what we call, “end-to-end event classification”. We demonstrate the power of this approach in the context of a physics search and offer solutions to some of the inherent challenges, such as image construction, image sparsity, combining multiple sub-detectors, and de-correlating the classifier from the search observable, among others.

        Speaker: Michael Andrews (Carnegie-Mellon University (US))
      • 430
        Machine learning techniques for jet flavour identification at CMS

        Jet flavour identification is a fundamental component for the physics program of the LHC-based experiments. The presence of multiple flavours to be identified leads to a multiclass classification problem. We present results from a realistic simulation of the CMS detector, one of two multi-purpose detectors at the LHC, and the respective performance measured on data. Our tagger, named DeepJet, relies heavily on applying convolutions on lower level physics objects, like individual particles. This approach allows the usage of an unprecedented amount of information with respect to what is found in the literature. DeepJet stands out as the first proposal that can be applied to multi-classification for all jet flavours as well as different widths of jets. We demonstrate significant improvements by the new approach on the classification capabilities of the CMS experiment in simulation in several of the tested classes. At high momentum improvements of nearly 90% less false positives at a standard operation point are reached. The application of jet flavour identification at the high level trigger of CMS is also discussed, based on recent Deep Learning developments.

        Speaker: Mauro Verzetti (CERN)
      • 431
        The Belle II flavor tagger

        Measurements of time-dependent CP violation and of $B$-meson mixing at B-factories require a determination of the flavor of one of the two exclusively produced $B^0$ mesons. The predecessors of Belle II, the Belle and BaBar experiments, developed so-called flavor tagging algorithms for this task. However, due to the novel high-luminosity conditions and the increased beam-backgrounds at Belle II, a new flavor tagging algorithm had to be developed with a specific and dedicated optimization in order to be applied in precision measurements while coping with the harsher experimental
        conditions.

        The new Belle II flavor tagging algorithm, the so-called Belle II flavor tagger, bases on inclusive methods that exploit the flavor specific signatures of $B^0$ decays, considering more signatures than in the previous algorithms and employing novel multivariate approaches. It proceeds in two steps: the event level, where candidate particles providing flavor specific signatures are selected; and the combiner, where all the possible signatures are combined into the final output. The framework is now complete and ready for use in data analysis, reaching an effective tagging efficiency of ca. 37$\%$ on Belle II Monte Carlo (MC).

        A major test consists in applying the Belle II flavor tagger algorithm in analysis with Belle data. For this, the Belle II flavor tagger was trained with Belle MC and tested with Belle data reaching an effective efficiency of $34\%$ (Belle reached 30$\%$). Additionally, the shape of the Belle II flavor tagger output shows good agreement between Belle data and Belle MC, meeting also the requirements to be used for analysis of Belle data.

        Speaker: Mr Fernando Abudinen (Max-Planck-institut für Physik)
      • 432
        Boosting Neutral Particles Identification by Boosting Trees: LHCb case

        Reconstruction and identification in calorimeters of modern High Energy Physics experiments is a complicated task. Solutions are usually driven by a priori knowledge about expected properties of reconstructed objects. Such an approach is also used to distinguish single photons in the electromagnetic calorimeter of the LHCb detector on LHC from overlapping photons produced from high momentum pi0 decays. We studied an alternative solution based on applying machine learning techniques to primary calorimeter information, that are energies collected in individual cells around the energy cluster. ML based model employs extreme gradient boosting trees approach which is widely used nowadays, and separates pi0 and photon responses from “first principles” i.e. plain energies deposited in calorimeter cells. This approach allowed to improve separation performance score from 80% to 93% on simulated data, that means reducing primary photons fake rate by factor of four. This presentation will describe the approach used for this identification problem, and compare performances obtained on simulated and real data. Possible extension and generalisation of the approach also will be discussed.

        Speaker: Viktoriia Chekalina (Yandex School of Data Analysis (RU))
      • 433
        Application of machine learning techniques at BESIII experiment

        The BESIII detector is a general purpose spectrometer located at BEPCII. BEPCII is a double ring $e^+e^-$ collider running at center of mass energies between 2.0 and 4.6 GeV and reached a peak luminosity of $1\times 10^{33}cm^{-2}s^{-1}$ at $\sqrt{s}$ =3770 MeV.

        As an experiment in the high precision frontier of hadron physics, since 2009, BESIII has collected the world's largest data samples of $J/\psi$, $\psi(3686)$,
        $\psi(3770)$, $\psi(4040)$ and $\psi(4260)$ decays. These data are being used to make a variety of interesting and unique studies of light hadron spectroscopy, charmonium spectroscopy and high-precision measurements of charmed hadron decays.

        The application of machine learning techniques at BESIII is discussed.
        By investigating the typical use cases as “event selection for rare signal”, “particle identification of lepton” and “track segment building for CGEM (Cylindrical Gas Electron Multiplier Inner Tracker)”, we map those tasks of data processing to classification problems in the machine learning field. The comparison of performance with traditional approaches will be given.

        Speaker: Dr Beijiang Liu (Institue of High Energy Physics, Chinese Academy of Sciences)
      • 434
        Physics Inspired Heavy Object Tagging

        We show how a novel network architecture based on Lorentz Invariance (and not much else) can be used to identify hadronically decaying top quarks. We compare its performance to alternative approaches, including convolutional neural networks, and find it to be very competitive.
        We also demonstrate how this architecture can be extended to include tracking information and show its application to a multi-class identification problem in Higgs physics.

        Speaker: Michael Russell (Heidelberg University)
    • T7 - Clouds, virtualization and containers: S5 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Convener: Martin Sevior (University of Melbourne (AU))
      • 435
        Quasi-online accounting and monitoring system for distributed clouds

        The HEP group at the University of Victoria operates a distributed cloud computing system for the ATLAS and Belle II experiments. The system uses private and commercial clouds in North America and Europe that run OpenStack, Open Nebula or commercial cloud software. It is critical that we record accounting information to give credit to cloud owners and to verify our use of commercial resources. We want to record the number of CPU-hours of the virtual machine.
        To obtain the required information, we run a fast benchmark at boot time to give an estimate of the HEPSpec06 units of the node. Our first system, writes the benchmark and CPU times (obtained from /proc/stat) to a log file every 15 minutes. The last entry of the VM was used to determine the CPU-hours. This system has work well but the information about a VM is only available after it is deleted and, in some cases, VMs can exist for many weeks. Hence, the final accounting information is delayed for some time.
        We have introduced a new system that continuously collects the information and uploads it into a an Elastic Search database. The information is processed and published as soon as it is available. The data is published in tables and plots in Kibana and Root. We have found the system to be useful beyond gathering accounting information and can be used for monitoring and diagnostic purposes. For example, we can use it to detect if the payload jobs are stuck in a waiting state for external information.
        We will report on the design and performance of the system, and show how it provides important accounting and monitoring information on a large distributed system.

        Speaker: Randy Sobie (University of Victoria (CA))
      • 436
        Backfilling the Grid with Containerized BOINC in the ATLAS computing

        Virtualization is a commonly used solution for utilizing the opportunistic computing resources in the HEP field, as it provides an unified software and OS layer that the HEP computing tasks require over the heterogeneous opportunistic computing resources. However there is always performance penalty with virtualization, especially for short jobs which are always the case for volunteer computing tasks, the overhead of virtualization becomes a big portion in the wall time, hence it leads to low CPU efficiency of the jobs. With the wide usage of containers in HEP computing, we explore the possibility of adopting the container technology into the ATLAS BOINC project, hence we implemented a Native version in BOINC, which uses the singularity container or direct usage of the target OS to replace VirtualBox. In this paper, we will discuss 1) the implementation and workflow of the Native version in the ATLAS BOINC ; 2) the performance measurement of the Native version comparing to the previous Virtualization version. 3) the limits and shortcomings of the Native version; 4) The use case and harvest of the Native version which includes using it in backfilling the ATLAS Grid Tier2 sites and other clusters, and using it to utilize the idle computers from the CERN computing centre.

        Speaker: Wenjing Wu (Computer Center, IHEP, CAS)
      • 437
        Apache Spark usage and deployment models for scientific computing

        This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need

        The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelarator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform.

        The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.

        Speaker: Prasanth Kothuri (CERN)
      • 438
        Facilitating collaborative analysis in SWAN

        SWAN (Service for Web-based ANalysis) is a CERN service that allows users to perform interactive data analysis in the cloud, in a "software as a service" model. It is built upon the widely-used Jupyter notebooks, allowing users to write - and run - their data analysis using only a web browser. By connecting to SWAN, users have immediate access to storage, software and computing resources that CERN provides, and that they need to do their analyses.

        Besides providing an easier way of producing scientific code and results, SWAN is also a great tool to create shareable content. From results that need to be reproducible, to tutorials and demonstrations for outreach and teaching, Jupyter notebooks are the ideal way of distributing this content. In one single file, users can pack their code, the results of the calculations and all the relevant textual information. By sharing them, it allows others to visualise, modify, personalise or even re-run all the code.

        In that sense, this paper describes the efforts made to facilitate sharing in SWAN. Given the importance of collaboration in our scientific community, we have brought the sharing functionality from CERNBox, CERN’s cloud storage service, directly inside SWAN. SWAN users have available a new and redesigned interface where they can share “Projects”: A special kind of folder containing notebooks and other files, e.g., like input datasets and images. When a user shares a Project with some other users, the latter can immediately see and work with the content of that project from SWAN.

        The collaborative aim of SWAN goes beyond CERN borders: The European project “Up to University” (Up2U), whose goal is to create a bridge between high schools and higher education, has chosen SWAN as its main tool to produce, share, and reuse high-quality educational notebooks among students, teachers, and scientists. SWAN reaches out to the academic community and even secondary schools, helping in preparing younger students for their future career in science.

        Speaker: Enric Tejedor Saavedra (CERN)
      • 439
        Serverless endpoint with public container clusters for HEP experiments

        In recent years, public clouds have undergone a large transformation. Nowadays, cloud providers compete in delivery specialized scalable and fault tolerant services where resource management is completely on their side. Such computing model called serverless computing is very attractive for users who do not want to worry about OS level management, security patches and scaling resources.
        Our aim was to build a batch computing solution based on serverless model enriched in traditional Grid computing endpoint features, like X509 certificate authorization and job scheduler. As the jobs processing ground we are using Docker container clusters orchestrated by AWS Batch service and AWS Lambda functions to handle REST API requests. API definition will be public available, so the proposed solution, initially for the Belle II, can be used by any HEP experiment.

        Speaker: Rafal Grzymkowski (IFJ PAN)
      • 440
        #555 slot and BOF advertisement
    • T5 - Software development: S5.2 - Software Packaging, Build Systems - Part 1 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Maiken Pedersen (University of Oslo (NO))
      • 441
        Software Packaging and Deployment in HEP

        The process of building software for High Energy Physics is a problem that all experiments must face. It is also an aspect of the technical management of HEP software that is highly suited to sharing knowledge and tools. For this reason the HEP Software Foundation established a working group in 2015 to look at packaging and deployment solutions in the HEP community. The group has examined in some detail the experience and requirements of a number of experiments. From this input a set of use cases for building software, deploying it and setting up a runtime environment were established. This has captured some of the unique aspects of managing software in our community, such as the need to support multiple build types (e.g., optimised for different architectures or using different compilers) and parallel deployments (e.g., different production releases) that make our community different from other projects. In addition the size of some of our software stacks poses some problems when lower layers of the software may come from a common source. The necessity of reproducibility is also challenge when any external packages are used. The group looked at a wide range of packaging tools, from the HEP community itself and from the wider open source world. We defined a small test software stack in order to evaluate how well they met our use cases, how easy they were to use, and how easy it was for multiple users to share build recipes. For deployment scenarios CVMFS and installation into containers were considered as critical to support.

        From all of these considerations the group has produced an HSF technical report that describes the work we have done and makes recommendations for the best tools to use and the best practices to adopt. Our ultimate goal is to provide simple packaging solutions that experiments can use that are as much as possible ‘off the shelf’ and in which the community can collectively contribute its considerable experience for everyone’s benefit.

        Speaker: Benjamin Morgan (University of Warwick (GB))
      • 442
        Spack-Based Packaging and Development for HEP Experiments

        Since its inception in 2010, the art event-based analysis framework and associated software have been delivered to client experiments using a Fermilab-originated system called UPS. Salient features valued by the community include installation without administration privileges, trivially-relocatable binary packages and the ability to use coherent sets of packages together (such as those compiled with a particular C++ standard or compiler, or a particular MPI implementation). These features have also been useful to a related system (Multi Repository Build, MRB) allowing simultaneous development of multiple related packages in a consistent environment.

        The field-wide shift to start looking to leverage flagship HPC systems, and the need to support experiments using modern MacOS systems with SIP security measures restricting the use of dynamic load paths to provide relocatability, have led us to search for other ways of providing experiments with a packaging and build system that meets their needs.

        One such candidate packaging system is Spack, in conjunction with an environment module system such as Lmod. We describe our efforts to develop a packaging system and protocol using these tools that provides the binary relocatability and package-set consistency that our clients require, and a development system ("Spackdev") to allow the straightforward development of HEP software packages in a consistent environment. These efforts include enhancements to the underlying tools where necessary, and the development of higher-level utilities to provide the required functionality and consistency safeguards.

        Speaker: Kyle Knoepfel (Fermi National Accelerator Laboratory)
      • 443
        IceCube CVMFS Software and Spack

        IceCube is a cubic kilometer neutrino detector located at the south pole. CVMFS is a key component to IceCube’s Distributed High Throughput Computing analytics workflow for sharing 500GB of software across datacenters worldwide. Building the IceCube software suite on CVMFS has historically been accomplished first by a long bash script, then by a more complex set of python scripts. We document the migration to Spack, an open source package management tool that has gained popularity in HEP. One key advantage of Spack is the built-in package formulas, as well as other package formulas provided by the community. This has allowed us to add new compiler support and other user features that have been requested for several years. Our software package management is now standardized, simpler, and easier to maintain.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 444
        dCache as open-source project showcase for education

        For over a decade, dCache.org has provided software which is used at more than 80 sites around the world, providing reliable services for WLCG experiments and others. This can be achieved only with a well established process starting from the whiteboard, where ideas are created, all the way through to packages, installed on the production systems. Since early 2013 we have moved to git as our source code management system as well as to GitHub as our source code hosting service.

        Since the summer of 2017, the dCache team is involved into the education process of students from HTW Berlin. The goal of the course is to teach students to contribute to ‘a standard open-source project’. The dCache version 3.2 was released with multiple changes contributed during the course. As dCache is built around well-known tools and technologies, students should be able to apply the skills acquired during the seminar to most other open source projects as well.

        Although students only had a very short time to get familiar with the dCache project, our release process was stable enough to handle a flood of first-time contributors without compromising code quality and stability.

        This presentation will show how techniques and tools are used by dCache developers to manage code and releases, how to handle external contributions and still keep the quality and reliability of the provided packages.

        Speaker: Mr Tigran Mkrtchyan (DESY)
    • Posters: S2 & packed lunch. To see the poster list, click on 'View session details'.
    • Social events: Sofia walking tour
    • Plenary: S6 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Gang Chen (Chinese Academy of Sciences (CN))
      • 445
        Blockchain for Large Scale Scientific Computing

        The HL-LHC will present enormous storage and computational demands, creating a total dataset of up to 200 Exabytes and requiring commensurate computing power to record, reconstruct, calibrate, and analyze these data. Addressing these needs for the HL-LHC will require innovative approaches to deliver the necessary processing and storage resources. The "blockchain" is a recent technology for accessing very large pools of computing and storage space, while also ensuring accurate processing of that data. This is the technology underpinning cryptocurrency platforms such as Bitcoin and Ethereum. Blockchain processing networks are, by their nature, highly distributed systems; exploiting them will allow access to a so-far untapped high-quality, global computing and storage network. Ethereum in particular already makes provisions to ensure accurate execution of delivered code, making it a favorable testbed for applications in HEP. The implementation of blockchain technology in the context of particle physics experiments will be explained, defining relevant units of work, and how this work can be distributed safely across worldwide computing nodes. Furthermore, a proof of concept in a simplified analysis workflow where we have embedded the workflow and its results into the blockchain will be demonstrated.

        Speaker: Lindsey Gray (Fermi National Accelerator Lab. (US))
      • 446
        Quantum Computing
        Speaker: Elizabeth Sexton-Kennedy (Fermi National Accelerator Lab. (US))
      • 447
        Beyond X.509: Token-based Authentication and Authorization for HEP

        X.509 certificates and VOMS have proved to be a secure and reliable solution for authentication and authorization on the Grid, but also showed usability issues and required the development of ad-hoc services and libraries to support VO-based authorization schemes in Grid middleware and experiment computing frameworks. The need to move beyond X.509 certificates is recognized as an important objective in the HEP R&D roadmap for software and computing, to overcome the usability issues of the current AAI and embrace recent advancement in web technologies widely adopted in industry, but also to enable the secure composition of computing and storage resources provisioned across heterogeneous providers (e.g., Grid, private and commercial clouds, HPC centers) in order to meet the computing needs of HL-LHC.
        A flexible and usable AAI based on modern technologies (such as OpenID Connect, OAuth 2, Json Web Tokens (JWTs)) is a key enabler of such secure composition, and has been a major topic of research of the recently concluded INDIGO-DataCloud project.
        In this contribution, we will present an integrated solution, based on the INDIGO-Datacloud Identity and Access Management (IAM) service and other software components, that demonstrates how a next generation, token-based VO-aware AAI can be built in support of HEP computing use cases while maintaining compatibility with the existing, VOMS-based AAI used by the Grid.
        We will describe and demonstrate:
        - The base technologies and standards used (OpenID Connect, OAuth, JWTs)
        - IAM support for multiple authentication mechanisms (SAML, social logins,X.509 certificates)
        - the IAM account linking functionality, which allows users to link multiple different credentials to their VO membership
        - the IAM registration service, which provides VO enrollment flows similar to the ones in use today but that do not impose X.509 certificate management on users
        - the IAM VO management service, used to organize the VO structure and organize users in groups and grant privileges to users;
        - How IAM can directly be integrated with existing Grid software leveraging on-demand X.509 certificate and VOMS provisioning;
        - How token-based VO-based authentication and authorization can be implemented at relying services using off-the-shelf software compliant with the OAuth/OpenID connect standards.
        Finally, we will discuss how this work is positioned with respect to other AAI solutions (e.g. the EGI CheckIn service) and standardization efforts being pursued in the context of relevant European and American projects (AARC, SciTokens).

        Speaker: Andrea Ceccanti
    • 10:30
      Break
    • T1 - Online computing: S6 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Ryosuke Itoh (KEK)
      • 448
        Readout software for the ALICE integrated Online-Offline (O2) system

        ALICE (A Large Ion Collider Experiment) is a heavy-ion detector studying the physics of
        strongly interacting matter and the quark-gluon plasma at the CERN LHC (Large Hadron
        Collider). During the second long shut-down of the LHC, the ALICE detector will be
        upgraded to cope with an interaction rate of 50 kHz in Pb-Pb collisions, producing in the
        online computing system (O2) a sustained input throughput of 3 TB/s.

        The readout software is in charge of the first step of data-acquisition, handling the
        data transferred from over 8000 detector links to PCs memory by dedicated PCI boards,
        formatting and buffering incoming traffic until sent to the next components in the
        processing pipeline. On the 250 readout nodes where it runs, it has to sustain a
        throughput which can locally exceed 100 Gb/s.

        We present the modular design used to cope with various data sources (hardware devices
        and software emulators), integrated with the central O2 components (logging,
        configuration, monitoring, data sampling, transport) and initiating the online data flow
        using the standard O2 messaging system. Performance considerations and measurements are
        also discussed.

        Speaker: Filippo Costa (CERN)
      • 449
        FELIX based readout of the Single-Phase ProtoDUNE detector

        The liquid argon Time Projection Chamber technique has matured and is now in use by several short-baseline neutrino experiments. This technology will be used in the long-baseline DUNE experiment; however, this experiment represents a large increase in scale, which needs to be validated explicitly. To this end, both the single-phase and dual-phase technology are being tested at CERN, in two full-scale (6x6x6 m$^3$) ProtoDUNE setups. Besides the detector technology, these setups also allow for extensive tests of readout strategies.
        The Front-End LInk eXchange (FELIX) system was initially developed within the ATLAS collaboration and is based on custom FPGA-based PCIe I/O cards in combination with commodity servers. FELIX will be used in the single-phase ProtoDUNE setup to read the data coming from 2560 anode wires organised in a single Anode Plane Assembly structure. With a sampling rate of 2 MHz, the system must deal with an input rate of 96 Gb/s, and buffer data. Event building requests will arrive at a rate of at least 25 Hz, and lossless compression must reduce the data within the selected time windows by a factor of 4 before being sent to the experiment's event building farm.
        This presentation will discuss the design of the system as well as first operational experience.

        Speaker: Enrico Gamberini (CERN)
      • 450
        The CMS event-builder system for LHC run 3 (2021-23)

        The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) assembles events of 2 MB at a rate of 100 kHz. The event builder collects event fragments from about 740 sources and assembles them into complete events which are then handed to the high-level trigger (HLT) processes running on O(1000) computers. The aging event-building hardware will be replaced during the long shutdown 2 of the LHC taking place in 2019/20. The future data networks will be based on 100 Gb/s interconnects using Ethernet and Infiniband technologies. More powerful computers may allow to combine the currently separate functionality of the readout and builder units into a single I/O processor handling simultaneously 100 Gb/s of input and output traffic. It might be beneficial to preprocess data originating from specific detector parts or regions before handling it to generic HLT processors. Therefore, we will investigate how specialized coprocessors, e.g. GPUs, could be integrated into the event builder. We will present the envisioned changes to the event-builder compared to today's system. Initial measurements of the performance of the data networks under the event-building traffic pattern will be shown. Implications of a folded network architecture for the event building and corresponding changes to the software implementation will be discussed.

        Speaker: Remi Mommsen (Fermi National Accelerator Lab. (US))
      • 451
        The Design of a Distributed Key-Value Store for Petascale Hot Storage in Data Acquisition Systems

        Data acquisition (DAQ) systems for high energy physics experiments readout data from a large number of electronic components, typically over thousands of point to point links. They are thus inherently distributed systems. Traditionally, an important stage in the data acquisition chain has always been the so called event building: data fragments coming from different sensors are identified as belonging to the same physical event and are physically assembled in a single memory location on one host. The complete events cached on event builder nodes are then served to farms of processors for analysis and filtering of data. In this work we propose a new approach - logical event building with hot storage. Data fragments are stored in a large distributed key-value store without any physical event building. Fragments belonging to one event can be then queried directly by the processes carrying out the data analysis and filtering when needed. We analyze the advantages of this approach. Among them are optimized usage of network and storage resources, and foremost increased CPU efficiency in the computing farm. The latter is possible thanks to the decoupling of the lifetime of the analysis/filtering processes from the changing event rate due to duty cycle of the accelerator. Then, we present the design and initial performance evaluation of FogKV - a distributed key-value store for high-bandwidth data acquisition systems. We discuss key design choices, including hybrid NVDIMM/SSD storage backend with buffering and RDMA transport. They are required in order to meet bandwidth and storage requirements of the high luminosity upgrades of the LHC at CERN, after which data will be produced at a rate of 6 TB/s. The storage required to be able to keep data for up to twenty four hours is in the order of 500 PB. We present how single node performance is to scale to meet these requirements.

        Speaker: Dr Grzegorz Jereczek (Intel Corporation)
      • 452
        Data distribution and load balancing for the ALICE Online-Offline (O2) system

        ALICE (A Large Ion Collider Experiment), one of the large LHC experiments, is undergoing a major upgrade during the next long shutdown. Increase in data rates planned for LHC Run3 (3TiB/s for Pb-Pb collisions) with triggerless continuous readout operation requires a paradigm shift in computing and networking infrastructure.
        The new ALICE O2 (online-offline) computing facility consists of two types of nodes: First Level Processors (FLP), containing the read-out PCI cards, and Event Processing Nodes (EPN), responsible for the online reconstruction. Each FLP node buffers detector data, of a predefined time interval, called SubTimeFrame (STF). The central task of the data distribution is to aggregate corresponding STFs from all FLP nodes into an object called Time Frame (TF). The FLP-EPN network must support the high aggregate data rate and sustain a large number of concurrent transfers. An application-level scheduling of data transfers as well as the selection of the receiving EPNs will be necessary to keep the high quality of service.
        We give an overview of the TF building process including FLP node synchronization, traffic shaping and balancing for even utilization of processing and network components of the O2 facility.

        Speaker: Gvozden Neskovic (Johann-Wolfgang-Goethe Univ. (DE))
      • 453
        The NaNet Project: Real-time Distributed Heterogeneous Stream Processing for the NA62 Low Level Trigger

        The NA62 experiment at CERN SPS is aimed at measuring the branching ratio of the ultra-rare K+→π+νν decay.
        This imposes very tight requirements on the particle identification capabilities of the apparatus in order to reject the considerable background.
        To this purpose a centralized level 0 hardware trigger system (L0TP) processes in real-time the streams of data primitives coming from the detectors readout boards.
        Our approach aims at improving the L0TP performances distributing this processing over the whole chain starting from the earliest stages, i.e. the readout boards, and operating on the data streams with an orchestrated combination of heterogeneous computing devices (CPUs, FPGAs and GPUs).
        The key element of this real-time distributed stream computing architecture is NaNet, a FPGA-based PCI Express Network Interface Card with processing, RDMA and GPUDirect capabilities, supporting multiple link technologies (1/10/40GbE and custom ones).
        We have demonstrated the effectiveness of our method by harvesting the computing power of latest generations nVIDIA Pascal and Volta GPUs and of the FPGA hosted by NaNet to build in real-time refined physics-related primitives for the RICH detector, as the knowledge of Cerenkov rings parameters allows building more stringent conditions for data selection at low trigger level.
        Recent results collected during NA62 runs along with a detailed description of the latest developments in the NaNet architecture and an insight of future project developments are presented and discussed.

        Speaker: Alessandro Lonardo (Sapienza Universita e INFN, Roma I (IT))
    • T2 - Offline computing: S6 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Gene Van Buren (Brookhaven National Laboratory)
      • 454
        Conditions and Alignment extensions to the DD4hep Detector Description Toolkit

        The detector description is an essential component to analyse data resulting from particle collisions in high energy physics experiments.
        The interpretation of these data from particle collisions typically require more long-living data which describe in detail the state of the experiment itself. Such accompanying data include alignment parameters, the electronics calibration and their organization, the environmental parameters needed for the data processing, etc. We present a mechanism to manage the data in simultaneously multiple versions depending on their validity.
        The detector conditions data are made available to the physics algorithms through a number of transient objects grouped to collections. Such a collection represents a coherent slice of all conditions data necessary to process one or a several event depending of the valid interval of the slice being the intersection of the individual conditions.
        A multi-threaded application may hold several such collections in parallel depending on the time-stamps of the events currently processed. Once prepared, these collections are read-only and can easily be shared between threads with minimal requirements for locking and hence minimal overhead.
        We deliberately restrained ourselves from providing a persistent data solution, which in the past were fields of expertise of the experiments, but rather provided the necessary hooks to populate We will present the use-cases that have driven the development, the main design choices and details of the implementation. Finally we describe how it is customized to provide specific needs of an experiment.

        Speaker: Dr Markus Frank (CERN)
      • 455
        A vectorization approach for multifaceted solids in VecGeom

        VecGeom is a multi-purpose geometry library targeting the optimisation of the 3D-solid's algorithms used extensively in particle transport and tracking applications. As a particular feature, the implementations of these algorithms are templated on the input data type and are explicitly vectorised using VecCore library in case of SIMD vector inputs. This provides additional performance for applications supporting a multi-particle flow, such as the GeantV prototype. VecGeom became a fully-supported option in Geant4 since the release 10.2, replacing on demand the functionality provided by the native Geant4 solids. In this mode, VecGeom is called in scalar mode and SIMD benefits can be achieved by internal vectorization of the geometry algorithms allowing it. This approach has proven to bring very large benefits for the tessellated solids represented in terms of triangular facets. To expose more vectorization in the scalar mode we have extended the approach used for the triangular tessellations to other multi-faceted shapes, such as the extruded polygon, the polyhedra and different trapezoids. The paper will present the strategy used to vectorise the different processing phases for tessellated solids, the performance improvements compared to the previous scalar implementations for other solids using this approach, and how this is reflected in Geant4 simulations using VecGeom as geometry engine.

        Speaker: Mihaela Gheata (Institute of Space Science (RO))
      • 456
        Experience of the Development of the Geometry Database for the CBM Experiment

        This paper is dedicated to the current state of the Geometry Database (Geometry DB) for the CBM experiment. The geometry DB is an information system that supports the CBM geometry. The main aims of Geometry DB are to provide storage of the CBM geometry, convenient tools for managing the geometry modules assembling various versions of the CBM setup as a combination of geometry modules and additional files providing support of various versions of the CBM setup. The development is based on the analyzed users’ requirements (which were formulated in User Requirements Document) and takes into account peculiarities of the workflow for simulation of particles transport through the setup. Both the Graphical User Interface (GUI) and the Application Programming Interface (API) are available for the members of the CBM collaboration.

        Speaker: Irina Filozova (Joint Institute for Nuclear Research (RU))
      • 457
        Upgrade of ATLAS data quality monitoring for multithreaded reconstruction

        ATLAS is embarking on a project to multithread its reconstruction software in time for use in Run 3 of the LHC. One component that must be migrated is the histogramming infrastructure used for data quality monitoring of the reconstructed data. This poses unique challenges due to its large memory footprint which forms a bottleneck for parallelization and the need to accommodate relatively inexperienced developers. We discuss several possibilities evaluated for the upgraded framework.

        Speaker: Peter Onyisi (University of Texas at Austin (US))
      • 458
        The Data Quality Monitoring Software for the CMS experiment at the LHC: past, present and future

        The Data Quality Monitoring Software is a central tool in the CMS experiment. It is used in the following key environments: 1) Online, for real-time detector monitoring; 2) Offline, for the prompt-offline-feedback and final fine-grained data quality analysis and certification; 3) Validation of all the reconstruction software production releases; 4) Validation in Monte Carlo productions. Though the basic structure of the Run1 DQM system remains the same for Run2, between the Run1 and Run2 period, the DQM system made substantial upgrades in many areas, not only to adapt to the surrounding infrastructure changes, but also to provide improvements to meet the growing needs of the collaboration with an emphasis on more sophisticated methods for evaluating data quality. We need to cope with the higher energy and luminosity proton-proton collision data, as well as the data from various special runs, such as Heavy Ion runs. In this contribution, we will describe the current DQM Software, Structure & Workflow in the different environments. We, then, discuss the performance and our experiences with the DQM system in Run2. The main technical challenges which we have encountered and the adopted solutions during Run2 will be also discussed, including efficient use of memory in multithreading environment. Finally, we present the prospect of future DQM upgrade with emphasis on functionality and long-term robustness for LHC Run3.

        Speaker: Marcel Andre Schneider (CERN)
      • 459
        Testing and verification of the LHCb Simulation

        Monte-Carlo simulation is a fundamental tool for high-energy physics experiments, from the design phase to data analysis. In recent years its relevance has increased due to the ever growing measurements precision. Accuracy and reliability are essential features in simulation and particularly important in the current phase of the LHCb experiment, where physics analysis and preparation for data taking with the upgraded detector need to be performed at the same time.
        In this talks we will give an overview of the full chain of tests and procedures implemented for the LHCb Simulation software stack to ensure the quality of its results. The tests comprise simple checks to validate new software contributions in a nightlies system as well as more elaborate checks probing simple physics and software quantities for performance and regression verifications. Commissioning of new major version of the simulation software for production implies also checking effects on few physics analysis. A new system for Simulation Data Quality (SimDQ) that is being put in place to help in the first phase of commissioning and for fast verification of all samples produced will also be discussed.

        Speaker: Dmitry Popov (Max-Planck-Gesellschaft (DE))
    • T3 - Distributed computing: Experiment Frameworks and Operational Experiences (2) Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: Julia Andreeva (CERN)
      • 460
        The XENON1T Data Distribution and Processing Scheme

        The Xenon Dark Matter experiment is looking for non baryonic particle Dark Matter in the universe. The demonstrator is a dual phase time projection chamber (TPC), filled with a target mass of ~2000 kg of ultra pure liquid xenon. The experimental setup is operated at the Laboratori Nazionali del Gran Sasso (LNGS).
        We present here a full overview about the computing scheme for data distribution and job management in Xenon1T. The software package Rucio, which is developed by the ATLAS collaboration, takes over the data handling on connected grid storages which are part of the Open Science Grid (OSG) and European Grid Infrastructure (EGI). A copy on dedicated tape storage at the Centre for High Performance Computing (PDC) is handled by the Tivoli Storage Manager (TSM). Data reduction and MonteCarlo production is handled by CI Connect which is integrated in the OSG network. This job submission system connects resources at the EGI, OSG, SDSC's Comet and the campus HPC resources for distributed computing.
        The archived success in the Xenon1T computing scheme is also starting point for its successor experiment XENONnT which starts to take data in autumn 2019.

        Speaker: Boris Bauermeister (Stockholm University)
      • 461
        Harvester : an edge service harvesting heterogeneous resources for ATLAS

        The Production and Distributed Analysis (PanDA) system has been successfully used in the ATLAS experiment as a data-driven workload management system. The PanDA system has proven to be capable of operating at the Large Hadron Collider data processing scale over the last decade including the Run 1 and Run 2 data taking periods. PanDA was originally designed to be weakly coupled with the WLCG processing resources. Lately the system is revealing the difficulties to optimally integrate and exploit new resource types such as HPC and preemptable cloud resources with instant spin-up, and new workflows such as the event service, because their intrinsic nature and requirements are quite different from that of traditional grid resources. Therefore, a new component, Harvester, has been developed to mediate the control and information flow between PanDA and the resources, in order to enable more intelligent workload management and dynamic resource provisioning based on detailed knowledge of resource capabilities and their real-time state. Harvester has been designed around a modular structure to separate core functions and resource specific plugins, simplifying the operation with heterogeneous resources and providing a uniform monitoring view. This talk will give an overview of the Harvester architecture, its advanced features, current status with various resources, and future plans.

        Speaker: Tadashi Maeno (Brookhaven National Laboratory (US))
      • 462
        Using PanDA WMS for LSST Simulations on Distributed Infrastructure

        A goal of LSST (Large Synoptic Survey Telescope) project is to conduct a 10-year survey of the sky that is expected to deliver 200 petabytes of data after it begins full science operations in 2022. The project will address some of the most pressing questions about the structure and evolution of the universe and the objects in it. It will require a large amount of simulations to understand the collected data.
        In this talk we will discuss the use of the PanDA Workload Management System (WMS) for LSST photon simulations, which model the atmosphere, optics and camera. The PanDA WMS was developed by the ATLAS Experiment at the LHC to manage data analysis and detector simulations on distributed and heterogeneous computing resources. It now runs ~2M jobs per day on hundreds of Grid sites and serving thousands of ATLAS users. PanDA processed about 1.5 exabytes of data in 2017. PanDA is also used for projects and experiments outside ATLAS and is expected to be suitable for computational needs of LSST Dark Energy Science Collaboration (DESC).
        For running LSST simulations with the PanDA WMS we have established a distributed testbed infrastructure that employs the resources of several sites on GridPP and Open Science Grid as well as the Titan supercomputer at ORNL. In order to submit jobs to these sites we have used a PanDA server instance deployed on the Amazon AWS Cloud. We will present results of running realistic LSST simulations with this setup and discuss future plans for integrations of PanDA with LSST computational infrastructure.

        Speaker: Pavlo Svirin (Brookhaven National Laboratory (US))
      • 463
        The Cherenkov Telescope Array production system for data-processing and Monte Carlo simulation

        The Cherenkov Telescope Array (CTA) is the next-generation instrument in the field of very high energy gamma-ray astronomy. It will be composed of two arrays of Imaging Atmospheric Cherenkov Telescopes, located at La Palma (Spain) and Paranal (Chile). The construction of CTA has just started with the installation of the first telescope on site at La Palma and the first data expected by the end of 2018. The scientific operations should begin in 2022 for a duration of about 30 years. The overall amount of data produced during these operations is of about 27 PB per year. The associated computing power for data processing and Monte Carlo (MC) simulations is of the order of hundreds of millions of CPU HS06 hours per year.
        In order to cope with these high computing requirements, we have developed a production system prototype based on the DIRAC framework, that we have intensively exploited during the past 5 years to handle massive MC simulations on the grid needed for the CTA design and prototyping phases.
        However, CTA workflows are quite complex and composed of several interdependent steps, which we used to handle separately within our production system. In order to fully automatize the whole workflows execution, we have partially revised the production system by further enhancing the data-driven behaviour and by extending the use of meta-data to link together the different steps of a workflow.
        In this contribution we present the application of the production system to the last years MC campaigns as well as the recent production system evolution, intended to obtain a fully data-driven and automatized workflow execution for an efficient processing of real telescope data.

        Speaker: Luisa Arrabito
      • 464
        Multicore workload scheduling in JUNO

        The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment which will start in 2020. To fasten JUNO data processing over multicore hardware, the JUNO software framework is introducing parallelization based on TBB. To support JUNO multicore simulation and reconstruction jobs in the near future, a new workload scheduling model has to be explored and implemented in JUNO distributed computing system which was built on DIRAC. Inside this model, the evolution of pilot model from singlecore to multicore is the key issue. Two multicore pilot strategies will be stated and evaluated in this paper. One is customized pilots whose size varied with the resource requirement of payloads in the Task Queue. The other uses common pilots with equal size, which allow internal scheduling inside allocated resources to accept more than one payload with various core requirements. With the SLURM and cloud testbed built, the tests have been done to evaluate these two strategies and study their efficiency in JUNO use cases. The paper also will present an algorithm designed to solve “big” job starvation and improve efficiency with a hybrid of various-core jobs submitted.

        Speaker: Xiaomei Zhang (Chinese Academy of Sciences (CN))
      • 465
        Exploring GlideinWMS and HTCondor scalability frontiers for an expanding CMS Global Pool

        The CMS Submission Infrastructure Global Pool, built on GlideinWMS and HTCondor, is a worldwide distributed dynamic pool responsible for the allocation of resources for all CMS computing workloads. Matching the continuously increasing demand for computing resources by CMS requires the anticipated assessment of its scalability limitations. Extrapolating historical usage trends, by LHC Run III the CMS Global Pool must be able to manage stably and efficiently 0.5M CPU cores, about a factor 2 from current size. In addition, the Global Pool must be able to expand in a more heterogeneous environment, in terms of resource provisioning (combining Grid, HPC and Cloud) and workload submission. A dedicated testbed has been set up to simulate such conditions with the purpose of finding potential bottlenecks in the software or its configuration. This contribution will provide a thorough description of the various scalability dimensions in size and complexity that are being explored for the future Global Pool, along with the analysis and solutions to the limitations proposed with the support of the GlideinWMS and HTCondor developer teams.

        Speaker: Antonio Perez-Calero Yzquierdo (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
    • T4 - Data handling: S6 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Elizabeth Gallas (University of Oxford (GB))
      • 466
        Evolution of the open-source data management system Rucio for LHC Run-3 and beyond ATLAS

        Rucio, the distributed data management system of the ATLAS collaboration already manages more than 330 Petabytes of physics data on the grid. Rucio has seen incremental improvements throughout LHC Run-2 and is currently being prepared for the HL-LHC era of the experiment. Next to these improvements the system is currently evolving into a full-scale generic data management system for application beyond ATLAS, or even beyond high energy physics. This contribution focuses on the development roadmap of Rucio for LHC Run-3, such as, event level data management, generic meta-data support, and increased usage of networks and tapes. At the same time Rucio is evolving beyond the original ATLAS use-case. This includes authentication beyond the WLCG ecosystem, generic database compatibility, deployment and packaging of the software stack in containers and a project paradigm shift to a full-scale open source project.

        Speaker: Martin Barisits (CERN)
      • 467
        Data management for the SoLid experiment

        The SoLid experiment is a short-baseline neutrino project located at the BR2 research reactor in Mol, Belgium. It started data taking in November 2017. Data management, including long term storage will be handled in close collaboration by VUB Brussels, Imperial College London and Rutherford Appleton Laboratory (RAL).
        The data management system makes the data available for analysis on the Grid and interactively via xrootd at VUB Brussels and Imperial College, and archives them at RAL for safe keeping. As an additional requirement, as the SoLid experiment uses the GridPP DIRAC instance as a resource broker and file catalogue, this also has to be integrated into the data management system.
        In this presentation we discuss the overall setup of SoLid data management model with an emphasis on the software developed for the file distribution on the Grid, the data archiving and the initial data transfer from the experiment to a Grid storage. We present results from the first six months of data taking, showing how the system performed in a production setting. Finally we discuss how the software developed can be useful to other communities.

        Speaker: Janusz Martyniak
      • 468
        Architecture and prototype of a WLCG data lake for HL-LHC

        The computing strategy document for HL-LHC identifies storage as one of the main WLCG challenges in one decade from now. In the naive assumption of applying today’s computing model, the ATLAS and CMS experiments will need one order of magnitude more storage resources than what could be realistically provided by the funding agencies at the same cost of today. The evolution of the computing facilities and the way storage will be organized and consolidated will play a key role in how this possible shortage of resources will be addressed. In this contribution we will describe the architecture of a WLCG data lake, intended as a storage service geographically distributed across large data centers connected by fast network with low latency. Will present the experience with our first prototype, showing how the concept, implemented at different scales, can serve different needs, from regional and national consolidation of storage to an international data provisioning service. We will highlight how the system leverages its distributed nature, the economy of scale and different classes of storage to optimise the hardware and operational cost, through a set of policy driven decisions concerning data placement and data retention. We will discuss how the system leverages or interoperates with existing federated storage solutions. We will finally describe the possible data processing models in this environment and present our first benchmarks.

        Speaker: Simone Campana (CERN)
      • 469
        BESIII Data Management System

        BES III experiment have taked data more than ten years, about fifty thounsand runs have been taken. So how to manage these large data is a big challenge to us. For years, we have created an efficient and complete data management system, including MySQL database, C++ API, BookKeeping system, monitor applications and etc. I will focus on introduce BESIII central database management system’s structure, workflow; BookKeeping system’s workflow and the importance of monitor applications.

        Speaker: Ms Qiumei Ma (IHEP)
      • 470
        The data management of heterogeneous resources in Belle II

        The Belle II experiment at the SuperKEKB collider in Tsukuba, Japan, will start taking physics data in early 2018 and aims to accumulate 50/ab, or approximately 50 times more data than the Belle experiment. The collaboration expects it will manage and process approximately 200 PB of data.

        Computing at this scale requires efficient and coordinated use of the compute grids in North America, Asia and Europe and will take advantage of high-speed global networks.

        We present the general Belle II the distributed data management system and results from the first phase of data taking for e+e- collisions.

        Speaker: Dr Malachi Schram (Pacific Northwest National Laboratory)
      • 471
        IceCube File Catalog

        IceCube is a cubic kilometer neutrino detector located at the south pole. Metadata for files in IceCube has traditionally been handled on an application by application basis, with no user-facing access. There has been no unified view of data files, and users often just ls the filesystem to locate files. Recently effort has been put into creating such a unified view. Going for a simple solution, we created a user-facing REST API backed by a nosql database. All major data producers add their metadata to this central catalog. Challenges that were overcome include schema generation, containerized deployment, and scale testing.

        Speaker: PATRICK MEADE (University of Wisconsin-Madison)
    • T5 - Software development: S6.1 - Software Packaging, Build Systems - Part 2 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Maiken Pedersen (University of Oslo (NO))
      • 472
        Bulding, testing and distributing common software for the LHC experiments

        Building, testing and deploying of coherent large software stacks is very challenging, in particular when they consist of the diverse set of packages required by the LHC experiments, the CERN Beams department and data analysis services such as SWAN. These software stacks include several packages (Grid middleware,Monte Carlo generators, Machine Learning tools, Python modules) all required for a large number of compilers,operating systems and hardware architectures.

        To address this challenge, we developed an infrastructure around a tool, called ‘lcgcmake’. Dedicated modules are responsible for building the packages, controlling the dependencies in a reliable and scalable way. The distribution relies on a robust automatic system, responsible for building and testing the packages, installing them on CernVM-FS and packaging the binaries in RPMs and tarballs. This system is orchestrated through Jenkins on build machines provided by the CERN Openstack facility. The results are published through user-friendly web pages.

        In this talk we will present an overview of this infrastructure tools and policies. We also discuss the role of this effort within the HEP Software Foundation (HSF). Finally we will discuss the evolution of the infrastructure towards container (Docker) technologies and the future directions and challenges of the project.

        Speaker: Patricia Mendez Lorenzo (CERN)
      • 473
        Testing of complex, large-scale distributed storage systems: a CERN disk storage case study

        Complex, large-scale distributed systems are more frequently used to solve
        extraordinary computing, storage and other problems. However, the development
        of these systems usually requires working with several software components,
        maintaining and improving large codebases, and also a relatively large number
        of developers working together. Therefore, it is inevitable to introduce faults
        to the system. On the other hand, these systems often perform important if not
        crucial tasks so critical bugs, performance-hindering algorithms are not
        acceptable to reach the production state of the software and the system. Also,
        the larger number of developers can work more liberated and productively when
        they receive constant feedback that their changes are still in harmony with the
        system requirements and other people’s work which also greatly helps scaling
        out manpower, meaning that adding more developers to a project can actually
        result in more work done.

        In this paper we will go through the case study of EOS, the CERN disk storage
        system and introduce the methods and possibilities of how to achieve
        all-automatic regression, performance, robustness testing and continuous
        integration for such a large-scale, complex and critical system using
        container-based environments. We will also pay special attention to the details
        and challenges of testing distributed storage and file systems.

        Speaker: Andrea Manzi (CERN)
      • 474
        Robust Linux Binaries

        The data processing of HEP data relies on rich software distributions, made of experiment specific software and hundreds of other software products, developed by our community and outside it.
        This kind of software stacks are traditionally distributed on shared file systems as a set of packages coherently built. This has the benefit of reducing as much as possible any coupling with the libraries present in the system on which HEP applications are executed.
        Such large projects require an appropriate shell environment, for example, on Linux, a LD_LIBRARY_PATH allowing to find a consistent set of binaries and shared libraries. This solution can have side-effects, for instance causing incompatibilities between system binaries that find themselves linked with libraries incompatible with the expected ones.
        This presentation proposes a solution to improve the current setup relying heavily on the usage of RPATH and the Gentoo build system. Despite its usefulness, so far, RPATH-based builds are little to not used on Linux in High energy Physics. We show how build systems can be adjusted for RPATH-based builds, what the result is and how it enables trivial multiple installations of binaries and libraries with otherwise conflicting versions.

        Speaker: Guilherme Amadio (CERN)
    • T6 - Machine learning and physics analysis: S6 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 475
        Large-Scale Distributed Training of Deep Neural Net Models

        In the recent years, several studies have demonstrated the benefit of using deep learning to solve typical tasks related to high energy physics data taking and analysis. Building on these proofs of principle, many HEP experiments are now working on integrating Deep Learning into their workflows. The computation need for inference of a model once trained is rather modest and does not usually need specific treatment. On the other hand, the training of neural net models requires a lot of data, especially for deep models with numerous parameters. The amount of data scales with the many parameters of the models which can be in billions or more. The more categories present in classification or the more wide a range of regression is performed the more data is required. Training of such models has been made tractable with the improvement of optimization methods and the advent of GP-GPU well adapted to tackle the highly-parallelizable task of training neural nets. Despite these advancement, training of large models over large dataset can take days to weeks. To take the best out of this new technology, it would be important to scale up the available network-training resources and, consequently, to provide tools for optimal large-scale trainings. Neural nets are typically trained using various stochastic methods based on gradient descent. One of the avenue to further accelerate the training is via data parallelism, where the computation of the gradients is computed on multiple subset of the data in parallel and used collectively to update the model toward the optimum parameters. Several frameworks exists for performing such distributed training, including framework already developed by the authors, all with their strengths and limitations. In this context, our development of a new training workflow, which scales on multi-node/multi-GPU architectures with an eye to deployment on high performance computing machines is described. Old and new frameworks at put on a benchmark test on a few HEP-specific examples and results are presented.

        Speaker: Felice Pantaleo (CERN)
      • 476
        Training Generative Adversarial Models over Distributed Computing Systems

        In the field of High Energy Physics, the simulation of the interaction of particles in the material of calorimeters is a computing intensive task, even more so with complex and fined grained detectors. The complete and most accurate simulation of particle/matter interaction is primordial while calibrating and understanding the detector at the very low level, but is seldomly required at physics analysis level, once several level of detector effect smeared and hides slight imperfection in simulation. Some level of approximation in the simulation is therefore acceptable and one can move on with implementing fast simulation surrogate models which have the advantage of being less computationally intensive. Going further than parameterized simulation already implemented in HEP experiment, we present a fast simulation based on generative adversarial networks. The model is constructed from a conditional generative model of calorimeter response and a discriminative model to distinguish between generated and ground truth energy deposition, trained in adversarial manner. We use a dataset composed of the energy deposition from electron, photons, charged and neutral hadrons in a fine grained digital calorimeter. The training of these deep neural net models is quite computing intensive, even with the help of GP-GPU, and we propose a method to train them over multiple node and GPGPU using a standard message passing interface. We report on the scalings of time-to-solution. Further tuning of hyper-parameter of the models are rendered tractable and we present the physics performance of the best model obtained via a bayesian optimization using gaussian processes. We demonstrate how high-performance-computing center can be utilized to globally optimize this kind of models.

        Speaker: Jean-Roch Vlimant (California Institute of Technology (US))
      • 477
        Scaling studies for deep learning in LArTPC event classification

        Measurements in LArTPC neutrino detectors feature high fidelity and result in large event images. Deep learning techniques have been extremely successful in classification tasks of photographs, but their application to LArTPC event images is challenging, due to the large size of the events; two orders of magnitude larger than images found in classical challenges like MNIST or ImageNet. This leads to extremely long training cycles, which slow down the exploration of new network architectures and hyperparameter scans to improve the classification performance.
        We present studies of scaling an LArTPC classification problem on multiple architectures, spanning multiple nodes. The studies are carried out in simulated events in the MicroBooNE detector.

        Speaker: Jan Fridolf Strube
      • 478
        Fitting and Modeling in ROOT

        The ROOT Mathematical and Statistical libraries have been recently improved to facilitate the modelling of parametric functions that can be used for performing maximum likelihood fits to data sets to estimate parameters and their uncertainties.
        We report here on the new functionality of the ROOT TFormula and TF1 classes to build these models in a convenient way for the users. We show how function objects, represented in ROOT by TF1 classes, can be used as probability density functions and how they can be combined together—via an addition operator—to perform extended likelihood fit of several normalised components. We also describe how the functions can be used to perform convolution, using the Fast Fourier transforms provided in ROOT by the FFTW3 package. Furthermore, we present new capabilities of the fitting algorithms for dealing with new data classes such as the ROOT 7 histograms.
        Finally, we report on the improvements in the performance of the ROOT fitting algorithm, by using SIMD vectorisation in evaluating the model function on large data sets and by exploiting multi-thread or multi-process parallelisation in computing the likelihood function.
        We present these performance results by fitting some standard benchmark data sets and we compare the obtained results with other fitting software tools provided by ROOT such as RooFit.

        Speaker: Lorenzo Moneta (CERN)
      • 479
        GPU analyses on the grid

        Analyses of multi-million event datasets are natural candidates to exploit the massive parallelisation available on GPUs. This contribution presents two such approaches to measure CP violation and the corresponding user experience.

        The first is the energy test, which is used to search for CP violation in the phase-space distribution of multi-body hadron decays. The method relies on a pairwise comparison of all events, hence introducing a quadratic dependence on the sample size, which parallelisation on the GPU can compensate.

        The second method is the time-dependent amplitude analysis with the GooFit package. With the complex amplitude model the number of fit parameters is typically in excess of 50, making this a great computational challenge on very large datasets.

        Both methods have been submitted via Dirac to the Manchester Analysis Facility GPU cluster. The contribution will report on the performance of running GPU jobs on the grid as well as on the performance of the two algorithms. Both analyses have been performed on the large datasets of the LHCb experiment. The gain in computing time is up to about one order of magnitude, which would render the analysis impossible on CPUs.

        Speaker: Chris Burr (University of Manchester (GB))
      • 480
        Deployment of a Matrix Element Method code for the ttH channel analysis on GPU's platform

        In the proton-proton collisions at the LHC, the associate production of the Higgs boson with two top quarks has not been observed yet. This ttH channel allows directly probing the coupling of the Higgs boson to the top quark. The observation of this process could be a highlight of the ongoing Run 2 data taking.

        Unlike to supervised methods (neural networks, decision trees, support vector machines, …), the Matrix Element Method (MEM) allows to classify events by computing the probability that a final state occurs thanks to the physic laws involved. This sophisticated method is however very CPU time consuming to explore all possible final states and requires huge powerful computing platform to perform the CMS analyses carried out at our laboratory in a reasonable time.

        The Matrix Element method is based on the computation of high dimensional integrals. We will describe how we develop and deploy our MEM production code on GPU's platform. Especially, how we adapted the main components of these computations into OpenCL kernels: VEGAS, the MadGraph code generator to compute the Matrix Element terms and the parton distribution function calculations for LHAPDF part.

        We will conclude by discussing about the gain obtained on GPU's platforms (> 20 GPUs) compared with classical CPU's platforms.

        Speaker: Gilles Grasseau (Centre National de la Recherche Scientifique (FR))
    • T8 - Networks and facilities: S6 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Conveners: Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno), Oksana Shadura (University of Nebraska Lincoln (US))
      • 481
        INFN Tier-1: a distributed site

        The INFN Tier-1 center at CNAF has been extended in 2016 and 2017 in order to include a small amount of resources (~24 kHS06 corresponding to ~10% of the CNAF pledges for LHC in 2017) physically located art the Bari-ReCas site (~600 km far from CNAF).
        In 2018, a significant percentage of the CPU power (~170 kHS06, equivalent to ~50% of the total CNAF pledges) are going to be provided via a collaboration with the PRACE Tier-0 CINECA center (a few km far from CNAF), thus building a truly geographically distributed center over the WAN.
        The two sites are going to be interconnected via an high bandwidth link (400-1000 Gb/s), in order to ensure a transparent access to data resident on storage at CNAF; latency between the centers is low enough not to need particular caching strategies. In this contribution we describe the issues and the results of the production configuration, with a focus both on the management aspects and on the performance as seen by the users.

        Speaker: Luca dell'Agnello (INFN)
      • 482
        Dynamic Integration and Management of Opportunistic Resources for HEP

        Experience to date indicates that the demand for computing resources in high energy physics shows a highly dynamic behaviour, while the provided resources by the WLCG remain static over the year. It has become evident that opportunistic resources such as High Performance Computing (HPC) centers and commercial clouds are very well suited to cover peak loads. However, the utilization of this resources leads to new levels of complexity, i.e. resources need to be managed highly dynamic, HEP applications require a very specific software environment usually not provided at opportunistic resources and network bandwidth limitations can cause I/O-intensive workflows to run inefficiently.

        The key component in order to successfully run HEP applications in a dynamic fashion on opportunistic resources is the utilization of modern container and virtualization technologies. Based on these technologies, the Karlsruhe Institute of Technology (KIT) has developed the resource scheduler ROCED to dynamically integrate and manage a variety opportunistic resources. In combination with the HTCondor batch system, a powerful single entry point to all available computing resources has been created, leading to a seamless and transparent integration of opportunistic resources into HEP computing at KIT.

        KIT is currently further improving the resource management and job scheduling by taking into account I/O requirements of individual workflows as well as the available network bandwidth. In this talk an overview of the utilized technologies, the dynamic management and integration of resources as well as the status of the I/O-based resource and job scheduling is given.

        Speaker: Matthias Jochen Schnepf (KIT - Karlsruhe Institute of Technology (DE))
      • 483
        A Feasibility Study about Integrating HTCondor Cluster Workload with SLURM Cluster Workload

        There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with SLURM as the workload manager. The resources of the HTCondor cluster are provided by multiple experiments, and the resource utilization has reached more than 90% by adopting a dynamic resource share mechanism. Nevertheless, there will be a bottleneck if more resources are requested by multiple experiments at the same moment. On the other hand, parallel jobs running on the SLURM cluster reflect some specific attributes, such as high parallel degree, low quantity and long wall time. Such attributes make it easy to generate free resource slots which are suitable for jobs from the HTCondor cluster. As a result, if there is a mechanism to schedule jobs from the HTCondor cluster to the SLURM cluster transparently, it would improve the resource utilization both for two clusters. HTCondor provides HTCondor-C to schedule jobs to other clusters managed by different workload managers, for example, SLURM. However, it's not enough if we would like to decide which, when and where jobs are allowed to schedule by SLURM. Also, how to manage the re-scheduled jobs running on the SLURM cluster will be a problem. Furthermore, design philosophy and application scenes are different between HTCondor and SLURM, large quantity of jobs in a short period may bring extra scheduling load for SLURM. In this paper, after a brief background introduction, we will describe the problems to integrate two cluster workloads, and we will also present possible solutions to these problems.

        Speaker: Ran Du
      • 484
        LHCb High Level Trigger in remote IT datacentre

        While the LHCb experiment will be using a local data-centre at the experiment site for its computing infrastructure in Run3, LHCb is also evaluating the possibility to move its High Level Trigger server farm into an IT data-centre located few kilometres away from the LHCb detector. If proven feasible and if it could be replicated by other LHC experiments, the solution would allow the construction of a common HLT facility for sharing IT hardware resources with substantial cost savings.
        The main challenges of the experiments are: the transmission of a very large amount of data over a scarce number of fibre connections; the relatively long distance to be covered; the use of shared, general purpose computer servers.
        This paper describes the set up of the experiment, the different technologies considered and used, the challenges encountered and how they have been tackled.
        The paper also tries to give guidelines and recommendations for the implementation of a solution that could be shared by all the LHC experiments.

        Speaker: Edoardo Martelli (CERN)
      • 485
        A prototype for the ALICE Analysis Facility at GSI

        The ALICE computing model for Run3 foresees few big centres, called Analysis Facilities, optimised for fast processing of large local sets of Analysis Object Data (AODs). Contrary to the current running of analysis trains on the Grid, this will allow for more efficient execution of inherently I/O-bound jobs. GSI will host one of these centres and has therefore finalised a first Analysis Facility prototype in order to prepare the conversion and expansion of the current ALICE Tier2 centre. In order to meet the new and enhanced requirements, the facility will need, in particular, to allow for an aggregate throughput of up to 100 GB/s.
        We describe how the use of in-house developed XRootD plug-ins significantly improves the I/O performance of local file operations. Some solutions to automate and validate the creation of container images to be used in production are also presented. They constitute a reliable workflow to transparently run ALICE jobs within a controlled environment, independent of the underlying availability of software libraries, setting of environment variables or updating of software packages.

        Speaker: Kilian Schwarz (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 486
        CMS Site in a Box: Deploying a Tier-3 Site using Local Resources and Central Services via a Centrally Managed Server

        Even as grid middleware and analysis software has matured over the course of the LHC's lifetime it is still challenging for non-specialized computing centers to contribute resources. Many U.S. CMS collaborators would like to set up Tier-3 sites to contribute campus resources for the use of their local CMS group as well as the collaboration at large, but find the administrative burden of setting up and operating full OSG compute and storage elements too large a burden for their available effort. In order to reduce the operational cost and expertise overhead needed to provide a functioning Tier-3 access point to compute resource the USCMS Tier-3 support team has sought to adapt and expand the successful Site-in-a- Box model utilized by the Pacific Research Platform into a more generic support structure. This model assumes the onsite presence of a centrally managed, single server to serve as a XRootD server, redirector and cache, a platform for various HTCondor tasks and a Squid cache for CVMFS. This server is coupled with a dedicated GlideInWMS pool to facilitate job submission and optionally an OSG Hosted-CE service depending on the needs of the site. We discuss the technical details of this approach as well as specific instances of deployed servers, highlighting commonalities and differences between current deployments.

        Speaker: Carl Lundstedt (University of Nebraska Lincoln (US))
    • T5 - Software development: S6.2 - Continuous Integration Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Maiken Pedersen (University of Oslo (NO))
      • 487
        LHCb continuous integration and deployment system: a message based approach

        The LHCb physics software has to support the analysis of data taken up to now and at the same time is under active development in preparation for the detector upgrade coming into operation in 2021. A continuous integration system is therefore crucial to maintain the quality of the ~6 millions of lines of C++ and Python, to ensure consistent builds of the software as well as to run the unit and integration tests. As already presented, the Jenkins automation server is used for this purpose. It builds and tests around 100 configurations and produces in the order of 1500 build artifacts per day which are installed on the CVMFS file system or potentially on the developers machines.

        Faced with a large and growing number of configurations built every day, and in order to ease interoperation between the continuous integration system and the developers we decided to put in place a flexible messaging system. As soon as the build artifacts have been produced, the distributed system allows their deployment based on the priority of the configurations.

        We will describe the architecture of the new system, which is based on RabbitMQ messaging system (and the pika Python client library), that uses priority queues to start the LHCb software integration tests and to drive the installation of the nightly builds on the CVMFS file system. We will also show how the introduction of an event based system can help with the communication of results to developers.

        Speaker: Stefan-Gabriel Chitic (CERN)
      • 488
        Monitoring LHCb Trigger developments using nightly integration tests and a new interactive web UI

        The LHCb Performance Regression (LHCbPR) framework allows for periodic software testing to be performed in a reproducible manner.
        LHCbPR provides a JavaScript based web front-end service, built atop industry standard tools such as AngularJS, Bootstrap and Django (https://lblhcbpr.cern.ch).
        This framework records the evolution of tests over time allowing for this data to be extracted for end-user analysis.

        The LHCbPR framework has been expanded to integrate the nightly testing and profiling.
        These developments allow for key performance metrics within the Trigger software to be monitored over time.
        Additionally, tests of the full physics reconstruction have been integrated into LHCbPR.
        These allow for tracking the effect that optimization work has on physics reconstruction performance.

        This presentation will describe the integration of these tests into LHCbPR as well as describing the structure and new features developed for the frontend web service.

        Speaker: Dr Robert Andrew Currie (The University of Edinburgh (GB))
      • 489
        Continuous Integration service at Fermilab

        This paper describes the current architecture of Continuous Integration (CI) service developed at Fermilab, encountered successes and difficulties, as well as future development plans. Current experiment code has hundreds of contributors that provide new features, bug fixes, and other improvements. Version control systems help developers to collaborate in contributing software for their experiments, while the CI system helps developers to keep their code healthy. The Fermilab CI service allows experiments and projects to test and validate their offline production and analysis code on the supported platforms. It is built on top of Jenkins, designed to be set up from a configuration file that provides implementation for each phase of the CI workflow, and able to validate experiment code through grid jobs. This CI service provides a dashboard for easy access to logs and statistical graphs. Since the CI service has been adopted by Fermilab experiments/projects, it proved to be very useful to intercept issues in their code early on and get them fixed before running it in production. Currently the CI service is in use by the ArgoNeuT, DUNE, g-2, LArIAT, MINERvA, mu2e, NOvA, SBND and uBooNE experiments and by the following projects: ART and LArSoft software suites, GENIE, and GlideinWMS. The CI service is under active development, planning to support code profiling.

        Speaker: Dr Vito Di Benedetto (Fermi National Accelerator Lab. (US))
    • 12:30
      Lunch break
    • T1 - Online computing: S7 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Conveners: Adriana Telesca (CERN), Clara Gaspar (CERN)
      • 490
        ALICE Overwatch: Online monitoring and data quality assurance using HLT data

        ALICE Overwatch is a project started in late 2015 to provide augmented online monitoring and data quality assurance utilizing time-stamped QA histograms produced by the ALICE High Level Trigger (HLT). The system receives the data via ZeroMQ, storing it for later review, enriching it with detector specific functionality, and visualizing it via a web application. These provided capabilities are complementary to the existing Data Quality Monitoring (DQM) system. In addition to basic visualization, additional processing options are accessible to the user, including requests for data within a particular time range or reprocessing of a particular run with different processing parameters. For example, the first ten minutes of a particular run could be investigated for a transient hot trigger channel.

        Due to similarities between the Overwatch architecture and that which will be used for Quality Control (QC) in LHC Run 3 and beyond, Overwatch will be utilized to develop and test various QC components during LHC Run 2. Some of the areas of QC development include testing possible database options, as well as the development of a trending and alarm framework. We report on the project's design, development, and status.

        Speaker: Raymond Ehlers (Yale University (US))
      • 491
        Highly extensible modular system for online monitoring of the ATLAS experiment

        Unprecedented size and complexity of the ATLAS experiment required
        adoption of a new approach for online monitoring system development as
        many requirements for this system were not known in advance due to the
        innovative nature of the project.

        The ATLAS online monitoring facility has been designed as a modular
        system consisting of a number of independent components, which can
        interact with one another via a set of well defined interfaces. The
        system has been developed using open source software and is based on
        the two in-house developed highly scalable distributed services for
        message passing and information exchange, which can deal with
        information of arbitrary types. The other monitoring components use
        these services to implement high-level facilities, like for example
        Monitoring Data Archiving and Data Quality Assessment, as well as end
        user interfaces like the Data Quality and Online Histogramming
        displays.

        This presentation will describe the online monitoring system design
        and evolution for the first two data taking periods showing how the
        chosen approach allowed the system to be gradually extended during
        operation in a non-disruptive way by adding more high level tools and
        frameworks as requirements evolved.

        Speaker: Serguei Kolos (University of California Irvine (US))
      • 492
        Presentation layer of CMS Online Monitoring System

        The Compact Muon Solenoid (CMS) is one of the experiments at the CERN Large Hadron Collider (LHC). The CMS Online Monitoring system (OMS) is an upgrade and successor to the CMS Web-Based Monitoring (WBM) system, which is an essential tool for shift crew members, detector subsystem experts, operations coordinators, and those performing physics analyses. CMS OMS is divided into aggregation and presentation layers. Communication between layers uses RESTful JSON:API compliant requests. The aggregation layer is responsible for collecting data from heterogeneous sources, storage of transformed and pre-calculated (aggregated) values and exposure of data via the RESTful API.

        The presentation layer displays detector information via a modern, user-friendly and customizable web interface. The CMS OMS user interface is composed of a set of cutting-edge software frameworks and tools to display non-event data to any authenticated CMS user worldwide. The web interface tree-like component structure comprises (top-down): workspaces, folders, pages, controllers and portlets. A clear hierarchy gives the required flexibility and control for content organization. Each bottom element instantiates a portlet and is a reusable component that displays a single aspect of data, like a table, a plot, an article, etc. Pages consist of multiple different portlets and can be customized at run-time by using a drag-and-drop technique. This is how a single page can easily include information from multiple online sources. Different pages give access to a summary of the current status of the experiment, as well as convenient access to historical data.

        This paper describes the CMS OMS architecture, core concepts and technologies of the presentation layer.

        Speaker: Mantas Stankevicius (Fermi National Accelerator Lab. (US))
      • 493
        Open Hardware and Application Specific Design for the monitoring system of the Belle II forward/backward electromagnetic calorimeter

        Control and monitoring of experimental facilities as well as laboratory equipment requires handling a blend of different tasks. Often in industrial or scientific fields there are standards or form factor to comply with and electronic interfaces or custom busses to adopt. With such tight boundary conditions, the integration of an off-the-shelf Single Board Computer (SBC) is not always a possible or viable alternative.
        The availability of electronic schematics and PCBs with Open-source Hardware license for various SBCs overcome such integration problems, making feasible the implementation of a custom architecture composed by a central core inherited from a vendor reference design (most likely the microprocessor, static RAM and flash memory) augmented with application-specific integrated circuits and hardware resources, in order to handle the requirements of the specific environment. The user is then able to exploit most of the supported tools and software provided by Open Source community, fulfilling all the constraints enforced by his environment.

        We have used such an approach for the design and development of the monitoring system framework of the ECL endcap calorimeter of the Belle2 experiment, presently under construction at the KEK Laboratory (Tsukuba, J). In this work we present and discuss the main aspects of the hardware and software architectures tailored on the needs of a detector designed around CsI scintillators.

        Speaker: Francesco Di Capua (Università di Napoli Federico II and INFN)
      • 494
        DAQBroker - A general purpose instrument monitoring framework

        The current scientific environment has experimentalists and system administrators allocating large amounts of time for data access, parsing and gathering
        as well as instrument management. This is a growing challenge with more large
        collaborations with significant amount of instrument resources, remote instrumentation sites and continuously improved and upgraded scientific instruments.
        DAQBroker is a new software framework adopted by the CLOUD experiment at
        CERN. This framework was designed to monitor CLOUD's network of various
        architectures and operating systems and collect data from any instrument while
        also providing simple data access to any user. Data can be stored in one or
        several local or remote databases running on any of the most popular RDBMS
        (MySQL, PostgreSQL, Oracle). It also provides the necessary tools for creating
        and editing the meta data associated with different instruments, perform data
        manipulation and generate events based on instrument measurements, regard-
        less of the user's know-how of individual instruments. DAQBroker is a software
        package focused on three major components: Storage, communication and user
        interface. This submission will present an overview of each of DAQBroker's components as well as provide preliminary performance results of the application
        running on high and low performance machines.

        Speaker: Antonio Dias (Universidade de Lisboa (PT))
      • 495
        A scalable online monitoring system based on Elasticsearch for distributed data acquisition in CMS

        The part of the CMS data acquisition (DAQ) system responsible for data readout and event building is a complex network of interdependent distributed programs. To ensure successful data taking, these programs have to be constantly monitored in order to facilitate the timeliness of necessary corrections in case of any deviation from specified behaviour. A large number of diverse monitoring data samples are periodically collected from multiple sources across the network. Monitoring data are kept in memory for online operations and optionally stored on disk for post-mortem analysis.

        We present a generic, reusable solution based on an open source NoSQL database, Elasticsearch, which is fully compatible and non-intrusive with respect to the existing system. The motivation is to benefit from an off-the-shelf software to facilitate the development, maintenance and support efforts. Elasticsearch provides failover and data redundancy capabilities as well as a programming language independent JSON-over-HTTP interface. The possibility of horizontal scaling matches the requirements of a DAQ monitoring system. The data load from all sources is balanced by redistribution over an Elasticsearch cluster that can be hosted on a computer cloud.

        In order to achieve the necessary robustness and to validate the scalability of the approach the above monitoring solution currently runs in parallel with an existing in-house developed DAQ monitoring system. The effectiveness and reusability of such a distributed monitoring solution is demonstrated by the current usage of the same system within the CMS BRIL subsystem. Another Elasticsearch based system is used for the High-Level-Trigger (HLT) part of the DAQ system monitoring, which also benefits from this off-the-shelf solution facilitating data storing and load balancing.

        Speaker: Dr Dainius Simelevicius (Vilnius University (LT))
    • T2 - Offline computing: S7 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Lucia Grillo (University of Manchester (GB))
      • 496
        ALICE Track visualisation options for LHC Run 3

        Good quality track visualization is an important aspect of every High-Energy Physics experiment, where it can be used for quick assessment of recorded collisions. The event display, operated in the Control Room, is also important for visitors and increases public recognition of the experiment. Especially in the case of the ALICE detector at the Large Hadron Collider (LHC), which reconstructs up to 2000 tracks per rapidity unit for each Pb-Pb collision, a clear visualization of such collisions is a challenging task.

        A new track display algorithm may also replace static 3D snapshots presented in Run-2 and visualize the dynamics of particle collisions.
        This is also a snapshot, showing only few events per second, but an animated one. Together with wise selection of the nice event groups it may increase attractiveness of the visualization.

        Such animations of particles in the event, which also may appear and fade out, require a well defined graphical framework.

        We present the results of comparative tests of different approaches to the topic, with their benefits and limitations. Special considerations were taken to ensure the proper performance of the visualization to avoid frame jams. Modern visualization technologies like Vulkan and shader-based OpenGL were investigated. Traditional ROOT geometry functions were also considered as an option which do not require deeper changes in underlying ROOT library.

        Speaker: Mr Julian Myrcha (Warsaw University of Technology)
      • 497
        Going standalone and platform-independent, an example from recent work on the ATLAS Detector Description and interactive data visualization

        Until recently, the direct visualization of the complete ATLAS experiment geometry and final analysis data was confined within the software framework of the experiment.
        To provide a detailed interactive data visualization capability to users, as well as easy access to geometry data, and to ensure platform independence and portability, great effort has been recently put into the modernization of both the core kernel of the detector description and the visualization tools. In this talk we will present the new tools, as well as the lessons learned while modernizing the experiment's code for an efficient use of the detector description and for user-friendly data visualization.

        Speaker: Sebastian Andreas Merkt (University of Pittsburgh (US))
      • 498
        Belle II Virtual Reality CAVE Implementation

        The Belle II experiment, based in Japan, is designed for the precise measurement of B and C meson as well as $\tau$ decays and is intended to play an important role in the search for physics beyond the Standard Model. To visualize the collected data, amongst other things, virtual reality (VR) applications are used within the collaboration. In addition to the already existing VR application which runs on a head-mounted display, an implementation for the cave automatic virtual environment (CAVE) has been created, where the CAVE is an immersive VR environment, in which projectors are directed to up to six walls of a room-sized cube. The application allows for the inspection of the Belle II detector itself, as well as the illustration of GEANT4 simulated (and data) events of the electron-positron collisions occurring at the SuperKEKB collider. The CAVE VR implementation is not only limited to the use within the Belle II collaboration, but is a helpful tool in outreach activities. The main concept, the implementation as well as the future plans for this application will be presented.

        Speaker: Michael Bender (University of Munich (LMU))
      • 499
        Virtual Reality and game engines for interactive data visualization and event displays in HEP, an example from the ATLAS experiment

        Interactive 3D data visualization plays a key role in HEP experiments, as it is used in many tasks at different levels of the data chain. Outside HEP, for interactive 3D graphics, the game industry makes heavy use of so-called “game engines”, modern software frameworks offering an extensive set of powerful graphics tools and cross-platform deployment. Recently, a very strong support for Virtual Reality (VR) technology has been added to such engines. In this talk we explore the usage of game engines and VR for HEP data visualization, discussing the needs, the challenges and the issues of using such technologies. We will also make use of ATLASRift, a VR applications developed by the ATLAS experiment, to discuss the lessons learned while developing it using the game engine "Unreal Engine", and the feedback on the use of Virtual Reality we got from users while using it at many demonstrations and public events.

        Speaker: Claire Adam Bourdarios (Centre National de la Recherche Scientifique (FR))
      • 500
        ATLAS OpenData and OpenKey: using low tech computational tools for students training in High Energy Physics

        One of the big challenges in High Energy Physics development is the fact that many potential -and very valuable- students and young researchers live in countries where internet access and computational infrastructure are poor compared to institutions already participating.

        In order to accelerate the process, the ATLAS Open Data project releases useful and meaningful data and tools using standard and easy-to-deploy computational means, such as custom and light Linux Virtual Machines, open source technologies, web and desktop applications. The ATLAS Open Key, a simple USB pen, allows to transport all those resources around the globe. As simple as it sounds, this approach is helping to train students that are now PhD candidates and to integrate HEP educational programs at Master level in universities where did not exist before.

        The software tools and resources used will be presented, as well as results and stories, ideas and next steps of the ATLAS Open Data project.

        Speaker: Arturo Sanchez Pineda (Abdus Salam Int. Cent. Theor. Phys. (IT))
      • 501
        Discussion
    • T3 - Distributed computing: Computing Models and Future Views Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: David Cameron (University of Oslo (NO))
      • 502
        CMS Computing Resources: Meeting the demands of the high-luminosity LHC physics program

        The HL-LHC program has seen numerous extrapolations of its needed computing resources that each indicate the need for substantial changes if the desired HL-LHC physics program is to be supported within the current level of computing resource budgets. Drivers include large increases in event complexity (leading to increased processing time and analysis data size) and trigger rates needed (5-10 fold increases) for the HL-LHC program. CMS has recently undertaken an effort to merge the ideas behind short-term and long-term resource models in order to make easier and more reliable extrapolations to future needs. Near term computing resource estimation requirements depend on numerous parameters: LHC uptime and beam intensities; detector and online trigger performance; software performance; analysis data requirements; data access, management, and retention policies; site characteristics; and network performance. Longer term modeling is affected by the same characteristics, but with much larger uncertainties that must be considered to understand the most interesting handles for increasing the "physics per computing dollar" of the HL-LHC. In this presentation, we discuss the current status of long term modeling of the CMS computing resource needs for HL-LHC with emphasis on techniques for extrapolations, uncertainty quantification, and model results. We illustrate potential ways that high-luminosity CMS could accomplish its desired physics program within today's computing budgets.

        Speaker: David Lange (Princeton University (US))
      • 503
        Towards a computing model for the LHCb Upgrade

        The LHCb experiment will be upgraded for data taking in the LHC Run 3. The foreseen trigger output bandwidth trigger of a few GB/s will result in datasets of tens of PB per year, which need to be efficiently streamed and stored offline for low-latency data analysis. In addition, simulation samples of up to two orders of magnitude larger than those currently simulated are envisaged, with big impact on offline computing and storage resources.
        This contribution discusses the offline computing model and the required offline resources for the LHCb Upgrade, as resulting from the above requirements.

        Speaker: Stefan Roiser (CERN)
      • 504
        The Future of Distributed Computing Systems in ATLAS: Boldly Venturing Beyond Grids

        The Production and Distributed Analysis system (PanDA) for the ATLAS experiment at the Large Hadron Collider has seen big changes over the past couple of years to accommodate new types of distributed computing resources: clouds, HPCs, volunteer computers and other external resources. While PanDA was originally designed for fairly homogeneous resources available through the Worldwide LHC Computing Grid, the new resources are heterogeneous, at diverse scales and with diverse interfaces. Up to a fifth of the resources available to ATLAS are of such new types and require special techniques for integration into PanDA. In this talk, we present the nature and scale of these resources. We provide an overview of the various challenges faced, spanning infrastructure, software distribution, workload requirements, scaling requirements, workflow management, data management, network provisioning, and associated software and computing facilities. We describe the strategies for integrating these heterogeneous resources into ATLAS, and the new software components being developed in PanDA to efficiently use them. Plans for software and computing evolution to meet the needs of LHC operations and upgrade in the long term future will be discussed.

        Speaker: Fernando Harald Barreiro Megino (University of Texas at Arlington)
      • 505
        JAliEn: the new ALICE high-performance and high-scalability Grid framework

        The ALICE experiment will undergo an extensive detector and readout upgrade for the LHC Run3 and will collect a 10 times larger data volume than today. This will translate into increase of the required CPU resources worldwide as well as higher data access and transfer rates. JAliEn (Java ALICE Environment) is the new Grid middleware designed to scale-out horizontally and satisfy the ALICE production and analysis requirements after the upgrade.
        This paper work presents the architecture of JAliEn, the technologies and methods introduced to achieve high scalability and stability of all services. Performance measurements of key system services are shown and discussed. In addition, we describe the next generation solution that is used to implement and operate one of the main components - the JAliEn File Catalogue. We describe the catalogue design and the database backend used to reach almost unlimited scalability in terms of file descriptors and thus its ability to handle near exascale data volumes. The catalogue is further enhanced with powerful and fast search tools, an important aspect to reach and surpass the required performance for JAliEn. Lastly, we present how JAliEn handles the higher complexity of data processing tasks implemented in the new ALICE software framework ALFA and in multi-core environments.

        Speaker: Miguel Martinez Pedreira (Johann-Wolfgang-Goethe Univ. (DE))
      • 506
        System Performance and Cost Modelling in LHC computing

        The increase in the scale of LHC computing expected for Run 3 and even more so for Run 4 (HL-LHC) over the course of the next ten years will most certainly require radical changes to the computing models and the data processing of the LHC experiments. Translating the requirements of the physics programmes into computing resource needs is an extremely complicated process and subject to significant uncertainties. For this reason, WLCG has established a working group to develop methodologies and tools intended to characterize the LHC workloads, better understand their interaction with the computing infrastructure, calculate their cost in terms of resources and expenditure and assist experiments, sites and the WLCG project in the evaluation of their future choices.
        This working group started in November 2017 and has about 30 active participants representing experiments and sites. In this contribution we expose the activity and the results obtained so far by the working group and the direction it is taking for the fulfillment of its objectives.

        Speaker: Andrea Sciaba (CERN)
      • 507
        OSG and GPUs: A tale of two use cases

        With the increase of power and reduction of cost of GPU accelerated processors a corresponding interest in their uses in the scientific domain has spurred. OSG users are no different and they have shown an interest in accessing GPU resources via their usual workload infrastructures. Grid sites that have these kinds of resources also want to make them grid available. In this talk, we discuss the software and infrastructure challenges and limitations of the OSG implementations to make GPU’s widely accessible over the grid. Two use cases are considered for this. First: Icecube, a big VO with a well-curated software stack taking advantage of GPUs with OpenCL. Second, a more general approach to supporting the grid use of industry and academia maintained machine learning libraries like Tensorflow, and keanos on the grid using Singularity.

        Speaker: Edgar Fajardo Hernandez (Univ. of California San Diego (US))
    • T4 - Data handling: S7 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Elizabeth Gallas (University of Oxford (GB))
      • 508
        Building a global file system for data access using Large Scale CVMFS and DynaFed

        CVMFS has proved an extremely effective mechanism for providing scalable, POSIX like, access to experiment software across the Grid. The normal method for file access is http downloads via squid caches from a small number of Stratum 1 servers. In the last couple of years this mechanisms has been extended to allow access of files from any storage offering http access. This has been named Large Scale CVMFS. Large Scale CVMFS has been shown to work for experiments whose entire dataset can be stored at a single site however it has not been designed for when the data is distributed across many sites or when there is more than one copy of a file available.

        DynaFed can federate http storage endpoints and is able to present a huge distributed repository as if it were one. It is an ideal complement to Large Scale CVMFS as it provides a mechanism to select the most appropriate file when more than one copy exists. The dynamic nature of the federation also allows storage to be added and removed without requiring changes to CVMFS clients running on every worker node. This paper reports on the work done within GridPP to build a global file system for data access using Large Scale CVMFS and DynaFed. The data federation includes both traditional Grid storage endpoints such as DPM as well as cloud storage such as S3 and this paper also describes the differences in their setup and observed performance.

        Speaker: Alastair Dewhurst (STFC-Rutherford Appleton Laboratory (GB))
      • 509
        An http data-federation eco-system with caching functionality using DPM and Dynafed

        The implementation of Cache Systems in the computing model of HEP experiments enables to accelerate access to hot data sets by scientists, opening new scenarios of data distribution and enable to exploit the paradigm of storage-less sites.
        In this work, we present a study for the creation of an http data-federation eco-system with caching functionality. By exploiting the volatile-pool concept introduced in the last version of Disk Pool Manager, we created a code plugged in the logic of a DPM Storage, able to simulate the cache behaviour. Then we used Dynafed as light federation system to aggregate a set of standard Grid Storages with the cache system. With the designed setup, clients asking for a file present on the Data-Grid, but located closer to the cache rather than to physical location of the file, are automatically redirected to the cache thanks to the action of the geo-plugin run by Dynafed.
        As proof of the concept, we tested the whole system in a controlled environment within the Belle II computing infrastructure using a set of files located in production Storage Elements. Preliminary results demonstrate the proper functionality of the logic and encourage continuing the work.

        Speaker: Silvio Pardi (INFN)
      • 510
        The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

        The computing center GridKa is serving the ALICE, ATLAS, CMS and LHCb experiments as one of the biggest WLCG Tier-1 centers world wide with compute and storage resources. It is operated by the Steinbuch Centre for Computing at Karlsruhe Institute of Technology in Germany. In April 2017 a new online storage system was put into operation. In its current stage of expansion it offers the HEP experiments a capacity of 23 Petabytes of online storage distributed over 16 redundant storage servers with 3900 disks and 50TB SSDs. The storage is connected via two redundant infiniband fabrics to 44 file servers which in turn are connected each via 40Gbit/s and several 100Gbit/s ethernet uplinks to the GridKa backbone network. The whole storage is partitioned into few large file systems, one for each experiment, using IBM Spectrum Scale as software-defined-storage base layer. The system offers a combined read-write performance of 70Gbyte/s. It can be scaled transparently both in size and performance allowing to fulfill the growing needs especially of the LHC experiments for online storage in the coming years.
        In this presentation we discuss the general architecture of the storage system and present first experiences with the performance of the system in production use. In addition we present the current plans for expansion of the system.

        Speaker: Jan Erik Sundermann (Karlsruhe Institute of Technology (KIT))
      • 511
        Advancements in data management services for distributed e-infrastructures: the eXtreme-DataCloud project

        The development of data management services capable to cope with very large data resources is a key challenge to allow the future e-infrastructures to address the needs of the next generation extreme scale scientific experiments.
        To face this challenge, in November 2017 the H2020 “eXtreme DataCloud - XDC” project has been launched. Lasting for 27 months and combining the expertise of 8 large European research organisations, the project aims at developing scalable technologies for federating storage resources and managing data in highly distributed computing environments. The targeted platforms are the current and next generation e-Infrastructures deployed in Europe, such as the European Open Science Cloud (EOSC), the European Grid Infrastructure (EGI), and the Worldwide LHC Computing Grid (WLCG).
        The project is use-case driven with a multidisciplinary approach, addressing requirements from research communities belonging to a wide range of scientific domains: High Energy Physics, Astronomy, Photon and Life Science, Medical research.
        XDC will implement data management scalable services to address the following high level topics: policy driven data management based on Quality-of-Service, Data Life-cycle management, smart placement of data with caching mechanisms to reduce access latency, meta-data with no predefined schema handling, execution of pre-processing applications during ingestion, data management and protection of sensitive data in distributed e-infrastructures, intelligent data placement based on access patterns.
        Experts from the project consortium will work on combining already established data management and orchestration tools to provide a highly scalable solution supporting the computing models of the current and next generation experiments. The XDC products will be based on tools such as ONEDATA, EOS, FTS, Indigo-Orchestrator, Indigo-CDMI server, Dynafed.
        This contribution will introduce the project, present the foreseen overall architecture and the developments that are being carried on to implement the requested functionalities.

        Speaker: Daniele Cesini (Universita e INFN, Bologna (IT))
      • 512
        Using a dynamic data federation for running Belle-II simulation applications in a distributed cloud environment

        The dynamic data federation software (Dynafed), developed by CERN IT, provides a federated storage cluster on demand using the HTTP protocol with WebDAV extensions. Traditional storage sites which support an experiment can be added to Dynafed without requiring any changes to the site. Dynafed also supports direct access to cloud storage such as S3 and Azure. We report on the usage of Dynafed to support Belle-II production jobs running on a distributed cloud system utilizing clouds across North America. Cloudscheduler, developed by the UVic HEP Research Computing group, federates Openstack, OpenNebula, Amazon, and Microsoft cloud compute resources and provides them as a unified grid site which on average runs about 3500 Belle-II production jobs in parallel. The input data for those jobs is accessible through a single endpoint, our Dynafed instance. This Dynafed instance unifies storage resources provided by Amazon S3, Ceph, and minio object stores as endpoints, as well as storage provided by traditional DPM and dCache sites. We report on our long term experience with this setup, the implementation of a grid-mapfile based X509 authentication/authorization for Belle-II access,  and we show how a federated cluster can be used by Belle-II through gfalFS. We also report on the advantages and challenges we encountered in using it to support the Belle-II experiment in a distributed cloud environment.

        Speaker: Dr Marcus Ebert (University of Victoria)
      • 513
        storage events: distributed users, federation and beyond

        Whatever the use case, for federated storage to work well some knowledge from each storage system must exist outside that system. This is needed to allow coordinated activity; e.g., executing analysis jobs on worker nodes with good accessibility to the data.

        Currently, this is achieved by clients notifying central services of activity; e.g., a client notifies a replica catalogue after an upload. Unfortunately, this forces end users to use bespoke clients. It also forces clients to waiting for asynchronous activities to finish.

        dCache is providing an alternative approach: storage events. In this approach the storage systems (rather than the clients) become the coordinating service, notifying interested parties of key events.

        At DESY, we are investigating storage events along with Apache OpenWhisk and Kubernetes to build a "serverless" cloud, similar to AWS Lambda or Google Cloud Functions, for photon science use-cases.

        Storage events are more generally useful: catalogues are notified whenever data is uploaded or delete, tape becomes more efficient because analysis can start immediately after the data is on disk, caches can be "smart" fetching new datasets pre-emptively.

        In this paper we will present work within dCache to support a new event-based interface, with which these and other use-cases become more efficient.

        Speaker: Paul Millar (DESY)
    • T5 - Software development: S7 - Performance, Profiling, Code Optimization Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Patricia Mendez Lorenzo (CERN)
      • 514
        Retroactive sustainability improvements in the MERLIN particle tracking code

        MERLIN is a C++ particle tracking software package, originally developed at DESY for use in International Linear Collider (ILC) simulations. MERLIN has more recently been adapted for High-Luminosity Large Hadron Collider (HL-LHC) collimation studies, utilising more advanced scattering physics. However, as is all too common in existing high-energy physics software, recent developments have not focused on code design or sustainability. This had resulted in notable usability issues for users and developers alike. The following presents current developer endeavours to address these issues, adhering to modern sustainability practices. Quantifiable improvements in code complexity and usability are presented via profiling and test metrics. Moreover, changes in software architecture are analysed and the developer experiences, as well as the pros and cons of retroactively applying sustainability methodology, is discussed.

        Speaker: Dr Scott Rowan (University of Huddersfield)
      • 515
        Data mining techniques for software quality prediction in open source software: an initial assessment

        Software quality monitoring and analysis is one of the most productive topics of software engineering research. Their results may be employed effectively by engineers during software development life cycle. Software metrics, together with data mining techniques, can provide the basis for developing prediction models.

        Open source software constitutes a valid test case for the assessment of software characteristics. A large number of data mining techniques have been proposed in literature over time for analysing complex relationships and extracting useful information.

        This paper aims at comparing diverse data mining techniques (e.g., derived from machine learning) for development of effective software quality prediction models.
        In order to achieve this goal, we tackled various issues such as the collection of software metrics from open source repositories by employing automatic tools, the assessment of prediction models to detect software issues and the adoption of statistical methods to evaluate data mining techniques.
        The results of this study aspire to identify the best data mining techniques amongst all the ones used in this paper for the development of software quality prediction models. Furthermore, we attempt to provide some guidelines to integrate these techniques in existing projects.

        Speaker: Marco Canaparo (INFN)
      • 516
        Porting the LHCb Stack from x86 (Intel) to aarch64 (ARM)

        LHCb is undergoing major changes in its data selection and processing chain for the upcoming LHC Run 3 starting in 2021. With this in view several initiatives have been launched to optimise the software stack. This contribution discusses porting the LHCb Stack from x86 architecture to aarch64 architecture with the goal to evaluate the performance and the cost of the computing infrastructure for e.g. the trigger selection process. The whole LHCb software stack contains around 6 million lines of code which need to be ported. In addition this requires to find working versions of external libraries which LHCb depends on and are provided by LCG. Across all software packages the biggest challenge is the growing use of vectorisation - as many vectorisation libraries are specialised on x86 architecture and do not have aarch64 support. In spite of these challenges we have successfully ported the LHCb reconstruction code to aarch64. This contribution discusses the status and plans for the porting of the software aswell as the LHCb approach for tackling code vectorisation in a platform independent way.

        Speaker: Laura Promberger (University of Applied Sciences (DE))
      • 517
        Implementation of Feldman-Cousins correction and oscillation calculations in the HPC environment for the NOvA and DUNE Experiments

        Analysis of neutrino oscillation data involves a combination of complex fitting procedures and statistical corrections techniques that are used to determine the full three-flavor PMNS parameters and constraint contours. These techniques rely on computationally intensive “multi-universe” stochastic modeling. The process of calculating these contours and corrections can dominate final stages of the data analysis and become a bottleneck for examining the effect of systematic variations on the final results..

        As part of the DOE SciDAC-4 sponsored research program, we present a new implementation of a neutrino oscillation fitting and Feldman-Cousins corrections calculations framework. The implementation is based on decomposition and mapping of the parameter space into MPI ranks by the DIY framework that is specifically designed and optimized to operate on modern High Performance Computing (HPC) facilities. We present the performance of the system in calculating results contours for the NOvA experiment based on their 6E20 and 9E20 protons on target (PoT) neutrino datasets and compare the performance of this new implementation run at NERSC and the Argonne Leadership Computing Facility (ALCF) to methods used previously by the NOvA collaboration running on grid computing facilities.

        Speaker: Alexandre Sousa (University of Cincinnati)
      • 518
        The Event Buffer Management for MT-SNiPER

        SNiPER is a general purpose offline software framework for high energy physics experiment. It provides some features that is attractive to neutrino experiments, such as the event buffer. More than one events are available in the buffer according to a customizable time window, so that it is easy for users to apply events correlation analysis.
        We also implemented the MT-SNiPER to support multithreading computing based on Intel TBB. In MT-SNiPER, the event loop is split into pieces, and each piece is dispatched to a task. The event buffer management becomes more complicated in this case.
        The global buffer, an extension and enhancement to the event buffer, is implemented for MT-SNiPER. The global buffer is available by all threads. It keeps all the events being processed in memory. When there is an available task, a subset of its events is dispatched to that task. There can be overlaps between the subsets in different tasks due to the time window. However, it can be ensured by the global buffer that each event is processed only once. In the task side, the subsets of events are locally managed by a normal event buffer. So the global buffer can be transparent to most user algorithms. Moreover, the global buffer can be used to keep all events in right order.
        Within the global buffer, the multithreading computing of MT-SNiPER becomes more practicable.

        Speaker: Jiaheng Zou (IHEP)
      • 519
        A plugin-based approach to data analysis for the AMS experiment on the ISS

        In many HEP experiments a typical data analysis workflow requires each user
        to read the experiment data in order to extract meaningful information and produce relevant plots for the considered analysis. Multiple users accessing the same data result in a redundant access to the data itself, which could be factorised effectively improving the CPU efficiency of the analysis jobs and relieving stress from the storage infrastructure.
        To address this issue we present a modular and lightweight solution where
        the users code is embedded in different "analysis plugins" which are then collected and loaded at runtime for execution, where the data is read only once and shared between all the different plugins. This solution was developed for one of the data analysis groups within the AMS collaboration but is easily extendable to all kinds of analyses and workloads that need I/O access on AMS data or custom data formats and can even adapted with little effort to another HEP experiment data.
        This framework could then be easily embedded into a "analysis train" and we will
        discuss a possible implementation and different ways to optimise CPU efficiency
        and execution time.

        Speaker: Valerio Formato (Universita e INFN, Perugia (IT))
    • T6 - Machine learning and physics analysis: S7 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P)
      • 520
        The HEP.TrkX project. Approaching Charged Particle Tracking at the HL-LHC with Deep Learning, for Online and Offline Processing

        In the horizon of the High Luminosity Large Hadron Collider phase (HL-LHC), each proton bunch crossing will bring up to 200 simultaneous proton collisions. Performing the charged particle trajectory reconstruction in such dense environment will be computationally challenging because of the nature of the traditional algorithms used. The common combinatorial Kalman Filter state-of-the-art approach is highly successful in reconstructing high quality tracks but scales worse than quadratically with event density, leading to unmanageable computation requirements for the HL-LHC. As an alternative to other approaches to overcome this challenge, we propose novel methods employing deep learning, with inspiration from image and sequence processing. While training such models is a computationally intensive task that can be performed offline, the inference is potentially fast and can be implemented in highly parallel processing units such as FPGA or GPU. We present several models for performing pattern recognition, end-to-end track parameter estimation and tree-search track following. The methods explore multiple ways of performing the association of hits in appropriate groups, with the assumption of a tracklet seed or not. Our methods utilize neural networks of various architecture, including image-based model exploiting the power of convolutional layers, recurrent cells such as long short term memory to process sequences of hits and graph-based models that overcome sparsity. We report on the introduction of physics prior in the formalism of the neural net training, with the goal to accelerate and improve convergence. We present performances, scaling and comparison of these methods applied on realistic simulation datasets.

        Speaker: Dr Jean-Roch Vlimant (California Institute of Technology (US))
      • 521
        THE PARTICLE TRACK RECONSTRUCTION BASED ON DEEP LEARNING NEURAL NETWORKS

        Charged particle tracks registered in high energy and nuclear physics (HENP) experiments are to be reconstructed on the very important stage of physical analysis named the tracking. It consists in joining into clusters a great number of so-called hits produced on sequential co-ordinate planes of tracking detectors. Each of these clusters joins all hits belonging to the same track, one of many others, discarding noise and fake hits. The tracking procedure is especially difficult for modern HENP experiments with heavy ions where detectors register events with very high multiplicity. Furthermore, this problem is seriously hampered due to the famous shortcoming of quite popular multiwired, strip and GEM detectors where the appearance of fake hits is caused by extra spurious crossings of wires or strips, while the number of those fakes is greater for some order of magnitude than for true hits. Here we propose two steps technique based on hit preprocessing by a k-d tree search followed by applying a deep learning neural network. Preliminary results of our approach for simulated events of the BM@N GEM detector are presented.

        Speaker: Prof. Gennady Ososkov (Joint Institute for Nuclear Research (JINR), Russia)
      • 522
        Machine learning approaches in tracks pattern recognition for the SHiP Spectrometer Tracker

        SHiP is a new proposed fixed-target experiment at the CERN SPS accelerator. The goal of the experiment is to search for hidden particles predicted by models of Hidden Sectors. Track pattern recognition is an early step of data processing at SHiP. It is used to reconstruct tracks of charged particles from the decay of neutral New Physics objects. Several artificial neural networks and boosting decision trees models have been tested for clustering the SHiP Spectrometer Tracker hits into tracks. The models learn correlations between hits within tracks in an event. This allows to form hits into groups where all hits more likely belong to one track. The groups are transformed into tracks by resolving hit confusions between them. The machine learning models reduces hit combinations are needed to be considered during track pattern recognition and increases the pattern recognition speed. The new approaches are compared with conventional track pattern recognition methods used at SHiP on MC data.

        Speaker: Mikhail Hushchyn (Yandex School of Data Analysis (RU))
      • 523
        Track Seed Filtering using Convolutional Neural Network at the CMS High Level Trigger

        Since Run II, future development projects for the Large Hadron Collider will constantly bring nominal luminosity increase, with the ultimate goal of reaching a peak luminosity of $5 · 10^{34} cm^{−2} s^{−1}$ for ATLAS and CMS experiments planned for the High Luminosity LHC (HL-LHC) upgrade. This rise in luminosity will directly result in an increased number of simultaneous proton collisions (pileup), up to 200, that will pose new challenges for the CMS detector and, specifically, for track reconstruction in the Silicon Pixel Tracker.

        One of the first steps of the track finding workflow is the creation of track seeds, i.e. compatible pairs of hits from different detector layers, that are subsequently fed to to higher level pattern recognition steps. However the set of compatible hit pairs is highly affected by combinatorial background resulting in the next steps of the tracking algorithm to process a significant fraction of fake doublets.

        A possible way of reducing this effect is taking into account the shape of the hit pixel cluster to check the compatibility between two hits. To each doublet is attached a collection of two images built with the ADC levels of the pixels forming the hit cluster. Thus the task of fake rejection can be seen as an image classification problem for which Convolutional Neural Networks (CNNs) have been widely proven to provide reliable results.

        In this work we present our studies on CNNs applications to the filtering of track pixel seeds. We will show the results obtained for simulated event reconstructed in CMS detector, focussing on the estimation of efficiency and fake rejection performances of our CNN classifier.

        Speaker: Mr Adriano Di Florio (Universita e INFN, Bari (IT))
      • 524
        trackML : the Kaggle HEP tracking challenge

        The High-Luminosity LHC will see pileup level reaching 200, which will greatly increase the complexity the tracking component of the event reconstruction.
        To reach out to Computer Science specialists, a Tracking Machine Learning challenge (trackML) is being set up on Kaggle for the first 2018 semester by a team of ATLAS, CMS and LHCb physicists tracking experts and Computer Scientists, building on the experience of the successful Higgs Machine Learning challenge in 2014. A dataset consisting of an accurate simulation of a LHC experiment tracker has been created, listing for each event the measured 3D points, and the list of 3D points associated to a true track. The data set is large to allow the training of data hungry Machine Learning methods : the orders of magnitude are : 100.000 events, 1 billion tracks, 100 GigaByte. Typical CPU time spent by traditional track reconstruction algorithms is 100s per event. No limit on the training resources will be imposed. The participants to the challenge should find the tracks in an additional test dataset, which means building the list of 3D points belonging to each track (deriving the track parameters is not the topic of the challenge). The emphasis is to expose innovative approaches, rather than hyper-optimising known approaches. The challenge will be run in two phases:
        1. During the Accuracy phase (March to June 2018), a metric reflecting the accuracy of the model at finding the proper point association that matters to most physics analysis will allow to elect the programs that could be good candidate at replacing the existing algorithms. The metric is based on the overall fraction of points associated to a good track (a good track being a track where more than 50% of the points come from the same ground truth tracks) has been shown to be well behaved and robust
        2. The Throughput phase (July to October 2018) will focus on optimising the inference speed on one CPU core, starting from the collection of algorithms exposed in the first phase. The training speed will remain unconstrained. We aim with this second phase at finding new implementation of algorithms for faster execution, at the cost of minimal accuracy loss.

        This talk will summarize the findings of the Accuracy phase, where multiple algorithms will have competed, and new approaches from Machine Learning that have been exposed. The various merits of the different algorithms will be discussed not only the accuracy, but also the detailed performance aspects, efficiency, fake rates as a function of the track parameters and the track density. The talk will also help advertise the second phase of the challenge.

        Speaker: Moritz Kiehn (Universite de Geneve (CH))
      • 525
        Novel Approaches to Track & Vertex Reconstruction in the Upgraded LHCb VELO

        The LHCb experiment will undergo a major upgrade for LHC Run-III, scheduled to
        start taking data in 2021. The upgrade of the LHCb detector introduces a
        radically new data-taking strategy: the current multi-level event filter will
        be replaced by a trigger-less readout system, feeding data into a software
        event filter at a rate of 40 MHz.

        In particular, a new Vertex Locator (VELO) will be installed. The current
        VELO, a silicon strip detector surrounding the interaction region, will be
        replaced by a hybrid pixel system. The electronics of the upgraded VELO will
        be capable of reading out data a 40 MHz.

        The particle tracks and decay vertices provided by the VELO comprise a large
        fraction of the background suppression power of the software event filter.
        Full event reconstruction at a rate of 40 MHz is a big challenge, given ever
        limited computing resources. It has become clear that conventional approaches
        to track and vertex reconstruction would limit the physics potential of the
        upgraded LHCb experiment. Exploring novel approaches is therefore necessary.

        Recent developments in Deep Learning and parallel processing enable radically
        new approaches to event reconstruction. The performance of several of these new
        techniques, in terms of processing time as well as efficiency and purity, will
        be presented.

        Speaker: Dr Kurt Rinnert (University of Liverpool (GB))
    • T7 - Clouds, virtualization and containers: S7 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Convener: Dave Dykstra (Fermi National Accelerator Lab. (US))
      • 526
        Experience using Oracle OCI Cloud at CERN

        This contribution reports on the experience acquired from using the Oracle Cloud
        Infrastructure (OCI) as an Infrastructure as a Service (IaaS) within the distributed computing environments of the LHC experiments. The bare metal resources provided in the cloud were integrated using existing deployment and computer management tools. The model used in earlier cloud exercises was adapted to the bare metal requirements featuring the use of container technologies. We describe the architecture, the tools and the production experience obtained by the experiments.
        The activities have been carried out in the framework of CERN openlab.

        Speaker: Katarzyna Maria Dziedziniewicz-Wojcik (CERN)
      • 527
        FaaM: FPGA as a Microservice - A Case study for Data Compression

        Field-programmable gate arrays (FPGAs) have largely been used in communication and high-performance computing, and given the recent advances in big data and emerging trends in cloud computing (e.g., serverless [18]), FPGAs are increasingly being introduced into these domains (e.g., Microsoft’s datacenters [6] and Amazon Web Services [10]). To address these domains’ processing needs, recent research has focused on using FPGAs to accelerate workloads, ranging from analytics and machine learning to databases and network function virtualization. In this paper, we present an ongoing effort to realize a high-performance FPGA-as-a-microservice (FaaM) architecture for the cloud. We discuss some of the technical challenges and propose several solutions for efficiently integrating FPGAs into virtualized environments. Our case study deploying a multithreaded, multi-user compression as a microservice using the FaaM architecture indicate that microservices-based FPGA acceleration can sustain high-performance compared to straightforward implementation with minimal to no communication overhead despite the hardware abstraction.

        Speaker: Mr Roland Kunz (DELL)
      • 528
        Integrating HPC into an agile and cloud-focused environment at CERN

        CERN's batch and grid services are mainly focused on High Throughput computing (HTC) for LHC data processing. However, part of the user community requires High Performance Computing (HPC) for massively parallel applications across many cores on MPI-enabled intrastructure. This contribution addresses the implementation of HPC infrastructure at CERN for Lattice QCD application development, as well as for different types of simulations for the accelerator and technology sector at CERN. Our approach has been to integrate the HPC facilities as far as possible with the HTC services in our data centre, and to take advantage of an agile infrastructure for updates, configuration and deployment. The HPC cluster has been orchestrated with the Openstack Ironic component, and is hence managed with the same tools as the CERN internal cloud. Experience and benchmarks of MPI applications across Infiniband with shared storage on CephFS is discussed, as well the setup of the SLURM scheduler for HPC jobs with a provision for backfill of HTC workloads.

        Speaker: Pablo Llopis Sanmillan (CERN)
      • 529
        Container support in Vac

        During 2017 support for Docker and Singularity containers was added to
        the Vac system, in addition to its long standing support for virtual
        machines. All three types of "logical machine" can now be run in
        parallel on the same pool of hypervisors, using container or virtual
        machine definitions published by experiments. We explain how CernVM-FS
        is provided to containers by the hypervisors, to avoid any need to operate
        the containers in privileged modes.

        Speaker: Dr Andrew McNab (University of Manchester)
      • 530
        Grid services in a box - container management in ALICE

        Virtualization and containers have become the go-to solutions for simplified deployment, elasticity and workflow isolation. These benefits are especially advantageous in containers, which dispense with the resources overhead associated with VMs, applicable in all cases where virtualization of the full hardware stack is not considered necessary. Containers are also simpler to setup and maintain in production systems which should be minimally disrupted.
        This contribution focuses on container configuration and deployment to run central and site services on the ALICE Grid system, for example the VO-boxes at all sites. We describe the methods through which we minimize the manual interaction, while retaining the simplicity and scalability associated with container deployment, the so-called ’service in a box’. Furthermore, we explore ways to increase fault tolerance, aimed at reducing the risk of service downtime, and identify possible performance bottlenecks. We also describe the management schema allowing for these features and its future application within the context of Singularity, a container platform originally optimized for HPC use, and today rapidly gaining popularity within Grid systems.

        Speaker: Maksim Melnik Storetvedt (Western Norway University of Applied Sciences (NO))
      • 531
        The LHCb DIRAC containers

        During 2017, LHCb created Docker and Singularity container definitions which allow sites to run all LHCb DIRAC workloads in containers as "black boxes". This parallels LHCb's previous work to encapsulate the execution of DIRAC payload jobs in virtual machines, and we explain how these three types of "logical machine" are related in LHCb's case and how they differ, in terms of architecture, operations, and security model. We demonstrate that a simple contextualization script and access to filesystem hierarchies provided by CernVM-FS allows us to accommodate the full range of LHCb workloads.

        Speaker: Andrew McNab (University of Manchester)
    • 15:30
      Break
    • Plenary: S7 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Patrick Fuhrmann (Deutsches Elektronen-Synchrotron (DE))
      • 532
        Data analysis tools from within HEP and from industry

        High energy physics is no longer the main user or developer of data analysis tools. Open source tools developed primarily for data science, business intelligence, and finance are available for use in HEP, and adopting them would the reduce in-house maintenance burden and provide users with a wider set of training examples and career options. However, physicists have been analyzing data with computers for over 50 years and have sophisticated needs that are not entirely met by non-HEP tools. HEP tools are more advanced in some ways, non-HEP tools in others. I will discuss several categories of differences with specific examples, with an eye toward how their strengths can be combined.

        Speaker: Jim Pivarski (Princeton University)
      • 533
        ROOT: Back To The Future

        After 20 years of evolution, ROOT is currently undergoing a change of gears, bringing our vision of simplicity, robustness and speed closer to physicists' reality. ROOT is now offering a game-changing, fundamentally superior approach to writing analysis code. It is working on a rejuvenation of the graphics system and user interaction. It automatically leverages modern CPU vector and multi-core capabilities. It improves compilation and run time with the use of C++ modules. And last but not least, it offers a new, optimized interface to access the content of a TTree.
        In parallel to these major new development efforts, ROOT continues to build on its strengths and evolves, for instance with a speedup of the I/O subsystem thanks the judicious use of multiple cores and offering alternative compression algorithms, enhancements of its machine learning capabilities and connections, and improved platform support.
        This presentation will introduce the motivation, describe the features and state the progress with these main development lines. It will provide insights on the impact on experiments' frameworks, benchmarks from the context of the experiments' frameworks and data formats.

        Speaker: Axel Naumann (CERN)
      • 534
        EOS Open Storage - evolution of an ecosystem for scientific data repositories

        The EOS project started as a specialized disk-only storage software solution for physics analysis use-cases at CERN in 2010.
        Over the years EOS has evolved into an open storage platform, leveraging several open source building blocks from the community. The service at CERN manages around 250 PB, distributed across two data centers and provides user- and project-spaces to all CERN experiments. Strategic collaborations with many WLCG sites and non-HEP community members such as JRC, AARnet and COMTRADE helped to broaden and generalize its usability. Another recent important milestone was the commissioning of the CERNBOX and SWAN services which enable EOS as a sync-and-share platform. In 2017 the initial EOS architecture reached its design limitations at CERN with a visible impact on service stability. In 2018 we'll put a revised architecture into production: this includes a highly scalable namespace implementation, generalized POSIX-like filesystem access and coalescence of EOS storage and the CERN tape archive. As a participating member in the extreme data cloud project and WLCG data lake R&D activity, a new important project focus is to enable cost-effective distributed storage management incorporating existing storage and dynamic cloud resources. To meet scalability requirements expected of LHC run 3, EOS is evolving to act as a hybrid unifying platform for file and object storage, which might have a significant impact on the way physics data can be handled in the future by experiment frameworks and applications.

        Speaker: Andreas Joachim Peters (CERN)
      • 535
        Towards a Serverless CernVM-FS

        The CernVM File System (CernVM-FS) provides a scalable and reliable software distribution and---to some extent---a data distribution service. It gives POSIX access to more than half a billion binary files of experiment application software stacks and operating system containers to end user devices, grids, clouds, and supercomputers. Increasingly, CernVM-FS also provides access to certain classes of data, such as detector conditions data, genomics reference sets, or gravitational wave detector experiment data. For most of the high-energy physics experiments, an underlying HTTP content distribution infrastructure is jointly provided by universities and research institutes around the world.
        In this contribution, we will present recent developments, such as the integration into HPC environments and into XRootD networks, as well as future plans. For future developments, we put a focus on evolving the content distribution infrastructure and at lowering the barrier for publishing into CernVM-FS. Through so-called serverless computing, we envision cloud hosted CernVM-FS repositories without the need to operate dedicated servers or virtual machines. An S3 compatible service in conjunction with a content delivery network takes on data provisioning, replication, and caching. A chain of time-limited and resource-limited functions (so called “lambda function” or “function-as-a-service”) operate on the repository and stage the updates. As a result, any CernVM-FS client should be able to turn into a writer, possession of suitable keys provided. For repository owners, we aim at providing cost transparency and seamless scalability from very small to very large CernVM-FS installations.

        Speaker: Jakob Blomer (CERN)
      • 536
        Disaster recovery of the INFN Tier-1 data center: lesson learned

        The year 2017 was most likely a turning point for the INFN Tier-1. In fact, on November 9th 2017 early at morning, a large pipe of the city aqueduct, located under the road next to CNAF, broke. As a consequence, a river of water and mud flowed towards the Tier-1 data center. The level of the water did not exceeded the threshold of safety of the waterproof doors but, due to the porosity of the external walls and the floor, it could find a way into the data center. The flood almost compromised all the activities and represented a serious threat to future of the Tier-1 itself. The most affected part of the data center was the electrical room, with all switchboards for both power lines and for the continuity systems, but the damages were diffused also to all the IT systems, including all the storage devices and the tape library. After a careful assessment of the damages, an intense recovery activity was launched, aimed not only to restore the services but also to secure data stored on disks and tapes. After nearly two months, in January, we were able to start to reopen gradually all the services, including part of the farm and the storage systems. The long tail of recovery (tapes recovery, second power line) has lasted until the end of May. As a short term consequence we have started a deep consolidation of the data center infrastructure to be able to cope also with this type of incidents; for the medium and long term we are working to move to a new, larger, location, able also to accommodate the foreseen increase of resources for HL-LHC.

        Speaker: Luca dell'Agnello (INFN)
    • Banquet Hotel Marinela, Hall EFE

      Hotel Marinela, Hall EFE

      Blvd "James Bourchier" 83, 1407 Lozenets, Sofia