CHEP 2018 Conference, Sofia, Bulgaria

Europe/Sofia
Sofia, Bulgaria

Sofia, Bulgaria

National Culture Palace, Boulevard "Bulgaria", 1463 NDK, Sofia, Bulgaria
Peter Hristov (CERN) , Vasil Vasilev (Princeton University (US)) , Petya Vasileva (CERN)
Description

CHEP 2018 took place on 9-13 July 2018 at the National Palace of Culture, Sofia, Bulgaria.
Details about the conference could be found here: CHEP2018.org

 

 

 


PLATINUM SPONSORS

    


SILVER SPONSORS

 

 

               

 


 

Participants
  • Adam Aurisano
  • Adam Morris
  • Adam Wegrzynek
  • Adrian Alan Pol
  • Adrian Coveney
  • Adriana Telesca
  • Adriano Di Florio
  • Alastair Dewhurst
  • Alastair Pidgeon
  • Alberto Aimar
  • Alessandra Doria
  • Alessandra Forti
  • Alessandro De Salvo
  • Alessandro Lonardo
  • Alex Kastanas
  • Alex Martyniuk
  • Alexander Undrus
  • Alexandre Sousa
  • Alexei Klimentov
  • Alexey Anisenkov
  • Alexey Rybalchenko
  • Alja Mrak Tadel
  • Alkis Papadopoulos
  • Alvaro Fernandez Casani
  • Amber Boehnlein
  • Andre Sailer
  • Andrea Ceccanti
  • Andrea Manzi
  • Andrea Perrotta
  • Andrea Rizzi
  • Andrea Sciaba
  • Andrea Valassi
  • Andrea Valassi
  • Andreas Heiss
  • Andreas Joachim Peters
  • Andreas Nowack
  • Andreas Petzold
  • Andreas Salzburger
  • Andrei Gheata
  • Andrei Gheata
  • Andrei Kazarov
  • Andrew Hanushevsky
  • Andrew John Washbrook
  • Andrew McNab
  • Andrew Melo
  • Andrew Wightman
  • Andrey Baginyan
  • Andrey Dolbilov
  • Andrey Lebedev
  • Andrey Nechaevskiy
  • Anna Fatkina
  • Anna Woodard
  • Anton Josef Gamel
  • Antoni Shtipliyski
  • Antoni Shtipliyski
  • Antonio Augusto Alves Junior
  • Antonio Dias
  • Antonio Dias
  • Antonio Perez-Calero Yzquierdo
  • Aresh Vedaee
  • Arif Bayirli
  • Aristotelis Kyriakis
  • Armenuhi Abramyan
  • Armin Nairz
  • Artem Petrosyan
  • Arturo Sanchez Pineda
  • Arun Kumar
  • Athanasios Kotisis
  • Attila Krasznahorkay
  • Axel Naumann
  • Baosong Shan
  • Barthelemy Von Haller
  • Bartlomiej Rachwal
  • Beijiang Liu
  • Belmiro Moreira
  • Ben Couturier
  • Ben Couturier
  • Benedetto Gianluca Siddi
  • Benjamin Fischer
  • Benjamin Fischer
  • Benjamin Krikler
  • Benjamin Krikler
  • Benjamin Morgan
  • Benjamin Richards
  • Beraldo Leal
  • Birgit Lewendel
  • Bjarte Kileng
  • Boretto Marco
  • Boris Bauermeister
  • Boris Bauermeister
  • Boris Wagner
  • Borja Aparicio Cotarelo
  • Bouhali Othmane
  • Bradley Spitzbart
  • Brian Paul Bockelman
  • Bruno Heinrich Hoeft
  • Carl Lundstedt
  • Catherine Biscarat
  • Catrin Bernius
  • Cesare Calabria
  • Charles Leggett
  • Chekalina Viktoriia
  • Chris Burr
  • Chris Burr
  • Chris Lee
  • Christian Caron
  • Christian Voss
  • Christoph Heidecker
  • Christoph Wissing
  • Christopher Jones
  • Christopher Tunnell
  • Christopher Tunnell
  • Chun-Yu Lin
  • Claire Adam Bourdarios
  • Clara Gaspar
  • Claudio Grandi
  • Claudio Grandi
  • Concezio Bozzi
  • Costin Grigoras
  • Dagmar Adamova
  • Dainius Simelevicius
  • Damian Dechev
  • Daniel Crawford
  • Daniel S. Katz
  • Daniela Bauer
  • Daniele Cesini
  • DANIELE SPIGA
  • Danilo Piparo
  • Dario Berzano
  • Dave Dykstra
  • David Cameron
  • David Colling
  • David Crooks
  • David Dossett
  • David Groep
  • David Groep
  • David Kelsey
  • David Lange
  • David Lange
  • David Lawrence
  • David Martin Clavo
  • David Rohr
  • david rousseau
  • David Schultz
  • David Smith
  • Dennis Klein
  • Diego Da Silva Gomes
  • Dimitri Bourilkov
  • Dimitri Bourilkov
  • Diogo Di Calafiori
  • Dirk Duellmann
  • Dirk Hufnagel
  • Dirk Hutter
  • Dmitry Popov
  • Dmytro Kresan
  • Domenico Giordano
  • Dominik Muller
  • Dorin-Daniel Lobontu
  • Doris Ressmann
  • Duncan Rand
  • Edgar Fajardo Hernandez
  • Edoardo Martelli
  • Eduardo Rodrigues
  • Eduardo Rodrigues
  • Edward Karavakis
  • Edward Moyse
  • Elizabeth Gallas
  • Elizabeth Sexton-Kennedy
  • Emanouil Atanassov
  • Emanuel Gouveia
  • Enric Tejedor Saavedra
  • Enric Tejedor Saavedra
  • Enrico Fasanelli
  • Enrico Fattibene
  • Enrico Gamberini
  • Enrico Guiraud
  • Enrico Guiraud
  • Eric Lancon
  • Eric Vaandering
  • Esteban Fullana Torregrosa
  • Fabian Lambert
  • Fabio Hernandez
  • Fabio Hernandez
  • Fabrizio Furano
  • FaHui Lin
  • Federica Legger
  • Federico Carminati
  • Federico Stagni
  • Felice Pantaleo
  • Fernando Abudinen
  • Fernando Harald Barreiro Megino
  • Filippo Costa
  • Fons Rademakers
  • Francesco Di Capua
  • Francesco Tenchini
  • Frank Berghaus
  • Frank Gaede
  • Frank Winklmeier
  • Frank Winklmeier
  • Gancho Dimitrov
  • Gang Chen
  • Gene Van Buren
  • Gennady Ososkov
  • Geonmo Ryu
  • Gerhard Raven
  • German Cancio Melia
  • Ghita Rahal
  • Ghita Rahal
  • Giacomo Cucciati
  • Gianfranco Sciacca
  • Gianpaolo Carlino
  • Gilles Grasseau
  • Gioacchino Vino
  • Giulio Eulisse
  • Giuseppe Avolio
  • Gleyzer Sergei
  • Gonzalo Martínez Lema
  • Gordon Watts
  • Graeme Stewart
  • Graeme Stewart
  • Grigori Rybkin
  • Grigorii Latyshev
  • Grigory Kozlov
  • Grzegorz Jereczek
  • Guilherme Amadio
  • Guy Barrand
  • Gvozden Neskovic
  • Hadrien Benjamin Grasland
  • Hadrien Benjamin Grasland
  • Hannah Short
  • Hannes Sakulin
  • Hasib Muhammad
  • Helge Meinhard
  • Henry Fredrick Schreiner
  • Henry Fredrick Schreiner
  • Herve Rousseau
  • Holger Schulz
  • Hristo Mohamed
  • Hugo Gonzalez Labrador
  • Håvard Helstrup
  • Ian Bird
  • Ian Collier
  • Ian Fisk
  • Ifan Williams
  • Ignacio Asensi Tortajada
  • Ignacio Heredia
  • Igor Pelevanyuk
  • Igor Soloviev
  • Ilaria Vai
  • Illya Shapoval
  • Imma Riu
  • Irina Filozova
  • Ivan Glushkov
  • Ivan Kadochnikov
  • Ivan Sokolov
  • Ivana Hrivnacova
  • Jack Cranshaw
  • Jakob Blomer
  • James Catmore
  • James Catmore
  • JAMES LETTS
  • Jan Erik Sundermann
  • Jan Knedlik
  • Jan Strube
  • Jana Schaarschmidt
  • Janusz Martyniak
  • Jaroslava Schovancova
  • Jaroslava Schovancova
  • Javier Montejo Berlingen
  • Jean-Roch Vlimant
  • Jean-Yves Le Meur
  • Jeff Dost
  • Jeff Porter
  • Jeff Templon
  • Jennifer Ngadiuba
  • Jerome LAURET
  • Jerome Odier
  • Jeronimо Bezerra
  • Jessica Stietzel
  • Jiaheng Zou
  • Jim Pivarski
  • Jim Pivarski
  • Joao Vitor Viana Barbosa
  • Joel Closier
  • Joel Closier
  • Johan Bregeon
  • Johannes Elmsheuser
  • Johannes Lehrbach
  • Jose Castro Leon
  • Jose Hernandez
  • Josep Flix Molina
  • João Fernandes
  • Julia Andreeva
  • Julian Wojciech Myrcha
  • Julie Kirk
  • Julien Leduc
  • Juraj Smiesko
  • Jurry de la Mar
  • Ka Vang Tsang
  • Kai Leffhalm
  • Karol Hennessy
  • Karol Hennessy
  • Katarzyna Maria Dziedziniewicz-Wojcik
  • Kenyi Hurtado Anampa
  • Kevin Patrick Lannon
  • Kevin Pedro
  • Khalil Bouaouda
  • Khalil Chawoshi
  • Kilian Schwarz
  • Kim Albertsson
  • Konstantin Gertsenberger
  • Konstantin Gertsenberger
  • Konstantin Skarbuskas
  • Kosuke Takeda
  • Kristin Fanebust Hetland
  • Kurt Rinnert
  • Kurt Rinnert
  • Kyle Knoepfel
  • Latchezar Betev
  • Laura Promberger
  • Laurent Aphecetche
  • LeManuel Bitsoi
  • Leo Piilonen
  • Li Haibo
  • Lindsey Gray
  • Lorenzo Moneta
  • Lorenzo Rinaldi
  • Lotta Mether
  • Luca dell'Agnello
  • Luca Perrozzi
  • Lucia Grillo
  • Luis Granado Cardoso
  • Luisa Arrabito
  • Lukas Heinrich
  • Lukasz Dutka
  • Lynn Wood
  • Maarten Litmaath
  • Maciej Maciejewski
  • Maciej Pawel Szymanski
  • Maciej Pawel Szymanski
  • Maiken Pedersen
  • Maksym Zyzak
  • Malachi Schram
  • Mantas Stankevicius
  • Manuel Giffels
  • Manuel Jesus Rodriguez
  • Marcel Andre Schneider
  • Marco Canaparo
  • Marco Cattaneo
  • Marco Clemencic
  • Marco Clemencic
  • Marco Mascheroni
  • Marcus Ebert
  • Maria Girone
  • Marica Antonacci
  • Marilena Bandieramonte
  • Marilena Bandieramonte
  • Marina Sahakyan
  • Mario Lassnig
  • Mark Neubauer
  • Mark Neubauer
  • Marko Petric
  • Markus Frank
  • Martin Adam
  • Martin Barisits
  • Martin Gasthuber
  • Martin Ritter
  • Martin Ritter
  • Martin Sevior
  • Mateusz Jacek Goncerz
  • Matevz Tadel
  • Matteo Concas
  • Matteo Cremonesi
  • Matteo Rama
  • Matthias Jochen Schnepf
  • Matthias Pink
  • Matthias Richter
  • Matthieu Marinangeli
  • Mauro Verzetti
  • Max Fischer
  • Maxim Borisyak
  • Maxim Potekhin
  • Mayank Sharma
  • Meade Patrick
  • Michael Andrews
  • Michael Bender
  • Michael Davis
  • Michael Papenbrock
  • Michael Papenbrock
  • Michael Russell
  • Michal Orzechowski
  • Michal Svatos
  • Michel Jouvin
  • Miguel Martinez Pedreira
  • Mihaela Gheata
  • Mihaela Gheata
  • Mikhail Hushchyn
  • Mikhail Titov
  • Milos Lokajicek
  • Mirena Paneva
  • Mohammad Al-Turany
  • Monica Rush
  • Moritz Kiehn
  • Narine Manukyan
  • Nico Madysa
  • Nico Madysa
  • Nicola Tarocco
  • Nicola Tarocco
  • Nicolas Liampotis
  • Nicolo Magini
  • Nikita Balashov
  • Nikita Balashov
  • Nikolay Voytishin
  • Nikos Kasioumis
  • Oksana Shadura
  • Oksana Shadura
  • Oleg Samoylov
  • Oliver Lantwin
  • Oliver Lantwin
  • Ondrej Subrt
  • Oxana Smirnova
  • Pablo Llopis Sanmillan
  • Pablo Martin Zamora
  • Panos Paparrigopoulos
  • Paolo Calafiura
  • Pardi Silvio
  • Patricia Mendez Lorenzo
  • Patrick Fuhrmann
  • Patrick Meade
  • Patrick Robbe
  • Paul James Laycock
  • Paul Millar
  • Paul Seyfert
  • Paul Seyfert
  • Pavel Goncharov
  • Pavlo Svirin
  • Pedro Ferreira
  • Pedro Ferreira
  • Pekko Metsä
  • Pere Mato
  • Pere Mato
  • Peter Clarke
  • Peter Elmer
  • Peter Hristov
  • Peter Hristov
  • Peter Love
  • Peter Onyisi
  • Peter Onyisi
  • Petr Fedchenkov
  • Petya Vasileva
  • Philipp Sitzmann
  • Philippe Canal
  • Philippe Charpentier
  • Pier Paolo Ricci
  • Placido Fernandez Declara
  • Prasanth Kothuri
  • Prasanth Kothuri
  • Predrag Buncic
  • Preslav Konstantinov
  • Qiumei Ma
  • Radu Popescu
  • Rafal Grzymkowski
  • Ralf Spiwoks
  • Ralf Vamosi
  • Ran Du
  • Randall Sobie
  • RAUL LOPES
  • Raymond Ehlers
  • Remi Ete
  • Remi Mommsen
  • Rene Caspart
  • Richard Cavanaugh
  • Richard Hughes-Jones
  • Rob Appleyard
  • Robert Andrew Currie
  • Rodney Walker
  • Roel Aaij
  • Roel Aaij
  • Rok Pestotnik
  • Roland Kunz
  • Rolf Seuster
  • Rosario Esposito
  • Rosen Matev
  • Rosie Bolton
  • Ryan Taylor
  • Ryosuke Itoh
  • Samoylov Oleg
  • Samuel Skipsey
  • Sandro Wenzel
  • Sang Un Ahn
  • Santiago Gonzalez De La Hoz
  • Scott Rowan
  • Scott Snyder
  • Scott Snyder
  • Sean Murray
  • Sebastian Merkt
  • Sebastien Binet
  • Sebastien Gadrat
  • Sebastien Ponce
  • Semen Lebedev
  • Sergey Gorbunov
  • Sergey Linev
  • Sergio Traldi
  • Serguei Kolos
  • Sharad Agarwal
  • Shawn Mc Kee
  • Shota Hayashida
  • Sibel Yasar
  • Simon Blyth
  • Simon C. Lin
  • Simon C. Lin
  • Simon Fayer
  • Simone Campana
  • Simone Sottocornola
  • Sioni Paris Summers
  • Slava Krutelyov
  • Sofia Vallecorsa
  • Stefan Roiser
  • Stefan Stancu
  • Stefan-Gabriel Chitic
  • Stefan-Gabriel Chitic
  • Stefano Bagnasco
  • Stefano Spataro
  • Steven Farrell
  • Steven Goldfarb
  • Stewart Martin-Haugh
  • Storetvedt Maksim
  • Sören Fleischer
  • Tadashi Maeno
  • Tadashi Murakami
  • Tadeas Bilka
  • Tadej Novak
  • Tai Sakuma
  • Tai Sakuma
  • Takahashi Yuka
  • Tao Lin
  • Tatiana Strizh
  • Templon Jeff
  • Teng LI
  • Teo Mrnjavac
  • Thomas Charman
  • Thomas Charman
  • Thomas Hartmann
  • Thomas Hauth
  • Thomas Hauth
  • Thomas Kress
  • Thomas Kuhr
  • Thomas Maier
  • Thomas Vuillaume
  • Thorsten Kollegger
  • Tian Yan
  • Tibor Simko
  • Tigran Mkrtchyan
  • Tim Hallyburton
  • Tobias Stockmanns
  • Tobias Wegner
  • Todd Seiss
  • Tomas Lindén
  • Tomasz Piotr Trzcinski
  • Tommaso Boccali
  • Tomoaki Nakamura
  • Tony Cass
  • Torre Wenaus
  • Toru Sugitate
  • Traynor daniel
  • Ulf Behrens
  • V Daniel Elvira
  • Valentin Kuznetsov
  • Valentina Akishina
  • Valerio Formato
  • Valery Mitsyn
  • Vardan Gyurjyan
  • Vasco Chibante Barroso
  • Vasil Georgiev Vasilev
  • Vassil Verguilov
  • Victor Estrade
  • Vidar Tonaas Fauske
  • Vincente Beisner
  • Vincenzo Capone
  • Virginia Azzolini
  • Vito Di Benedetto
  • Vladimir Ivantchenko
  • Vladimir Korenkov
  • Volker Friese
  • Volodimir Begy
  • Volodymyr Yurchenko
  • Wahid Bhimji
  • Walter Lampl
  • Waseem Kamleh
  • Wei Yang
  • Wenjing Wu
  • William Kalderon
  • William Leight
  • Wojciech Jan Krzemien
  • Wojciech Jan Krzemien
  • Xiaobin Ji
  • Xiaoguang Yue
  • Yao Zhang
  • Yaodong Cheng
  • Yuji Kato
  • Yuka Takahashi
  • Zach Marshall
  • Zbigniew Baranowski
  • Zbigniew Baranowski
  • Zhang Xiaomei
  • Zhechka Toteva
  • Ziyan Deng
    • 8:00 AM
      Registration

      National Palace of Culture

    • Plenary: S1 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      • 1
      • 2
        Integrated HPC Infrastructure for the Research Communities in South-East Europe

        The region of South-East Europe has a long history of successful collaboration in sharing resources and managing distributed electronic infrastructures for the needs of research communities. The HPC resources like supercomputers and big clusters with low-latency interconnection are an especially valuable and scarce resource in the region. Building upon the successfully tested operational and policy models, the VI-SEEM project is focusing on establishing an integrated Virtual Research Environment that comprises not only HPC but also Grid, Cloud and Data management resources. By providing a common Authentication, Authorisation and Accounting system the VI-SEEM infrastructure simplifies the access to resources for research groups with complex workflows. Because of the diversity of hardware technologies, chosen by the different partners, the integrated infrastructure is able to fulfil widely different needs of HPC resources. Applications outside of the initial target user communities – Climatology, Life Sciences and Digital Cultural Heritage, are also accepted through open calls with a peer review process. Motivated by the challenges that the individual nations in the region are facing to provide HPC resources for their researchers, the integrated VI-SEEM infrastructure achieves optimisation of the use of the available resources and retention of the human capital of the region.

        Speaker: Emanouil Atanassov (Unknown)
      • 3
        HNSciCloud, a Hybrid Cloud for Science

        Helix Nebula Science Cloud (HNSciCloud) has developed a hybrid cloud platform that links together commercial cloud service providers and research organisations’ in-house IT resources via the GEANT network.
        The platform offers data management capabilities with transparent data access where applications can be deployed with no modifications on both sides of the hybrid cloud and compute services accessible via eduGAIN and ELIXIR federated identity and access management systems. In addition, it provides support services, account management facilities, full documentation and training.
        The cloud services are being tested by a group of 10 research organisations from across Europe (http://www.hnscicloud.eu/partner-type/buyers ) against the needs of use-cases (http://www.hnscicloud.eu/hnscicloud-user-groups ) from 7 ESFRI research infrastructures:

        • Euro-BioImaging: European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
        • CTA: Cherenkov Telescope Array
        • BBMRI: Biobanking and BioMolecular resources Research Infrastructure
        • ELIXIR: A distributed infrastructure for life-science information
        • ESRF Upgrades: Extremely Brilliant Source
        • European-XFEL: European X-Ray Free-Electron Laser Facility
        • HL-LHC: High-Luminosity Large Hadron Collider

        The capacity procured by the 10 research organisations from the commercial cloud service providers to support these use-cases during 2018 exceeds 20,000 cores and 2 petabytes of storage with a network bandwidth of 40Gbps.
        All the services are based on open source implementations that do not require licenses in order to be deployed on the in-house IT resources of research organisations connected to the hybrid platform.
        An early adopter scheme has been put in place so that more research organisations can connect to the platform and procure additional capacity to support their research programmes.

        This contribution will make an overview of the status of the project, go through the current findings and provide an outlook on the future beyond HNSciCloud.

        Speaker: João Fernandes (CERN)
      • 4
        HNSciCloud – Large-scale data processing and HPC for science with T-Systems hybrid cloud

        As the result of joint R&D work with 10 of Europe’s leading public research organisations, led by CERN and funded by the EU, T-Systems provides a hybrid cloud solution, enabling science users to seamlessly extend their existing e-Infrastructures with one of the leading European public cloud services based on OpenStack – the Open Telekom Cloud.
        With this new approach large-scale data-intensive and HPC-type scientific use cases can now be run more dynamically, reaping the benefits of the on-demand availability of commercial cloud services at attractive costs.
        Over the course of the last year, the prototyping and piloting has confirmed, that science users can get seamless, performing, secure and fully automated access to cloud resources over the GÉANT network, simplified by the identity federation with eduGAIN and Elixir AAI. Users can work in a cloud-native way, maintaining existing toolsets or choose from a large and fast-growing community other OpenStack and S3-compatible tools, e.g. Ansible and Terraform to run and manage applications. Users remain in full control and have access to all native functions of the cloud resources, either through web browser, APIs or CLI. Cloud Management Platforms or Broker solutions are not needed, but may be added if further abstraction is required.
        The extensive service menu of Open Telekom Cloud – based on OpenStack – is opening up new functionality and performance for scientific use cases with build-in support for e.g. Docker, Kubernetes, MapReduce, Data Management, Data Warehouse and Data Ingestion services. The services can be combined with a wide range of compute and storage options. Compute can consist of any combination of containers, virtual, dedicated or bare metal servers. Server-types can be optimized for disk-intensive, large-memory, HPC or GPU applications. The extensive network and security functions enable users to maintain a private and secure environment, whereby access to services can make full use of 10G networking.
        The keynote will give an overview of the performance and scale of use cases that have been successfully deployed. It will address how large-scale data can be processed at new performance levels with hundreds of containers and how data can be processed in an intelligent way by pre-fetching the data or leaving the data remote at the existing infrastructure, making use of the state-of-the-art Onedata Data Management solution from Cyfronet. Furthermore, the results of the new high level of transparency and budget control developed will be demonstrated. Details will be available at the T-Systems booth.
        Ten of Europe’s leading public research organisations led by CERN launched the Helix Nebula Science Cloud (HNSciCloud) Pre-Commercial Procurement to establish a European hybrid cloud platform that will support the high-performance, data-intensive scientific use-cases of this “Buyers Group” and of the research sector at large. The intention is to pave the way for the European Open Science Cloud, enabling more and better science.

        Speaker: Jurry de la Mar (T-Systems International GmbH)
      • 5
        HNSciCloud Multi-Cloud Approach – An open hybrid cloud for science

        Ten of Europe’s leading public research organisations led by CERN launched the Helix Nebula Science Cloud (HNSciCloud) Pre-Commercial Procurement to establish a European hybrid cloud platform that will support the high-performance, data-intensive scientific use-cases of this “Buyers Group” and of the research sector at large. It calls for the design and implementation of innovative Infrastructure as a Service (IaaS) solutions for compute, storage, network connectivity, Federated Identity Management and Service Payment Models, to augment & enhance the science community’s existing systems.

        The RHEA Group HNSciCloud Nuvla multi-cloud solution provides the European science community a means to access, via their existing academic network infrastructure, large scale commercial cloud resources (computing and storage) from multiple cloud providers securely and transparently. The Nuvla platform allows users to authenticate with their academic on-line credentials to manage compute resources and to access local and cloud-based data locally. Users can deploy Virtual Machines (VMs) and/or Containers to multiple clouds and monitor their usage and cloud performance. Nuvla’s brokering system means users (or the procuring authorities) can choose which cloud to use based on price, performance, location or other factors which are important to them. The Nuvla service is completely neutral with respect to cloud providers, so by using Nuvla, users can deploy their VMs and containers to any supported cloud. Users may have their own orchestration tools and can optionally deploy directly to the clouds of their choice by using their native APIs.

        Users signing-in via Nuvla can still access the underlying cloud resources directly; such users still benefit from the data management, usage and performance monitoring offered by Nuvla as a way the users maintain control while Nuvla helps them get the best value for money and performance from the resources to achieve their needs. In the future, almost any other public/commercial cloud could be integrated provided it is connected to GÉANT and meets the minimum performance and reliability requirements, as demonstrated with the inclusion of the Advania cloud.

        As well as delivering processing and resource at scale via the GÉANT network, the platform supports the eduGAIN and Elixir AAI (SAML 2.0) identity federations, allowing users to access cloud resources via a web browser, Application Programming Interface (API) or Command Line Interface (CLI) – with access rights accorded by their unique identity. OneData allows data to be shared across multiple clouds as well as with local infrastructures. Organisations with their own internal cloud can also access that cloud via Nuvla allowing commercial and internal clouds to be managed from one interface. Nuvla is based on the SlipStream open-source software and Cyfronet’s Onedata extended within the HNSciCloud project to support the diverse scientific requirements of the Buyers Group with three European commercial clouds - Open Telekom, Exoscale and Advania – that provide secure, high-performance cloud resources, including High-Performance Computing (HPC) as a service and access to Graphical Processing Units (GPU).

        Speaker: Mr Alastair Pidgeon (RHEA System S.A.)
    • 10:30 AM
      Break
    • T1 - Online computing: S1 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Adriana Telesca (CERN)
      • 6
        The ToolDAQ DAQ software Framework and its use in the Hyper-K, E61 & ANNIE detectors

        Data AQuisition (DAQ) systems are a vital component of every experiment. The purpose of the underlying software of these systems is to coordinate all the hardware components and detector states, providing the means of data readout, triggering, online processing, persistence, user control and the routing of data. These tasks are made more challenging when also considering fault tolerance, scalability, computing and data distribution, tolerance to hardware changes and usability.
        We will present ToolDAQ a C++ DAQ framework which has been designed to be easy and fast to adapt in a modular and simple way. It has many features like in built service discovery, dynamic reconfiguration, remote control/monitoring via web and terminal interfaces and a highly scalable fault tolerant network communication infrastructure provided by ZMQ built in. It also is compatible with newer and older hardware, both being very lightweight and with low dependencies.
        The framework is currently in use on the ANNIE experiment in Fermilab and has been used to develop the DAQ for Hyper-Kamiokande and the E61 intermediate water Cherenkov detector at J-PARC. We will report on the experience gained in adapting and running the framework for the ANNIE experiment as well as the Hyper-Kamiokande and E61 developments and use cases

        Speaker: Dr Benjamin Richards (Queen Mary University London)
      • 7
        The Continuously Running iFDAQ of the COMPASS Experiment

        Recently, a stability of Data Acquisition System (DAQ) has become a vital precondition for a successful data taking in high energy physics experiments. The intelligent, FPGA-based Data Acquisition System (iFDAQ) of the COMPASS experiment at CERN is designed to be able to readout data at the maximum rate of the experiment and running in a mode without any stops. DAQ systems fulfilling such requirements reach the efficiency up to 99%. The iFDAQ runs nonstop 24/7 regardless of nights, weekends or bank holidays for most of the calendar year. Thus, it puts stress on reliability and robustness of the system. Every undesirable interruption of data taking results in a possible loss of physics data. To improve the iFDAQ stability, the communication library DIALOG for the inter-process communication has been implemented and the DAQ Debugger has been developed for an error detection and incorporated to all iFDAQ processes. Moreover, the continuously running mode enables to collect data in runs with 200 spills without a necessity of any other user intervention and runs 24/7. Such mode affects all processes of the iFDAQ with high emphasis on reliability and precise synchronization. It helped to collect more physics data in run 2017. In the paper, we present the continuously running mode in more detail and discuss the overall iFDAQ stability.

        Speaker: Ondrej Subrt (Czech Technical University (CZ))
      • 8
        The Data acquisition system of the NA62 experiment at CERN

        The NA62 experiment looks for the extremely rare Kaon decay K+->pinunu and aims at measuring its branching ratio with a 10% accuracy.
        In order to do so a very high intensity secondary beam from the CERN SPS is used to produce charged Kaons whose decay products are detected by many detectors installed along a 150m decay region.

        The NA62 Data Acquisition system exploits a multilevel trigger system; following a L0 trigger decision, the 1MHz data flux from about 60 sources is read by a PC farm, the partial event is built and then passed through a series of L1 algorithms to further reduce the trigger rate. Events passing this level are completed with the missing, larger, data sources (~400 units) at the rate of 100KHz.

        The DAQ is built around a high performance Ethernet network interconnecting the detectors to a farm of 30 PCs. After an overall description of the system design and the main implementation choices that allowed to reach the required performance and functionality, this paper describes the overall behaviour of the DAQ in the 2017 data taking period, which has been the first year of high intensity and high efficiency data taking for the experiment. It then concludes with an outlook of possible improvements and upgrades that may be applied to the system in the future

        Speaker: Mr Marco Boretto (CERN)
      • 9
        Experience with Shifter Assistant: an intelligent tool to help operations of ATLAS TDAQ system in LHC Run 2

        The Trigger and DAQ (TDAQ) system of the ATLAS experiment is a complex
        distributed computing system, composed of O(30000) of applications
        running on a farm of computers. The system is operated by a crew of
        operators on shift. An important aspect of operations is to minimize
        the downtime of the system caused by runtime failures, such as human
        errors, unawareness, miscommunication, etc.

        The paper describes recent developments in one of “intelligent” TDAQ
        frameworks, the Shifter Assistant (SA) and summarizes the experience
        of its use in operations of ATLAS in the course of LHC Run 2.

        SA is a framework whose main aim is to automatize routine system
        checks, error detection and diagnosis, events correlation etc. in
        order to help the operators to react on runtime problems promptly and
        effectively. The tool is based on CEP (Complex Event Processing)
        technology. It constantly processes the stream of operational events
        (O(100kHz)) over a set of “directives” (or rules) in the knowledge
        base, producing human-oriented alerts and making shifters aware of
        operational issues.

        More then 200 directives were developed by TDAQ and detector experts
        for different domains. In this paper we also describe different types
        of directives and present examples of the most interesting ones,
        demonstrating the power of CEP for this type of applications.

        Speaker: Andrei Kazarov (Petersburg Nuclear Physics Institut (RU))
      • 10
        Operational experience with the new CMS DAQ expert

        The data acquisition (DAQ) system of the Compact Muon Solenoid (CMS) at CERN reads out the detector at the level-1 trigger accept rate of 100 kHz, assembles events with a bandwidth of 200 GB/s, provides these events to the high level-trigger running on a farm of 26000 cores and records the accepted events. Comprising custom-built and cutting edge commercial hardware and several 1000 instances of software applications, the DAQ system is complex in itself and failures cannot be completely excluded. Moreover, problems in the detectors, in the first level trigger system or in the high level trigger may provoke anomalous behaviour of the DAQ system which sometimes cannot easily be differentiated from a problem in the DAQ system itself. In order to achieve high data taking efficiency with operators from the entire collaboration and without relying too heavily on the on-call experts, an expert system, the DAQ Expert, has been developed that can pinpoint the source of most failures and give advice to the shift crew on how to recover in the quickest way. The DAQ Expert constantly analyzes monitoring data from the DAQ system and the high level trigger by making use of logic modules written in Java that encapsulate the expert knowledge about potential operational problems. The results of the reasoning are presented to the operator in a web-based dashboard, may trigger sound alerts in the control room and are archived for post-mortem analysis - presented in a web-based timeline browser. We present the design of the DAQ Expert and report on the operational experience since 2017, when it was first put into production.

        Speaker: Dr Hannes Sakulin (CERN)
      • 11
        Software-based data acquisition system for Level-1 end-cap muon trigger in ATLAS Run-3

        In 2019, the ATLAS experiment at CERN is planning an upgrade
        in order to cope with the higher luminosity requirements. In this
        upgrade, the installation of the new muon chambers for the end-cap
        muon system will be carried out. Muon track reconstruction performance
        can be improved, and fake triggers can be reduced. It is also
        necessary to develop readout system of trigger data for the Level-1
        end-cap muon trigger.

        We have decided to develop software-based data acquisition system.
        Therefore, we have implemented SiTCP technology, which connects a FPGA
        with the network, on FPGA of new trigger processor boards.

        Due to this implementation, this new DAQ system can take advantage of
        the latest developments in computing industry. This new readout system
        architecture is based on multi-process software, and can assemble
        events at a rate of 100 kHz. For data collection, the 10 Gbit Ethernet
        network switch is used. Moreover, we have optimized these processes to
        send data to the following system without any error. Therefore, the
        built events can be sent with an average throughput of ~ 211 Mbps.

        Our newly developed readout system is very generic and it is flexible
        for modifications, extensions and easy to debug. In this talk, we will
        present the details of the new software-based DAQ system and report
        the development status for ATLAS Run-3.

        Speaker: Kosuke Takeda (Kobe University (JP))
    • T2 - Offline computing: S1 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Victor Daniel Elvira (Fermi National Accelerator Lab. (US))
      • 12
        A fully parametric option in the LHCb simulation framework

        Faster alternatives to a full, GEANT4-based simulation are being pursued within the LHCb experiment. In this context the integration of the Delphes toolkit in the LHCb simulation framework is intended to provide a fully parameterized option.
        Delphes is a modular software designed for general-purpose experiments such as ATLAS and CMS to quickly propagate stable particles using a parametric approach and to provide reconstructed physical objects as output. It includes propagation inside a magnetic field and parameterized response for tracking and muon systems as well as calorimeters. Particle energies are computed by smearing the initial visible particles momenta according to detector resolution.
        The present contribution illustrates the current status of the Delphes toolkit in Gauss, the LHCb simulation framework. In this integration the particle transport performed by GEANT4 and subsequent mimicking of detector response and reconstruction has been replaced with a parametric response of the various detector elements.
        The implementation required significant changes to Delphes itself to constrain the particle transport inside the detector acceptance and to match the LHCb dipole magnetic field. The configuration of various parameterizations of resolution and efficiency is also
        a major aspect of the work to provide a fully functional simulation. The output of the resulting fast simulation is formatted in such a way that can be used in the LHCb physics analysis framework.
        An outline of the use of Delphes to evaluate future detector upgrade options is also given.

        Speaker: Benedetto Gianluca Siddi (Universita di Ferrara & INFN (IT))
      • 13
        Improved Fast Calorimeter Simulation in ATLAS

        ATLAS relies on very large samples of simulated events for delivering high-quality
        and competitive physics results, but producing these samples takes much time and
        is very CPU intensive when using the full GEANT4 detector simulation.
        Fast simulation tools are a useful way of reducing CPU requirements when detailed
        detector simulations are not needed. During the LHC Runs 1 and 2, a fast
        calorimeter simulation (FastCaloSim) was successfully used in ATLAS.
        FastCaloSim provides a simulation of the particle energy response at the calorimeter
        read-out cell level, taking into account the detailed particle shower shapes and the
        correlations between the energy depositions in the various calorimeter layers. It is
        interfaced to the standard ATLAS digitization and reconstruction software, and it can
        be tuned to data more easily than GEANT4.
        Now an improved version of FastCaloSim is in development, incorporating the experience
        with the version used during Run-1. The new FastCaloSim makes use of machine
        learning techniques, such as principal component analysis and neural networks, to
        optimise the amount of information stored in the ATLAS simulation infrastructure. This
        allows for further performance improvement by reducing the I/O time and the memory
        usage during the simulation job.
        A prototype is being tested and validated, and it has shown significant improvements in the
        description of cluster level variables in electromagnetic and hadronic showers. ATLAS
        plans to use this new FastCaloSim parameterization to simulate several billion events in the
        upcoming LHC runs.
        It will be combined with other fast tools used in the ATLAS production chain. In this
        Fast Chain the simulation, digitisation and reconstruction of the events are handled by fast
        tools. In this talk, we will describe this new FastCaloSim parametrisation and the current status
        of the ATLAS Fast Chain.

        Speaker: Jana Schaarschmidt (University of Washington (US))
      • 14
        Fast calorimeter simulation in the LHCb Gauss framework

        In HEP experiments CPU resources required by MC simulations are constantly growing and becoming a very large fraction of the total computing power (greater than 75%). At the same time the pace of performance improvements given by technology is slowing down, so the only solution is a more efficient use of resources. Efforts are ongoing in the LHC experiment collaborations to provide multiple options for simulating events in a faster way when higher statistics is needed. A key of the success for this strategy is the possibility of enabling these fast simulation options in a common framework with minimal action by the final user.

        In this talk we will describe the solution adopted in Gauss, the LHCb simulation software framework, to selectively exclude particles from being simulated by the Geant4 toolkit and to insert the corresponding hits generated in a faster way. The approach, integrated within the Geant4 toolkit, has been applied to the LHCb calorimeter but it could also be used
        for other subdetectors. The hits generation can be carried out by any external tool, e.g. by a static library of showers or more complex machine-learning techniques. A first implementation of the calorimeter hit generation will be described. Detailed timing measurements and a comparison with the default simulation for reference physical quantities will be also presented.

        Speaker: Matteo Rama (Universita & INFN Pisa (IT))
      • 15
        Generative Models for Fast Calorimeter Simulation: LHCb Case

        The goal to obtain more precise physics results in current collider experiments drives the plans to significantly increase the instantaneous luminosity collected by the experiments . The increasing complexity of the events due to the resulting increased pileup requires new approaches to triggering, reconstruction, analysis,
        and event simulation. The last task brings to a critical problem: generating the significantly higher amount of Monte Carlo (MC) data, required for analysis of the data collected at higher collider luminosity, without a drastic increase in computing resources requires a significant speed up of the simulation algorithms.
        The largest part of computer resources in simulation is currently spent in the detailed GEANT modeling of particles interacting with the material of the experimental apparatus, in particular the shower development in electromagnetic and hadronic calorimeters.
        To accelerate these computations we use approach based on methods of sample creation by generative models, which are nowadays widely used for computer vision and image processing.
        These models are based on maximizing likelihood between real data and samples produced by a generator. The two main approaches to this problem are Generative Adversarial Networks (GAN), that takes into account explicit description of the real data, and Variational Autoencoders (VAE), that uses latent variables to describe ones.
        In this contribution we present both approaches being applied to the calorimeter simulation for the LHCb experiment at LHC, discuss advantages and possible problems of these approaches, and compare the results.

        Speaker: Viktoriia Chekalina (Yandex School of Data Analysis (RU))
      • 16
        A Machine Learning tool for fast simulation

        Machine Learning techniques have been used in different applications by the HEP community: in this talk, we discuss the case of detector simulation. The amount of simulated events, expected in the future for LHC experiments and their High Luminosity upgrades, is increasing dramatically and requires new fast simulation solutions. We will describe an R&D activity, aimed at providing a configurable tool capable of training a neural network to reproduce the detector response and replace standard Monte Carlo simulation. This represents a generic approach in the sense that such a network could be designed and trained to simulate any kind of detector response. Eventually, it could be extended to replace the whole data processing chain in order to get, directly in one step, the final reconstructed quantities, in just a small fraction of time. We will present the first application of three-dimensional convolutional Generative Adversarial Networks to the simulation of high granularity electromagnetic calorimeters. We will describe detailed validation studies comparing our results to Geant4 Monte Carlo simulation, showing, in particular, the very good agreement we obtain for high level physics quantities (such as energy shower shapes) and detailed calorimeter response (single cell response). Finally we will show how this tool can easily be generalized to describe a larger class of calorimeters, opening the way to a generic machine learning based fast simulation approach.

        Speaker: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 17
        A scalable and asynchronous detector simulation system based on ALFA

        In the context of the common online-offline computing infrastructure for Run3 (ALICE-O2), ALICE is reorganizing its detector simulation software to be based on FairRoot, offering a common toolkit to implement simulation based on the Virtual-Monte-Carlo (VMC) scheme. Recently, FairRoot has been augmented by ALFA, a software framework developed in collaboration between ALICE and FAIR, offering portable building blocks to construct message-based and loosely-coupled multiprocessing systems.

        We will report here on the implementation of a scalable and asynchronous detector simulation system which is based on ALFA. The system offers parallelization at the primary-track level, going beyond the usual inter-event parallelism of Geant4-MT, and the possibility to asynchronously and simultaneously process simulation data for the purpose of digitization and clusterization. Core advantages of our implementation are an ideal reduction of the processing time per event as well as a reduction of the memory footprint, allowing us to make significantly better use of opportunistic resources, such as HPC backfills, than before. Moreover, the track-level parallelism opens up the interesting possibility to use different simulation engines (such as Geant4 and Fluka) concurrently, based on simple selection filters on the primary particles. The integration of fast MC processes, such as machine learning kernels running on a dedicated GPU, are a natural extension to the system.

        Speaker: Sandro Christian Wenzel (CERN)
    • T3 - Distributed computing: Experiment Frameworks and HPC Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: David Cameron (University of Oslo (NO))
      • 18
        Experience running IceCube simulation workloads on the Titan supercomputer

        IceCube Neutrino Observatory is a neutrino detector located at the South Pole. Here we present experiences acquired when using HTCondor to run IceCube’s GPU simulation worksets on the Titan supercomputer. Titan is a large supercomputer geared for High Performance Computing (HPC). Several factors make it challenging to use Titan for IceCube’s High Throughput Computing (HTC) workloads: (1) Titan is designed for MPI applications, (2) Titan scheduling policies heavily favor very large resource reservations, (3) Titan compute nodes run a customized version of Linux, (4) Titan compute nodes cannot access outside network. In contrast, IceCube’s simulation workloads consist of large numbers of relatively small independent jobs intended to run in standard Linux environments, and may require connectivity to public networks. Here we present how we leveraged HTCondor batch scheduler within Singularity containers to provide an HTC-friendly interface to Titan suitable for IceCube’s GPU workloads.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 19
        Production experience and performance for ATLAS data processing on a Cray XC-50 at CSCS

        Predictions for requirements for the LHC computing for Run 3 and for Run 4 (HL_LHC) over the course of the next 10 years, show a considerable gap between required and available resources, assuming budgets will globally remain flat at best. This will require some radical changes to the computing models for the data processing of the LHC experiments. The use of large scale computational resources at HPC centres worldwide is expected to increase substantially the cost-efficiency of the processing. In order to pave the path towards the HL-LHC data processing, the Swiss Institute of Particle Physics (CHIPP) has taken the strategic decision to migrate the processing of all the Tier-2 workloads for ATLAS and other LHC experiments from a dedicated x86_64 cluster that has been in continuous operation and evolution since 2007, to Piz Daint, the current European flagship HPC, which ranks third in the TOP500 at the time of writing. We report on the technical challenges and solutions adopted to migrate to Piz Daint, and on the experience and measured performance for ATLAS in over one year of running in production.

        Speaker: Gianfranco Sciacca
      • 20
        Enabling production HEP workflows on Supercomputers at NERSC

        Many HEP experiments are moving beyond experimental studies to making large-scale production use of HPC resources at NERSC including the knights landing architectures on the Cori supercomputer. These include ATLAS, Alice, Belle2, CMS, LSST-DESC, and STAR among others. Achieving this has involved several different approaches and has required innovations both on NERSC and the experiments’ sides. We detail the approaches taken, comparing and contrasting the benefits and challenges. We also describe the innovations and improvements needed particularly in the areas of data transfer (via DTNs), containerization (via Shifter), I/O (via burst buffer, Lustre, or Shifter per-node-cache), scheduling (via developments in SLURM), workflow (via grid services or on-site engines), databases, external networking from compute nodes (via a new approach to networking on Cray systems), and software delivery (via a new approach to CVMFS on Cray systems).
        We also outline plans, and initial development, for future support of experimental science workloads at NERSC, via a ‘Superfacility API’ that will provide a more common, plug-and-play base for such workflows, building on best practises to provide a lower bar of entry to HPC for new experiments as well as consistency and performance.

        Speaker: Wahid Bhimji (Lawrence Berkeley National Lab. (US))
      • 21
        BigPanDA Workflow Management on Titan for HENP and extreme scale applications

        The Titan supercomputer at Oak Ridge National Laboratory prioritizes the scheduling of large leadership class jobs, but even when the supercomputer is fully loaded and large jobs are standing in the queue to run, 10 percent of the machine remains available for a mix of smaller jobs, essentially ‘filling in the cracks’ between the very large jobs. Such utilisation of the computer resources is called “backfill”.
        Smaller scientific groups and data science experiments are able to conduct their computations on Titan in backfill mode. For example, simulations in high-energy physics don’t require large MPI scale scale jobs on a supercomputer. As a universal jobs scheduler to optimize Titan utilization it was successfully used to schedule and to submit ATLAS MC simulation jobs earlier. The current R&D project is aimed to enabling the convergence of HPC and HTC paradigms for a range of application types and communities in particle physics and beyond.
        In March 2017, we implemented a new PanDA server instance within ORNL operating under Red Hat OpenShift Origin - a powerful container cluster management and orchestration system in order to serve various experiments at Titan supercomputer. We implemented a set of demonstrations serving diverse scientific workflows including LQCD and IceCube, biology studies of the genes and human brain, and molecular dynamics studies.

        Speaker: Alexei Klimentov (Brookhaven National Laboratory (US))
      • 22
        PanDA and RADICAL-Pilot Integration: Enabling the Pilot Paradigm on HPC Resources

        PanDA executes millions of ATLAS jobs a month on Grid systems with more than
        300k cores. Currently, PanDA is compatible only with few HPC resources due to
        different edge services and operational policies, does not implement the pilot
        paradigm on HPC, and does not dynamically optimize resource allocation among
        queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP)
        system to overcome these disadvantages and enable the execution of ATLAS,
        Molecular Dynamics and other workflows on HPC resources.

        Harvester is a commonality layer which brings coherence to diverse HPC
        systems, providing integration with PanDA workflows at job and event level. RP
        is a pilot system capable of executing short/long-running single/many-cores
        tasks on diverse HPC machines, supporting CPUs, GPUs, and multiple MPI
        implementations.

        We integrated Harvester and RP on Titan at ORNL, prototyping a Next Generation
        Executor (NGE) to expose RP capabilities and manage the execution of PanDA
        workflows. RP acquires Titan resources via queues and backfill capabilities
        and publishes the available resources to NGE. Harvester requests available
        resources and submits tasks for execution to NGE. NGE uses RP to execute those
        tasks, managing input and output staging, and holding the states of resources
        and tasks on a dedicated database.

        Speaker: Pavlo Svirin
      • 23
        #585 slot
    • T4 - Data handling: S1 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Conveners: Tigran Mkrtchyan (A.Alikhanyan National Science Laboratory (AM)) , Tigran Mkrtchyan (DESY)
      • 24
        JADE Long Term Archive

        IceCube is a cubic kilometer neutrino detector located at the south pole. Every year, 29 TB of data are transmitted via satellite, and 365 TB of data are shipped on archival media, to the data warehouse in Madison, WI, USA. The JADE Long Term Archive (JADE-LTA) software indexes and bundles IceCube files and transfers the archive bundles for long term storage and preservation into tape silos at collaborating institutions DESY and NERSC. The software and its features are discussed, including integration with Globus Online, and later migration to FTS, to manage data transfers.

        Speaker: PATRICK MEADE (University of Wisconsin-Madison)
      • 25
        CERN Tape Archive – from development to production deployment

        The first production version of the CERN Tape Archive (CTA) software is planned to be released for the end of 2018. CTA is designed to replace CASTOR as the CERN tape archive solution, in order to face scalability and performance challenges arriving with LHC Run-3.

        This contribution will describe the main commonalities and differences of CTA with CASTOR. We outline the functional enhancements and integration steps required to add the CTA tape back-end to an EOS disk storage system. We present and discuss the different deployment and migration scenarios for replacing the five CASTOR instances at CERN, including a description of how FTS will interface with EOS and CTA.

        Speaker: Michael Davis (CERN)
      • 26
        dCache: storage for advanced scientific use-cases and beyond

        The dCache project provides open-source storage software deployed internationally to satisfy ever more demanding scientific storage requirements. Its multifaceted approach provides an integrated way of supporting different use-cases with the same storage, from high throughput data ingest, through wide access and easy integration with existing systems.

        In supporting new communities, such as photon science and microbiology, dCache is evolving to provide new features and access to new technologies.

        In this paper, we describe some of these recent features that facilitate the use of storage to maximise the gain from stored data, including quality-of-service management, support for distributed and federated systems, and improvements with support for parallel NFS (pNFS).

        Speaker: Tigran Mkrtchyan (DESY)
      • 27
        The GridKa Tape System: status and outlook

        Tape storage is still a cost effective way to keep large amounts of data over a long period of time. It is expected that this will continue in the future. The GridKa tape environment is a complex system of many hardware components and software layers. Configuring this system for optimal performance for all use cases is a non-trivial task and requires a lot of experience. We present the current status of the GridKa tape environment, report on recent upgrades and improvements and plans to further develop and enhance the system, especially with regard to the future requirements of the HEP experiments to their large data centers. The short-term planning mainly includes the transition from TSM to HPSS as the backend and the effects on the connection of dCache and xrootd. Recent changes of the vendor situation of certain tape technologies require a precise analysis of the impact and eventual adaptation of the mid-term planning, in particular with respect to scalability challenge that comes with HL-LHC at the horizon.

        Speaker: Dr Doris Ressmann (KIT)
      • 28
        The archive solution for distributed workflow management agents of the CMS experiment at LHC

        The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this talk we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system leverages modern technologies such as a document oriented database and the Hadoop eco-system to provide the necessary flexibility in order to reliably process, store, and aggregate ~1M documents on a daily basis. We will discuss the data transformation, the short and long term storage layers, the query language, along with the aggregation pipeline developed to visualize various performance metrics to assist CMS data operators in assessing the performance of the CMS computing system.

        Speaker: Valentin Y Kuznetsov (Cornell University (US))
      • 29
        WLCG space accounting in the SRM-less world

        The WLCG computing infrastructure provides distributed storage capacity hosted at the geographically dispersed computing sites.
        In order to effectively organize storage and processing of the LHC data, the LHC experiments require a reliable and complete overview of the storage capacity in terms of the occupied and free space, the storage shares allocated to different computing activities, and the possibility to detect “dark” data that occupies space while being unknown to the experiment’s file catalog. The task of the WLCG space accounting activity is to provide such an overview and to assist LHC experiments and WLCG operations to manage storage space and to understand future requirements.
        Several space accounting solutions which have been developed by the LHC experiments are currently based on Storage Resource Manager (SRM). In the coming years SRM becomes an optional service for sites which do not provide tape storage. Moreover, already now some of the storage implementations do not provide an SRM interface. Therefore, the next generation of the space accounting systems should not be based on SRM. In order to enable possibility for exposing storage topology and space accounting information the Storage Resource Reporting proposal has been agreed between LHC experiments, sites and storage providers. This contribution describes the WLCG storage resource accounting system which is being developed based on Storage Resource Reporting proposal.

        Speaker: Julia Andreeva (CERN)
    • T5 - Software development: S1 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Patricia Mendez Lorenzo (CERN)
      • 30
        ALFA: ALICE-FAIR new message queuing based framework

        ALFA is a message queuing based framework for online and offline data processing. It is a flexible framework that supports an actor based computational model and allows to implement an experiment defined data model on top. The design of ALFA is modular with separate layers for data transport, process management and process deployment. Although still under ongoing development, ALFA is already being tested and used by different experiments in different stages of data processing as it offers an easy integration of heterogeneous hardware and software. Examples of ALFA usage in different stages of event processing will be presented; in a detector read-out as well as in an online reconstruction and in a pure offline world of detector simulations.

        Speaker: Mohammad Al-Turany (CERN)
      • 31
        The ALICE Analysis Framework for LHC Run 3

        The ALICE experiment at the LHC (CERN) is currently developing a new software framework designed for Run 3: detector and software will have to cope with Pb-Pb collision rates 100 times higher than today, leading to the combination of core Online-Offline operations into a single framework called O².
        The analysis code is expected to run on a few large Analysis Facilities counting 20k cores and sustaining a 100 GB/s throughput: this requires a conjoint effort between the definition of the data format, the configuration of the Analysis Facilities and the development of the Analysis Framework.
        We present the prototype of a new Analysis Object Data format based on timeframes and optimized for continuous readout. Such format is designed to be extensible and transported efficiently over the network. We also present the first iteration of the Analysis Framework, based on the O² Data Processing Layer and leveraging message passing across a topology of processes.
        We will also illustrate the implementation and benchmarking of a compatibility layer designed to mitigate the transition from the current event-oriented analysis model to the new time-oriented one.
        Finally, we will give a status report on the integration of the Analysis Framework and Analysis Facilities for Run 3 into our current organized analysis model.

        Speaker: Dr Dario Berzano (CERN)
      • 32
        From Catalogs to Cosmology: An Automated Weak Gravitational Lensing Analysis Pipeline

        Weak gravitational lensing is an extremely powerful probe for gaining insight into the nature of two of the greatest mysteries of the universe -- dark energy and dark matter. To help prepare for the massive amounts of data coming from next generation surveys like LSST that hope to advance our understanding of these mysteries, we have developed an automated and seamless weak lensing cosmic shear analysis pipeline that begins with survey catalogs and performs all the necessary analysis steps in order to derive the cosmological parameters at the end. This pipeline is designed to be generic enough to handle data from any galaxy survey and has been tested by re-analyzing existing datasets from four different surveys -- CFHTLens, DES-SV, KiDS-450, and DLS -- and comparing their results. It is flexible enough to allow running each pipeline step on distributed resources at different sites in order to optimally match the computational capabilities of each site to the unique requirements of each task. This talk will begin with a brief overview of the analysis performed by this pipeline and then delve into a discussion of the architecture and performance of an actual implementation using modern work-flow engines and description languages.

        Speaker: Eric Vaandering (Fermi National Accelerator Lab. (US))
      • 33
        Software development for the NICA experiments: MpdRoot & BmnRoot

        The software for detector simulation, reconstruction and analysis of physics data is an essential part of each high-energy physics experiment. A new generation of the experiments for the relativistic nuclear physics is expected to be started up in the nearest years at the Nuclotron-based Ion Collider facility (NICA) being under construction at the Joint Institute for Nuclear Research in Dubna: the fixed target experiment BM@N (Baryonic Matter at Nuclotron), whose technical runs were started in 2015, and the future MPD (Multi-Purpose Detector) experiment on ion collisions, which will operate at the storage rings of the NICA facility. The event data model of the experiments is shown. The status of the software frameworks MpdRoot and BmnRoot developed for the MPD and BM@N is considered. For these tasks many additional systems, such as raw data converter, monitoring systems, event display, databases, parallelization tools and others have been developed.

        Speaker: Konstantin Gertsenberger (Joint Institute for Nuclear Research (RU))
      • 34
        ATLAS Technical Coordination Expert System

        Technical details of the directly manipulated systems and the impact on non-obviously connected systems are required knowledge when preparing an intervention in a complex experiment like ATLAS. In order to improve the understanding of the parties involved in an intervention a rule-based expert system has been developed. On the one hand this helps to recognize dependencies that are not always evident and on the other hand it facilitates communication between experts with different backgrounds by translating domain specific vocabularies. To simulate an event this tool combines information from diverse areas such as detector control (DCS) and safety systems (DSS), gas, cooling, ventilation, and electricity distribution. The inference engine provides a fast response of the impacted systems that are connected at a low level although they belong to different domains. It also predicts the probability of failure for each of the components affected by an intervention. Risk assessment models considered are fault tree analysis and principal component analysis. The user interface is a web-based application that uses graphics and text to provide different views of the detector system adapted to the different user needs and to interpret the data.

        Speaker: Ignacio Asensi Tortajada (Univ. of Valencia and CSIC (ES))
      • 35
        The AMI (ATLAS Metadata Interface) 2.0 metadata ecosystem: new design principles and features

        AMI (ATLAS Metadata Interface) is a generic ecosystem for metadata aggregation, transformation and cataloguing. Benefitting from more than 15 years of feedback in the LHC context, the second major version was recently released. We describe the design choices and their benefits for providing high-level metadata-dedicated features. In particular, we focus on the implementation of the Metadata Querying Language (MQL), a domain-specific language allowing to query databases without knowing the relation between entities. Finally, we emphasise the assets of the ecosystem for deploying an AMI service and extending its functionalities.

        Speaker: Jerome Odier (IN2P3/CNRS (FR))
    • T6 - Machine learning and physics analysis: S1 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Conveners: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P) , Mr Sergei Gleyser
      • 36
        Partnering with industry for machine learning at HL-LHC

        The High Luminosity LHC (HL-LHC) represents an unprecedented computing challenge. For the program to succeed the current estimates from the LHC experiments for the amount of processing and storage required are roughly 50 times more than are currently deployed. Although some of the increased capacity will be provided by technology improvements over time, the computing budget is expected to be flat and to close the gap huge gains in the efficiency for processing and analyzing events must be achieved. An area that has the potential for a significant breakthrough is Machine Learning. In recent years industry has invested heavily in both hardware and software to develop machine learning techniques to filter, process, analyze, and derive correlations from very large scale heterogeneous datasets. Through CERN openlab, with industry partners, and the R&D projects of the LHC experiments we are attempting to build on the industry investments to utilize these techniques for science. In this presentation we will discuss the activities of the CERN openlab industry partnerships in machine learning and how they can be extended to science applications. Industry has shown the ability to monitor the health of complex systems and predict failures and maintenance needs. We will show how these techniques could be applied to detector health. We will discuss how industry developments in anomaly detection might be applicable to monitoring data quality and identifying new signals. Industry has demonstrated the feasibility of automated resource scheduling and optimization. We will show how these techniques could be used for data placement and workflow execution. Industry has advanced systems for high speed and high accuracy image recognition. We will discuss explorations of how these techniques could be applied to physics object identification. In recent years there have been advancements in the use of adversarial networks to improve the accuracy and robustness of training. These techniques may be applicable generally to some physics machine learning applications, but are potentially particularly interesting for tuning fast event simulation. We will present examples of industry activity and how all of these are being explored in LHC applications. At the end we will look at data processing techniques and speed requirements for physics processing and how those compare to similar real time industry processing applications.

        Speaker: Maria Girone (CERN)
      • 37
        New Machine Learning Developments in ROOT/TMVA

        In this talk, we will describe the latest additions to the Toolkit for Multivariate Analysis (TMVA), the machine learning package integrated into the ROOT framework. In particular, we will focus on the new deep learning module that contains robust fully-connected, convolutional and recurrent deep neural networks implemented on CPU and GPU architectures. We will present performance of these new libraries on benchmark datasets from high-energy physics. Additionally, we will discuss new developments in parallelization, cross-validation, regression and unsupervised learning and new interfaces to external machine learning frameworks, such as Tensorflow and scikit-learn.

        Speaker: Kim Albertsson (Lulea University of Technology (SE))
      • 38
        The Scikit-HEP project

        The Scikit-HEP project is a community-driven and community-oriented effort with the aim of providing Particle Physics at large with a Python scientific toolset containing core and common tools. The project builds on five pillars that embrace the major topics involved in a physicist’s analysis work: datasets, data aggregations, modelling, simulation and visualisation. The vision is to build a user and developer community engaging collaboration across experiments, to emulate scikit-learn's unified interface with Astropy's embrace of third-party packages, and to improve discoverability of relevant tools. The Scikit-HEP toolset is comprised of the scikit-hep package and numerous third-party “affiliated” packages. These affiliated packages include well-known packages such as root_numpy and root_pandas, but also newer packages that add functionality such as reading ROOT files natively, and interfaces between NumPy and Pythia or FastJet. The project's website is http://scikit-hep.org/ and the GitHub repositories are at https://github.com/scikit-hep. An overview of the overall project, toolset, and main functionality, will be presented.

        Speaker: Eduardo Rodrigues (University of Cincinnati (US))
      • 39
        HIPSTER - a python package for particle physics analyses

        HIPSTER (Heavily Ionising Particle Standard Toolkit for Event Recognition) is an open source Python package designed to facilitate the use of TensorFlow in a high energy physics analysis context. The core functionality of the software is presented, with images from the MoEDAL experiment Nuclear Track Detectors (NTDs) serving as an example dataset. Convolutional neural networks are selected as the classification algorithm for this dataset and the process of training a variety of models with different hyperparameters is detailed. Next the results are shown for the MoEDAL problem demonstrating the rich information output by HIPSTER that enables the user to probe the performance of their model in detail.

        Speaker: Thomas Paul Charman (University of London (GB))
      • 40
        Interactive, scalable, reproducible data analysis with containers, Jupyter, and Parsl

        In the traditional HEP analysis paradigm, code, documentation, and results are separate entities that require significant effort to keep synchronized, which hinders reproducibility. Jupyter notebooks allow these elements to be combined into a single, repeatable narrative. HEP analyses, however, commonly rely on complex software stacks and the use of distributed computing resources, requirements that have been barriers to notebook adoption. In this presentation we describe how Jupyter can be combined with Parsl (Parallel Scripting Library) and containers to enable intuitive and interactive high performance computing in Python.

        Parsl is a pure Python library for orchestrating the concurrent execution of multiple tasks. Parsl is remarkable for its simplicity. Its primary construct is an “app” decorator, which the programmer uses to indicate that certain functions (either pure Python or wrappers around shell programs) are to be treated as “apps.” App function calls then result in the creation of a new “task” that runs concurrently with the main program and other tasks, subject to dataflow constraints defined by the availability of app function input data. Data dependencies can be in-memory objects, or external files. App decorators can further specify which computation resources to use and the required software environment to run the decorated function. Parsl abstracts hardware details, allowing a single script to be executed efficiently on one or more laptops, clusters, clouds, and/or supercomputers. To manage complex execution environments on various resources and also to improve reproducibility, Parsl can use containers— lightweight, virtualized constructs for packaging software with its environment— to wrap tasks.

        In this presentation we 1) show how a real-world complete HEP analysis workflow can be developed with Parsl and 2) demonstrate efficient and reproducible execution of such workflows on heterogeneous resources, including leadership-class computing facilities, using containers to wrap analysis code, Parsl to orchestrate the execution of these containers, and Jupyter as the interface for writing and executing the Parsl script.

        Speaker: Ms Anna Elizabeth Woodard (Computation Institute, University of Chicago)
      • 41
        #182 slot
    • T7 - Clouds, virtualization and containers: S1 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Convener: Andrew McNab (University of Manchester)
      • 42
        Lightweight WLCG Sites

        The WLCG unites resources from over 169 sites spread across the world and the number is expected to grow in the coming years. However, setting up and configuring new sites to support WLCG workloads is still no straightforward task and often requires significant assistance from WLCG experts. A survey presented in CHEP 2016 revealed a strong wish among site admins for reduction of overheads through the use of prefab Docker containers or OpenStack VM images, along with the adoption of popular tools like Puppet for configuration. In 2017, the Lightweight Sites project was initiated to construct shared community repositories providing such building blocks. In this contribution, we present the first sets of components made available and discuss the technical design choices and the subsequent roadmap. We describe a modular and extensible core system that abstracts low-level details through a YAML based site-wide configuration file which is used to configure all distributed components through a single command. To accommodate the diverse scenarios at different sites, the system will enable site admins to cherry pick their background technologies and methodologies for orchestration (Puppet, Ansible, ...), clustering (Docker Swarm, Kubernetes, ...) and networking (dedicated networks, custom overlay networks or a combination of both). As an alternative to setting up classic grid sites using the aforementioned technologies, we also discuss a different approach altogether to provide lightweight computing resources, featuring HTCondor and BOINC.

        Speaker: Mayank Sharma (CERN)
      • 43
        EVALUATING KUBERNETES AS AN ORCHESTRATOR OF THE HIGH LEVEL TRIGGER COMPUTING FARM OF THE TRIGGER AND DATA ACQUISITION SYSTEM OF THE ATLAS EXPERIMENT AT THE LARGE HADRON COLLIDER

        The ATLAS experiment at the LHC relies on a complex and distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data. The High Level Trigger (HLT) component of the TDAQ system is responsible for executing advanced selection algorithms, reducing the data rate to a level suitable for recording to permanent storage. The HLT functionality is provided by a computing farm made up of thousands of commodity servers, each executing one or more processes.
        Moving the HLT farm management towards a containerized solution is one of the main theme of the ATLAS TDAQ Phase-II upgrades in the area of the online software; it would make it possible to open new possibilities for fault tolerance, reliability and scalability.
        This paper presents the results of an evaluation of Kubernetes as a possible orchestrator of the ATLAS TDAQ HLT computing farm. Kubernetes is a system for advanced management of containerized applications in large clusters.
        We will first highlight some of the technical solutions adopted to run the offline version of today’s HLT software in a Docker container. Then we will focus on some scaling performance measurements executed with a cluster of 1000 CPU cores. In particular, we will:
        - Show the way Kubernetes scales in deploying containers as a function the cluster size;
        - Prove how a proper tuning of the Query Per Second (QPS) Kebernetes parameter set can improve the scaling of applications.
        Finally, we will conclude with an assessment about the possibility to use Kubernetes as an orchestrator of the HLT computing farm in LHC’s Run IV.

        Speaker: Giuseppe Avolio (CERN)
      • 44
        Lightweight on-demand computing with Elasticluster and Nordugrid ARC

        The cloud computing paradigm allows scientists to elastically grow or shrink computing resources as requirements demand, so that resources only need to be paid for when necessary. The challenge of integrating cloud computing into distributed computing frameworks used by HEP experiments has led to many different solutions in the past years, however none of these solutions offer a complete, fully integrated cloud resource out of the box. This paper describes how to offer such a resource using stripped-down minimal versions of existing distributed computing software components combined with off-the-shelf cloud tools. The basis of the cloud resource is Elasticluster, and the glue to join to the HEP computing infrastructure is provided by the NorduGrid ARC middleware and the ARC Control Tower. These latter two components are stripped down to bare minimum edge services, removing the need for administering complex grid middleware, yet still provide the complete job and data management required to fully exploit the cloud resource. The ARC architecture delegates all the grid-related interactions (e.g. data staging, communication with workload management system) to the edge service so that no grid middleware or sensitive credentials are required on the rest of the cloud. This model can be applied not only to cloud computing, but also to restrictive environments such as High Performance Computing centres. We show how to configure Elasticluster to automatically create the cloud resource and run these services at the click of a button and present results of the setup running real payload for the ATLAS experiment.

        Speaker: Maiken Pedersen (University of Oslo (NO))
      • 45
        CVMFS: Stratum0 in kubernetes

        IceCube is a cubic kilometer neutrino detector located at the south pole. CVMFS is a key component to IceCube’s Distributed High Throughput Computing analytics workflow for sharing 500GB of software across datacenters worldwide. Building the IceCube software suite across multiple platforms and deploying it into CVMFS has until recently been a manual, time consuming task that doesn’t fit well within an agile continuous delivery framework.

        Within the last 2 years a plethora of tooling around microservices has created an opportunity to upgrade the IceCube software build and deploy pipeline. We present a framework using Kubernetes to deploy Buildbot. The Buildbot pipeline is a set of pods (docker containers) in the Kubernetes cluster that builds the IceCube software across multiple platforms, tests the new software for critical errors, syncs the software to a containerized CVMFS server, and finally executes a publish. The time from code commit to CVMFS publish has been greatly reduced and has enabled the capability of publishing nightly builds to CVMFS.

        Speaker: David Schultz (University of Wisconsin-Madison)
      • 46
        Exploiting private and commercial clouds to generate on-demand CMS computing facilities with DODAS

        Reducing time and cost, through setup and operational efficiency increase is a key nowadays while exploiting private or commercial clouds. In turn this means that reducing the learning curve as well as the operational cost of managing community-specific services running on distributed environments became a key to success and sustainability, even more for communities seeking to exploit opportunistic cloud resources. DODAS (Dynamic On Demand Analysis Service) is a tool based on INDIGO-DataCloud services which allows on-demand generation of a container-based HTCondor batch system over cloud infrastructures implementing a “Batch System as a Service” that can be easily customized to cover multiple scenarios. DODAS has been designed to completely automate the process of provisioning, creating, managing and accessing a pool of heterogeneous computing and storage resources over cloud infrastructures. As such, DODAS provides a complete abstraction of the underlying IaaS, which is only required to provide VMs or bare metal resources. It uses a combination of TOSCA templates and Ansible roles to orchestrate and configure cloud resources, and it relies on Apache Mesos to abstract CPU, RAM and storage. Marathon is adopted as container orchestration platform on top of Mesos. Marathon takes care of setting up both HTCondor and any additional services (e.g squid proxy, proxy certificate cache) that experiments might require. Credential management is completely Json Web Tokens- based which relies on Identity Access Management (IAM) and Token Translation Services (TTS) provided by INDIGO. The DODAS architecture is highly flexible and customizable; as such, it has been evolved in order to provide also the configuration and the automatic setup services and dependencies required by a regular CMS computing facility, a Tier site. This means that DODAS allows CMS to utilize “any cloud provider” to generate sites on demand, with almost zero effort. We describe the overall architecture of DODAS, including the integration strategy adopted to connect it with the CMS HTCondor Global Pool. We also present performance, scalability and efficiency results of DODAS-generated Tiers used to process CMS jobs, showing how DODAS has been effectively used in different CMS sites in order to consolidate disparate computing resources being used by CMS scientists. We will present the different ways DODAS can be used to integrate different usage patterns, especially how it can fit the final stages of the CMS analysis.
        Finally, we present and compare results obtained both on private and commercial clouds spanning from Microsoft Azure, to Open Telekom Cloud up to private OpenStack instances.

        Speaker: Daniele Spiga (Universita e INFN, Perugia (IT))
      • 47
        Digital archive as a service: automatic deployment of an Invenio-based repository using TOSCA orchestration and Apache Mesos

        In the framework of the H2020 INDIGO-DataCloud project we have implemented an advanced solution for the automatic deployment of digital data repositories based on Invenio, the digital library framework developed by Cern. Exploiting cutting-edge technologies, like docker and Apache Mesos, and standard interfaces like TOSCA we are able to provide a service that simplifies the process of creating and managing repositories of various digital assets using cloud resources.
        An Invenio-based repository consists of a set of services (e.g. database, message queue, cache, worker and frontend) that need to be properly installed, configured and linked together. These operations along with the provisioning of the resources and their monitoring and maintenance can be challenging for individual researchers or small-to-moderate-sized research groups.
        To this purpose the INDIGO-Datacloud platform provides advanced features for orchestrating the deployment of complex virtual infrastructures on distributed cloud environments: it is able to provision the required resources automatically over heterogeneous and/or hybrid cloud infrastructures and to configure them automatically ensuring dynamic elasticity and resilience. This approach has been successfully adopted to support the “Digital Research Infrastructure for the Arts and Humanities” (DARIAH) use-case.

        Speaker: Marica Antonacci
    • 12:30 PM
      Lunch break
    • T1 - Online computing: S2 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Clara Gaspar (CERN)
      • 48
        LHCb MiniDAQ Control System

        LHCb is one of the 4 experiments at the LHC accelerator at CERN, specialized in b-physics. During the next long shutdown period, the LHCb experiment will be upgraded to a trigger-less readout system with a full software trigger in order to be able to record data with a much higher instantaneous luminosity. To achieve this goal, the upgraded systems for trigger, timing and fast control (TFC) and data acquisition (DAQ) will have new electronic boards and a new software stack will be introduced for data acquisition. In the development stage, all these components are housed in servers named MiniDAQ. These new components will need to be integrated in an upgraded Experiment Control System (ECS) based on the WinCC OA SCADA and the CERN JCOP framework. The ECS provides full monitoring and control for all these sub-systems and many important features like configuration recipes and automated actions. This paper will describe the implementation of the upgraded ECS and its component fwMiniDAQ, which integrates all the new developments and can be easily distributed to the sub-detector developers and configured for their various setups.

        Speaker: Luis Granado Cardoso (CERN)
      • 49
        A Web-based control and monitoring system for DAQ applications

        Data acquisition and control play an important role in science applications especially in modern Experiments of high energy physics (HEP). A comprehensive and efficient monitoring system is a vital part of any HEP experiment. In this paper we describe the software web-based framework which is currently used by CMD-3 Collaboration during data taking with the CMD-3 Detector at the VEPP-2000 Collider and partially by the Muon g-2 experiment at Fermilab to monitor the status of data acquisition and control the quality of physics data taken in the experiments.
        The system is designed to meet typical requirements and cover various use-cases of DAQ applications (MIDAS oriented), starting from central configuration view, slow control data monitoring, user-oriented visualization and full control of DAQ modules in the end.
        Being an intermediate middleware between front-end electronics and DAQ applications the system is focused to provide a high-level coherent view for shifters and experts for robust operations and, in particular, is used to integrate various experiment dependent monitoring modules and tools into a unified Web oriented portal with appropriate access control policy implemented.

        The paper describes the design and overall architecture of the system, recent developments and the most important aspects of the framework implementation and features like real-time interactive monitoring plots, trend histograms, run log details, remote script execution, built-in widget concept (template tags), fine-granular authorization and others.

        Speaker: Alexey Anisenkov (Budker Institute of Nuclear Physics (RU))
      • 50
        The protoDUNE Single Phase Detector Control System

        This paper presents the Detector Control System (DCS) that is being designed and implemented for the NP04 experiment at CERN. NP04, also known as protoDUNE Single Phase (SP), aims at validating the engineering processes and detector performance of a large LAr Time Projection Chamber in view of the DUNE experiment. The detector is under construction and will be operated on a tertiary beam of the CERN SPS accelerator starting in 2018.
        After an overall description of the distributed control architecture that has been chosen for the control of this experiment, focus will be put on describing the software system design, based on the CERN control frameworks UNICOS and JCOP (built on top of WINCC OA), the hardware implementation, the deployment of the system into production and the integration of the detector subsystems and instrumentation into the overall NP04 DCS.
        The challenging specifications of the SP detector, such as temperature reading accuracy, sensitivity to any electromagnetic noise of the SP detector electronics and the measures applied at the hardware layout level are also described. The knowledge acquired during the operation of the Anode Plane Assemblies (APA) sub-detector components within the NP Coldbox test facility is discussed and improvements are presented.

        Speaker: Manuel Jesus Rodriguez Alonso (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
      • 51
        Software based control and monitoring of a hardware based track reconstruction system for the ATLAS experiment

        During the Run-2 of the Large Hadron Collider (LHC) the instantaneous luminosity exceeds the nominal value of 10^{34} cm^{−2} s^{−1} with a 25 ns bunch crossing period and the number of overlapping proton-proton interactions per bunch crossing increases up to about 80. These conditions pose a challenge to the trigger system of the experiments that has to control rates while keeping a good efficiency for interesting physics events.

        This document summarizes the software based control and monitoring of a hardware-based track reconstruction system for the ATLAS experiment, called Fast Tracker (FTK), composed of associative memories and FPGAs operating at the rate of 100 kHz and providing high quality track information within the available latency to the high-level trigger. In particular, we will detail the commissioning of the FTK within the ATLAS online software system presenting the solutions adopted for scaling up the system and ensuring robustness and redundancy. We will also describe the solutions to challenges such as controlling the occupancy of the buffers, managing the heterogeneous and large configuration, and providing monitoring information at sufficient rate.

        Speaker: Simone Sottocornola (Universita and INFN (IT))
      • 52
        Front-End Electronics Control and Monitoring for the LHCb Upgrade

        The LHCb experiment, one of the four operating in the LHC, will be enduring a major upgrade of its electronics during the third long shutdown period of the particle accelerator. One of the main objectives of the upgrade effort is to implement a 40MHz readout of collision data. For this purpose, the Front-End electronics will make extensive use of a radiation resistant chipset, the Gigabit Transceiver (GBT), for readout as well as for slow control, monitoring and synchronization. At LHCb, the tools to operate the front-end electronics are developed by a central team and distributed to the users. This contribution describes the architecture of the system that implements the slow control and monitoring of all Front-End electronics using the GBT chipset, namely the GBTx and GBT-SCA. The system is implemented in 3 layers starting with an FPGA based electronic board that interfaces the GBT chipset directly through optical fibers. The second layer is composed by a PCIe driver and a number of processes to operate these boards. The user operates the system in the third layer which is the WinCC OA SCADA that is interfaced with the Front-Ends via a message broker called DIM. The requirements of the system as well as the design and integration of each layer are discussed in detail. The results of the firmware implementation in hardware and operational tests are shown and the overall performance of the system is discussed.

        Speaker: Joao Vitor Viana Barbosa (CERN)
      • 53
        Status report on the architecture and future upgrades of the CMS Electromagnetic Calorimeter Control and Safety Systems

        The Electromagnetic Calorimeter (ECAL) is one of the sub-detectors of the Compact Muon Solenoid (CMS) experiment of the Large Hadron Collider (LHC) at CERN. Since more than 10 years, the ECAL Detector Control System (DCS) and the ECAL Safety System (ESS) have supported the experiment operation, contributing to its high availability and safety. The evolution of both systems to fulfill new requirements and constraints, in addition to optimizations towards improving usage and process automation, led to several changes to their original design. This paper presents the current software/hardware architecture of both CMS ECAL DCS and ESS and reviews the major changes applied to both systems during the past years. Furthermore, in view of the CMS Phase-II upgrade of this sub-detector, the corresponding plans for the control and safety systems are also discussed.

        Speaker: Mr Diogo Di Calafiori (Eidgenoessische Technische Hochschule Zuerich (ETHZ) (CH))
      • 54
        Towards the ALICE Online-Offline (O2) control system

        The ALICE Experiment at CERN LHC (Large Hadron Collider) is under
        preparation for a major upgrade that is scheduled to be deployed during Long
        Shutdown 2 in 2019-2020 and that includes new computing systems, called O2
        (Online-Offline).
        To ensure the efficient operation of the upgraded experiment along with its
        newly designed computing system, a reliable, high performance and automated
        control system will be developed with the goal of managing the lifetime of all
        the O2 processes, and of handling the various phases of the data taking activity
        by interacting with the detectors, the trigger system and the LHC.
        The ALICE O2 control system will be a distributed system based on state
        of the art cluster management and microservices which have recently emerged
        in the distributed computing ecosystem. Such technologies weren’t available
        during the design and development of the original LHC computing systems,
        and their use will allow the ALICE collaboration to benefit from a vibrant and
        innovating open source community.
        This paper illustrates the O2 control system architecture. It evaluates sev-
        eral solutions that were considered during an initial prototyping phase and pro-
        vides a rationale for the choices made. It also provides an in-depth overview of
        the components, features and design elements of the actual system.

        Speaker: Teo Mrnjavac (CERN)
      • 55
        The online DQM of BESIII

        The BESIII detector is a magnetic spectrometer operating at BEPCII, a
        double-ring e+e- collider with center-of-mass energies between 2.0 and
        4.6 GeV and a peak luminosity $10^{33}$ cm$^{-2}$ s$^{-1}$. The event rate
        is about 4 kHz after the online event filter (L3 trigger) at J/$\psi$
        peak.

        The BESIII online data quality monitoring (DQM) system is used to
        monitor the data and the detector in real time using full
        reconstructed data. It is a essential supplement to the DAQ
        monitoring, which using unprocessed data directly from the detector. I
        will give the review of the BESIII DQM system. The system is designed
        to be weak coupled to the DAQ online system, so it can be developed
        under the offline environment freely. The system includes the data
        server, which is used to get data from DAQ, the reconstruction
        processes for the data reconstruction, the histogram server, which is
        used to collect all histograms from all reconstruction processes, and
        the error handle of itself, and so on.

        Speaker: Xiaobin Ji (IHEP, Beijing, China)
    • T2 - Offline computing: S2 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Lucia Grillo (University of Manchester (GB))
      • 56
        Impact of Detector Simulation in Particle Physics Collider Experiments

        Detector simulation has become fundamental to the success of modern high-energy physics (HEP) experiments. For example, the Geant4-based simulation applications developed by the ATLAS and CMS experiments played a major role for them to produce physics measurements of unprecedented quality and precision with faster turnaround, from data taking to journal submission, than any previous hadron collider experiment. The material presented here add to a recent review on the impact of detector simulation in collider particle physics experiments published in Ref. [1]. Most recent examples illustrate the application of simulation to detector design and optimization, software development and testing of computing models, detector commissioning and data calibration studies, as well as physics signal and background estimations utilized in data analysis for physics measurements. The cost and economic impact of simulation in the CMS experiment are also discussed. Future experimental programs present new challenges for increasingly sophisticated detectors, events, and data samples. The presentation finishes with a short discussion on potential solutions to mitigate the impact of this new environment on computing performance, given that physics accuracy needs to be preserved or improved. They are based on re-engineering detector simulation software using modern software and computing techniques to run on emerging hardware architectures.

        [1] Physics Reports 695 (2017) 1-54, arXiv:1706.04293

        Speaker: Victor Daniel Elvira (Fermi National Accelerator Lab. (US))
      • 57
        Current and Future Performance of the CMS Simulation

        The CMS full simulation using Geant4 has delivered billions of simulated events for analysis during Runs 1 and 2 of the LHC. However, the HL-LHC dataset will be an order of magnitude larger, with a similar increase in occupancy per event. In addition, the upgraded CMS detector will be considerably more complex, with an extended silicon tracker and a high granularity calorimeter in the endcap region. Increases in conventional computing resources are subject to both technological and budgetary limitations, so novel approaches are needed to improve software efficiency and to take advantage of new architectures and heterogeneous resources. Several projects are in development to address these needs, including the vectorized geometry library VecGeom and the GeantV transport engine, which uses track-level parallelization. The current computing performance of the CMS simulation will be presented as a baseline, along with an overview of the various optimizations already available for Geant4. Finally, the progress and outlook for integrating VecGeom and GeantV in the CMS software framework will be discussed.

        Speaker: Kevin Pedro (Fermi National Accelerator Lab. (US))
      • 58
        New techniques for pile-up simulation in ATLAS

        The high-luminosity data produced by the LHC leads to many proton-proton interactions per beam
        crossing in ATLAS, known as pile-up. In order to understand the ATLAS data and extract the physics
        results it is important to model these effects accurately in the simulation. As the pile-up rate continues
        to grow towards an eventual rate of 200 for the HL-LHC, this puts increasing demands on computing resources required for the simulation and the current approach of simulating the pile-up interactions along with the hard-scatter for each Monte Carlo production is no longer feasible. The new ATLAS "overlay" approach to pile-up
        simulation is presented. Here a pre-simulated set of minimum bias interactions, either from simulation
        or from real data, is created once and events drawn from this are overlaid with the hard-scatter event
        being simulated. This leads to significant improvements in CPU time. The contribution will discuss the
        technical aspects of the implementation in the ATLAS simulation and production infrastructure and
        compare the performance, both in terms of computing and physics, to the previous approach.

        Speaker: Tadej Novak (Jozef Stefan Institute (SI))
      • 59
        FairRoot and ALICE O2 multithreading simulation

        To address the challenges of the major upgrade of the experiment, the ALICE simulations must be able to make efficient use of computing and opportunistic supercomputing resources available on the GRID. The Geant4 transport package, the performance of which has been demonstrated in a hybrid multithreading (MT) and multiprocessing (MPI) environment with up to ¼ million threads, is therefore of a particular interest.
        The O2 simulation framework is based on FairRoot, which itself is based on the Virtual Monte Carlo (VMC). The integration of MT into the VMC design and its impact on the Geant4 VMC was presented at CHEP 2014. Geant4 VMC MT and the scaling behavior of the computing time with the number of cores have been then tested using a simplified but realistic multithreaded simulation application.
        The focus was then put on the integration of MT in FairRoot classes as the necessary step towards MT in the FairRoot based experimental frameworks. The new O2 framework is the first one the migration to MT of which is achieved for all actually included detectors. The planned integration of the Geant4 MT mode in the standard FairRoot and O2 testing suites should allow to keep this mode operational during the period of further developments and enhancements in 2018.
        In this contribution we will present the progress with the integration of MT in FairRoot classes, the work for thread-safety in the O2 simulation classes and the experience with the integration of the MT mode in testing. We will also discuss plans for the further performance studies with a realistic detector setup.

        Speaker: Ivana Hrivnacova (IPNO, Université Paris-Saclay, CNRS/IN2P3)
      • 60
        Status of parallelized JUNO simulation software

        The Jiangmen Underground Neutrino Observatory (JUNO) is a multi-purpose neutrino experiment. It consists of a central detector, a water pool and a top tracker. The central detector, which is used for neutrino detection, consists of 20 kt liquid scintillator (LS) and about 18,000 20-inch photomultiplier tubes (PMTs) to collect lights from LS.
        Simulation software is one of the important parts in JUNO offline software. To speed up simulation, a parallelized simulation framework has been developed based on the SNiPER framework and Geant4 version 10. The SNiPER task components are in charge of event loop, which can run in sequential mode, Intel TBB mode and other modes. Based on SNiPER, the simulation framework and its underlying parallel libraries have been decoupled. However parallelized simulation of events with correlation is a challenge. In order to keep the correct events order, a component called global buffer is developed in SNiPER.
        In this presentation, an overview of parallelized JUNO simulation framework will be presented first. The global buffer is used in the parallelized event correlation simulation. An event generator produces events with timestamps in sequential mode. These events are put into global buffer and simulated by detector simulation algorithms in different tasks. After simulation, these events are saved into ROOT files with I/O services running in extra threads. Finally, we will present the software performance.

        Speaker: Dr Tao Lin (Institute of High Energy Physics, CAS)
      • 61
        Current status for simulation chain of the neutrino events in the NOvA experiment

        The NOvA experiment is a two-detectors, long-baseline neutrino experiment operating since 2014 in the NuMI muon neutrino beam (FNAL, USA). NOvA has already collected about 25% of its expected statistics in both neutrino and antineutrino modes for electron-neutrino appearance and muon-neutrino disappearance analyses. Careful simulation of neutrino events and backgrounds are required for precise measurements of the neutrino oscillation parameters. We will describe updates in the NOvA simulation chain in comparison to previous one presented at CHEP2015, including: a revision of the detector geometries, upgrading the GENIE neutrino event generator to access an array of 2p2h models, upgrading GEANT to improve neutron modeling, modifying our custom electronic simulation package to model energy dependent shaping changes, improvements to our fiber and scintillator light models, and modifications to the overlay procedure to allow for mixing cosmic ray data with simulated neutrino events as well as enhanced realism in the addition of beam-related activity from outside the detector to the detector simulation.

        Speaker: Oleg Samoylov (Joint Institute for Nuclear Research)
      • 62
        Adopting new technologies in the LHCb Gauss simulation framework

        The increase in luminosity foreseen in the future years of operation of the Large Hadron Collider (LHC) creates new challenges in computing efficiency for all participating experiment. These new challenges extend beyond the data-taking alone, because data analyses require more and more simulated events, whose creation already takes a large fraction of the overall computing resources. For Run 3 of the LHC, the LHCb collaboration needs to simulate about two orders of magnitude more Monte Carlo events to exploit the increased luminosity.
        Therefore, the LHCb simulation framework (Gauss) will go through a significant renovation, mostly driven by the upgraded core software framework (Gaudi) and the availability of a multithreaded version of Geant4. The upgraded Gaudi framework replaces single-threaded processing by a multithreaded approach, allowing concurrent execution of tasks with a single event as well as multiple events in parallel. A major task of the required overhaul of Gauss is the implementation of a new interface to the multithreaded version of Geant4. Furthermore, in order to implement fast simulation options for different sub-detectors, this interface needs to be designed as modular as possible to allow seamless customisation of the detector simulation and a different treatment based on particle type.
        In this talk, we will give an overview of the whole plan and discuss the status of the project , as well as the lessons learned facing the challenges above. Furthermore, we present the work on Gaussino, a new Gaudi-based core simulation framework which forms the basis of Gauss for Run 3. Gaussino is a cross-experiment simulation framework with no dependency on LHCb software and provides an ideal testbed for the implementation of future technologies such as GeantV.

        Speaker: Dominik Muller (CERN)
      • 63
        #573 slot
    • T3 - Distributed computing: Facilities Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: Julia Andreeva (CERN)
      • 64
        Advances and enhancements in the FabrIc for Frontier Experiments Project at Fermilab

        The FabrIc for Frontier Experiments (FIFE) project within the Fermilab Scientific Computing Division is charged with integrating offline computing components into a common computing stack for the non-LHC Fermilab experiments, supporting experiment offline computing, and consulting on new, novel workflows. We will discuss the general FIFE onboarding strategy, the upgrades and enhancements in the FIFE toolset, and plans for the coming year.
        These enhancements include: a refactoring of the Fermilab General Purpose Grid cluster for integration with the HEPCloud Project; expansion of opportunistic computing resources (including GPU and high-performance computing resources) ; assistance with commissioning computing resources at European sites for individual experiments; StashCache repositories for experiments; enhanced job monitoring tools; and a custom workflow management service. Additionally we have completed the first phase of a Federated Identity Management system to make it easier for FIFE users to access Fermilab computing resources. There is also a new service, known as FERRY, that will provide a central repository for authorization and job management attributes.

        Speaker: Vito Di Benedetto (Fermi National Accelerator Lab. (US))
      • 65
        HEPCloud, an Elastic Hybrid HEP Facility using an Intelligent Decision Support System

        HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet peak demands of the next generation of High Energy Physics experiments, Fermilab must either plan to locally provision enough resources to cover the forecasted need, or find ways to elastically expand its computational capabilities. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and to choose an appropriate scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources-spanning multiple cloud providers, multiple HPC centers, and grid computing federations. In this paper, we discuss the goals and architecture of the HEPCloud Facility, the architecture of the IDSS, and our early experience in using the IDSS for automated facility expansion both at Fermilab and Brookhaven Lab.

        Speaker: Eric Vaandering (Fermi National Accelerator Lab. (US))
      • 66
        Modeling and Simulation of Load Balancing Strategies for Computing in High Energy Physics

        The amount of data to be processed by experiments in high energy physics is tremendously increasing in the coming years. For the first time in history the expected technology advance itself will not be sufficient to cover the arising gap between required and available resources based on the assumption of maintaining the current flat budget hardware procurement strategy. This leads to unprecedented challenges to both HEP software development and computing models, respectively. A first step towards meeting these challenges is the development of load balancing strategies for the existing workflow management systems in order to ensure a most effective resource utilization during processing of the comprehensive and world-wide distributed datasets.

        We report on using Palladio [1], an existing simulator for simulating the performance of abstract software architectural models, to model and simulate the performance of computing jobs executed at GridKa Tier 1 center. We validate the model with real-world performance measurements. With this model, we will enable model-based evaluation of different load balancing strategies.

        [1] Becker, Steffen, Heiko Koziolek, and Ralf Reussner. "The Palladio component model for model-driven performance prediction." Journal of Systems and Software 82, no. 1 (2009): 3-22.

        Speaker: Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE))
      • 67
        The LZ UK Data Centre

        LZ is a Dark Matter experiment based at the Sanford Underground Research Facility. It is currently under construction and aims to start data taking in 2020. Its computing model is based on two data centres, one in the USA (USDC) and one in the UK (UKDC), both holding a complete copy of its data. During stable periods of running both data centres plan to concentrate on different aspects of reconstruction and analysis. However, both data centres are expected to be able to handle all aspects of data processing, including user analysis, to provide a failover should one of the data centres be temporarily unavailable.
        Unlike the USDC where both storage and processing facilities are located at NERSC, the UKDC uses a distributed approach on GridPP-provided computing resources. The data themselves will be hosted at the Tier-2 at Imperial College London, while their processing and analysis will take place at various UK Tier 2s and possibly other collaborating European institutes.
        Considerable effort has gone into providing an intuitive interface to the UKDC, hiding much of the complexity of the underlying infrastructure from the end users.
        We report on the setup of the UK Data Centre with an emphasis on the user interface, in addition to the planned interface to the US Data Centre and the experience gained from the mock data challenges.

        Speaker: Daniela Bauer (Imperial College (GB))
      • 68
        THE JINR DISTRIBUTED COMPUTING ENVIRONMENT

        Computing in the field of high energy physics requires usage of heterogeneous computing resources and IT, such as grid, high performance computing, cloud computing and big data analytics for data processing and analysis. The core of the distributed computing environment at the Joint Institute for Nuclear Research is the Multifunctional Information and Computing Complex (MICC). It includes Tier-1 for CMS experiment, Tier-2 site for all LHC experiments and other grid non-LHC VOs, such as BIOMED, СOMPASS, NICA/MPD, NOvA, STAR and BESIII, as well as cloud and HPC infrastructures. A brief status overview of each component is presented. Particular attention is given to the development of distributed computations performed in collaboration with CERN, BNL, FNAL, FAIR, China, and JINR Member States. One of the directions for the cloud infrastructure is development of integration methods of various cloud resources of the JINR Member States in order to perform common tasks, and also distribution of load on the integrated resources. We performed integration of cloud resources of several institutes in Azerbaijan, Ukraine, Russia, Bulgaria and Belarus. Extension of the HPC component will be carried through a specialized for HPC engineering infrastructure being created at MICC, which makes use of the contact liquid cooling technology implemented by the Russian company JSC "RSC Technologies". We present our plans to further develop MICC as a center for scientific computing within the multidisciplinary research environment of JINR and JINR Member States, and particularly for megascience projects, such as NICA.

        Speaker: Vladimir Korenkov (Joint Institute for Nuclear Research (RU))
      • 69
        Extending CERN computing to volunteers - LHC@home consolidation and outlook

        LHC@home has provided computing capacity for simulations under BOINC since 2005. Following the introduction of virtualisation with BOINC to run HEP Linux software in a virtual machine on volunteer desktops, initially started on the test BOINC projects, like Test4Theory and ATLAS@home, all CERN applications distributed to volunteers have been consolidated under a single LHC@home BOINC project. As part of an effort to unite CERN's batch, cloud, grid and volunteer computing efforts, the BOINC service has been integrated with the other computes services at CERN, notably HTCondor, in terms job submission and accounting. The paper will also address contributions to improve the BOINC software and community effort to evolve BOINC for a sustainable volunteer computing environment. Furthermore, we discuss future challenges to reduce the effort required by volunteers to run virtual machines for experiment simulations and improvements to BOINC to widen the appeal of volunteer computing.

        Speaker: David Cameron (University of Oslo (NO))
      • 70
        Many hands make light work: Experiences from a shared resource WLCG Tier-2 computing site

        The Edinburgh (UK) Tier-2 computing site has provided CPU and storage resources to the Worldwide LHC Computing Grid (WLCG) for close to 10 years. Unlike other sites, resources are shared amongst members of the hosting institute rather than being exclusively provisioned for Grid computing. Although this unconventional approach has posed challenges for troubleshooting and service delivery there are unique advantages, such as being able to run above baseline expectations through opportunistic use of the cluster during quiet periods. Furthermore, the economy of scale of a larger shared facility enables access to niche equipment without additional cost (e.g. large-scale GPU processing). This operational model has recently been expanded to incorporate access to national supercomputing and data storage facilities.

        This presentation will describe the steps taken to consolidate disparate computing and storage resources into a coherent Grid site endpoint. This effort includes: cloud provisioning and workload management using Openstack and Cloud Scheduler; enabling container technologies to ease workload management on HPC systems; connecting remote data facilities to grid storage middleware. We will reflect upon our experiences fostering links between stakeholders and consider how this model could be adopted at other institutes.

        Speaker: Andrew John Washbrook (The University of Edinburgh (GB))
      • 71
        Advances in ATLAS@Home towards a major ATLAS computing resource

        The volunteer computing project ATLAS@Home has been providing a stable computing resource for the ATLAS experiment since 2013. It has recently undergone some significant developments and as a result has become one of the largest resources contributing to ATLAS computing, by expanding its scope beyond traditional volunteers and into exploitation of idle computing power in ATLAS data centres. Removing the need for virtualization on Linux and instead using container technology has made the entry barrier significantly lower data centre participation and in this paper, we describe the implementation and results of this change. We also present other recent changes and improvements in the project. In early 2017 the ATLAS@Home project was merged into a combined LHC@Home platform, providing a unified gateway to all CERN-related volunteer computing projects. The ATLAS Event Service shifts data processing from file-level to event-level and we describe how ATLAS@Home was incorporated into this new paradigm. The finishing time of long tasks was also greatly improved by implementing a reassignment scheduling algorithm to assign late jobs to “reliable” volunteer hosts. Finally, the steps taken to allow regular ATLAS grid sites to move completely to ATLAS@Home are shown.

        Speaker: David Cameron (University of Oslo (NO))
    • T4 - Data handling: S2 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Conveners: Tigran Mkrtchyan (DESY) , Tigran Mkrtchyan (A.Alikhanyan National Science Laboratory (AM))
      • 72
        Caching technologies for Tier-2 sites: a UK perspective.

        Pressures from both WLCG VOs and externalities have led to a desire to "simplify" data access and handling for Tier-2 resources across the Grid. This has mostly been imagined in terms of reducing book-keeping for VOs, and total replicas needed across sites. One common direction of motion is to increasing the amount of remote-access to data for jobs, which is also seen as enabling the development of administratively-cheaper Tier-2 subcategories, reducing manpower and equipment costs.
        Caching technologies are often seen as a "cheap" way to ameliorate the increased latency (and decreased bandwidth) introduced by ubiquitous remote-access approaches, but the usefulness of caches is strongly dependant on the reuse of the data thus cached.
        We report on work done in the UK at four GridPP Tier-2 sites - ECDF, Glasgow, RALPP and Lancaster - to investigate the suitability of transparent caching via the recently-rebranded XCache (Xrootd Proxy Cache) for both ATLAS and CMS workloads. We discuss issues with authentication, and measure the actual hit rates against the caches and their effect on efficiency of jobs at sites.
        We also examine the use of Xroot Proxy Caches backed by solid state media as internal caches within a site, buffering popular files between the site Storage Element and the workers.

        Speaker: Samuel Cadellin Skipsey
      • 73
        A data caching model for Tier-2 WLCG computing centres using XCache

        The XCache (XRootD Proxy Cache) provides a disk-based caching proxy for data access via the XRootD protocol. This can be deployed at WLCG Tier-2 computing sites to provide a transparent cache service for the optimisation of data access, placement and replication.

        We will describe the steps to enable full read/write operations to storage endpoints consistent with the distributed data management systems provided by LHC experiments using an example Tier-2 computing site (Edinburgh, UK). Performance studies will also be outlined to assess the benefits and limitations using the XCache service, leading to a generation of configuration templates and best practices for other grid computing sites to follow. We will also discuss how the XCache service can provide access to large regional storage pools for diskless or modest disk capacity sites as part of a regionally federated layout. The possibility of using a single XCache service for multiple VO support is also discussed.

        Speaker: Dr Teng LI (University of Edinburgh)
      • 74
        Advancing throughput of HEP analysis work-flows using caching concepts

        High throughput and short turnaround cycles are core requirements for the efficient processing of I/O-intense end-user analyses. Together with the tremendously increasing amount of data to be processed, this leads to enormous challenges for HEP storage systems, networks and the data distribution to end-users. This situation is even compounded by taking into account opportunistic resources without dedicated storage systems as possible extension of traditional HEP computing facilities for end-user analyses.

        Enabling data locality via local caches on the processing units is a very promising approach to solve throughput limitations and to ensure short turnaround cycles of end-user analyses. Therefore, two different caching concepts have been studied at the Karlsruhe Insitute of Technology. Both are transparently integrated into the HTCondor batch system in order to avoid job specific adaptations for end-users.

        The first concept relies on coordinated caches on SSDs in the worker nodes. Data locality is taken into account by custom developed components around the HTCondor batch system ensuring that jobs are assigned to nodes holding its input data.

        The second concept utilizes CEPH as a distributed file system acting as a system-wide cache. In this case no data locality specific adjustments need to be applied to the HTCondor batch system. In combination with developed XRootD caching and data locality plug-ins, this approach is also very well suited to tackle bandwidth limitations on opportunistic resources like HPC centers offering parallel file systems.

        In this talk an overview about the utilized technologies, the data locality concepts and the current status of the project will be presented.

        Speaker: Christoph Heidecker (KIT - Karlsruhe Institute of Technology (DE))
      • 75
        Evolution of the Hadoop platform for HEP

        The interest in using Big Data solutions based on Hadoop ecosystem is constantly growing in HEP community. This drives the need for increased reliability and availability of the central Hadoop service and underlying infrastructure provided to the community by the CERN IT department.
        This contribution will report on the overall status of the Hadoop platform and the recent enhancements and features introduced in many areas including the service configuration, availability, alerting, monitoring and data protection, in order to meet the new requirements posed by the users comminity.

        Speaker: Zbigniew Baranowski (CERN)
      • 76
        Disk failures in the EOS setup at CERN - A first systematic look at 1 year of collected data

        The EOS deployment at CERN is a core service used for both scientific data
        processing, analysis and as back-end for general end-user storage (eg home directories/CERNBOX).
        The collected disk failure metrics over a period of 1 year from a deployment
        size of some 70k disks allows a first systematic analysis of the behaviour
        of different hard disk types for the large CERN use-cases.

        In this presentation we will describe the data collection and analysis,
        summarise the measured rates and compare them with other large disk
        deployments. In a second part of the presentation we will present a first
        attempt to use the collected failure and SMART metrics to develop a machine
        learning model predicting imminent failures and hence avoid service degradation
        and repair costs.

        Speaker: Dirk Duellmann (CERN)
      • 77
        The challenges of mining logging data in ATLAS

        Processing ATLAS event data requires a wide variety of auxiliary information from geometry, trigger, and conditions database systems. This information is used to dictate the course of processing and refine the measurement of particle trajectories and energies to construct a complete and accurate picture of the remnants of particle collisions. Such processing occurs on a worldwide computing grid, necessitating wide-area access to this information.
        Event processing tasks may deploy thousands of jobs. Each job calls for a unique set of information from the databases via SQL queries to dedicated Squid servers in the ATLAS Frontier system, a system designed to pass queries to the database only if that result has not already been cached from another request. Many queries passing through Frontier are logged in an Elastic Search cluster along with pointers to the associated tasks and jobs, various metrics, and states at the time of execution. PanDA, which deploys the jobs, stores various configuration files as well as many log files after each job completes. Information is stored at each stage, but no system contains all information needed to draw a complete picture.
        This presentation describes the challenges of mining information from these sources to compile a view of database usage by jobs and tasks as well as assemble a global picture of the coherence and competition of tasks in resource usage to identify inefficiencies and bottlenecks within the overall system.

        Speaker: Elizabeth Gallas (University of Oxford (GB))
      • 78
        Parallel Event Selection Performance on HPC Systems

        In their measurement of the neutrino oscillation parameters (PRL 118, 231801
        (2017)), NOvA uses a sample of approximately 27 million reconstructed spills to
        search for electron-neutrino appearance events. These events are stored in an
        n-tuple format, in 180 thousand ROOT files. File sizes range from a few hundred KiB to a
        few MiB; the full dataset is approximately 3 TiB. These millions of events are
        reduced to a few tens of events by the application of strict event selection
        criteria, and then summarized by a handful of numbers each, which are used in
        the extraction of the neutrino oscillation parameters.

        The NOvA event selection code is currently a serial C++ program that reads
        these n-tuples. The current table data format and organization and the
        selection/reduction processing involved provides us with an opportunity to
        explore alternate approaches to represent the data and implement the
        processing. We represent our n-tuple data in HDF5 format that is optimized for
        the HPC environment and which allows us to use the machine's high-performance
        parallel I/O capabilities. We use MPI, numpy and h5py to implement our approach
        and compare the performance with the existing approach. We study the
        performance implications of using thousands of small files of different sizes
        as compared with one large file using HPC resources. This work has been done as
        part of the SciDAC project, “HEP analytics on HPC” in collaboration with the
        ASCR teams at ANL and LBNL.

        Speaker: Holger Schulz (Fermilab)
      • 79
        The ATLAS & Google "Data Ocean" Project for the HL-LHC era

        With the LHC High Luminosity upgrade the workload and data management systems are facing new major challenges. To address those challenges ATLAS and Google agreed to cooperate on a project to connect Google Cloud Storage and Compute Engine to the ATLAS computing environment. The idea is to allow ATLAS to explore the use of different computing models, to allow ATLAS user analysis to benefit from the Google infrastructure, and to give Google real science use cases to improve their cloud platform. Making the output of a distributed analysis from the grid quickly available to the analyst is a difficult problem. Redirecting the analysis output to Google Cloud Storage can provide an alternative, faster solution for the analyst. First, Google's Cloud Storage will be connected to the ATLAS Data Management System Rucio. The second part aims to let jobs run on Google Compute Engine, accessing data from either ATLAS storage or Google Cloud Storage. The third part involves Google implementing a global redirection between their regions to expose Google Cloud Storage as a single global entity. The last part will deal with the economic model necessary for sustainable cloud resource usage, including Google Cloud Storage costs, network costs, and peering costs with ESnet.

        Speaker: Mario Lassnig (CERN)
    • T5 - Software development: S2 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Mr Barthelemy Von Haller (CERN)
      • 80
        A tale of two programs: driving innovation in HEP computing

        In spite of the fact that HEP computing has evolved considerably over the years, the understanding of the evolution process seems to be still incomplete. There is no clear procedure to replace an established product with a new one, and most of the successful major transitions (e.g. PAW to Root or Geant3 to Geant4) have involved a large dose of serendipity and have caused splits in the community, while many other attempts have failed. In general, software innovation has proceeded with a relatively low efficiency, in an environment that is moreover already lacking sufficient resources for code development. The accelerating pace of evolution of the hardware and of the algorithms is certainly good news, since it offers the opportunity to address HEP dire need of computing resources, but only if the innovation process is able to take advantage of it in a timely and efficient way. One case in point is the evolution of the simulation code, responsible for 50% of the total computing resource consumption. This talk will describe the history of the GEANTV project, aimed at designing the next generation of simulation codes. In particular, we will describe how the concept of “disruptive innovation” has been adapted to our environment and how it has successfully led to the development of a revolutionary simulation prototype while continuously feeding technological innovation into the existing product. This process, begun with a no-compromise extreme design, has now led to the seamless merging of the development activities into the main line, hopefully leading to a seamless and efficient transition.

        Speaker: Federico Carminati (CERN)
      • 81
        Software training for the next generation of Physicists: joint experience of LHCb and ALICE

        The need for good software training is essential in the HEP community. Unfortunately, current training is non-homogeneous and the definition of a common baseline is unclear, making it difficult for newcomers to proficiently join large collaborations such as ALICE or LHCb.
        In the last years, both collaborations have started separate efforts to tackle this issue through training workshops, via Analysis Tutorials (organized by the ALICE Juniors since 2014) and the StarterKit (organized by LHCb students since 2015).
        In 2017, ALICE and LHCb have for the first time joined efforts to provide combined training by identifying common topics, such as version control systems (Git) and programming languages (e.g. Python). Given the positive experience and feedback, this collaboration will be repeated in the future.
        We will illustrate the teaching methods, experience and feedback from our first common training workshop. We will also discuss our efforts to extend our format to other HEP experiments for future iterations.

        Speakers: Dario Berzano (CERN) , Chris Burr (University of Manchester (GB))
      • 82
        Exploring polyglot software frameworks in ALICE with FairMQ and fer

        In order to meet the challenges of the Run-3 data rates and volumes, the ALICE collaboration is merging the online and offline infrastructures into a common framework: ALICE-O2.
        O2 is based on FairRoot and FairMQ, a message-based, multi-threaded and multi-process control framework.
        In FairMQ, processes (possibly on different machines) exchange data via message queues either through 0MQ or nanomsg.
        FairMQ allows to build sophisticated topologies of processes: router/dealer, request/reply, publish/subscribe, client/server, etc...
        In turn, this enables developers to write their reconstruction or analysis process in whatever language they choose or deem appropriate for the task at hand, as long as that programming language can send and receive data through these message queues.

        This paper presents "fer", a Go-based library compatible and interoperable with FairMQ.
        We will start with a brief introduction of the builtin features that make Go a solid choice when dealing with I/O and concurrency.
        We will then describe the principal components of fer and how they interact with C++ FairMQ.
        Finally, we will report on the performances (CPU, VMem) of fer and conclude with the main figures of merit of fer, in the context of deployment in a distributed computing setup.

        Speaker: Dr Sebastien Binet (IN2P3/LPC)
      • 83
        HEPCon - A Cross-Platform Mobile Application for HEP Events

        Collaboration in research is essential for it is saving time and money. The field of high-energy physics (HEP) is no different. The higher level of collaboration the stronger community. The HEP field encourages organizing various events in format and size such as meetings, workshops and conferences. Making attending a HEP event easier leverages cooperation and dialogue and this is what makes Indico service defacto a community standard.

        We would like to present HEPCon, a cross-platform mobile application which collects all information available on Indico on a portable device. It keeps most of the data locally which speeds up the interaction. HEPCon uses a shared codebase which allows easy multiplatform development and support. There are iOS and Android implementations available for free download.

        The project is based on C# and we use the Xamarin mobile app technology for building native iOS and Android apps. SQLite database is responsible for retrieving and storing conference data. The app can be used to preview data from past CHEP conferences but the tool is implemented generic enough to support other kinds of indico events.

        Speaker: Mr Martin Vasilev (University of Plovdiv)
      • 84
        Application extensibility and customization - Indico's case

        Indico is a general-purpose event management system currently in use by more than 150 institutions world wide. Despite having been born at CERN and primarily adopted by the High Energy Physics Community, it has recently gained adoption in other communities (for example, the United Nations and its agencies) and received the attention of commercial vendors world wide. This growth in adoption implied finding better ways to customize the system to the needs of each individual organization. The application is written in Python and uses the Flask web framework, as well as SQLAlchemy as an ORM to PostgreSQL.

        Indico 2.0 brings improvements that make it an easily configurable application, starting from a simpler installation process and customizable user interface using CSS and a templating language; and ending with a plugins system providing easy solutions for developing custom plugins, which allow for functionality to be added without changes in the application's core. The project’s philosophy has always been that of “simplicity by default” and “optional complexity”.

        In this paper we will explain in detail the mechanisms we put in place in order to make Indico more extensible, configurable and easy to customize; we will describe some of the technical challenges that creating a feature-complete yet flexible web application involves and some of the technical solutions we have found and which can be reused by others. We will also detail the approaches that led to the development of flask-pluginengine and flask-multipass, two standalone modules that implement most of the application's extensibility and which can be reused by any Flask application. Finally, we will go over the parts of the tool suite which powers Indico, including the alembic-based migration utility and the asset generating pipeline, and analyze the implications this work had on them.

        Speaker: Pedro Ferreira (CERN)
      • 85
        Extending ROOT through Modules

        The ROOT software framework is foundational for the HEP ecosystem, providing capabilities such as IO, a C++ interpreter, GUI, and math libraries. It uses object-oriented concepts and build-time modules to layer between components. We believe additional layering formalisms will benefit ROOT and its users.

        We present the modularization strategy for ROOT which aims to formalize the description of existing source modules, making available the dependencies and other metadata externally from the build system, and allow post-install additions of functionality in the runtime environment. Modules can then be grouped into packages, installable from external repositories to deliver post-install step of missing packages. This provides a mechanism for the wider software ecosystem to interact with a minimalistic install. Reducing intra-module dependencies improves maintainability and code hygiene. We believe helping maintain the smallest “base install” possible will help embedding use cases.

        The modularization effort draws inspiration from the Java, Python, and Swift ecosystems. Keeping aligned with the modern C++, this strategy relies on forthcoming features such as C++ modules. We hope formalizing the module layer will provide simpler ROOT installs, improve extensibility, and decrease the complexity of embedding in other ecosystems.

        Speaker: Oksana Shadura (University of Nebraska Lincoln (US))
      • 86
        A Python upgrade to the GooFit package for parallel fitting

        The GooFit highly parallel fitting package for GPUs and CPUs has been substantially upgraded in the past year. Python bindings have been added to allow simple access to the fitting configuration, setup, and execution. A Python tool to write custom GooFit code given a (compact and elegant) MINT3/AmpGen amplitude description allows the corresponding C++ code to be written quickly and correctly. New PDFs have been added. The most recent release was built on top of the December 2017 2.0 release that added easier builds, new platforms, and a more robust and efficient underlying function evaluation engine.

        Speaker: Henry Fredrick Schreiner (University of Cincinnati (US))
      • 87
        Vectorization of ROOT Mathematical Libraries

        In order to take full advantage of new computer architectures and to satisfy the requirement of minimising the CPU usage with increasing amount of data to analysis, parallelisation and SIMD vectorisation have been introduced in the ROOT mathematical libraries. The VecCore library provides a very convenient solution to abstract SIMD vectorization and it has been found extremely useful for integrating vectorisation into an existing software. VecCore features a simple API to write SIMD-enabled algorithms, by avoiding using directly lower level vectorisation libraries. Furthermore VecCore supports several back-end such as Vc and UME::SIMD.
        We describe first how VecCore is integrated into ROOT and how it has been applied to implement SIMD vectorization in some of the most commonly used Mathematical functions. We report as well how the SIMD vectorisation has been deployed for describing model functions, which can be used for fitting. In particular, we show how ROOT functions built with a string expression such as those based on the ROOT TFormula class, or functions provided by users, can be vectorised and used for speeding up fitting on large data sets. We report performance figures from benchmarks on different architectures, such Intel Xeon and Xeon Phi processors, showing the gains from using SIMD parallelism.

        Speaker: Lorenzo Moneta (CERN)
    • T6 - Machine learning and physics analysis: S2 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 88
        Study of Neural Network Size Requirements for Approximating Functions Relevant to HEP

        Neural networks, and recently, specifically deep neural networks, are attractive candidates for machine learning problems in high energy physics because they can act as universal approximators. With a properly defined objective function and sufficient training data, neural networks are capable of approximating functions for which physicists lack sufficient insight to derive an analytic, closed-form solution. There are, however, a number of challenges that can prevent a neural network from achieving a sufficient approximation of the desired function. One of the chief challenges is that there is currently no fundamental understanding of the size—both in terms of number of layers and number of nodes per layer—necessary to approximate any given function. Networks that are too small are doomed to fail, and networks that are too large often encounter problems converging to an acceptable solution or develop issues with overtraining. In an attempt to gain some intuition, we performed a study of neural network approximations of functions known to be relevant to high energy physics, such as calculating the invariant mass from momentum four-vector components, or calculating the momentum four-vector vector of a parent particle from the four-vectors of its decay products. We report on the results of those studies and discuss possible future directions.

        Speaker: Jessica Stietzel (College of the Holy Cross)
      • 89
        Adversarial event generator tuning with Bayesian Optimization

        High Energy Physics experiments often rely on Monte-Carlo event generators. Such generators often contain a large number of parameters and need fine-tuning to closely match experimentally observed data. This task traditionally requires expert knowledge of the generator and the experimental setup as well as vast computing power.Generative Adversarial Networks (GAN) is a powerful method to match distribution of samples produced by a parametrized generator to a set of observations. Following the recently proposed study on adversarial variational optimization of non-differentiable generator, we adopt Bayesian Optimization as an efficient gradient-free optimization method for adversarial fine-tining of event generators. The proposed method requires minimal prior knowledge,nevertheless, allows for expert insights to be straightforwardly incorporated into the method.In this talk, we briefly describe a theoretical approach to the problem and show the results for parameter tunning of PYTHIA event generator.

        Speaker: Mr Maxim Borisyak (National Research University Higher School of Economics)
      • 90
        Anomaly detection using Deep Autoencoders for the assessment of the quality of the data acquired by the CMS experiment

        The certification of the CMS data as usable for physics analysis is a crucial task to ensure the quality of all physics results published by the collaboration. Currently, the certification conducted by human experts is labor intensive and can only be segmented on a run by run basis. This contribution focuses on the design and prototype of an automated certification system assessing data quality on a per-luminosity section (i.e. 23 seconds of data taking) basis. Anomalies caused by detector malfunctions or sub-optimal reconstruction are unpredictable and occur rarely, making it difficult to use classical supervised classification methods such as feedforward neural networks. We base our prototype on a semi-supervised model which employs deep autoencoders. This approach has been qualified successfully on CMS data collected during the 2016 LHC run: we demonstrate its ability to detect anomalies with high accuracy and low fake rate, when compared against the outcome of the manual certification by experts. A key advantage of this approach over other ML technologies is having great interpretability of the results, which can be further used to ascribe the origin of the problems in the data to a specific sub-detector or physics objects.

        Speaker: Adrian Alan Pol (Université Paris-Saclay (FR))
      • 91
        Online detector monitoring using AI: challenges, prototypes and performance evaluation for automation of online quality monitoring of the CMS experiment exploiting machine learning algorithms.

        Online Data Quality Monitoring (DQM) in High Energy Physics experiment is a key task which, nowadays, is extremely expensive in terms of human resources and required expertise.

        We investigate machine learning as a solution for automatised DQM. The contribution focuses on the peculiar challenges posed by the requirement of setting up and evaluating the AI algorithms in the online environment; it also presents the successful application of modern machine learning techniques, in particular deep learning, to concrete examples of detector monitorables (e.g. based on the Muon Spectrometer) integrated in the production DQM infrastructure of CMS.

        This pioneeristic work paves the way to the automatisation of many of the tasks currently performed in the online DQM system, allowing the check of large volumes of data in real-time and improving the ability to detect unexpected failures and reducing the manpower requirements simultaneously.

        Speaker: Adrian Alan Pol (Université Paris-Saclay (FR))
      • 92
        ATLAS Analytics and Machine Learning Platforms

        In 2015 ATLAS Distributed Computing started to migrate its monitoring systems away from Oracle DB and decided to adopt new big data platforms that are open source, horizontally scalable, and offer the flexibility of NoSQL systems. Three years later, the full software stack is in place, the system is considered in production and operating at near maximum capacity (in terms of storage capacity and tightly coupled analysis capability). The new model provides several tools for fast and easy to deploy monitoring and accounting. The main advantages are: ample ways to do complex analytics studies (using technologies such as java, pig, spark, python, jupyter), flexibility in reorganization of data flows, near real time and inline processing. The analytics studies improve our understanding of different computing systems and their interplay, thus enabling whole-system debugging and optimization. In addition, the platform provides services to alarm or warn on anomalous conditions, and several services closing feedback loops with the Distributed Computing systems. Here we briefly describe the main system components and data flows, but will concentrate on both hardware and software tools we use for in depth analytics/simulations, support for machine learning algorithms, specifically artificial neural network training and reinforcement learning techniques. We describe several applications the platform enables, and discuss ways for further scale up.

        Speaker: James Catmore (University of Oslo (NO))
      • 93
        REANA: A System for Reusable Research Data Analyses

        The revalidation, reinterpretation and reuse of research data analyses requires having access to the original computing environment, the experimental datasets, the analysis software, and the computational workflow steps which were used by the researcher to produce the original scientific results in the first place.

        REANA (=Reusable Analyses) is a nascent platform enabling researchers to structure their research data analyses in view of enabling future reuse. The analysis is described by means of a YAML file that captures sufficient information about the analysis assets, parameters and processes. The REANA platform offers a set of micro-services permitting to launch and monitor container-based computational workflow jobs on the cloud. The REANA user interface and the command-line client enables researchers to easily rerun analysis workflows with new input parameters. The REANA platform aims at supporting several container technologies (Docker), workflow engines (CWL, Yadage), shared storage systems (Ceph, EOS) and compute cloud infrastructures (Kubernetes/OpenStack, HTCondor) used by the community.

        REANA was developed with the particle physics use case in mind and profits from synergies with general reusable research data analysis patterns in other scientific disciplines, such as bioinformatics and life sciences.

        Speaker: Tibor Simko (CERN)
      • 94
        Continuous Analysis Preservation and Streamlining for the ATLAS Experiment

        We present recent work within the ATLAS collaboration to centrally provide tools to facilitate analysis management and highly automated container-based analysis execution in order to both enable non-experts to benefit from these best practices as well as the collaboration to track and re-execute analyses independently, e.g. during their review phase.

        Through integration with the ATLAS GLANCE system, users can request a pre-configured, but customizable version control setup, including continuous integration for automated build and testing as well as continuous Linux Container image building for software preservation purposes.

        As analyses typically require many individual steps, analysis workflow pipelines can then be defined using such images and the yadage workflow description language. The integration into the workflow exection service REANA allows the interactive or automated reproduction of the main analysis results by orchestrating a large number of container jobs using Kubernetes.

        For long-term archival, we present integration with the CERN Analysis Preservation Portal (CAP), where data, analysis repositories, software images and workflows can be stored and indexed for later re-use such as reinterpretations.

        Speaker: Lukas Alexander Heinrich (New York University (US))
      • 95
        Allocation Optimization for the ATLAS Rebalancing Data Service

        The distributed data management system Rucio manages all data of the ATLAS collaboration across the grid. Automation such as replication and rebalancing are an important part to ensure the minimum workflow execution times. In this paper, a new rebalancing algorithm based on machine learning is proposed. First, it can run independently of the existing rebalancing mechanism and can be modularised. It collects data from other services and learns optimality as it runs in the background. Periodically this learning agent takes a subset of the global datasets and proposes them for redistribution to reduce waiting times. The user can interact and choose to accept, decline, or override the dataset placement suggestions. The accepted items are shifted continuously between destination data centres as a background service while taking network and storage utilisation into account.

        Speaker: Ralf Vamosi (CERN)
    • T8 - Networks and facilities: S2 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Conveners: Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno) , Sang Un Ahn (Korea Institute of Science & Technology Information (KR))
      • 96
        Improving WLCG Networks Through Monitoring and Analytics

        WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. OSG Networking Area in partnership with WLCG has focused on collecting, storing and making available all the network related metrics for further analysis and discovery of issues that might impact network performance and operations. This has been facilitated by the already existing network of the perfSONAR instances that have been commissioned to operate in production. In order to help sites and experiments better understand and fix the networking issues, WLCG Network Throughput working group was formed, which works on the analysis and integration of the network-related monitoring data collected by the OSG/WLCG infrastructure and operates a support unit to help find and fix the network performance issues.
        In this talk we'll describe the current state of the OSG network measurement platform and summarise the activities taken by the working group, focusing mainly on the throughput issues that have been reported and resolved during the recent period with the help of the perfSONAR network. We will also cover the updates on the few higher level services that were developed to help bring the platform to its full potential as well as describe the currently ongoing analytical activities related to networking and their results. Finally, we will discuss and propose possible R&D areas to help improve our networks in general as well as prepare the infrastructure for the foreseen major changes in the way networks will be provisioned and operated in the future.

        Speaker: Edoardo Martelli (CERN)
      • 97
        IPv6 in production: its deployment and usage in WLCG

        The fraction of general internet traffic carried over IPv6 continues to grow rapidly. The transition of WLCG central and storage services to dual-stack IPv4/IPv6 is progressing well, thus enabling the use of IPv6-only CPU resources as agreed by the WLCG Management Board and presented by us at CHEP2016. By April 2018, all WLCG Tier 1 data centres will provide access to their services over IPv6. The LHC experiments have requested all WLCG Tier 2 centres to provide dual-stack access to their storage by the end of LHC Run 2.

        For several years the HEPiX IPv6 Working Group has been testing WLCG services to ensure their IPv6 compliance. Many Tier 2s are now successfully running their services in dual-stack mode, but there is still a large number who are not. The working group, driven by the requirements of the LHC VOs to be able to use IPv6-only opportunistic resources, continues to encourage wider deployment of dual-stack services and has been monitoring the transition.

        We will present the progress of the transition to IPv6. This includes making experiment central services dual stack as well as the storage and data services. The monitoring, accounting and information services that are used by jobs are also being upgraded. Finally, we will assess the timetable for being able to turn off IPv4 completely at a WLCG data centre, this being the point at which the transition can be declared truly complete.

        Speaker: David Kelsey (STFC-Rutherford Appleton Laboratory (GB))
      • 98
        Using Kytos SDN platform to enhance international big data transfers

        Data-intensive science collaborations still face challenges when transferring large data sets between globally distributed endpoints. Many issues need to be addressed to orchestrate the network resources in order to better explore the available infrastructure. In multi-domain scenarios, the complexity increases because network operators rarely export the network topology to researchers and users, resulting in a slow inter domain circuit provisioning. The LHC from CERN and the LSST are two examples of such scientific initiatives.

        Kytos SDN Platform is a new solution, developed at the Sao Paulo Research and Analysis Center (SPRACE) from Unesp, that enables a fast way to deploy an SDN infrastructure. Kytos was designed to be easy to install, use, develop and share via Network Apps (NApps). A circuit provisioning application has been developed on top of the Kytos platform, known as the "Kytos MEF E-Line". It is a service type defined by the Metro Ethernet Forum for connecting exactly two User Network Interfaces (UNI), so they can communicate only with each other. This NApp can be used to schedule a circuit with a minimum bandwidth before the beginning of a specific data transfer between two endpoints. We present in detail the Kytos open source SDN platform and introduce the "Kytos MEF E-Line" NApp, developed in collaboration with AmLight. We also show how this can be integrated with FTS via GFAL plugin in order to allow the provisioning of end-to-end circuits over a Software-Defined Network.

        Speaker: Beraldo Costa Leal (UNESP - Universidade Estadual Paulista (BR))
      • 99
        Integration and evaluation of QUIC and TCP-BBR in long-haul WLCG data transfers

        Recent years have seen the mass adoption of streaming in mobile computing, an increase in size and frequency of bulk long-haul data transfers
        in science in general, and the usage of big data sets in job processing
        demanding real-time long-haul accesses that can be greatly affected by
        variations in latency. It has been shown in the Physics and climate research communities that the need to exchange petabytes of data with
        global collaborators can be seriously hampered by the TCP congestion
        control and latency. Demands for faster and lower latency transfers have
        been stressed also by the increasing need of encryption both in mobile
        computing and in computational science.

        Two recent and promising additions to the internet protocols are TCPBBR and QUIC. BBR implements a congestion policy that promises a
        better control in TCP bottlenecks on long haul transfer. TCP-BBR is
        implemented in the Linux kernnels above 4.9. It has been shown, however,
        to demand some fine tuning in the interaction, for example, with the Linux
        Fair Queue. QUIC, on the other hand, replaces HTTP and TLS with a
        protocol on the top of UDP and thin layer to serve HTTP. It has been
        reported to account today for 7% of Google’s traffic. It hasn’t been used
        in server to server transfers even if its creators see that as a real possibility.

        Our work evaluates the applicability and tuning of TCP-BBR and
        QUIC for WLCG and data science transfers. We describe the integration
        of each of them into the transfer tool iperf and the xroot protocol. Possibly, for the first time, server to server deployment of QUIC and tests
        of the resulting performance evaluation of both QUIC and TCP-BBR on
        long haul transfers involving WLCG servers is presented.

        Speaker: Raul Cardoso Lopes (Brunel University (GB))
      • 100
        Integrating Networking into ATLAS

        Networking is foundational to the ATLAS distributed infrastructure and there are many ongoing activities related to networking both within and outside of ATLAS. We will report on the progress in a number of areas exploring ATLAS's use of networking and our ability to monitor the network, analyze metrics from the network, and tune and optimize application and end-host parameters to make the most effective use of the network. Specific topics will include work on Open vSwitch for production systems, network analytics, FTS testing and tuning, and network problem alerting and alarming.

        Speaker: Shawn McKee (University of Michigan (US))
      • 101
        Entry Stage for the CBM First-level Event Selector

        The First-level Event Selector (FLES) is the main event selection
        system of the upcoming CBM experiment at the future FAIR facility in
        Germany. As the central element, a high-performance compute
        cluster analyses free-streaming, time-stamped data delivered from the
        detector systems at rates exceeding 1 TByte/s and selects data
        for permanent storage.
        While the detector systems are located in a dedicated CBM building, the
        FLES compute resources will be hosted in the central Green IT Cube
        data center approx. 700 meters away from the experiment. This demands
        single-mode optics and protocols not susceptible to delay. Previous
        concepts therefore included an FPGA-based concentrator stage and custom
        optical links to the FLES cluster.
        Exploiting newest network technologies allows for a more flexible, yet
        cost effective design using COTS equipment for the long-haul link. A
        FLES entry cluster, placed near the detector, will terminate detector
        links in custom PCIe cards. It aggregates processing components and
        delivers them via a specially configured InfiniBand network to the main
        compute cluster. The network design facilitates flexible staging and
        local processing for testing and detector commissioning purposes.
        An overview of the FLES entry cluster and network architecture as well
        as newest performance measurements will be presented.

        Speaker: Dirk Hutter (Johann-Wolfgang-Goethe Univ. (DE))
      • 102
        Netbench – large-scale network device testing with real-life traffic patterns

        Network performance is key to the correct operation of any modern datacentre infrastructure or data acquisition (DAQ) system. Hence, it is crucial to ensure the devices employed in the network are carefully selected to meet the required needs.

        The established benchmarking methodology [1,2] consists of various tests that create perfectly reproducible traffic patterns. This has the advantage of being able to consistently asses the performance differences between various devices, but comes at the disadvantage of always using known, pre-defined traffic patterns (frame sizes and traffic distribution) that do not stress the buffering capabilities of the devices to the same extent as real-life traffic would.
        Netbench is a network-testing framework, relying on commodity servers and NICs, that aims at overcoming the previously mentioned shortcoming. While not providing identical conditions for every test, netbench enables assessing the devices’ behaviour when handling multiple TCP flows, which closely resembles real-life usage. e.g.:
        - a full-mesh traffic distribution [3] is most likely the best benchmarking pattern for a network device to be employed in a multi-purpose data-centre;
        - a unidirectional partial-mesh traffic distribution [3] will closely mimic the event-building traffic pattern from a DAQ system.

        Due to the prohibitive cost of specialized hardware equipment that implements RFC tests [1,2], few companies/organisations can afford a large-scale test setup. The compromise that is often employed is to use two hardware tester ports and feed the same traffic to the device multiple times through loop-back cables (the so called “snake-test”). This test fully exercises the per-port throughput capabilities, but barely stresses the switching engine of the device. The per-port price of a netbench test setup is significantly smaller than that of a testbed made using specialized hardware, especially if we take into account the fact that generic servers can be time-shared between netbench and day-to-day usage. Thus, a large-scale multi-port netbench setup is affordable, and enables organisations/companies to complement the snake test with benchmarks that stress test the switching fabric of network devices.

        Netbench has a layered architecture and uses standard technologies. At its core, it relies on iperf3 [4] as an engine to drive TCP flows between servers, but it can easily be extended to support other traffic generators. The orchestration platform that sets up multiple iperf3 sessions is written in Python and relies on XML-RPC for fast provisioning of flows. Per-flow statistics are gathered into a PostgreSQL database, and the results visualisation is based on a Python REST API and a web page using JavaScript and the D3.js library for displaying graphs. Statistics are presented at different levels of detail allowing the human tester to quickly asses the overall state of a test from both per-node and per-pair (source-destination) statistics.

        During its last call for tender for high-end routers, CERN has employed netbench for evaluating the behaviour of network devices when exposed to full and partial mesh TCP traffic. We will present sample results from the evaluation. Furthermore, during the evaluation it became apparent that, due to the temporary congestion caused by competing TCP flows, netbench provides a good estimation of the devices’ buffering capabilities.

        To summarize, we present netbench, a tool that allows provisioning TCP flows with various traffic distributions (pairs, partial and full-mesh). We consider netbench an essential complement to synthetic RFC tests [1][2], as it enables affordable, large-scale testing of network devices with traffic patterns that closely resemble real-life conditions.

        [1] RFC 2544, Bradner, S. and McQuaid J., "Benchmarking Methodology for Network Interconnect Devices"
        [2] RFC 2889, Mandeville, R. and Perser J., "Benchmarking Methodology for LAN Switching Devices"
        [3] RFC 2285, Mandeville, R., "Benchmarking Terminology for LAN Switching Devices"
        [4] iperf3 http://software.es.net/iperf/

        Speaker: Stefan Nicolae Stancu (CERN)
      • 103
        Long-term experiences in keeping balance between safety and usability in research activities in KEK

        We provide KEK general purpose network to support various kinds of research activities in the field of high-energy physics, material physics, and accelerator physics. Since the end of 20th century, cyber attacks to the network are on an almost daily basis, and attack techniques change rapidly and drastically. In such circumstances, we are constantly facing difficult tradeoff and are required to keep a balance between safety and usability in research activities. While restriction of network services/applications and network monitoring may improve the safety, too much restriction must be avoided and network monitoring should be done with a full consideration.

        In this talk, we present our long-term experiences in keeping a balance between safety and usability in KEK research activities. The key points are ensuring traceability and security management in a reasonable manner. We have been operating security devices such as firewalls, intrusion detection system, and vulnerability management system to realize the balance. In addition, CSIRT (Computer Security Incident Response Team) activities and cooperative activities among research organizations are presented.

        Speaker: Tadashi Murakami (KEK)
    • 4:00 PM
      Break
    • Plenary: S2 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Marco Cattaneo (CERN)
    • Social events: Welcome drink Central Military Club

      Central Military Club

      Tsar Osvoboditel Blvd 7, 1000 Sofia Center
    • Plenary: S3 Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Maria Girone (CERN)
      • 108
        Next generation generative neural networks for HEP

        Initial studies have suggested generative adversarial networks (GANs) have promise as fast simulations within HEP. These studies, while promising, have been insufficiently precise and also, like GANs in general, suffer from stability issues. We apply GANs to to generate full particle physics events (not individual physics objects), and to large weak lensing cosmology convergence maps. We explore recent developments in convergence, such as ‘optimal transport’ GANs; explore representations that match the structure of the detector better than images; and evaluate the precision and generalisability of the generated datasets.
        In addition we demonstrate a framework making use of distributed computing on the Cori supercomputer at NERSC launched via interactive jupyter notebook sessions,. This allows for tackling high-resolution detector data; model selection and hyper-parameter tuning in a productive yet scalable deep learning environment.

        Speaker: Steven Andrew Farrell (Lawrence Berkeley National Lab. (US))
      • 109
        Synthesizing Machine Learning algorithms on FPGA

        Machine learning methods are becoming ubiquitous across particle physics. However, the exploration of such techniques in low-latency environments like L1 trigger systems has only just begun. We present here a new software, based on High Level Synthesis (HLS), to generically port several kinds of network models (BDTs, DNNs, CNNs) into FPGA firmware. As a benchmark physics use case, we consider the task of tagging high-pT jets as H->bb candidates using jet substructure. We map out resource usage and latency versus types of machine learning algorithms and their hyper-parameters. We present a set of general practices to efficiently design low-latency machine-learning algorithms on FPGAs.

        Speaker: Jennifer Ngadiuba (INFN, Milano)
      • 110
        The obsolescence of Information and Information Systems

        In 2016 was started the CERN Digital Memory project with the main goal of preventing loss of historical content produced by the organisation. The first step of the project was targeted to address the risk of deterioration of the most vulnerable materials, mostly the multimedia assets created in analogue formats from 1954 to the late 1990's, like still and moving images on films or magnetic tapes. In parallel was studied today's best practices to guarantee a long life to digital content, either born digital or resulting from a digitization process. If traditional archives and libraries have grown up during centuries establishing recognized standards to deal with the preservation of printed content, the field of digital archiving is in its infancy.

        This talk will shortly expose the most surprising challenges when migrating hundreds of thousands of audio, slides, negatives, videotapes or films from the analogue to the digital era. It will then describe how a Digital Memory platform is being built, conform to the principles of the ISO-16363 digital object management norm that defines trustworthy digital repositories. Finally, as all information repository managers are faced with the necessary migration of underlying systems and the obsolescence of the information itself, the talk will explain how a digital archiving platform focusing only on content preservation could be of direct interest for most of the live systems.

        Speaker: Jean-Yves Le Meur (CERN)
    • 10:30 AM
      Break
    • T1 - Online computing: S3 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Ryosuke Itoh (KEK)
      • 111
        Simultaneous usage of the LHCb HLT farm for Online and Offline processing workflows

        LHCb is one of the 4 LHC experiments and continues to revolutionise data acquisition and analysis techniques. Already two years ago the concepts of “online” and “offline” analysis were unified: the calibration and alignment processes take place automatically in real time and are used in the trigerring process such that Online data are immediately available offline for physics analysis (Turbo analysis), the computing capacity of the HLT farm has been used simultaneously for different workflows : synchronous first level trigger, asynchronous second level trigger, and Monte-Carlo simulation. Thanks to the same software environment, LHCb can switch seamlessly and fast from Online to Offline workflows, run them simultaneously and thus maximize the usage of the HLT Farm computing resources.

        Speaker: Joel Closier (CERN)
      • 112
        Alignment and Calibration of the Belle II Detector

        In spring 2018 the SuperKEKB electron-positron collider at High Energy Accelerator Research Organization (KEK, Tsukuba, Japan) will deliver its first collisions to the Belle II experiment. The aim of Belle II is to collect a data sample 50 times larger than the previous generation of B-Factories taking advantage of the unprecedented SuperKEKB design luminosity of 8x10^35 cm^-2 s^-1. The Belle II detector will allow to conduct precise measurements in the harsh collider environment, probing for signs of physics beyond the standard model at the precision frontier.

        In order to deliver data suitable for physics analysis, the detector has to be properly calibrated on a regular basis. Among other calibrations the detector alignment plays a key role. For example, precise measurements of time dependent CP-violation rely on the accurate alignment of the new vertex detector, as well as on the determination of the beamspot position and size. To automatize the calibration procedures and manage the large amount of data and processing power needed for detector calibration, a software framework has been developed which allows to define the complete workflow and to execute it on a computing cluster. The framework integrates the Millepede II algorithm to solve the large minimization problem emerging in the track-based alignment and calibration of the pixel and strip detector, the central drift chamber, and the muon system.

        The first collision data will allow to test and to further improve and tune the alignment and calibration procedures. Although the vertexing capabilities will be limited due to the installation of only a small slice of the full vertex detector, the commissioning phase will allow to test most of the alignment procedure features and to prepare for the full operation.
        We will present the results achieved during the first data taking, the experience gained and the plans for the first physics run with the full detector.

        Speaker: Tadeas Bilka (Charles University, Prague)
      • 113
        Spy-Agency service for monitoring of automated real-time calibration workflows and steering the injection of prompt reconstruction workflows at the CMS experiment

        The calibration of the detector in almost real time is a key to the exploitation of the large data volumes at the LHC experiments. For this purpose the CMS collaboration deployed a complex machinery involving several components of the processing infrastructure and of the condition DB system. Accurate reconstruction of data start only once all the calibrations become available for consumption and relies on continuous and detailed monitoring of the calibration machinery and the physics performance of its products. This monitoring task requires to aggregate, digest, and react upon information from all the different components based on very heterogeneous technologies: the Tier0 processing farm, the Oracle based condition DB, the data quality monitoring framework and various other logging and bookkeeping services . An application has been designed and deployed to spy data from various sources. The application presents information to a web based interface and communicates the readiness of the calibrations to the computing infrastructure. A dedicated application, spying data from these various different sources, presenting them on a web based interface and able to communicate to the computing infrastructure the readiness for reconstruction of any chunk of data, has been designed and deployed. The presentation reports on the design choices and operational experience of this new tool.

        Speaker: Arun Kumar (National Taiwan University (TW))
      • 114
        LHCb full-detector real-time alignment and calibration: latest developments and perspectives

        A key ingredient of the data taking strategy used by the LHCb experiment in Run-II is the novel real-time detector alignment and calibration. Data collected at the start of the fill are processed within minutes and used to update the alignment, while the calibration constants are evaluated hourly. This is one of the key elements which allow the reconstruction quality of the software trigger in Run-II to be as good as the offline quality of Run-I.
        The most recent developments of the real-time alignment and calibration paradigm enable the fully automated updates of the RICH detectors' mirror alignment and a novel calibration of the calorimeter systems. Both evolutions improve the particle identification performance stability resulting in higher purity selections. The latter leads also to an improvement in the energy measurement of neutral particles, resulting in a 15% better mass resolution of radiative b-hadron decays.
        A large variety of improvements has been explored for the last year of Run-II data taking and is under development for the LHCb detector upgrade foreseen in 2021.These range from the optimization of the data samples selection and strategy to the study of a more accurate magnetic field description. Technical and operational aspects as well as performance achievements are presented, focusing on the new developments for both the current and upgraded detector.

        Speaker: Chris Burr (University of Manchester (GB))
      • 115
        Data handling in the ALICE O2 event processing

        The ALICE experiment at the Large Hadron Collider (LHC) at CERN is planned to be operated in a continuous data-taking mode in Run 3.This will allow to inspect data from all collisions at a rate of 50 kHz for Pb-Pb, giving access to rare physics signals embedded into a large background.

        Based on experience with real-time reconstruction of particle trajectories and event properties in the ALICE High Level Trigger, the ALICE O2 facility is currently designed and developed to support processing of a continuous, trigger-less stream of data segmented into so-called time frames.

        ALICE O2 uses the ALFA software framework and message queue communication to distribute workload among many processes running on multiple compute nodes. After aggregation of all data forming a time frame on an Event Processing Node (EPN), algorithms can efficiently navigate through large data sets and add new data to the data stream. Both data format and software framework have been designed with emphasis on lightweight data organization and optimized in-memory data format, supporting an efficient pipelined data processing.

        In this contribution, the organization of data reconstruction in the O2 event processing together with prototype studies for detector reconstruction and performance measurements will be presented.

        Speaker: Matthias Richter (University of Oslo (NO))
      • 116
        Improving the use of data quality metadata via a partnership of technologies and resources between the CMS experiment at CERN and industry

        The CMS experiment dedicates a significant effort to supervise the quality of its data, online and offline. A real-time data quality (DQ) monitoring is in place to spot and diagnose problems as promptly as possible to avoid data loss. The evaluation a posteriori of processed data is designed to categorize the data in term of their usability for physics analysis. These activities produce DQ metadata.
        The DQ evaluation relies on a visual inspection of monitoring features. This practice has a high cost in term of human resources and is naturally subject to human arbitration. Potential limitations are linked to the ability to spot a problem within the overwhelming number of quantities to monitor, or to the understanding of detector evolving conditions.
        In view of Run III, CMS aims at integrating deep learning technique in the online workflow to promptly recognize and identify anomalies and improve DQ metadata precision.
        The CMS experiment engaged in a partnership with IBM with the objective to support, with automatization, the online operations and to generate benchmarking technological results. The research goals, agreed within the CERN Openlab framework, how they matured in a demonstration application and how they are achieved, through a collaborative contribution of technologies and resources, will be presented.

        Speaker: Virginia Azzolini (Massachusetts Inst. of Technology (US))
    • T2 - Offline computing: S3 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Victor Daniel Elvira (Fermi National Accelerator Lab. (US))
      • 117
        AREUS - a software framework for ATLAS Readout Electronics Upgrade Simulation

        The design of readout electronics for the LAr calorimeters of the ATLAS detector to be operated at the future High-Luminosity LHC (HL-LHC) requires a detailed simulation of the full readout chain in order to find optimal solutions for the analog and digital processing of the detector signals. Due to the long duration of the LAr calorimeter pulses relative to the LHC bunch crossing time, out-of-time signal pile-up needs to be taken into account and realistic pulse sequences must be simulated together with the response of the electronics. For this purpose, the ATLAS Readout Electronics Upgrade Simulation framework (AREUS) has been developed based on the Observer design pattern to provide a fast and flexible simulation tool. Energy deposits in the LAr calorimeters from fully simulated HL-LHC collision events are taken as input. Simulated and measured analog pulse shapes proportional to these energies are then combined in discrete time series with proper representation of electronics noise. Analog-to-digital conversion, gain selection and digital signal processing are modeled at bit precision, including digitization noise and detailed electronics effects. In this way signal processing techniques can be optimized with respect to physics parameters like reconstructed energy and signal time in each channel. Finally, trigger and object reconstruction algorithms are taken into account in the optimization process. The software implementation of AREUS, the concepts of its main functional blocks and examples of obtained simulation results will be presented.

        Speaker: Nico Madysa (Technische Universitaet Dresden (DE))
      • 118
        Fitting fixed target and ion collisions in the LHCb Gauss simulation framework

        The LHCb experiment is a fully instrumented forward spectrometer designed for
        precision studies in the flavour sector of the standard model with proton-proton
        collisions at the LHC. As part of its expanding physics programme, LHCb collected data also during the LHC proton-nucleus collisions in 2013 and 2016 and
        during nucleus-nucleus collisions in 2015. All the collected datasets are unique,
        due to the peculiar pseudo-rapidity range not covered by any other LHC experiment. Furthermore, in 2015 LHCb commissioned the internal gas target SMOG,
        becoming the only LHC experiment with a programme of fixed target physics.
        Any of these particular collision conditions required a different operational setup,
        as well as dedicated simulation production based on heavy-ion Monte-Carlo event
        generators and interface extensions of the standard LHCb simulation framework.
        In this talk, we present the work done to implement such a variety of simulation
        productions for heavy-ion collisions, and to validate the produced samples. The
        future perspectives of the heavy-ion collision simulations at LHCb will also be
        discussed.

        Speaker: Patrick Robbe (Université Paris-Saclay (FR))
      • 119
        Opticks : GPU Optical Photon Simulation for Particle Physics with NVIDIA OptiX

        Opticks is an open source project that integrates the NVIDIA OptiX
        GPU ray tracing engine with Geant4 toolkit based simulations.
        Massive parallelism brings drastic performance improvements with optical photon simulation speedup expected to exceed 1000 times Geant4 with workstation GPUs.

        Optical physics processes of scattering, absorption, reemission and
        boundary processes are implemented as CUDA OptiX programs based on the Geant4
        implementations. Wavelength dependent material and surface properties as well as inverse cumulative distribution functions for reemission are interleaved into
        GPU textures providing fast interpolated property lookup or wavelength generation.
        OptiX handles the creation and application of a choice of acceleration structures
        such as boundary volume hierarchies and the transparent use of multiple GPUs.

        A major recent advance is the implementation of GPU ray tracing of complex constructive solid geometry shapes, enabling automated translation of Geant4 geometries to the GPU without approximation. Using common initial photons and random number sequences allows the Opticks and Geant4 simulations to be run point-by-point aligned. Aligned running has reached near perfect equivalence with test geometries.

        Speaker: Dr Simon Blyth (National Taiwan University)
      • 120
        Progress of Geant4 electromagnetic physics developments and applications

        We report developments for the Geant4 electromagnetic (EM) physics sub-packages for Geant4 release 10.4 and beyond. Modifications are introduced to the models of photo-electric effect, bremsstrahlung, gamma conversion, and multiple scattering. Important developments for calorimetry applications were carried out for the modeling of single and multiple scattering of charged particles. Corrections to scattering of positrons and to sampling of displacement have been recently added to the Geant4 default Urban model. The fully theory-based Goudsmit-Saunderson (GS) model for electron/positron multiple scattering has been recently reviewed and a new improved version is available. For testing purposes for novel calorimeters we provide a configuration of electron scattering based on the GS model or on the single scattering model (SS) instead of the Urban model. In addition, the GS model with enabled Mott corrections is included into the EM physics constructor, which recommended for accurate simulations. This EM configuration provides the most accurate results for scattering of electrons and positrons. To improve usability, a new set of User Interface (UI) commands and corresponding C++ interfaces have been added for easier configuration of EM physics. In particular, photo-absorption ionisation model may be enabled per detector region using corresponding UI command. Interfaces to material properties were improved, which includes density effect parameterization for compounds and optical parameters of materials. The last modification allows significantly speed up of the simulation of optical photon transport. These new interfaces and models are already used in LHC applications and may be useful for any type of simulations. We discuss their effective usage for various applications including possible improvements for simulation of LHC and dark matter experiments.

        Speaker: Vladimir Ivantchenko (CERN)
      • 121
        Towards full electromagnetic physics vectorization in the GeantV transport framework

        The development of the GeantV Electromagnetic (EM) physics package has evolved following two necessary paths towards code modernization. A first phase required the revision of the main electromagnetic physics models and their implementation. The main objectives were to improve their accuracy, extend them to the new high-energy frontiers posed by the Future Circular Collider (FCC) programme and allow a better adaptation to a multi-particle flow. Most of the EM physics models in GeantV have been reviewed from theoretical perspective and rewritten with vector-friendly implementations, being now available in scalar mode in the alpha release. The second phase consists of an thorough investigation on the possibility to vectorise the most CPU-intensive physics code parts, such as final state sampling. We have shown the feasibility of implementing electromagnetic physics models that take advantage of SIMD/SIMT architectures, thus obtaining gains in performance. After this phase, the time has come for the GeantV project to take a step forward towards the final proof of concept. This takes shape through the testing of the full simulation chain (transport + physics + geometry) running in vectorised mode. In this paper we will present the first benchmark results obtained after vectorizing a set of electromagnetic physics models, starting from the photoelectric effect.

        Speaker: Dr Marilena Bandieramonte (CERN)
      • 122
        A top to bottom framework approach to vectorization

        SIMD acceleration can potentially boost by factors the application throughput. However, achieving efficient SIMD vectorization for scalar code with complex data flow and branching logic, goes way beyond breaking loop dependencies and relying on the compiler. Since the re-factoring effort scales with the number of lines of code, it is important to understand what kind of performance gains can be expected in such complex cases. The GeantV R&D has started a couple of years ago a top to bottom vectorization approach to particle transport simulation. Percolating multiple data to algorithms was mandatory since not all the components offer natural internal vectorization capability. Vectorizing low-level algorithms such as position/direction geometry classifiers or field propagators were certainly necessary, but not sufficient to achieve relevant SIMD gains. Overheads for maintaining the concurrent vector data flow and data copying had to be minimized. GeantV developed a framework to allow different categories of scalar and vectorized components to co-exist, dealing with data flow management and real-time heuristic optimizations. The paper will describe our approach on co-ordinating SIMD vectorization at framework level, making a detailed quantitative analysis of the SIMD gain versus overheads with a break-down by component in terms of geometry, physics and magnetic field propagation. The more general context of the GeantV work and goals for 2018 will also be presented.

        Speaker: Andrei Gheata (CERN)
    • T3 - Distributed computing: Testing, Monitoring and Accounting Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: Julia Andreeva (CERN)
      • 123
        EGI Dataset Accounting and the WLCG

        While the WLCG and EGI have both made significant progress towards solutions for storage space accounting, one area that is still quite exploratory is that of dataset accounting. This type of accounting would enable resource centre and research community administrators to report on dataset usage to the data owners, data providers, and funding agencies. Eventually decisions could be made about the location and storage of data sets to make more efficient use of the infrastructure. By giving insight to data usage, dataset accounting also assists scientists in assessing the impact of their work.

        This paper reviews the status of the prototype dataset accounting developed during EGI-Engage and how it could be used to complement the view that the WLCG has of its datasets. This is a new feature of the EGI resource accounting system that will enable storing information on dataset usage such as who has accessed a dataset and how often, the transfer volumes and end points etc. The design of this new feature has been led by the users' requirements collected in the first part of the project, from which a set of dataset accounting metrics were derived. In these trials, the EGI Accounting Repository was integrated with the data provider Onedata (the underlying technology powering the EGI Open Data Platform and EGI DataHub) as an example of a generic data provider.

        Speaker: Mr Adrian Coveney (STFC)
      • 124
        GRACC: GRid ACcounting Collector

        The OSG has long maintained a central accounting system called Gratia. It uses small probes on each computing and storage resource in order to usage. The probes report to a central collector which stores the usage in a database. The database is then queried to generate reports. As the OSG aged, the size of the database grew very large. It became too large for the database technology to efficiently query to generate detailed reports.

        The design of a replacement required data storage that could be queried efficiently to generate multi-year reports. Additionally, it requires flexibility to add new attributes to the collected data.

        In this paper we will describe the GRACC architecture. GRACC uses modern web technologies that were designed for large data storage, query, and visualization. That includes the open source database Elasticsearch, message broker software RabbitMQ, and Grafana and Kibana as data visualization platforms. It uses multiple agents that perform operations on the data to transform it for easier querying and summarization.

        Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
      • 125
        Evolution of HammerCloud to commission CERN Compute resources

        HammerCloud is a testing service and framework to commission, run continuous tests or on-demand large-scale stress tests, and benchmark computing resources and components of various distributed systems with realistic full-chain experiment workflows.

        HammerCloud, userd by the ATLAS and CMS experiments in production, has been a useful service to commission both compute resources and various components of the complex distributed systems of the LHC experiments, as well as integral part of the monitoring suite essential for the computing operations of the experiments and their automation.

        In this contribution we review recent developments of the HammerCloud service that allow use of HammerCloud infrastructure to test Data Centre resources in the early phases of the infrastructure and services commissioning process. One of the benefits we believe HammerCloud can provide is to be able to tune the commissioning of the new infrastructure, functional and also stress testing, as well as benchmarking with a "standard candle" workflows, with experiment realistic workloads, that can be heavy for CPU, or I/O, or IOPS, or everything together. This extension of HammerCloud has been successfully used in CERN IT during prototype phase of the "BEER" Batch on EOS (Evaluation of Resources) project, and is being integrated with the continuous integration/continuous deployment suite for Batch service VMs.

        Speaker: Jaroslava Schovancova (CERN)
      • 126
        Monitoring system for the Belle II distributed computing

        The Belle II is an asymmetric energy e+e- collider experiment at KEK, Japan. The Belle II aims to reveal the physics beyond the standard model with a data set of about 5×10^10 BB^bar pairs and starts the physics run in 2018. In order to store such a huge amount of data including simulation events and analyze it in a timely manner, Belle II adopts a distributed computing model with DIRAC (Distributed Infrastructure with Remote Agent Control).

        The monitoring system for Belle II computing is developed as an extension of DIRAC. It collects and investigates various information related to the job processing or data transfer stored in DIRAC database, and then visualizes them. We also develop a system which regularly performs several accessibility tests to the various components such as computing/storage elements, database storing calibration information, and DIRAC servers. Finally, the system makes a total health check of the Belle II computing with all of the investigation and test results. Detected issues are displayed in single place so that even non-expert shifter can easily find the problem.

        In this contribution, we will present the details of the monitoring system as well as experience during the simulation data production campaign and a first few months of the data taking.

        Speaker: Yuji Kato
      • 127
        Towards the integrated ALICE Online-Offline (O2) monitoring subsystem

        ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout system and computing for LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect performance degradation and component failures. The monitoring subsystem will receive and collect up to 600 kHz of performance parameters. It consists of a custom monitoring library and a server-side, distributed software covering five main functional tasks: parameter collection and processing, storage, visualization and alarms.
        To select the most appropriate tools for these tasks, we evaluated three server-side systems: MonALISA, Zabbix and “Modular stack”. The latter one consists of a toolkit including collectd, Apache Flume, Apache Spark, InfluxDB, Grafana and Riemann.
        This paper describes the monitoring subsystem functional architecture. It goes through a complete evaluation of the three considered solutions, the selection processes, risk assessment and justification for the final decision. The in-depth comparison includes functional features, latency and throughput measurement to ensure the required processing and storage performance.

        Speaker: Adam Wegrzynek (CERN)
      • 128
        CRIC: a unified information system for WLCG and beyond

        The WLCG Information System (IS) is an important component of the huge heterogeneous distributed infrastructure. Considering the evolution of LHC computing towards high luminosity era and analyzing experience accumulated by the computing operations teams and limitations of the current information system, the WLCG IS evolution task force came up with the proposal to develop Computing Resource Information Catalogue (CRIC) which aims to consolidate WLCG topology information and to complement it with experiment-specific configuration required for computing operations, workload management and data management systems of the LHC experiments. CRIC represents evolution of ATLAS Grid Information System (AGIS) which has been evolved in order to serve not just ATLAS but any other virtual organization relying on large scale distributed infrastructure as well as the WLCG on the global scope.

        The contribution describes CRIC architecture, implementation of data models, collectors, UIs/APIs and advanced fine-granular authentication/authorization.

        Speaker: Alexey Anisenkov (Budker Institute of Nuclear Physics (RU))
    • T4 - Data handling: S3 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Costin Grigoras (CERN)
      • 129
        Conditions evolution of an experiment in mid-life, without the crisis (in ATLAS)

        The ATLAS experiment is approaching mid-life: the long shutdown period (LS2) between LHC Runs 1 and 2 (ending in 2018) and the future collision data-taking of Runs 3 and 4 (starting in 2021). In advance of LS2, we have been assessing the future viability of existing computing infrastructure systems. This will permit changes to be implemented in time for Run 3. In systems with broad impact such as the conditions database, making assessments now is critical as the full chain of operations from online data-taking to offline processing can be considered: evaluating capacity at peak times, looking for bottlenecks, identifying areas of high maintenance, and considering where new technology may serve to do more with less.

        We have been considering changes to the ATLAS conditions database related storage and distribution infrastructure based on similar systems of other experiments. We have also examined how new technologies may help and how we might provide more RESTful services to clients. In this presentation, we give an overview of the identified constraints and considerations, and our conclusions for the best way forward: balancing preservation of critical elements of the existing system with the deployment of the new technology in areas where the existing system falls short.

        Speaker: Lorenzo Rinaldi (Universita e INFN, Bologna (IT))
      • 130
        Performance of the Belle II Conditions Database

        The Belle II experiment at KEK is preparing for first collisions in early 2018. Processing the large amounts of data that will be produced requires conditions data to be readily available to systems worldwide in a fast and efficient manner that is straightforward for both the user and maintainer. This was accomplished by relying on industry-standard tools and methods: the conditions database is built as an HTTP REST service using tools such as Swagger for the API interface development, Payara for the Java EE application server, and Squid for the caching proxy. This talk will present the design of the Belle II conditions database production environment as well as go into detail about the capabilities and performance during both Monte Carlo campaigns and data reprocessing.

        Speaker: Lynn Wood (Pacific Northwest National Laboratory, USA)
      • 131
        The Open High Throughput Computing Content Delivery Network

        LHC experiments make extensive use of Web proxy caches, especially for software distribution via the CernVM File System and for conditions data via the Frontier Distributed Database Caching system. Since many jobs read the same data, cache hit rates are high and hence most of the traffic flows efficiently over Local Area Networks. However, it is not always possible to have local Web caches, particularly for opportunistic cases where experiments have little control over site services. The Open High Throughput Computing (HTC) Content Delivery Network (CDN), openhtc.io, aims to address this by using Web proxy caches from a commercial CDN provider. Cloudflare provides a simple interface for registering DNS aliases of any Web server and does reverse proxy Web caching on those aliases. The openhtc.io domain is hosted on Cloudflare's free tier CDN which has no bandwidth limit and makes use of data centers throughout the world, so the average performance for clients is much improved compared to reading from CERN or a Tier 1. The load on WLCG servers is also greatly reduced. WLCG Web Proxy Auto Discovery is used to select local Web caches when they are available and otherwise use openhtc.io caching. This paper describes the Open HTC CDN in detail and provides initial results from its use for USCMS opportunistic computing and LHC@Home.

        Speaker: Dave Dykstra (Fermi National Accelerator Lab. (US))
      • 132
        A new mechanism to use the Conditions Database REST API to serve the ATLAS detector description

        An efficient and fast access to the detector description of the ATLAS experiment is needed for many tasks, at different steps of the data chain: from detector development to reconstruction, from simulation to data visualization. Until now, the detector description was only accessible through dedicated services integrated into the experiment's software framework, or by the usage of external applications. In this work we explore the possibility of using a web access-based conditions database to store and serve the detector description, aiming at a simplification of the software architecture of the experiment and a reduction in the number of software packages to be maintained. We will also share the lessons learned while developing the new system to efficiently serve an experiment's geometry data to clients through a REST API.

        Speaker: Alessandro De Salvo (Sapienza Universita e INFN, Roma I (IT))
      • 133
        A Git-based Conditions Database backend for LHCb

        LHCb has been using the CERN/IT developed Conditions Database library COOL for several years, during LHC Run 1 and Run 2. With the opportunity window of the second long shutdown of LHC, in preparation for Run 3 and the upgraded LHCb detector, we decided to investigate alternatives to COOL as Conditions Database backend. In particular, given our conditions and detector description data model, we investigated the possibility of reusing the internal Git repository database as a conditions storage, and we adopted it since 2017 data taking. The adoption of Git gave us improved performance, smaller storage size and simplified maintenance and deployment.
        In this paper we describe the implementation of our Git Conditions Database and the way it simplified our detector description and conditions development workflow.

        Speaker: Marco Clemencic (CERN)
      • 134
        EventDB: an event indexer and caching system for BESIII experiment

        Beijing Spectrometer (BESIII) experiment has produced hundreds of billions of events. It has collected the world's largest data samples of J/ψ, ψ(3686), ψ(3770) andψ(4040) decays. The typical branching fractions for interesting physics channels are of the order of O(10^-3). The traditional event-wise accessing of BOSS (Bes Offline Software System) is not effective for the selective accessing with the rate of O(10^-3). We proposed and implemented an event indexer system - EventDB, which extracts the characteristics of all of events and stores them in hbase. Currently, the characteristics contain 22 variables including BeamEnergy, NTracks, NShowers, BeamVx, BeamVy, BeamVz and so on. We adopted inverted index technologies to store event index in hbase, and the index count is less than 1/1000 of event count. We also puts real event data into hbase for hot events caching to accelerate event access. After using EventDB, the data analysis workflow of BESIII experiment is changed as follows: the analysis program firstly queries the corresponding event index from database, then get event data from database if the event is cached, or get data from DST file using ROOT framework if it is not cached. Finally, the test results on one billion dataset showed the query speed was improved more than 10 times.

        Speaker: Yaodong Cheng (Chinese Academy of Sciences (CN))
    • T5 - Software development: S3 - Concurrency Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Barthelemy Von Haller (CERN)
      • 135
        I/O in the ATLAS multithreaded framework

        Scalable multithreading poses challenges to I/O, and the performance of a thread-safe I/O strategy
        may depend upon many factors, including I/O latencies, whether tasks are CPU- or I/O-intensive, and thread count.
        In a multithreaded framework, an I/O infrastructure must efficiently supply event data to and collect it from many threads processing multiple events in flight.
        In particular, on-demand reading from multiple threads may challenge caching strategies that were developed for serial processing and may need to be enhanced.
        This I/O infrastructure must also address how to read, make available, and propagate in-file metadata and other non-event data needed as context for event processing.

        We describe the design and scheduling of I/O components in the ATLAS multithreaded control framework, AthenaMT, for both event and non-event I/O.
        We discuss issues associated with exploiting the multithreading capabilities of our underlying persistence technology, ROOT,
        in a manner harmonious with the ATLAS framework’s own approach to thread management.
        Finally, we discuss opportunities for evolution and simplification of I/O components that have successfully supported ATLAS event processing for many years
        from their serial incarnations to their thread-safe counterparts.

        Speaker: Jack Cranshaw (Argonne National Laboratory (US))
      • 136
        Hydra: A framework for data analysis in massively parallel platforms

        Hydra is a templatized header-only, C++11-compliant library for data analysis on massively parallel platforms targeting, but not limited to, the field High Energy Physics research.
        Hydra supports the description of particle decays via the generation of phase-space Monte Carlo, generic function evaluation, data fitting, multidimensional adaptive numerical integration and histograming.
        Hydra is open source and the code is hosted in GitHub.

        The library deploys a series of techniques in order to achieve optimal
        performance in both computing and management of memory resources.
        The overall design exploits heavily C++ variadic templates in order to implement
        static polymorphism, kernel fusion and coalesced memory access patterns,
        avoiding completely the usage of function pointers,
        virtual methods calls and other known potential performance degrading constructs.

        Hydra is developed on top of the Thrust library and runs on Linux systems and can
        deploy transparently NVidia CUDA-enabled GPUs as well as multicore CPUs
        and accelerators.

        This contribution summarizes the main features of Hydra. A
        basic description of the user interface and some examples of applications
        are provided, along with measurements of performance in a variety of
        environments.

        Speaker: Antonio Augusto Alves Junior (University of Cincinnati (US))
      • 137
        Supporting Future HEP Data Processing with a Parallelised ROOT

        In the coming years, HEP data processing will need to exploit parallelism on present and future hardware resources to sustain the bandwidth requirements.
        As one of the cornerstones of the HEP software ecosystem, ROOT embraced an ambitious parallelisation plan which delivered compelling results.
        In this contribution the strategy is characterised as well as its evolution in the medium term.
        The units of the ROOT framework are discussed where task and data parallelism have been introduced, with runtime and scaling measurements. We will give an overview of concurrent operations in ROOT, for instance in the areas of I/O (reading and writing of data), fitting / minimization, and data analysis.
        The presentation introduces the programming model and use cases for explicit and implicit parallelism, where the former is explicit in user code and the latter is implicitly managed by ROOT internally.

        Speaker: Danilo Piparo (CERN)
      • 138
        Implementing Concurrent Non-Event Transitions in CMS

        Since the beginning of the LHC Run 2 in 2016 the CMS data processing framework, CMSSW, has been running with multiple threads during production of data and simulation via the use of Intel's Thread Building Blocks (TBB) library. The TBB library utilizes tasks as concurrent units of work. CMS used these tasks to allow both concurrent processing of events as well as concurrent running of modules processing the same event. This design has served CMS well and has allowed jobs to utilize less memory per core as well as reduce the number of jobs that must be tracked by CMS's worflow management system. As CMS has begun to utilize greater number of threads in a job, the effect of serialization points in the framework has decreased job's CPU efficiency. One major serialization point occurs when the framework processes a non-Event transition. These transitions occur when a new Run or LuminosityBlock is to be processed. In this talk we will discuss how the different transitions define the processing model for CMSSW and how we were able to successfully allow concurrent processing of those transitions using TBB via task queues. We will also show CPU efficiency comparisons between the same work being done with and without the additional concurrency.

        Speaker: Christopher Jones (Fermi National Accelerator Lab. (US))
      • 139
        Writing ROOT Data in Parallel with TBufferMerger

        The LHC experiments produce tens of petabytes of new data in ROOT format per year that need to be processed and analysed. In the next decade, following the planned upgrades of the LHC and the detectors, this rate is expected to increase at least ten-fold.
        Therefore, optimizing the ROOT I/O subsystem is of critical importance to the success of the LHC physics programme. This contribution presents ROOT's approach of writing data from multiple threads to the same file in a highly efficient way. Technical aspects of its implementation, the TBufferMerger class, as well as examples of its programming model are described. Runtime performance and writing rates measurements and associated improvements with respect to the single writer case are discussed in the context of standalone examples and production parallel data processing workflows such as CMS reconstruction.
        The interplay of parallel dataset writing with other parallelisation mechanisms in ROOT is discussed together with the future developments of the strategy aiming to maximise experiments’ data processing throughput

        Speaker: Guilherme Amadio (CERN)
      • 140
        A modular software framework for test-beam data analysis

        In high energy physics experiments, silicon detectors are often subjected to a harsh radiation environment, specially at hadronic colliders. Understanding the impact of radiation damage on the detector performance is an indispensable prerequisite for a successful operation throughout the lifetime of the planned experiment.

        A dedicated irradiation programme followed by detailed studies with particle beam are essential for proper evaluation of detector prototypes and predict their performance after accumulating the design fluence.

        In order to perform precise measuremens with the LHCb VELO detector prototypes a dedicated high resolution pixel beam telescope was developed based on 8 Timepix3 detector planes. This telescope has been taking data at CERN in the PS and SPS facilities since 2014. The Timepix3 can readout on data driven mode with very precise timestamps which makes triggering unnecessary.
        At the centre of the telescope a Device Under Test (DUT) can be installed on a motion stage allowing angular rotations about the y axis, and x and y translations, where the z coordinate is the beam direction. The telescope provides precise measurements of particle trajectories with a pointing resolution of ∼2 µm and a time resolution of ∼1 ns allowing in-depth analysis of the DUT performance.

        The data produced by the telescope can easily incorporate the signals from the DUT and in particular for Timepix3 devices the analysis is straight forward. The LHCb software embeded in the so-called Kepler project performs the decoding of raw data and produces track objects inside the official LHCb's Gaudi framework. In this way it is simple to produce histograms and ntuples with track and cluster data information.

        Subsequently, the offline analysis compares the performance after irradiation with several fluences of protons or neutrons from 2 to 8 × 10^15 1 MeV neq/cm2; different silicon substrates (n-on-p or n-on-n), distances from last pixel to the edge, guard rings designs and different vendors. Charge collection efficiencies (CCE), track resolution (TR), Eta correction are of particular interest as function of fluence.

        For purposes described above, a modular software framework has been developed. It allows to handle the test beam data for a set of runs, as well as the set of different DUTs (Devices Under Test) in one-go, and obtain an integrated workflow to present the results.

        All code is written in C++, which is a general-purpose objective programming language. A class based design makes it flexible to add any new features of the device under investigation following a plug-in scheme. Currently, the toolkit handles different types of analysis such as CCE, TR and Eta correction, implemented for non-uniform irradiated sensors.

        The presentation will describe the implemented analysis framework as a proposal of the prototype of a general design framework for test beam campaigns, that could be followed to obtain a quick ROOT based application for complex test beam data analyses being performed in different facilities and different irradiation sources or profiles. The interface is implemented for a series of algorithms in a user-friendly way. We believe it is a valuable complement to used by different groups performing detector R&D programs.

        Speaker: Bartlomiej Rachwal (AGH University of Science and Technology (PL))
    • T6 - Machine learning and physics analysis: S3 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Sofia Vallecorsa (Gangneung-Wonju National University (KR))
      • 141
        Columnar data processing for HEP analysis

        In the last stages of data analysis, only order-of-magnitude computing speedups translate into increased human productivity, and only if they're not difficult to set up. Producing a plot in a second instead of an hour is life-changing, but not if it takes two hours to write the analysis code. Fortunately, HPC-inspired techniques can result in such large speedups, but unfortunately, they can be difficult to use in a HEP setting.

        These techniques generally favor operating on columns— arrays representing a single attribute across events, rather than whole events individually— which allows data to stream predictably from disk media to main memory and finally to CPU/GPU/KNL onboard memory (e.g. L* cache) for prefetching and sometimes allows for for vectorization. However, the need to work with variable-length structures in HEP, such as different numbers of particles per event, makes it difficult to apply this technique to HEP problems.

        We will describe several new software tools to make it easier to compute analysis functions with columnar arrays in HEP: array-at-a-time I/O in ROOT ("BulkIO") and Python/Numpy ("uproot"), compiling object-oriented analysis code into columnar operations ("oamap" for "object-array mapping"), and storage solutions with columnar granularity. We will show performance plots and usage examples.

        Speaker: Jim Pivarski (Princeton University)
      • 142
        Using Big Data Technologies for HEP analysis

        The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quick as possible by analyzing PB and EB datasets. Providing the scientists with more modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.

        In this talk, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data collected by the CMS experiment to 1 TB of data in a format suitable for physics analysis in 5 hours.

        The second goal is achieved by implementing multiple physics use-cases in Apache Spark using in input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.

        Speaker: Matteo Cremonesi (Fermi National Accelerator Lab. (US))
      • 143
        Pandas DataFrames for F.A.S.T. binned analysis at CMS

        Many analyses on CMS are based on the histogram, used throughout the workflow from data validation studies to fits for physics results. Binned data frames are a generalisation of multidimensional histograms, in a tabular representation where histogram bins are denoted by category labels. Pandas is an industry-standard tool, providing a data frame implementation that allows easy access to "big data” scientific libraries, including I/O, visualisation, and machine learning tools.

        F.A.S.T. (Faster Analysis Software Taskforce) runs regular hackdays and tutorials in the UK. We present a new toolkit where Pandas DataFrames are used as the basis for binned physics analysis. We demonstrate how this engenders faster, more robust, and more flexible development, using fewer lines of code, as well as improving accessibility for newcomers.

        The toolkit is presented in the context of a typical CMS search analysis, where we look for evidence of new physics signal in a multidimensional parameter space, but is not analysis specific. The code structure is simple and modular, with built-in bookkeeping and documentation. We envisage its adoption by other UK binned analyses, and plan associated tools, helping physicists to focus more on “what” their analysis should do rather than technically “how” it is done.

        Speaker: Dr Benjamin Krikler (University of Bristol (GB))
      • 144
        RDataFrame: Easy Parallel ROOT Analysis at 100 Threads

        The Physics programmes of LHC Run III and HL-LHC challenge the HEP community. The volume of data to be handled is unprecedented at every step of the data processing chain: analysis is no exception.
        First class analysis tools need to be provided to physicists which are easy to use, exploit the bleeding edge hardware technologies and allow to seamlessly express parallelism.
        This contribution discusses the declarative analysis engine of ROOT, RDataFrame, and gives details about how it allows to profitably exploit commodity hardware as well as high-end servers and manycore accelerators thanks to the synergy with the existing parallelised ROOT components.
        Real-life analyses of LHC experiments’ data expressed in terms of RDataFrame are presented, highlighting the programming model provided to express them in a concise and powerful way. The recent developments which make RDataFrame a lightweight data processing framework are described, for example callbacks and I/O capabilities.
        The flexibility of RDataFrame and its ability to read data formats other than ROOT’s are characterised, as an example it is discussed how RDataFrame can directly read and analyze LHCb's raw data format MDF.

        Speaker: Enrico Guiraud (CERN, University of Oldenburg (DE))
      • 145
        A further reduction in CMS event data for analysis: the NANOAOD format

        A new event data format has been designed and prototyped by the CMS collaboration to satisfy the needs of a large fraction of Physics Analyses (at least 50%) with a per event size of order 1 Kb. This new format is more than a factor 20x smaller than the MINIAOD format and contains only top level information typically used in the last steps of the analysis. The talk will review the current analysis strategy from the point of view of event format in CMS (both skims and format such as RECO, AOD, MINIAOD, NANOAOD) and will describe the design guidelines for the new NANO format.

        Speaker: Andrea Rizzi (INFN Sezione di Pisa, Universita' e Scuola Normale Superiore, P)
      • 146
        Developing a Declarative Analysis Language: LINQToROOT

        The HEP community is preparing for the LHC’s Run 3 and 4. One of the big challenges for physics analysis will be developing tools to efficiently express an analysis and able to efficiently process the x10 more data expected. Recently, interest has focused on declarative analysis languages: a way of specifying a physicists’ intent, and leaving everything else to the underlying system. The underlying system takes care of finding the data - powering the event processing loop – and even exactly how to most efficiently apply a desired jet selection. If this works, this would allow an analyzer to test their algorithm on a small amount of data on their GPU-less laptop and then run it on a large amount of data on a server with multiple large GPU’s without having to alter their code. The LINQToROOT project, started almost seven years ago, fits this model. It has been used and tested in three ATLAS published analyses. LINQToROOT is based on the Language Integrated Query system built into the cross-platform C# language. It enables writing strongly-typed queries on a ROOT’s TTree’s data, and transcribes the data to a C++ algorithm that can run in ROOT. Recent work on this system has had two goals: improving analysis efficiency and better understanding the requirements of a declarative analysis language. For example, a good analysis language should be able to abstract away the backend – recent work has increased the possible back ends from formerly the single Windows ROOT backend to one that runs on Linux, the Windows Linux-subsystem, and an experimental one that allows for PROOF like parallel processing – all done with almost no change to the analysis code itself. Any analysis language must also be rich enough to support an experiment’s data model. To test this, some experiments with the full ATLAS xAOD data model have been performed. All of this has been done while attempting to keep the project close to its original goals: quick turn around for real ATLAS physics analysis. This work will be discussed in some detail along with thoughts and lessons that have helped shape our thinking about an Analysis Language and perhaps our approach to future physics analysis employing declarative analysis.

        Speaker: Gordon Watts (University of Washington (US))
    • T7 - Clouds, virtualization and containers: S3 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Convener: Fabio Hernandez (IN2P3/CNRS Computing Centre)
      • 147
        Optimizing OpenStack Nova for Scientific Workloads

        The CERN OpenStack Cloud provides over 200.000 CPU cores to run data processing analyses for the Large Hadron Collider (LHC) experiments. To deliver these services, with high performance and reliable service levels, while at the same time ensuring a continuous high resource utilization has been one of the major challenges for the CERN Cloud engineering team.

        Several optimizations like NUMA-aware scheduling and huge pages, have been deployed to improve Scientific Workloads performance, but the CERN Cloud team continues to explore new possibilities like pre-emptible instances and containers in baremetal.

        In this presentation we will dive into the concept and implementation challenges of pre-emptible instances and containers on baremetal for Scientific Workloads. We will also explore how they can improve Scientific Workloads throughput and infrastructure resource utilization.

        We will present the ongoing collaboration with the Square Kilometer Array (SKA) community to develop the necessary upstream enhancement to further improve OpenStack Nova to support large-scale scientific workloads.

        Speaker: Belmiro Moreira (CERN)
      • 148
        Advanced features of the CERN OpenStack Cloud

        The CERN OpenStack cloud has been delivering a wide variety of services to its 3000 customers since it entered in production in 2013. Initially, standard resources such a Virtual Machines and Block Storage were offered. Today, the cloud offering includes advanced features since as Container Orchestration (for Kubernetes, Docker Swarm mode, Mesos/DCOS clusters), File Shares and Bare Metal, and the Cloud team are preparing the addition of Networking and Workflow-as-a-Service components.

        In this presentation, we will describe these advanced features, the OpenStack projects that provide them, as well as some of the main usecases that benefit from them. We will present the ongoing work on those services that will increase functionality, such as container orchestration upgrades and networking features such as private networks and floating IPs.

        Speaker: Jose Castro Leon (CERN)
      • 149
        Sim@P1: Using Cloudscheduler for offline processing on the ATLAS HLT farm

        The Simulation at Point1 (Sim@P1) project was built in 2013 to take advantage of the ATLAS Trigger and Data Acquisition High Level Trigger (HLT) farm. The HLT farm provides more than 2,000 compute nodes, which are critical to ATLAS during data taking. When ATLAS is not recording data, this large compute resource is used to generate and process simulation data for the experiment. The Sim@P1 system uses virtual machines, deployed by OpenStack, in order to isolate the resources from the ATLAS technical and control network. During the upcoming long shutdown in 2019 (LS2), the HLT farm including the Sim@P1 infrastructure will be upgraded. A previous paper on the project emphasized the need for “simple, reliable, and efficient tools” to quickly switch between data acquisition operation and offline processing.In this contribution we assess various options for updating and simplifying the provisional tools. Cloudscheduler is a tool for provisioning cloud resources for batch computing that has been managing cloud resources in HEP offline computing since 2012. We present the argument for choosing Cloudscheduler, and describe technical details regarding optimal utilization of the Sim@P1 resources.

        Speaker: Frank Berghaus (University of Victoria (CA))
      • 150
        Experience with dynamic resource provisioning of the CMS online cluster using a cloud overlay

        The primary goal of the online cluster of the Compart Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is to build event data from the detector and to select interesting collisions in the High Level Trigger (HLT) farm for offline storage. With more than 1100 nodes and a capactity of about 600 kHEPSpec06, the HLT machines represent up to 40% of the combined Tier0/Tier-1 capacity on the Worldwide LHC Computing Grid (WLCG). Moreover, it is currently connected to the CERN IT datacenter via a dedicated 160 Gbps network connection and hence can access the remote EOS based storage with a high bandwidth. In the last few years, a cloud overlay based on Openstack has been commissioned to use these resources for the WLCG when they are not needed for data taking. This online cloud facility was designed for parasitic use of the HLT, which must never interfere with its primary function as part of the DAQ system. It also allows to abstract from the different types of machines and their underlying segmented networks. For the monitoring, an infrastructure based on Graphite for metric storage and aggregation, and Grafana for the visualisation has been deployed. During the LHC technical stop periods, the HLT cloud is set to its static mode of operation where it acts like other grid facilities. The online cloud was also extended to make dynamic use of resources during periods between LHC fills. These periods are a-priori unscheduled and of undetermined length, typically of several hours, once or more a day. For that, it dynamically follows LHC beam states and hibernates Virtual Machines (VM) accordingly. Finally, this work presents the design and implementation of a mechanism to dynamically ramp up VMs when the DAQ load on the HLT reduces towards the end of the fill.

        Speaker: Diego Da Silva Gomes (CERN)
      • 151
        Elastic resource allocation for multi-queue HEP workloads in cloud based environments

        As the development of cloud computing, more and more clouds are widely applied in the high-energy physics fields. OpenStack is generally considered as the future of cloud computing. However in OpenStack, the resource allocation model assigns a fixed number of resources to each group. It is not very suitable for scientific computing such as high energy physics applications whose demands of resource various, especially with strong peak demands. In a traditional static cluster, a fixed number of virtual machines are pre-allocated to the job queue of different experiments. What happens often is some queues are queued while some queues are idle. As a result, the overall efficacy of virtual cluster is rather low. To solve this problem, we developed a resource allocation service to provide OpenStack elastic scheduling. In this implement, each experiment queue in OpenStack will have a pair of {Min, Max} quota, which Min represents the minimum number of resources for this experiment and Max represents the highest available resources. The final resource allocation for each experiment is determined by the fair resource scheduling algorithm.

        In the paper, we will discuss development work of elastic resource allocation. An use case with VCondor (Virtual cluster with HTCondor implemented by IHEP ) is given. The result shows it can greatly improve cloud resource efficiency.

        Speaker: Haibo li (Institute of High Energy Physics Chinese Academy of Science)
      • 152
        Cyber security detection and monitoring at IHEP private cloud for web services

        To improve hardware utilization and save man power in system management, we have migrated most of the web services in our institute (Institute of High Energy Physics, IHEP) to a private cloud build upon OpenStack since last few years. However, cyber security attacks becomes a serious threats to the cloud progressively. Therefore, a detection and monitoring system for cyber security threats is necessary for such an important platform.

        This system collects network traffic data through OpenStack Neutron API and processes the traffic with Bro IDS, it's logs and the web/system log data of Virtual Machines (VM) are collected by Logstach and Filebeat. All the log data are stored in a storage server as well as Elasticsearch.The latter is used for quick search purpose during forensics. A group of analysis jobs are running to check the logs according to security policies, these policies are stored in databases and can be updated by security operators. The real time analysis results are illustrated in Web UI. Email alerts will be sent to security operators when an incident is detected.

        A prototype of this system has been developed and deployed at IHEP to enhance the security of the private cloud for web services.

        Speaker: Tian Yan (Institution of High Energy Physics, Chinese Academy of Science)
    • 12:30 PM
      Lunch break
    • T1 - Online computing: S4 Hall 3.1 (National Palace of Culture)

      Hall 3.1

      National Palace of Culture

      Convener: Catrin Bernius (SLAC National Accelerator Laboratory (US))
      • 153
        The design and performance of the ATLAS Inner Detector trigger in high pileup collisions at 13 TeV at the Large Hardon Collider

        The design and performance of the ATLAS Inner Detector (ID) trigger
        algorithms running online on the High Level Trigger (HLT) processor
        farm for 13 TeV LHC collision data with high pileup are discussed.
        The HLT ID tracking is a vital component in all physics signatures
        in the ATLAS trigger for the precise selection of the rare or
        interesting events necessary for physics analysis without
        overwhelming the offline data storage in terms of both size and
        rate. To cope with the high interaction rates expected in the 13
        TeV LHC collisions the ID trigger was redesigned during the 2013-15
        long shutdown. The performance of the ID trigger in both the 2016
        and 2017 data from 13 TeV LHC collisions has been excellent and
        exceeded expectations, even at the very high interaction
        multiplicities observed at the end of data taking in 2017. The
        detailed efficiencies and resolutions of the trigger in a wide range
        of physics signatures are presented for the Run 2 data. The superb
        performance of the ID trigger algorithms in these extreme pileup
        conditions demonstrates how the ID tracking continues to lie at the
        heart of the trigger performance to enable the ATLAS physics
        program, and will continue to do so in the future.

        Speaker: Julie Hart Kirk (STFC-Rutherford Appleton Laboratory (GB))
      • 154
        Kalman Filter track reconstruction on FPGAs for acceleration of the High Level Trigger of the CMS experiment at the HL-LHC

        Track reconstruction at the CMS experiment uses the Combinatorial Kalman Filter. The algorithm computation time scales exponentially with pile-up, which will pose a problem for the High Level Trigger at the High Luminosity LHC. FPGAs, which are already used extensively in hardware triggers, are becoming more widely used for compute acceleration. With a combination of high perfor- mance, energy efficiency, and predictable low latency, FPGA accelerators are an interesting technology for high energy physics. Here, progress towards porting of the CMS track reconstruction is shown, using Maxeler Technologies’ DataFlow Engines, programmed with their high level language MaxJ. The performance is compared to CPUs, and further steps to optimise for the architecture are presented.

        Speaker: Sioni Paris Summers (Imperial College Sci., Tech. & Med. (GB))
      • 155
        Fast Kalman Filtering: new approaches for the LHCb upgrade

        In order to profit from the largely increased instantaneous luminosity provided by the accelerator in Run III (2021-2023), the upgraded LHCb detector will make usage of a fully software based trigger, with a real-time event reconstruction and selection performed at the bunch crossing rate of the LHC (~30 MHz). This assumption implies much tighter timing constraints for the event reconstruction than currently used in Run II (2015-2018). The reconstruction of particles’ trajectories passing through the LHCb detector uses, already at an early stage, the Kalman Filter, a linear quadratic estimator for the tracks parameters. In the present configuration, the Kalman Filter is one of the most time-consuming parts of the reconstruction. For this reason, different approaches are currently being explored in order to reduce its execution time while keeping high reconstruction performances. The latest developments and their impact will be presented. These include vectorization, use of different computer architectures, simplifications of the algorithm, use of a parametrized approach and changes in the formalism.

        Speaker: Placido Fernandez Declara (University Carlos III (ES))
      • 156
        Fast Boosted Decision Tree inference on FPGAs for triggering at the LHC

        Boosted Decision Trees are used extensively in offline analysis and reconstruction in high energy physics. The computation time of ensemble inference has previously prohibited their use in online reconstruction, whether at the software or hardware level. An implementation of BDT inference for FPGAs, targeting low latency by leveraging the platform’s enormous parallelism, is presented. Full pipelining provides high throughput, and the use of a high-level programming language adds flexibility to construction of new ensembles. Classification latencies of tens of nanoseconds are achieved within the FPGA, as well as a 600 times speedup over inference on a single threaded CPU with the FPGA as a coprocessor. A use case within the CMS Level 1 track trigger is presented, demonstrating use of the implementation on a real classification problem, including resource and latency aware hyperparameter tuning.

        Speaker: Sioni Paris Summers (Imperial College Sci., Tech. & Med. (GB))
      • 157
        Comparison of Track Reconstruction Methods of the PANDA Experiment

        PANDA is one of the main experiments of the future FAIR accelerator facility at Darmstadt. It utilizes an anti-proton beam with a momentum up to 15 GeV/c on a fixed proton or nuclear target to investigate the features of strong QCD.

        The reconstruction of charged particle tracks is one of the most challenging aspects in the online and offline reconstruction of the data taken by PANDA. Several features of the experiment make the reconstruction special in comparison to others:

        • The anti-proton beam of PANDA is quasi continuous with a Poisson distribution of the time differences between two consecutive events with a mean time of 50 ns. This leads to a very high probability that events overlap and to an a-priori unknown event time.

        • PANDA uses two different magnetic fields, a solenoidal one around the interaction point and a dipole field in the forward direction. Therefore, different track models have to be used and a connection between the different regions is necessary.

        • The experiment is equipped with straw tubes trackers which measure the drift time inside a straw. This leads to an isochrone ring for each hit where the track has passed. The radius of the ring is dependent of the unknown start time of the track.

        • PANDA reads out and reconstructs the complete detector data online. This imposes strong requirements onto the reconstruction speed of the online track finding and fitting algorithms.

        Many different track reconstruction algorithms have been developed and tested both for online and offline reconstruction and on different hardware platforms such as CPUs, GPUs and FPGAs. In this presentation, an overview and a comparison of the different methods will be given.

        Speaker: Tobias Stockmanns (Forschungszentrum Jülich GmbH)
      • 158
        ATLAS Hardware based track-finding: Present and Future

        The ATLAS Fast TracKer (FTK) is a hardware based track finder for the ATLAS trigger infrastructure currently under installation and commissioning. FTK sits between the two layers of the current ATLAS trigger system, the hardware-based Level 1 Trigger and the CPU-based High-Level Trigger (HLT). It will provide full-event tracking to the HLT with a design latency of 100 µs at a 100 kHz event rate, reducing the need for computationally intensive tracking in the HLT, which is several orders of magnitude slower. Having tracking information available at the trigger level will significantly improve trigger efficiencies and thresholds for many physics objects, including b-jets and tau leptons. FTK uses a two-stage tracking procedure implemented with massive parallelization in FPGAs and associative memory ASICs for fast pattern matching. In the first stage, hits are compared in parallel to 1 billion precomputed patterns shared across 8192 associative memory chips, and tracks in the matched patterns are then fit at a design speed of 1 fit per nanosecond per FPGA across 512 fitter FPGAs. The second stage tracks are extrapolations of the first stage tracks into additional layers of the tracking detector. The proposed Hardware Track Trigger (HTT) for the ATLAS Phase-II upgrade at the High Luminosity LHC (HL-LHC) uses a conceptually similar two-stage pattern matching tracking procedure. However, the HTT is instead a co-processor to the CPU-based Event Filter (EF) trigger, providing regional tracking at the full 1 MHz EF input event rate as well as full-event tracking on request at 100 kHz. This talk reviews the current design and performance of FTK, as well as the conceptual design and performance of the HTT co-processor for the HL-LHC.

        Speaker: Todd Michael Seiss (University of Chicago (US))
      • 159
        GPU-based Online Track Reconstructions for the ALICE TPC in Run 3 with Continuous Read Out

        In LHC Run 3, ALICE will increase the data taking rate significantly to 50 kHz continuous read out of minimum bias Pb-Pb collisions.
        The reconstruction strategy of the online offline computing upgrade foresees a first synchronous online reconstruction stage during data taking enabling detector calibration, and a posterior calibrated asynchronous reconstruction stage.
        Many new challenges arise, among them continuous TPC read out, more overlapping collisions, no a priori knowledge of the primary vertex and of location-dependent calibration in the synchronous phase, identification of low-momentum looping tracks, and a distorted refit to improve track model entropy coding for a total TPC compression factor of 20.
        The tracking algorithm for the Time Projection Chamber (TPC) will be based on a Cellular automaton and the Kalman filter.
        The reconstruction shall run online, processing 50 times more collisions than today, while yielding results comparable to current offline reconstruction.
        Our TPC track finding leverages the potential of hardware accelerators via the OpenCL and CUDA APIs in a shared source code for CPUs and GPUs for both reconstruction stages.
        We give an overview of the status of Run 3 tracking including track finding efficiency, resolution, treatment of continuous read out data, and performance on processors and GPUs.

        Speaker: David Rohr (CERN)
      • 160
        Quantum Associative Memory in HEP Track Pattern Recognition

        We have entered the Noisy Intermediate-Scale Quantum Era. A plethora of quantum processor prototypes allow evaluation of potential of the Quantum Computing paradigm in applications to pressing computational problems of the future. Growing data input rates and detector resolution foreseen in High-Energy LHC (2030s) experiments expose the often high time and/or space complexity of classical algorithms. Quantum algorithms can potentially become the lower-complexity alternatives in such cases. In this work we discuss the potential of Quantum Associative Memory (QuAM) in the context of LHC data triggering. We examine the practical limits of storage capacity, as well as store and recall efficiencies, from the viewpoints of state-of-the-art quantum hardware and LHC real-time charged track pattern recognition requirements. We present experimental tests of QuAM on the IBM 5Q chip - a cloud-based 5-qubit superconducting quantum processor. We further compare the results to QuAM simulations on LIQUi|> - the Microsoft’s Quantum Simulator toolsuite - as well as to theoretical expectations of QuAM efficiency bounds. We also review several difficulties integrating the end-to-end quantum pattern recognition into a real-time production workflow, and discuss possible mitigations.

        Speaker: Illya Shapoval (Lawrence Berkeley National Laboratory)
    • T2 - Offline computing: S4 Hall 3.2 (National Palace of Culture)

      Hall 3.2

      National Palace of Culture

      Convener: Gene Van Buren (Brookhaven National Laboratory)
      • 161
        Synergy between the tracker layer layout and track reconstruction in collider environment with high pileup

        Majority of currently planned or considered hadron colliders are expected to deliver data in collisions with hundreds of simultaneous interactions per beam bunch crossing on average, including the high luminosity LHC upgrade currently in preparation and the possible high energy LHC upgrade or a future circular collider FCC-hh. Running of charged particle track reconstruction for the general purpose detectors at these colliders is expected to be at least comparable in cost to the construction and operation of the tracking detectors. We show that the sensitive layer arrangement in the design of a tracking detector can play a vital role to reduce complexity and cost of the charged particle track reconstruction. Our case study is based on realistic simulation of a tracking detector similar to that of CMS to be operated in HL-LHC. We show that a customary layout with equal layer separation is inferior to that with grouped/paired layers with a smaller distance within a group, which allows for more performant track segment reconstruction. This synergy between the hardware and software choice can allow for a reduction of the total cost of a future collider experiment.

        Speaker: Slava Krutelyov (Univ. of California San Diego (US))
      • 162
        A High-Precision Tracking Algorithm for Mass Reconstruction of Heavy-Ion Fragments in the R3B Experiment at FAIR

        The multi-purpose R$^{3}$B (Reactions with Relativistic Radioactive Beams) detector at the future FAIR facility in Darmstadt will be used for various experiments with exotic beams in inverse kinematics. The two-fold setup will serve for particle identification and momentum measurement up- and downstream the secondary reaction target. In order to perform a high-precision charge identification of heavy-ion fragments and achieve a momentum resolution of 10$^{-4}$, a ToF measurement with 100 ps accuracy, position determination on the order of less than 0.5 mm and a dedicated algorithm for the heavy-ion tracking in highly non-homogeneous dipole field are required. With these constraints a tracking package is being developed and tested within the R$^{3}$B software framework, this package has to go into production in fall of 2018. An iterative approach has been chosen for simultaneous track finding and fitting. The design and concept of the package will be introduced, also the tests and efficiencies measured with simulated data will be presented.

        Speaker: Dmytro Kresan (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 163
        Parallelized and Vectorized Tracking Using Kalman Filters with CMS Detector Geometry and Events

        The High-Luminosity Large Hadron Collider (HL-LHC) at CERN will be characterized by higher event rate, greater pileup of events, and higher occupancy. Event reconstruction will therefore become far more computationally demanding, and given recent technology trends, the extra processing capacity will need to come from expanding the parallel capabilities in the tracking software. Existing algorithms at the LHC are based on Kalman filter techniques, which have proven themselves to be robust and offer good physics performance. We have therefore developed Kalman-filter-based methods for track finding and fitting that are adapted for many-core SIMD processors, since this type of hardware is increasingly dominant in high-performance systems.

        This effort has been underway for some time now, and our software has matured in several important ways. (1) The detector geometry now includes two endcaps as well as the barrel, and tracks can propagate through layers of both types, as well as the transition regions between them. (2) We are therefore able to reconstruct events in realistic detector geometries, including an accurate representation of the CMS-2017 tracker. (3) Hit data can be imported from CMSSW-generated events, including pileup, and is no longer restricted to artificial muon-only simulations. (4) The reconstructed tracks can be validated against either the CMSSW simulation that generated the hits, or the CMSSW reconstruction of the tracks. (5) Groups of track candidates can now be tracked and assessed all the way through the detector using a single, consistent set of vectorizable data structures. In general, the code's computational performance has continued to improve while the above capabilities were being added.

        The presentation summarizes the latest features of this software, beginning with the data structures and code constructs that facilitate vectorization, as well as the multiple levels of parallel tracking tasks that have been multithreaded with TBB. We demonstrate that the present Kalman filter implementation is able to reconstruct events with comparable physics performance to CMSSW, while providing generally better computational performance. Further plans for advancing the software are discussed.

        Speaker: Matevz Tadel (Univ. of California San Diego (US))
      • 164
        COMET multi turn track fitting

        One of the task of track reconstruction for COMET Phase-I drift chamber is to fit multi-turn curling tracks. A method of Deterministic Annealing Filter and implements a global competition between hits from different turn tracks is introduced. This method assigns the detector measurements to the track assumption based on the weighted mean of fitting quality on different turns. This method is studied on the simulated tracks in the COMET drift chamber. We show that it can be a candidate to tell the hit turn number.

        Speaker: Ms Yao Zhang
      • 165
        Belle II Track Reconstruction and Results from first Collisions

        In early 2018, e+e- collisions of the SuperKEKB B-Factory will be recorded by the Belle II detector in Tsukuba (Japan) for the first time. The new accelerator and detector represent a major upgrade from the previous Belle experiment and will achieve a 40-times higher instantaneous luminosity. Special considerations and challenges arise for track reconstruction at Belle II due to multiple factors. This high luminosity configuration of the collider increases the beam-induced background by many factors compared to Belle and a new track reconstruction software has been developed from scratch to achieve an excellent physics performance in this busy environment.
        Even though on average only eleven signal tracks are present in one event, all of them need to be reconstructed down to a transversal momentum of 50 MeV and no fake tracks should be present in the event. Many analyses at Belle II rely on the advantage that the initial state in B-factories is well known and a clean event reconstruction is possible if no tracks are left after assigning all tracks to particle hypotheses.

        This contribution will introduce the concepts and algorithms of the Belle II tracking software. Special emphasis will be put on the mitigation techniques developed to perform track reconstruction in high-occupancy events. First results from the data taking with the Belle II detector will be presented.

        Speaker: Thomas Hauth (KIT - Karlsruhe Institute of Technology (DE))
      • 166
        Track Fitting for the Belle II experiment

        The Belle II experiment is ready to take data in 2018, studying e+e- collisions at the KEK facility in Tsukuba (Japan), in a center of mass energy range of the Bottomonium states. The tracking system includes a combination of hit measurements coming from the vertex detector, made of pixel detectors and double-sided silicon strip detectors, and a central drift chamber, inside a solenoid of 1.5 T magnetic field. Once the pattern recognition routines have identified the track candidates, hit measurements are fitted together taking into account the different information coming from different detectors, the energy loss in the materials and the inhomogeneity of the magnetic field. Track fitting is performed by the generic track-fitting software GENFIT, which includes a Kalman filter improved by a deterministic annealing filter, in order to reject outlier hits coming from not correctly associated hits by the pattern recognition. Several mass hypotheses are used in the fit in parallel, in order to achieve the best track parameter estimation for each particle kind.
        This talk will present the design of the track fitting in the Belle II software, showing results in terms of track parameter estimation as well as computing performances.

        Speaker: Stefano Spataro (University of Turin)
      • 167
        Performance in offline event reconstruction landscape at CMS

        CMS offline event reconstruction algorithms cover simulated and acquired data processing starting from the detector raw data on input and providing high level reconstructed objects suitable for analysis. The landscape of supported data types and detector configuration scenarios has been expanding and covers the past and expected future configurations including proton-proton collisions and collisions with heavy ions through the high luminosity LHC era. All of this landscape is handled in the same software release development cycle. Every year the landscape is dominated by the current year proton-proton collision data taking in terms of computational and output data volume. By 2018 the landscape is relatively well established and application performance for Run-2 and Run-3 of LHC can be understood. We provide in-depth look at the performance of a typical production application running offline event reconstruction, focusing on specific areas of subdetector or higher level object reconstruction. We highlight the recent trends in developments and their impact on resource use required to run the application. An outlook for future challenges of running in HL-LHC era is provided based on the latest state of the art event reconstruction application setup.

        Speaker: Slava Krutelyov (Univ. of California San Diego (US))
      • 168
        JANA2 Mutli-threaded Event Reconstruction

        Development of the JANA multi-threaded event processing framework began in 2005. It’s primary application has been for GlueX, a major Nuclear Physics experiment at Jefferson Lab. Production data taking began in 2016 and JANA has been highly successful in analyzing that data on the JLab computing farm. Work has now begun on JANA2, a near complete rewrite emphasizing features targeted for large offsite computing facilities. Successful and less-than-successful strategies employed in JANA and how they are being addressed in JANA2 will be presented as well as new features suited to modern and future trends in data analysis.

        Speaker: David Lawrence (Jefferson Lab)
    • T3 - Distributed computing: Performance Optimization, Security and Federated Identity Hall 7 (National Palace of Culture)

      Hall 7

      National Palace of Culture

      Convener: David Cameron (University of Oslo (NO))
      • 169
        Minimising wasted CPU time with interruptible LHCb Monte Carlo

        During 2017 LHCb developed the ability to interrupt Monte Carlo
        simulation jobs and cause them to finish cleanly with the events
        simulated so far correctly uploaded to grid storage. We explain
        how this functionality is supported in the Gaudi framework and handled
        by the LHCb simulation framework Gauss. By extending DIRAC, we have been
        able to trigger these interruptions when running simulation on
        unoccupied capacity of the LHCb High Level Trigger farm, and are able to
        reclaim this capacity when needed for online data taking tasks. This has
        increased the opportunities for running Monte Carlo simulation during
        data taking runs as well as interfill periods and technical stops. We
        have also applied this mechanism to grid and cloud resources at external
        sites, providing the ability to reclaim capacity for operational reasons
        without long draining periods. In addition, the mechanism is used to
        increase the efficiency of the "job masonry" of packing
        single and multiprocessor jobs into the time slots on a single worker node,
        without the need for draining periods when multiple free processors must be
        assembled for a multiprocessor job. We explain how the Machine/Job
        Features mechanism is instrumental in communicating the desired finish
        time to LHCb jobs and virtual machines.

        Speaker: Andrew McNab (University of Manchester)
      • 170
        ATLAS Grid Workflow Performance Optimization

        The CERN ATLAS experiment grid workflow system manages routinely 250 to
        500 thousand concurrently running production and analysis jobs
        to process simulation and detector data. In total more than 300 PB
        of data is distributed over more than 150 sites in the WLCG.
        At this scale small improvements in the software and computing
        performance and workflows can lead to significant resource usage gains.
        ATLAS is reviewing together with CERN IT experts several typical
        simulation and data processing workloads for potential performance
        improvements in terms of memory and CPU usage, disk and network I/O.
        All ATLAS production and analysis grid jobs are instrumented to collect
        many performance metrics for detailed statistical studies using
        modern data analytics tools like ElasticSearch and Kibana.
        This presentation will review and explain the performance gains of
        several ATLAS simulation and data processing workflows and present
        analytics studies of the ATLAS grid workflows.

        Speaker: Johannes Elmsheuser (Brookhaven National Laboratory (US))
      • 171
        Improving efficiency of analysis jobs in CMS

        Hundreds of physicists analyse data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) using the CMS Remote Analysis builder (CRAB) and the CMS GlideinWMS global pool to exploit the resources of the World LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time the CMS collaboration is committed on minimizing time to insight for every scientist, by pushing for the fewer possible access restrictions to the full data sample and for freedom of choosing the application to run. Supporting such varied workflows while preserving efficient resource usage poses special challenges, like: scheduling of jobs in a multicore/pilot model where several single core jobs with an undefined runtime run inside pilot jobs with a fixed lifetime; balancing usage of every available CPU vs. use of CPU close to the data; avoiding that too many concurrent reads from same storage push jobs into I/O wait mode making CPU cycles go idle; watching over user activity to detect low efficiency workflows and prod them into smarter usage of the resources.

        In this paper we report on two complementary approaches adopted in CMS to improve the scheduling efficiency of user analysis jobs: job automatic splitting, and job automatic estimated running time tuning. They both aim at finding an appropriate value for the scheduling runtime, a number that tells how much walltime the user job needs, and it is used during scheduling to fit user's jobs into pilots that have enough lifetime. With the automatic splitting mechanism, an estimation of the runtime of the jobs is performed upfront so that an appropriate value can be estimated for the scheduling runtime. With the automatic time tuning mechanism instead, the scheduling runtime is dynamically modified by analyzing the real runtime of jobs after they finish. We also report on how we used the flexibility of the global computing pool to tune the amount, kind and running locations of jobs allowed to run exploiting remote access to the input data.

        We discuss the strategies concepts, details, and operational experiences, highlighting the pros and cons, and we show how such efforts helped improving the computing efficiency in CMS.

        Speakers: Todor Trendafilov Ivanov (University of Sofia (BG)) , Jose Hernandez (CIEMAT)
      • 172
        Using AWS Athena analytics to monitor pilot job health on WLCG compute sites

        ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In this paper we describe a method to analyse logs from the ADC resource provisioning system (AutoPyFactory) and provide monitoring views which target poorly performing resources and help diagnose the issues in good time. Central to this analysis is the use of Amazon Web Services (AWS) to provide an inexpensive and stable analytics platform. In particular we use the AWS Athena service as an SQL query interface for logging data stored in the AWS S3 service. We describe details of the data handling pipeline and services involved leading to a summary of key metrics suitable for ADC operations.

        Speaker: Peter Love (Lancaster University (GB))
      • 173
        Federated Identity Management for Research

        Federated identity management (FIM) is an arrangement that can be made among multiple organisations that lets subscribers use the same identification data to obtain access to the secured resources of all organisations in the group. In many research communities there is an increasing interest in a common approach to FIM as there is obviously a large potential for synergies. FIM4R [1] provides a forum for communities to share challenges and ideas, and to shape the future of FIM for our researchers. Current participation covers high energy physics, life sciences and humanities, to mention but a few. In 2012 FIM4R converged on a common vision for FIM, enumerated a set of requirements and proposed a number of recommendations for ensuring a roadmap for the uptake of FIM is achieved [2]. FIM4R is currently working on an updated version of this paper, to be published in spring 2018.

        The HEP Community has been heavily involved in creating both the original whitepaper and this new version, which will document the progress made in FIM for Research, in addition to the current challenges. During this presentation we will share the conclusions of this second FIM4R whitepaper and present a summary of the identified requirements and recommendations. We shall focus particularly on the direction being taken by WLCG, through the WLCG Authorisation Working Group, and the requirements gathered from the HEP Community.

        [1] https://fim4r.org
        [2] https://fim4r.org/documents/
        [3] https://aarc-project.eu
        [4] https://www.geant.org/Projects/GEANT_Project_GN4

        Speaker: Hannah Short (CERN)
      • 174
        Operational security, threat intelligence & distributed computing: the WLCG Security Operations Center Working Group

        The modern security landscape for distributed computing in High Energy Physics (HEP) includes a wide range of threats employing different attack vectors. The nature of these threats is such that the most effective method for dealing with them is to work collaboratively, both within the HEP community and with partners further afield - these can, and should, include institutional and campus security teams. In parallel with this work, an appropriate technology stack is essential, incorporating current work on Big Data analytics.

        The work of the Worldwide LHC Computing Grid (WLCG) Security Operations Center (SOC) working group is to pursue these goals to form a reference design (or guidelines) for WLCG sites of different types. The strategy of the group is to identify necessary components - starting with threat intelligence (MISP) and network monitoring (Bro) - and build a working model over time.

        We present on the progress of the working group thus far, in particular on the programme of workshops now underway. These workshops give an opportunity to engage with sites to allow the development of advice and procedures for deployment, as well as facilitating wider discussions on how to best work with trust groups at different levels. These trust groups vary in scope but can include institutes, NGIs and the WLCG as a whole.

        Speaker: David Crooks (University of Glasgow (GB))
      • 175
        Macaroons: looking back and looking forward

        X.509 is the dominate security infrastructure used in WLCG. Although
        this technology has worked well, it has some issues. One is that,
        currently, a delegated proxy can do everything the parent credential
        can do. A stolen "production" proxy could be used from any machine in
        the world to delete all data owned by that VO on all storage systems
        in the grid.

        Generating a delegated X.509 credential is also computationally
        expensive. As a result, credentials tend to be cached and reused,
        potentially increasing the likelihood of theft.

        In October 2017, dCache v3.2 was released, which introduced support
        for macaroons: a pure authorisation, bearer token that supports
        delegation with autonomous attenuation.

        In contrast to X.509, minting a macaroon is much faster, so that
        creating a macaroon per request is feasible. Macaroons also support
        attenuation when delegating: the delegated macaroon can be limited to
        a specific task and to a specific machine.

        This paper reviews the current state of this support in dCache and
        present the different use-cases and projects that are using or are
        evaluating macaroons. It also compares macaroons to other pure bearer
        token solutions, such as the OAuth2 approach in SciToken, highlighting
        the relative strengths and weaknesses of each.

        Speaker: Paul Millar (DESY)
      • 176
        EOSC-hub AAI: A federated authentication and authorisation infrastructure for international scientific collaboration at scale

        The European Open Science Cloud (EOSC) aims to enable trusted access to services and the re-use of shared scientific data across disciplinary, social and geographical borders. The EOSC-hub will realise the EOSC infrastructure as an ecosystem of research e-Infrastructures leveraging existing national and European investments in digital research infrastructures. EGI Check-in and EUDAT B2ACCESS are the initial basis of an integrated EOSC-hub AAI that will allow the use of federated identities for accessing services. The adoption of standards and open technologies, including SAML 2.0, OpenID Connect, and OAuth 2.0, facilitates integration with web-based services. Options to support non-web services, which traditionally relied on X509 certificates, are based around the concept of online authorities with attached credential stores, such as RCauth.eu with a tightly-coupled MyProxy server. Such techniques allow science gateways to obtain credentials on behalf of the end-user that can be used to directly authenticate to services. Another user-centric approach considers certificate proxies as opaque tokens that can be obtained from a credential store from the command-line using SSH authentication. The deployed RCauth.eu and MasterPortal service from AARC features both these capabilities and has been shown to work for the production EGI and WLCG environments.

        The presentation will provide an overview of the EOSC-hub AAI architecture and the various integration workflows in support of today’s federated access requirements, with an eye to the non-web use cases.

        Speaker: Mr Nicolas Liampotis (Greek Research and Technology Network - GRNET)
    • T4 - Data handling: S4 Hall 8 (National Palace of Culture)

      Hall 8

      National Palace of Culture

      Convener: Costin Grigoras (CERN)
      • 177
        Echo - Experiences running an erasure coded object store

        Since February 2017, the RAL Tier-1 has been storing production data from the LHC experiments on its new Ceph backed object store called Echo. Echo has been designed to meet the data demands of LHC Run 3 and should scale to meet the challenges of HL-LHC. Echo is already providing better overall throughput than the service it will replace (CASTOR) even with significantly less hardware deployed.

        Echo relies on erasure coding rather than hardware RAID to provide data resilience. Of the publicly known Ceph clusters around the world, Echo is largest running erasure coding in production. This paper describes the erasure coding setup, its advantages over hardware RAID and our experience relying on it for data resilience.

        At the start of 2017, the LHC experiments had more than 14PB of data stored on disk in Castor. Migrating this to Echo is no small challenge and will take approximately 2 years. This paper describes the different approaches taken by the experiments as well as their current usage of Echo. This paper also describes the growing usage of the S3 and Swift APIs and lessons learnt.

        In the first year of operation there have been many pieces of scheduled work, including the addition of new hardware which resulted in significant data rebalancing, major software updates as well as security patching. There have also been operational problems such as a power cut and high disk failure rates. This paper describes how Echo has coped with these events and the higher level of data availability it is able to provide.

        Speaker: Rob Appleyard (STFC)
      • 178
        dCache - joining the noWORM storage club.

        The life cycle of the scientific data is well defined: data is collected, then processed,
        archived and finally deleted. Data is never modified. The original data is used or new,
        derived data is produced: Write Once Read Many times (WORM). With this model in
        mind, dCache was designed to handle immutable files as efficiently as possible. Currently,
        data replication, HSM connectivity and data-server independent operations are only
        possible due to the immutable nature of the stored data.

        dCache is seen increasingly as a general-purpose file system, helped by its support for
        the NFSv4.1standard,especially by new communities, such as photon science and
        microbiology. Although many users are aware of the immutability of data stored in
        dCache, some applications and use cases still require in-place update of stored files.
        Satisfying these requires some fundamental changes to dCache's core design. However,
        those changes must not compromise any aspect of existing functionality.

        In this presentation we will show the new developments in dCache that will turn it
        into a regular file system. We will discuss the challenges of building a POSIX-compliant
        distributed storage system, one that can handle multiple replicas and that remains
        backward compatible by providing both WORM and non-WORM capabilities within
        the same system.

        Speaker: Mr Tigran Mkrtchyan (DESY)
      • 179
        A milestone for DPM (Disk Pool Manager)

        The DPM (Disk Pool Manager) system is a multiprotocol scalable technology for Grid storage that supports about 130 sites for a total of about 90 Petabytes online.

        The system has recently completed the development phase that had been announced in the past years, which consolidates its core component (DOME: Disk Operations Management Engine) as a full-featured high performance engine that can also be operated with standard Web clients and uses a fully documented REST-based protocol.

        Together with a general improvement on performance and with a comprehensive administration command-line interface, this milestone also brings back features like the automatic disk server status detection and the volatile pools for deploying experimental disk caches.

        In this contribution we also discuss the end of support for the historical DPM components (that also include a dependency on the Globus toolkit), whose deployment is now only linked to the usage of the SRM protocols, hence can be uninstalled when these are not needed anymore by the site.

        Speaker: Fabrizio Furano (CERN)
      • 180
        Providing large-scale disk storage at CERN

        The CERN IT Storage group operates multiple distributed storage systems and is
        responsible
        for the support of the infrastructure to accommodate all CERN storage
        requirements, from the
        physics data generated by LHC and non-LHC experiments to the personnel users’
        files.

        EOS is now the key component of the CERN Storage strategy. It allows to
        operate at high incoming
        throughput for experiment data-taking while running concurrent complex
        production work-loads.
        This high-performance distributed storage provides now more than 250PB of raw
        disks and it is the
        key component behind the success of CERNBox, the CERN cloud synchronisation
        service which allows
        syncing and sharing files on all major mobile and desktop platforms to provide
        offline
        availability to any data stored in the EOS infrastructure.

        CERNBox recorded an exponential growth in the last couple of year in terms of
        files and data stored
        thanks to its increasing popularity inside CERN users community and thanks to
        its integration
        with a multitude of other CERN services (Batch, SWAN, Microsoft Office).

        In parallel CASTOR is being simplified and transitioning from an HSM into an
        archival system, focusing mainly
        in the long-term data recording of the primary data from the detectors,
        preparing the road to the next-generation
        tape archival system, CTA.

        The storage services at CERN cover as well the needs of the rest of our
        community: Ceph as data back-end for
        the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for
        legacy home directory filesystem
        services and its ongoing phase-out and CVMFS for software distribution.

        In this paper we will summarise our experience in supporting all our
        distributed storage system and the ongoing work
        in evolving our infrastructure, testing very-dense storage building block
        (nodes with more than 1PB of raw space)
        for the challenges waiting ahead.

        Speaker: Herve Rousseau (CERN)
      • 181
        Scaling the EOS namespace

        The EOS namespace has outgrown its legacy in-memory implementation, presenting the need for an alternative solution. In response to this need we developed QuarkDB, a highly-available datastore capable of serving as the metadata backend for EOS. Even though the datastore was tailored to the needs of the namespace, its capabilities are generic.

        We will present the overall system design, and our efforts in providing comparable performance with the in-memory approach, both when reading, through the use of extensive caching on the MGM, and when writing through the use of latency-hiding techniques involving a persistent, back-pressured local queue for batching updates to the QuarkDB backend.

        We will also discuss the architectural decisions taken when designing our datastore, including the choice of consensus algorithm to maintain strong consistency between identical replicas (raft), the choice of underlying storage backend (rocksdb) and communication protocol (redis serialization protocol - RESP), as well as the overall testing strategy to ensure correctness and stability of this important infrastructure component.

        Speaker: Andrea Manzi (CERN)
      • 182
        CERNBox: the CERN cloud storage hub

        CERNBox is the CERN cloud storage hub. It allows synchronising and sharing files on all major desktop and mobile platforms (Linux, Windows, MacOSX, Android, iOS) aiming to provide universal access and offline availability to any data stored in the CERN EOS infrastructure.

        With more than 12000 users registered in the system, CERNBox has responded to the high demand in our diverse community to an easily and accessible cloud storage solution that also provides integration with other CERN services for big science: visualisation tools, interactive data analysis and real-time collaborative editing.

        Collaborative authoring of documents is now becoming standard practice with public cloud services, and within CERNBox we are looking into several options: from the collaborative editing of shared office documents with different solutions (Microsoft, OnlyOffice, Collabora) to integrating mark-down as well as LaTeX editors, to exploring the evolution of Jupyter Notebooks towards collaborative editing, where the latter leverages on the existing SWAN Physics analysis service.

        We report on our experience managing this technology and applicable use-cases, also in a broader scientific and research context and its future evolution with highlights on the current development status and future roadmap. In particular we will highlight the future move to an architecture based on microservices to easily adapt and evolve the service to the technology and usage evolution, notably to unify CERN home directory services.

        Speaker: Hugo Gonzalez Labrador (CERN)
      • 183
        Cloud Storage for data-intensive sciences in science and industry

        In the last few years we have been seeing constant interest for technologies providing effective cloud storage for scientific use, matching the requirements of price, privacy and scientific usability. This interest is not limited to HEP and extends out to other scientific fields due to the fast data increase: for example, "big data" is a characteristic of modern genomics, energy and financial services to mention a few.

        The provision of cloud storage accessible via synchronisation and sharing interfaces became an essential element of services' portfolios offered by research laboratories and universities. "Dropbox-like" services were created and now support HEP and other communities in their day to day tasks. The scope for these systems is therefore much broader of HEP: we will describe the usage and the plans to adopt part of the tools originally conceived for our community in other areas. The adoption of cloud storage services in the main workflow for data analysis is the challenge we are now facing, extending out the functionality of "traditional" cloud storage.

        Which are the ingredients for these new classes of services? Is nowadays HEP proposing interesting solutions for other future projects on the timescale of high-luminosity LHC?

        The authors believe that HEP-developed technologies will constitute the backend for a new generation of services. Namely, our solution for exascale geographically distributed storage (EOS), the access and the federation of cloud storage across different domains (CERNBox) and the possibility to offer effective heavy-duty interactive data analysis services (SWAN) growing from this novel data infrastructure are the three key enablers for future evolution.

        In this presentation we will describe the usage of these technologies to build large content-deliver-networks (e.g. AARNET Australia), the collaboration with other activities (e.g. handling of satellite images from the Copernicus programme at JRC) and different partnerships with companies active in this field.

        Speaker: Hugo Gonzalez Labrador (CERN)
      • 184
        Ceph File System for the CERN HPC Infrastructure

        The Ceph File System (CephFS) is a software-defined network filesystem built upon the RADOS object store. In the Jewel and Luminous releases, CephFS was labeled as production ready with horizontally scalable metadata performance. This paper seeks to evaluate that statement in relation to both the HPC and general IT infrastructure needs at CERN. We highlights the key metrics required by four users, including: POSIX compliance, single-user small-file latency, multi-user metadata throughput, and metadata horizontal scalability and failure tolerance. We will report about the experience so far and future evolution of the service.

        In addition, we describe a new suite of micro-benchmark which measure the small latencies of buffered and synchronous filesystem operations and can be used to quantify the evolving quality of a storage cluster over time. We also introduce a simple ping-like latency tool---fsping---which evaluates the time needed for two clients to notice file modifications in a shared filesystem. These tests, in combination with several classical HPC benchmarks run in single- and multi-user scenarios, paint a picture of CephFS which is emerging as a viable option for HPC storage and NFS-appliance replacements.

        Speaker: Herve Rousseau (CERN)
    • T5 - Software development: S4 - Frameworks Hall 3 (National Palace of Culture)

      Hall 3

      National Palace of Culture

      Convener: Patricia Mendez Lorenzo (CERN)
      • 185
        The core software framework for the LHCb Upgrade

        The LHCb detector will be upgraded for the LHC Run 3. The new, full software trigger must be able to sustain the 30MHz proton-proton inelastic collision rate. The Gaudi framework currently used in LHCb has been re-engineered in order to enable the efficient usage of vector registers and of multi- and many-core architectures. This contribution presents the critical points that had to be tackled, the current status of the core software framework and an outlook of the work program that will address the challenges of the software trigger.

        Speaker: Sebastien Ponce (CERN)
      • 186
        Evolution of the ALICE Software Framework for LHC Run 3

        ALICE is one of the four major LHC experiments at CERN. When the accelerator enters the Run 3 data-taking period, starting in 2021, ALICE expects almost 100 times more Pb-Pb central collisions than now, resulting in a large increase of data throughput. In order to cope with this new challenge, the collaboration had to extensively rethink the whole data processing chain, with a tighter integration between Online and Offline computing worlds. Such a system, codenamed ALICE O2, is being developed in collaboration with the FAIR experiments at GSI. It is based on the ALFA framework which provides a generalised implementation of the ALICE High Level Trigger approach, designed around distributed software entities coordinating and communicating via message passing.

        We will highlight our efforts to integrate ALFA within the ALICE O2 environment. We analyse the challenges arising from the different running environments for production and development, and conclude on requirements for a flexible and modular software framework. In particular we will present the ALICE O2 Data Processing Layer which exploits ALICE specific requirements in terms of Data Model. The main goal is to reduce the complexity of development of algorithms and managing a distributed system, and by that leading to a significant simplification for the large majority of the ALICE users. We will show examples for the usage of the Data Processing Layer in different contexts: local reconstruction after detector read-out, full reconstruction, simulation, and quality control. For each, we will provide a brief description, an evaluation of the performance and an assessment on the added value from the Data Processing Layer.

        Finally we will give an outlook on future developments and possible generalisations and reabsorption in the ALFA framework of some of the concepts and components we developed.

        Speaker: Giulio Eulisse (CERN)
      • 187
        The ATLAS multithreaded offline framework

        In preparation for Run 3 of the LHC, scheduled to start in 2021, the ATLAS
        experiment is revising its offline software so as to better take advantage
        of machines with many cores. A major part of this effort is migrating the
        software to run as a fully multithreaded application, as this has been
        shown to significantly improve the memory scaling behavior. This talk will
        outline changes made to the software framework to support this migration,
        as well as describe a static code checker used to assist in locating
        violations of thread-safety rules. Preliminary experiences in migrating
        reconstruction algorithms will also be discussed.

        Speaker: Scott Snyder (Brookhaven National Laboratory (US))
      • 188
        GNA: new framework for statistical data analysis

        Measurement of the physical parameters is usually done by fitting a numerical experiment model to the data. High precision experiments require detailed models with a large number of uncertain parameters. The models should be computationally efficient. In the same time they should be flexible enough since the analysis preparation requires a lot of testing.
        We are solving these problems by introducing the GNA framework. The computational model in GNA is represented by a directed acyclic graph. Each node is an operation on an array (matrix multiplication, reweighting or even cross section calculation) implemented in C++ for efficiency. The model is then assembled by connecting the nodes, outputs to inputs. The assembly is done within Python. Such an approach enables the user to build complex but flexible and efficient lazy evaluated models. GNA offers tools for handling large number of parameters with uncertainties and correlations, fitting and statistical analysis.
        GNA states for “global neutrino analysis” and is originally developed for the application for the neutrino physics for the data analysis of such experiments as Daya Bay, JUNO and NOvA.

        Speaker: Ms Anna Fatkina (JINR)
      • 189
        Physics data processing with micro-services

        In this paper, we present micro-services framework to develop data processing applications.
        We discuss functional decomposition strategies that help transitioning of existing data processing applications into a micro-services environment. We will also demonstrate advantages and disadvantages of this framework in terms of operational elasticity, vertical and horizontal scalability, productivity and maintenance. The discussion will accompany two real world implementations of the presented framework: 1) JLAB CLAS12 event reconstruction and analysis application (new design, using Java, C++ and Python micro-services within the same application), and 2) NASA EOS satellite data fusion and analytics (an application, designed using legacy software components written in Fortran presented as micro-services).

        Speaker: Vardan Gyurjyan (Jefferson Lab)
      • 190
        A generic data quality monitoring framework for HEP

        Data quality monitoring is the first step to the certification of the recorded data for off-line physics analysis. Dedicated monitoring framework have been developed by many experiments in the past and usually rely on the event data model (EDM) of the experiment, leading to a strong dependency on the data format and storage. We present here a generic data quality monitoring system, DQM4HEP, that has been developed without any assumption on the EDM. This increases the code maintenance, the portability across different experiments and re-usability for future experiment.
        We present the framework architecture and the various tools provided by the software package as well as various performances such as memory usage, stability and network bandwidth. We give an overview of the different experiments using DQM4HEP and the foreseen integration in future other experiments. We finally present the ongoing and future software development for DQM4HEP and long term prospects.

        Speaker: Remi Ete (DESY)
      • 191
        Performance optimization of the air shower simulation code for the Cherenkov Telescope Array

        The Cherenkov Telescope Array (CTA), currently under construction, is the next- generation instrument in the field of very high energy gamma-ray astronomy. The first data are expected by the end of 2018, while the scientific operations will start in 2022 for a duration of about 30 years. In order to characterise the instrument response to the Cherenkov light emitted by atmospheric cosmic ray showers, detailed Monte Carlo simulations will be regularly performed in parallel to CTA operations. The estimated CPU time associated to these simulations is very high, of the order of 200 millions HS06 hours per year. Reducing the CPU time devoted to simulations would allow either to reduce infrastructure cost or to better cover the large phase space.
        In these proceedings, we focus on the main computing step (70% of the whole CPU time) implemented in the corsika program, and specifically on the module responsible for the propagation of Cherenkov photons in the atmosphere. We present our preliminary studies about different options of code optimization, with a particular focus on vectorization facilities (SIMD instructions) together with computing precision tuning. Our proposals take care, as automatically as possible, of the hardware portability constraints introduced by the grid computing environment that hosts these simulations. Performance evaluation in terms of running-time and accuracy is provided.

        Speaker: Johan Bregeon (Laboratoire Univers et Particules, Université de Montpellier Place Eugène Bataillon - CC 72, CNRS/IN2P3, F-34095 Montpellier, France )
      • 192
        Floating-point profiling of ACTS using Verrou

        Numerical stability is not only critical to the correctness of scientific computations, but also has a direct impact on their software efficiency as it affects the convergence of iterative methods and the available choices of floating-point precision.

        Verrou is a Valgrind-based tool which challenges the stability of floating-point code by injecting random rounding errors in computations (a restricted form of Monte Carlo Arithmetic). Through bisection techniques, this tool is able to locate which parts of a codebase are correlated with significant losses of numerical accuracy.

        The ACTS project aims to establish a experiment-agnostic track reconstruction toolkit. It originates from the ATLAS Run2 tracking software and has already received strong adoption by FCC-hh, and is being evaluated for possible use by the CLICdp and Belle 2 experiments.

        We have used Verrou to evaluate the numerical stability of ACTS' Runge-Kutta track propagation. As this step is known to be a reconstruction bottleneck for many experiments, we expect significant performance gains to be achievable by introducing reduced-precision arithmetic in it. In this talk, we will present the results of this study and discuss the more general applicability of Monte Carlo arithmetic to High Energy Physics software.

        Speakers: Hadrien Benjamin Grasland (Université Paris-Saclay (FR)) , Bruno Lathuilière
    • T6 - Machine learning and physics analysis: S4 Hall 9 (National Palace of Culture)

      Hall 9

      National Palace of Culture

      Convener: Mr Sergei Gleyser
      • 193
        Machine learning at the Cosmic Frontier

        Within the field of dark matter direct detection, there has been very little penetration of machine learning. This is primarily due to the difficulty of modeling such low-energy detectors for training sets (the keV energies are $10^{-10}$ smaller than LHC). Xenon detectors have been leading the field of dark matter direct detection for the last decade. The current front runner is XENON1T, which has invested heavily into full-chain simulations and a flexible Python-stack reconstruction chain. Accordingly, we have been able to explore what opportunities there exist at the Cosmic Frontier with machine learning. I will report on three things. First, a new type of energy and spatial reconstruction algorithm that may enable this technology to reduce a major source of background while opening up double-beta-decay measurements in later such TPCs. This regression is achieved using deep convolutional neural networks. Second, how machine learning has been used to reduce certain nonphysical backgrounds that we have encountered, primarily using random forests. Third, how machine learning can be used in signal processing for classifying the two signal types that we observe. These three machine-learining R&D topics are applicable to all such detectors in our field and this work constitutes the first known robust exploration of such opportunities for dark matter experiments.

        Speaker: Dr Christopher Tunnell (University of Chicago)
      • 194
        Deep Learning applied to the Cherenkov Telescope Array data analysis

        The Cherenkov Telescope Array (CTA) is the next generation of ground-based gamma-ray telescopes for gamma-ray astronomy. Two arrays will be deployed composed of 19 telescopes in the Northern hemisphere and 99 telescopes in the Southern hemisphere. Observatory operations are planned to start in 2021 but first data from prototypes should be available already in 2019. Due to its very high sensitivity, CTA will record a tremendous amount of data that represent a computing challenge to the reconstruction software. Moreover, the vast majority of triggered events come from protons whereas only photons are of interest for the study of astrophysical sources.
        Deep learning developments in the last few years have shown tremendous improvements in the analysis of data in many domains. Thanks to the huge amount of simulated data and later of real data, produced by CTA, these algorithms look particularly adapted and very promising. Moreover, the trained neural networks show very good computing performances during execution.
        Here we present a first study of deep learning architectures applied to CTA simulated data to perform the energy reconstruction of the particles as well as their discrimination.

        Speaker: Dr Thomas Vuillaume (LAPP, CNRS, Univ. Savoie Mont-Blanc)
      • 195
        Machine Learning Techniques in the CMS Higgs to Di-muons Search

        With the accumulation of large datasets at energy of 13 TeV, the LHC experiments can search for rare processes, where the extraction of the signal from the copious and varying Standard Model backgrounds poses increasing challenges. Techniques based on machine learning promise to achieve optimal search sensitivity and signal-to-background ratios for such searches. Taking the search for the Higgs boson decay to two muons in the CMS experiment as an example, we highlight several techniques for event classification and introduce a new application for creating automatic event categories, ultimately used to achieve the highest sensitivity of the analysis. We conclude with insights about strategies for optimal utilization of machine learning algorithms with the data expected to be accumulated during Run2 of the LHC.

        Speaker: Dimitri Bourilkov (University of Florida (US))
      • 196
        Using Generative Adversarial Networks for fast simulations in the ALICE Experiment

        Data Quality Assurance (QA) is an important aspect of every High-Energy Physics experiment, especially in the case of the ALICE Experiment at the Large Hadron Collider (LHC) whose detectors are extremely sophisticated and complex devices. To avoid processing low quality or redundant data, human experts are currently involved in assessing the detectors’ health during the collisions’ recording. However, for Run 3 where the amount of collected data is 100 times higher, manual detector health checks will not be feasible.

        To mitigate this problem, we use unsupervised machine learning methods, namely Generative Adversarial Networks (GANs), for fast simulations of trajectories reconstructed in the detector after particle collisions. We then compare the results of those simulations and physical parameters reconstructed in the detector to identify anomalies that might lead to the corruption of recorded data. Since GAN-based simulation is fast enough that it can be performed during runtime, our method allows for fully automatic online health checks of running conditions to be performed in real time. Furthermore, since GAN’s parameters can be updated using incoming stream of data, our method can automatically adjust itself to changing conditions resulting, for example, from detector’s aging.

        Speaker: Dr Tomasz Piotr Trzcinski (Warsaw University of Technology (PL))
      • 197
        The use of adversaries for optimal neural network configuration

        Data from B-physics experiments at the KEKB collider have a substantial background from $e^{+}e^{-}\to q \bar{q}$ events. To suppress this we employ deep neural network algorithms. These provide improved signal from background discrimination. However, the neural network develops a substantial correlation with the $\Delta E$ kinematic variable used to distinguish signal from background in the final fit due to its relationship with the input variables. The effect of this correlation is counter-acted by deploying an adversarial neural network. Overall the adversarial deep neural network performs better than an unoptimised commercial package, NeuroBayes.

        Speaker: Prof. Martin Sevior (University of Melbourne)
      • 198
        Systematics aware learning: a case study in High Energy Physics

        Experimental science often has to cope with systematic errors that coherently bias data. We analyze this issue on the analysis of data produced by experiments of the Large Hadron Collider at CERN as a case of supervised domain adaptation. The dataset used is a representative Higgs to tau tau analysis from ATLAS and released as part of the Kaggle Higgs ML challenge. Perturbations have been introduced into this dataset to mimick systematic errors. A classifier is trained to separate the Higgs signal from the background. The goal is to reduce the sensitivity of the classifier with respect to systematics uncertainty. The figure of merit is the total uncertainty, including statistical and systematics uncertainty.

        Systematics-aware learning should create an efficient representation that is insensitive to perturbations induced by the systematic effects. Different techniques have been experimented with and will be reported (i) Data Augmentation (training on a mix of data generated by varying the nuisance parameter), (ii) Adversarial Learning (using the Pivot technique, an adversarial network is trained simultaneously to the classifier to reduce the classifier sensitivity) (iii) Tangent Propagation (regularizing the partial derivative of the classifier score with respect to the nuisance parameter).

        Speaker: Mr Victor Estrade (LRI, UPSud, Université Paris-Saclay)
      • 199
        Binary classifier metrics for event selection optimization in HEP

        This presentation discusses some of the metrics used in HEP and other scientific domains for evaluating the relative quality of binary classifiers that are built using modern machine learning techniques. The use of the area under the ROC curve, which is common practice in the evaluation of diagnostic accuracy in the medical field and has now become widespread in many HEP applications, is critically reviewed and compared to other alternatives. In particular, the "precision-recall curve" that is routinely used in the information retrieval domain is pointed out as a more relevant tool for HEP applications, where it is equivalent to the use of signal selection efficiency and purity. Qualitative and quantitative arguments are presented to support this statement, including in particular the argument that the number of True Negatives (rejected background events) is irrelevant in HEP. Some specific metrics that are relevant to the optimization of various HEP analyses are also discussed. In particular, the relevance of the product of purity and efficiency is recalled for point estimation problems, where this metric has a simple interpretation as the fraction of Fisher information about the measured parameter which is retained after the selection of events (globally for counting measurements or locally in each histogram bin in the case of fits to differential distributions). While many of these concepts have been common knowledge since the 1990s, this presentation reviews them in the language of modern machine learning methodologies, also pointing out the many similarities and differences to other scientific domains where ML tools are used.

        Speaker: Andrea Valassi (CERN)
      • 200
        Application of a Convolutional Neural Network for image classification to the analysis of collisions in High Energy Physics

        The application of deep learning techniques using convolutional neu-
        ral networks to the classification of particle collisions in High Energy Physics is
        explored. An intuitive approach to transform physical variables, like momenta of
        particles and jets, into a single image that captures the relevant information, is
        proposed. The idea is tested using a well known deep learning framework on a sim-
        ulation dataset, including leptonic ttbar events and the corresponding background
        at 7 TeV from the CMS experiment at LHC, available as Open Data. This initial
        test shows competitive results when compared to more classical approaches, like
        those using feedforward neural networks.

        Speaker: Mr Ignacio Heredia Cacha (Instituto de Física de Cantabria)
    • T8 - Networks and facilities: S4 Hall 10 (National Palace of Culture)

      Hall 10

      National Palace of Culture

      Conveners: Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno) , Wei Yang (SLAC National Accelerator Laboratory (US))
      • 201
        Next Generation of HEP CPU Benchmarks

        Benchmarking is a consolidated activity in High Energy Physics (HEP) computing where large computing power is needed to support scientific workloads. In HEP, great attention is paid to the speed of the CPU in accomplishing high-throughput tasks characterised by a mixture of integer and floating point operations and a memory footprint of few gigabytes.

        As of 2009, HEP-SPEC06 (HS06) is the benchmark adopted by the WLCG community to describe experiments’ computing requirements, assess data centres’ computing capacity and procure new hardware. It has been defined by the HEPiX Benchmarking Working Group and is based on a subset of the industry standard SPEC CPU2006 benchmark suite.

        In the recent years, following the evolution of CPU architectures and the adoption of new programming paradigms, such as multi-threading and vectorization, it has turned out that HS06 is less representative of the relevant applications running on the WLCG infrastructure. In the meantime, in 2017 a new SPEC generation of benchmarks (SPEC CPU 2017) for CPU intensive workloads has been delivered.

        This report summarises the findings of the HEPiX Benchmarking Working Group in comparing SPEC CPU 2017 as well as other HEP fast benchmarks with the typical WLCG workloads’ mixes.

        Speaker: Domenico Giordano (CERN)
      • 202
        Sharing server nodes for storage and compute

        Based on the observation of low average CPU utilisation of several hundred disk servers in the EOS storage system at CERN, the Batch on EOS Extra Resources (BEER) project developed an approach to utilise these resources for batch processing. After initial proof of concept tests, showing almost no interference between the batch and storage services, a model for production has been developed and implemented. This has been deployed on part the CERN EOS preproduction instance. The implementation and test results will be presented. The potential for additional resources at the CERN T0 centre is of the order of ten thousand hardware threads, as well as being a step towards a hyper-converged infrastructure.

        Speaker: David Smith (CERN)
      • 203
        MONIT: Monitoring the CERN Data Centres and the WLCG Infrastructure

        The new unified monitoring (MONIT) for the CERN Data Centres and for the WLCG Infrastructure is now based on established open source technologies for collection, streaming and storage of monitoring data. The previous solutions, based on in-house development and commercial software, are been replaced with widely- recognized technologies such as Collectd, Flume, Kafka, ElasticSearch, InfluxDB, Grafana and others. The monitoring infrastructure, fully based on CERN cloud resources, covers the whole workflow of the monitoring data: from collecting and validating metrics and logs to making them available for dashboards, reports and alarms.

        The deployment in production of this new DC and WLCG monitoring is well under way and this contribution provides a summary of the progress, hurdles met and lessons learned in using these open source technologies. It also focuses on the choices made to achieve the required levels of stability, scalability and performance of the MONIT monitoring service.

        Speaker: Alberto Aimar (CERN)
      • 204
        Challenges, solutions and lessons learnt in 7 years of Service Management at CERN

        CERN has been using ITIL Service Management methodologies and ServiceNow since early 2011. Initially a joint project between just the Information Technology and the General Services Departments, now most of CERN is using this common methodology and tool, and all departments are represented totally or partially in the CERN Service Catalogue.

        We will present a summary of the current situation of Service Management at CERN, as well as its recent evolution. Some topics will be explained in more detail, such as the incident versus request distinction, the Knowledge Base evolution, how to gather feedback about tickets from users, how to bring new services on board, how the recent renewals of the Service Desk and IT Helpdesk contracts were handled, and service privacy management.

        The various challenges, adopted solutions and lessons learnt in these topics will be presented. Finally, ongoing and future work will be presented, such as the upgrade of the CERN Service Portal, increased automation, new integrations and new services being brought on board.

        Speaker: David Martin Clavo (CERN)
      • 205
        Notifications workflows using the CERN IT central messaging infrastructure

        In the CERN IT agile infrastructure, Puppet, CERN IT central messaging infrastructure and the roger application are the key constituents handling the configuration of the machines of the computer centre. The machine configuration at any given moment depends on its declared state in roger and Puppet ensures the actual implementation of the desired configuration by running the puppet agent on the machine at regular intervals, typically every 90 minutes. Sometimes it is preferable that the configuration change is propagated immediately to the targeted machine, ahead of the next scheduled puppet agent run on this machine.
        The particular need of handling notifications in a highly scalable manner for a large scale infrastructure has been satisfied with the implementation of the CERN Megabus architecture, based on the ActiveMQ messaging system.
        The design and implementation of the CERN Megabus architecture are introduced, followed by the implementation of the roger notification workflow. The choice of ActiveMQ is analysed and the message flow between the roger notification producer and the CASTOR, EOS, BATCH and Load Balancing consumers are presented. The employment of predefined consumer modules in order to speed up the on-boarding of new CERN Megabus use cases are also described.

        Speaker: Zhechka Toteva (CERN)
      • 206
        Deploying a "push" model Prometheus

        Prometheus is a leading open source monitoring and alerting tool. Prometheus also utilizes a pull model, in the sense is pulls metrics from monitored entities, rather than receives them as a push. But sometimes this can be a major headache, even without security in mind, when performing network gymnastics to reach your monitored entities. Not only that, but sometimes system metrics might be required for consumption twice(For example you want to graph them, but at the same time you want to feed them to your fancy Machine Learning Apache Spark).
        Luckily, Prometheus main market competitor InfluxDB arrives on the scene, to help the situation with the main InfluxDB accompaning product - Telegraf.
        Telegraf is flexible, low profile, easily extensible metrics collector with a big community.
        In this talk we will offer our insight on implementing a "Push" modeled monitoring system,where metrics could be consumed multiple times, based on Telegraf,Kafka and Prometheus. We will see some pitfalls we met during the implementation, scaling issues and how we overcame them.
        We will also cover monitoring our monitoring system.

        Speaker: Hristo Umaru Mohamed (CERN)
      • 207
        Detection of erratic behavior in load balanced clusters of servers using a machine learning based method

        With the explosion of the number of distributed applications, a new dynamic server environment emerged grouping servers into clusters, which utilization depends on the current demand for the application.

        To provide reliable and smooth services it is crucial to detect and fix possible erratic behavior of individual servers in these clusters. Use of standard techniques for this purpose delivers suboptimal results.

        We have developed a method based on machine learning techniques which allows to detect outliers indicating a possible problematic situation. The method inspects the performance of the rest of a cluster and provides system operators with additional information which allows them to identify quickly the failing nodes. We applied this method to develop a Spark application using the CERN MONIT architecture and with this application we analyzed monitoring data from multiple clusters of dedicated servers in the CERN data center.

        In this contribution we present our results achieved with this new method and with the Spark application for analytics of CERN monitoring data.

        Speaker: Martin Adam (Acad. of Sciences of the Czech Rep. (CZ))
      • 208
        Monitoring System of the AMS Science Operation Centre

        The Alpha Magnetic Spectrometer (AMS) is a high energy physics experiment installed and operating on board of the International Space Station (ISS) from May 2011 and expected to last through Year 2024 and beyond. The Science Operation Centre is in charge of the offline computing for the AMS experiment, including flight data production, Monte-Carlo simulation, data management, data backup, etc. This paper introduces the design and implementation for the new monitoring system of AMS SOC, from the monitoring data collection (backend), to the visualisation (frontend). The system is designed to be distributed, collecting data from various sources in different AMS computing centres, and intelligent, attempting to analysis the causes of service incidents and to give resolving suggestions.

        Speaker: Baosong Shan (Beihang University (CN))
    • Posters: S1
      • 209
        FELIX: the new detector interface for the ATLAS experiment

        During the next major shutdown from 2019-2021, the ATLAS experiment at the LHC at CERN will adopt the Front-End Link eXchange (FELIX) system as the interface between the data acquisition, detector control and TTC (Timing, Trigger and Control) systems and new or updated trigger and detector front-end electronics. FELIX will function as a router between custom serial links from front end ASICs and FPGAs to data collection and processing components via a commodity switched network. Links may aggregate many slower links or be a single high bandwidth link. FELIX will also forward the LHC bunch-crossing clock, fixed latency trigger accepts and resets received from the TTC system to front-end electronics. The FELIX system uses commodity server technology in combination with FPGA-based PCIe I/O cards. The FELIX servers will run a software routing platform serving data to network clients. Commodity servers connected to FELIX systems via the same network will run the new Software Readout Driver (SW ROD) infrastructure for event fragment building and buffering, with support for detector or trigger specific data processing, and will serve the data upon request to the ATLAS High Level Trigger for Event Building and Selection. This presentation will cover the design and status of FELIX, the SW ROD, and results of early performance testing.

        Speaker: Serguei Kolos (University of California Irvine (US))
      • 210
        A cluster finding algorithm for free-streaming input data

        In position-sensitive detectors with segmented readout (pixels or strips), charged particles activate in general several adjacent read-out channels. The first step in the reconstruction of the hit position is thus to identify clusters of active channels associated to one particle crossing the detector. In conventionally triggered systems, where the association of raw data to events is given by a hardware trigger, this is an easy-to-solve problem. It, however, becomes more involved in untriggered, free-streaming read-out systems like the one employed by the CBM experiment. Here, the time coordinate of the single-channel measurement must be taken into account to decider whether neighbouring active channels belong to a cluster. A simple extension of well-known cluster finding algorithms is not satisfactory because of involving increasing combinatorics, which are prohibitive for reconstruction in real-time. In this talk, we present a cluster-finding solution for the Silicon Tracking System of the CBM experiment which avoids any combinatorics or loops over detector channels. Its execution time is thus independent on the size of the data packages (time slices) delivered by the data acquisition, making it suitable for being used in online reconstruction.

        Speaker: Volker Friese (GSI - Helmholtzzentrum fur Schwerionenforschung GmbH (DE))
      • 211
        A fully modular framework for detector simulations in ROOT

        To study the performance of the Micro Vertex Detector (MVD), a fully modularized framework has been developed. The main goals of this framework have been: easy adaptability to new sensor specifications or changes in the geometry. This should be provided and additional high constrains on performance and memory usage had been set.
        To achieve these goals a framework has been build which decouples the functional model of the detector from the geometry model. This allows to adapt the framework for any given functional model and geometry without any needs for changes in the software.
        The active material of the detector is used as the smallest building block of this framework. The framework allows for a change in geometry and functional model without any needs to adapt the software.
        The framework will be presented on the example of the MVD in the Cbm-Root-Software and an overview about the performance will be given. Also possibilities for the usage of multi threading will be shown.

        Speaker: Philipp Sitzmann
      • 212
        A new approach for ATLAS Athena job configuration

        The offline software framework of the ATLAS experiment (Athena) consists of many small components of various types like Algorithm, Tool or Service. To assemble these components into an executable application for event processing, a dedicated configuration step is necessary. The configuration of a particular job depends on the workflow (simulation, reconstruction, high-level trigger, overlay, calibration, analysis ...) and the input data (real or simulated data, beam-energy, ...) leading to a large number of possible configurations.

        The configuration step is done by executing python code. The resulting configuration depends on optionally pre-set flags as well as meta-data about the data to be processed that is found by peeking into the input file and even into databases. For the python configuration code, there is almost no structure enforced, leaving the full power of python to the user.

        While this approach did work, it also proved to be error prone and complicated to use. It also leads to jobs containing more components that they actually need.

        For LHC Run 3 a more robust system is envisioned. It is still based on python but enforces some structure and emphasises modularity. Concepts of the current system that were found to be useful are maintained, in particular the auto-configuration based on the input file.

        This contribution will briefly report about the configuration system used during LHC Run 1 and Run 2 and detail the prototype of an improved system to be used in Run 3 and beyond.

        Speaker: Walter Lampl (University of Arizona (US))
      • 213
        A new cluster finder for ALICE MUON arm for Run3

        Muon reconstruction is currently all done offline for ALICE. In Run3 this is supposed to move online, with ALICE running in continuous readout with a minimum bias Pb-Pb interaction rate of 50kHz.

        There are numerous obstacles to getting the muon software to achieve the required performance, with the muon cluster finder being replaced and moved to run on a GPU inside the new O2 computing farm.
        We will give an overview of where the muon reconstruction currently spends most of its time and how the new cluster finder has been redone in line with the O2 performance requirements, and how it improves
        the performance of current offline muon reconstruction and future online.

        Speaker: Mr Sean Murray (University of Cape Town (ZA))
      • 214
        Addressing scalability with message queues: architecture and use cases for DIRAC interware

        The Message Queue architecture is an asynchronous communication scheme that provides an attractive solution for certain scenarios in the distributed computing model. The introduction of the intermediate component (queue) in-between the interacting processes, allows to decouple the end-points making the system more flexible and providing high scalability and redundancy. The message queue brokers such as RabbitMQ, ActiveMQ or Kafka are proven technologies widely used nowadays.

        DIRAC is a general-purpose Interware software for distributed computing systems, which offers a common interface to a number of heterogeneous providers and guarantees transparent and reliable usage of the resources. The DIRAC platform has been adapted by several scientific projects, including High Energy Physics communities like LHCb, the Linear Collider and Belle2.

        A Message Queue generic interface has been incorporated into the DIRAC framework to help solving the scalability challenges that must be addressed during LHC Run3 starting in 2021. It allows to use the MQ scheme for the message exchange among the DIRAC components, or to communicate with third-party services. Within this contribution we will describe the integration of MQ systems with DIRAC, and several use cases will be shown. The focus will be put on the incorporation of MQ into the pilot logging system. Message Queues are also foreseen to be used as a backbone of the DIRAC component logging system, and monitoring.

        The results of the first performance tests will be presented.

        Speaker: Wojciech Jan Krzemien (National Centre for Nuclear Research (PL))
      • 215
        Adoption of ARC-CE and HTCondor at GridKa Tier 1

        The GridKa Tier 1 data and computing center hosts a significant share of WLCG processing resources. Providing these resources to all major LHC and other VOs requires an efficient, scalable and reliable cluster management. To satisfy this, GridKa has recently migrated its batch resources from CREAM-CE and PBS to ARC-CE and HTCondor. This contribution discusses the key highlights of the adoption of this middleware at the scale of a European Tier 1 center:

        As the largest WLCG Tier 1 using the ARC-CE plus HTCondor stack, GridKa is exemplary for migrating more than 20.000 cores over the time span of just few weeks. Supporting multiple VOs, we have extensively studied the constraints and possibilities of scheduling jobs of vastly different requirements. We present a robust and maintainable optimisation of resource utilisation which still respects constraints desired by VOs. Furthermore, we explore the dynamic extension of our batch system, integrating cloud resources with a lightweight configuration mechanism.

        Speaker: Matthias Jochen Schnepf (KIT - Karlsruhe Institute of Technology (DE))
      • 216
        Advanced Analytics service to enhance workflow control at the ATLAS Production System

        Modern workload management systems that are responsible for central data production and processing in High Energy and Nuclear Physics experiments have highly complicated architectures and require a specialized control service for resource and processing components balancing. Such a service represents a comprehensive set of analytical tools, management utilities and monitoring views aimed at providing a deep understanding of internal processes, and is considered as an extension for situational awareness analytic service. Its key points are analysis of task processing, e.g., selection and regulation of key task features that affect its processing the most; modeling of processed data lifecycles for further analysis, e.g., generate guidelines for particular stage of data processing; and forecasting processes with focus on data and tasks states as well as on the management system itself, e.g., to detect the source of any potential malfunction.
        The prototype of the advanced analytics service will be an essential part of the analytical service of the ATLAS Production System (ProdSys2). Advanced analytics service uses such tools as Time-To-Complete estimation towards processing units (i.e., tasks and chains of tasks) to control the processing state and to be able to highlight abnormal operations and executions. Obtained metrics are used in decision making processes to regulate the system behaviour and resources consumption.

        Speaker: Mikhail Titov (National Research Centre Kurchatov Institute (RU))
      • 217
        Advanced Scheduling in IaaS Clouds

        IaaS clouds brought us greater flexibility in managing computing infrastructures enabling us to mix different computing environments (e.g. Grid systems, web-servers and even personal desktop-like systems) in form of virtual machines (VM) within the same hardware equipment. The new paradigm automatically introduced efficiency increase caused by switching from using single-task dedicated physical servers to splitting them up into a greater number of smaller virtual servers each running its own computing tasks. In this work we describe an approach to further improve hardware utilization in IaaS clouds through dynamic reallocation of VMs and overcommitment. This approach introduces new risks and challenges in automated cloud management, but can help increase the cloud resources utilization efficiency which may become a serious problem given constant growth of computing infrastructures and the variety of modern workloads. We introduce a meta-scheduling framework and an example scheduling strategy able to handle dynamic reallocation of virtual machines and which currently supports OpenNebula platform. The framework has a modular design and can be extended to support other platforms (e.g. OpenStack) which provide API to communicate with the cloud.

        Speaker: Nikita Balashov (JINR)
      • 218
        ALICE O2 setup for TPC read-out tests for the LHC run 3

        Abstract:

        ALICE (A Large Ion Collider Experiment) is one of the four big experiments at the Large Hadron Collider (LHC). For ALICE Run 3 there will be a major upgrade for several detectors as well as the compute infrastructure with a combined Online-Offline computing system (O2) to support continuous readout at much higher data rates than before (3TB/s). The ALICE Time Projection Chamber (TPC) is the main contributor to the data volume which is read out via the Common Readout Unit (CRU).
        For the first read-out test of a complete TPC inner read-out chamber (IROC) in the beginning of 2018 and the following tests of a complete TPC sector end of Q2 2018 we build an O2 prototype system using the O2 software stack which is currently developed for LHC run 3. The setup consists of 5 First-Level-Processing servers (FLPs) with two CRUs each as input nodes and 2 Event-Processing-Nodes (EPNs). With this test setup we can evaluate the feasibility of the planned software tools on a small scale as well as the read-out with the O2 processing and analysis framework. During the long shutdown this setup will be extended for pre-commissioning tests of the upgraded TPC detector.
        We give a summary of the deployed test setup for the TPC as well as the experiences so far and an outlook of the future steps towards the final O2 system.

        Speaker: Johannes Lehrbach (Johann-Wolfgang-Goethe Univ. (DE))
      • 219
        AlphaTwirl: a python library for summarizing event data into multi-dimensional categorical data

        AlphaTwirl is a python library that loops over event data and summarizes them into multi-dimensional categorical (binned) data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in ROOT TTree with one entry per collision event of an LHC experiment. Event data are often large -- too large to be loaded in memory -- because they have as many entries as events. Multi-dimensional categorical data, the output of AlphaTwirl, have one row for one category. They are usually small -- small enough to be loaded in memory -- because they only have as many rows as categories. Users can, for example, import them as data frames into R and pandas, which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas. In this presentation, I will show (a) an example workflow of data analysis using AlphaTwirl and data frames, (b) the user interface of AlphaTwirl, e.g., how to specify conditions of event selection, binning and categories, and methods to summarize data in each category, and (c) features of implementation, such as concurrency in looping over large event data. In addition, I will mention particular analyses in CMS using AlphaTwirl. I will also discuss possibilities for future development.

        Speaker: Dr Tai Sakuma (University of Bristol (GB))
      • 220
        An integrated system for data quality and conditions assessment for the ATLAS Tile Calorimeter

        The ATLAS experiment records data from the proton-proton collisions produced by the Large Hadron Collider (LHC). The Tile Calorimeter is the hadronic sampling calorimeter of ATLAS in the region |eta| < 1.7. It uses iron absorbers and scintillators as active material. Jointly with the other calorimeters it is designed for reconstruction of hadrons, jets, tau-particles and missing transverse energy. It also assists in muon identification. The TileCal is regularly monitored by several different systems, which were independently developed to meet distinct collaboration’s requirements. Any problems are reported and immediately investigated. The data quality (DQ) efficiency achieved was 100% in 2015, 98.9% in 2016 and 99.4% in 2017. Changes to detector status or calibrations are entered into a dedicated conditions database called COOL. Experts maintain the tools used by DQ shifters and the calibration teams during normal operation, and prepare new conditions for data reprocessing and MC production campaigns.

        Tile-in-ONE is a unique system, which integrates all the web DQ monitoring and calibration systems and tools used by the TileCal, with a standard development technology and documentation. It also intends to abstract the user from knowing where and how to get the wanted data by providing a user friendly interface. It is based in a server containing a core, which represents the basic framework that loads the configuration, manages user settings and loads plug-ins at run-time; a set of services, which provide common features to be used by the plug-ins, such as connectors to different databases and resources; and the plug-ins themselves which provide features at the top level layer for the users. Moreover, a web environment is being designed to allow collaborators develop their own plug-ins, test them and add them to the system. To make it possible, an API is used allowing any kind of application to be interpreted and displayed in a standard way.

        Speaker: Juraj Smiesko (Comenius University (SK))
      • 221
        Application of Deep Learning on Integrating Prediction, Provenance, and Optimization

        We investigate novel approaches using Deep Learning (DL) for efficient execution of workflows on distributed resources. Specifically, we studied the use of DL for job performance prediction, performance classification, and anomaly detection to improve the utilization of the computing resources.

        • Performance prediction:
        • capture performance of workflows on multiple resources
        • consider intra-node task assignment

        • Performance classification: Prediction of job success/failure

        • Predict at regular intervals job succeed/fail - site reliability
        • Long short-term memory (LSTM) neural networks

        • Performance anomaly detection:

        • Example: Functions that consume unexpectedly large/small amounts of time

        We used the Belle II distributed computing workflow and modifications to the DIRAC system for these studies.

        Speakers: Dr Malachi Schram, Malachi Schram (Pacific Northwest National Laboratory)
      • 222
        ATLAS Distributed Computing: Its Central Services core

        The ATLAS Distributed Computing (ADC) Project is responsible for the off-line processing of data produced by the ATLAS experiment at the Large Hadron Collider (LHC) at CERN. It facilitates data and workload management for ATLAS computing on the Worldwide LHC Computing Grid (WLCG).

        ADC Central Services operations (CSops)is a vital part of ADC, responsible for the deployment and configuration of services needed by ATLAS computing and operation of those services on CERN IT infrastructure, providing knowledge of CERN IT services to ATLAS service managers and developers, and supporting them in case of issues. Currently this entails the management of thirty seven different OpenStack projects, with more than five thousand cores allocated for these virtual machines, as well as overseeing the distribution of twenty nine petabytes of storage space in EOS for ATLAS.

        As the LHC begins to get ready for the next long shut-down, which will bring in many new upgrades to allow for more data to be captured by the on-line systems, CSops must not only continue to support the existing services, but plan ahead for the expected increase in data, users, and services that will be required. In this paper we explain the current state of CSops as well as the systems put in place to maintain the services provided.

        Speaker: Chris Lee (University of Cape Town (ZA))
      • 223
        ATLAS Global Shares Implementation in the PanDA Workload Management System

        PanDA (Production and Distributed Analysis) is the workload management system for ATLAS across the Worldwide LHC Computing Grid. While analysis tasks are submitted to PanDA by over a thousand users following personal schedules (e.g. PhD or conference deadlines), production campaigns are scheduled by a central Physics Coordination group based on the organization’s calendar. The Physics Coordination group needs to allocate the amount of Grid resources dedicated to each activity, in order to manage sharing of CPU resources among various parallel campaigns and to make sure that results can be achieved in time for important deadlines.
        While dynamic and static shares on batch systems have been around for a long time, we are trying to move away from local resource partitioning and manage shares at a global level in the PanDA system. The global solution is not straightforward, given different requirements of the activities (number of cores, memory, I/O and CPU intensity), the heterogeneity of Grid resources (site/HW capabilities, batch configuration and queue setup) and constraints on data locality. We have therefore started the Global Shares project that follows a requirements-driven multi-step execution plan, starting from definition of nestable shares, implementing share-aware job dispatch, aligning internal processes with global shares and finally implementing a pilot stream control for controlling the batch slots while keeping late binding.
        This contribution will explain the development work and architectural changes in PanDA to implement Global Shares, and provide an operational point of view with the difficulties we found along the way.

        Speaker: Fernando Harald Barreiro Megino (University of Texas at Arlington)
      • 224
        ATLAS Software Installation on Supercomputers

        PowerPC and high performance computers (HPC) are important resources for computing in the ATLAS experiment. The future LHC data processing will require more resources than Grid computing, currently using approximately 100,000 cores at well over 100 sites, can provide. Supercomputers are extremely powerful as they use resources of hundreds of thousands CPUs joined together. However their architectures have different instruction sets. ATLAS binary software distributions for x86 chipsets do not fit these architectures, as emulation of these chipsets results in huge performance loss. This presentation describes the methodology of ATLAS software installation from source code on supercomputers. The installation procedure includes downloading the ATLAS simulation release code with 0.7 million C++ and Python lines as well as the source code of more than 50 external packages, such as ROOT and Geant4, followed by compilation, and rigorous unit and integration testing. The presentation reports the application of this procedure at Titan HPC and Summit PowerPC at Oak Ridge Computing Facility (OLCF).

        Speaker: Alexander Undrus (Brookhaven National Laboratory (US))
      • 225
        ATLAS utilisation of the Czech national HPC center

        The Czech national HPC center IT4Innovations located in Ostrava provides two HPC systems, Anselm and Salomon. The Salomon HPC is amongst the hundred most powerful supercomputers on Earth since its commissioning in 2015. Both clusters were tested for usage by the ATLAS experiment for running simulation jobs. Several thousand core hours were allocated to the project for tests, but the main aim is to use free resources waiting for large parallel jobs of other users.

        Multiple strategies for ATLAS job execution were tested on the Salomon and Anselm HPCs. The solution described herein is based on the ATLAS experience with other HPC sites. ARC Compute Element (ARC-CE) installed at the grid site in Prague is used for job submission to Salomon. The ATLAS production system submits jobs to the ARC-CE via ARC Control Tower (aCT). The ARC-CE processes job requirements from aCT and creates a script for a batch system which is then executed via ssh. Sshfs is used to share scripts and input files between the site and the HPC cluster. The software used to run jobs is rsynced from the site's CVMFS installation to the HPC's scratch space every day to ensure availability of recent software.

        Speaker: Michal Svatos (Acad. of Sciences of the Czech Rep. (CZ))
      • 226
        Automating calibration at the Belle II detector

        In 2018 the Belle II detector will begin collecting data from $e^+e^-$ collisions at the SuperKEKB electron-positron collider at the High Energy Accelerator Research Organization (KEK, Tsukuba, Japan). Belle II aims to collect a data sample 50 times larger than the previous generation of B-Factories, taking advantage of the SuperKEKB design luminosity of $8\times10^{35} cm^{-2} s^{-1}$.

        It is crucial for the detector components at Belle II to be calibrated quickly and accurately to provide analysts with the best possible data promptly so that Belle II can remain competitive in key analyses. A calibration framework (CAF) has been developed in the Belle II Analysis Software Framework (basf2). This allows basf2 users to write and run calibration code themselves in an easy and common format, while taking advantage of parallelized submission to local batch systems.

        A system to perform the automatic submission and monitoring of CAF processes on newly acquired data is currently in development at Belle II. In recent years workflow management software (WMS) using Directed Acyclic Graphs (DAGs) has become increasingly popular as a way to specify complex and potentially dynamic workflows. A WMS + DAG system is used in the automated CAF submission system. The current status of this, and a brief review of the various WMS projects investigated, will be presented.

        Speaker: David Dossett (University of Melbourne)
      • 227
        Automation and Testing for Simplified Software Deployment

        Creating software releases is one of the more tedious occupations in the life of
        a software developer. For this purpose we have tried to automate as many of the
        repetitive tasks involved from getting the commits to running the software as
        possible. For this simplification we rely in large parts on free collaborative
        services build around GitHub: issue tracking, code review (GitHub), continuous
        integration (Travis-CI), static code analysis (coverity). The dependencies and
        compilers used in the continuous integration are obtained via CVMFS used inside
        docker. This allows one to run any desired compiler version (e.g., gcc 6.2 ,
        llvm 3.9) or tool (e.g, clang-format, pylint). To create tags for the software
        package the powerful GitHub API is used. A script was developed that first
        collates the release notes from the description of each pull request, commits
        the release notes file, and finally makes a tag. This moves the burden of
        writing release notes to the individual developer and away from the package
        maintainer. The deployment of software releases to CVMFS is handled via
        GitLab-CI. When a tag is made the software is built and automatically
        deployed. In this presentation we will describe the software infrastructure used
        for the iLCSoft and iLCDirac projects, which are used by CLICdp and the ILC detector
        collaborations, and give many examples of automation which might be use