9–13 Jul 2018
Sofia, Bulgaria
Europe/Sofia timezone

Session

T3 - Distributed computing

T3
9 Jul 2018, 11:00
Hall 7 (National Palace of Culture)

Hall 7

National Palace of Culture

Conveners

T3 - Distributed computing: Experiment Frameworks and HPC

  • David Cameron (University of Oslo (NO))

T3 - Distributed computing: Facilities

  • Julia Andreeva (CERN)

T3 - Distributed computing: Testing, Monitoring and Accounting

  • Julia Andreeva (CERN)

T3 - Distributed computing: Performance Optimization, Security and Federated Identity

  • David Cameron (University of Oslo (NO))

T3 - Distributed computing: Experiment Frameworks and Operational Experiences (1)

  • Hannah Short (CERN)

T3 - Distributed computing: Experiment Frameworks and Operational Experiences (2)

  • Julia Andreeva (CERN)

T3 - Distributed computing: Computing Models and Future Views

  • David Cameron (University of Oslo (NO))

Presentation materials

There are no materials yet.

  1. David Schultz (University of Wisconsin-Madison)
    09/07/2018, 11:00
    Track 3 – Distributed computing
    presentation

    IceCube Neutrino Observatory is a neutrino detector located at the South Pole. Here we present experiences acquired when using HTCondor to run IceCube’s GPU simulation worksets on the Titan supercomputer. Titan is a large supercomputer geared for High Performance Computing (HPC). Several factors make it challenging to use Titan for IceCube’s High Throughput Computing (HTC) workloads: (1) Titan...

    Go to contribution page
  2. Gianfranco Sciacca
    09/07/2018, 11:15
    Track 3 – Distributed computing
    presentation

    Predictions for requirements for the LHC computing for Run 3 and for Run 4 (HL_LHC) over the course of the next 10 years, show a considerable gap between required and available resources, assuming budgets will globally remain flat at best. This will require some radical changes to the computing models for the data processing of the LHC experiments. The use of large scale computational...

    Go to contribution page
  3. Wahid Bhimji (Lawrence Berkeley National Lab. (US))
    09/07/2018, 11:30
    Track 3 – Distributed computing
    presentation

    Many HEP experiments are moving beyond experimental studies to making large-scale production use of HPC resources at NERSC including the knights landing architectures on the Cori supercomputer. These include ATLAS, Alice, Belle2, CMS, LSST-DESC, and STAR among others. Achieving this has involved several different approaches and has required innovations both on NERSC and the experiments’ sides....

    Go to contribution page
  4. Alexei Klimentov (Brookhaven National Laboratory (US))
    09/07/2018, 11:45
    Track 3 – Distributed computing
    presentation

    The Titan supercomputer at Oak Ridge National Laboratory prioritizes the scheduling of large leadership class jobs, but even when the supercomputer is fully loaded and large jobs are standing in the queue to run, 10 percent of the machine remains available for a mix of smaller jobs, essentially ‘filling in the cracks’ between the very large jobs. Such utilisation of the computer resources is...

    Go to contribution page
  5. Pavlo Svirin
    09/07/2018, 12:00
    Track 3 – Distributed computing
    presentation

    PanDA executes millions of ATLAS jobs a month on Grid systems with more than
    300k cores. Currently, PanDA is compatible only with few HPC resources due to
    different edge services and operational policies, does not implement the pilot
    paradigm on HPC, and does not dynamically optimize resource allocation among
    queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP)
    system...

    Go to contribution page
  6. 09/07/2018, 12:15
    Track 3 – Distributed computing
    presentation
  7. Vito Di Benedetto (Fermi National Accelerator Lab. (US))
    09/07/2018, 14:00
    Track 3 – Distributed computing
    presentation

    The FabrIc for Frontier Experiments (FIFE) project within the Fermilab Scientific Computing Division is charged with integrating offline computing components into a common computing stack for the non-LHC Fermilab experiments, supporting experiment offline computing, and consulting on new, novel workflows. We will discuss the general FIFE onboarding strategy, the upgrades and enhancements in...

    Go to contribution page
  8. Eric Vaandering (Fermi National Accelerator Lab. (US))
    09/07/2018, 14:15
    Track 3 – Distributed computing
    presentation

    HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet peak demands of the next generation of High Energy Physics experiments, Fermilab must either plan to locally provision enough resources to cover the forecasted need, or find ways to elastically expand its computational capabilities. Commercial...

    Go to contribution page
  9. Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE))
    09/07/2018, 14:30
    Track 3 – Distributed computing
    presentation

    The amount of data to be processed by experiments in high energy physics is tremendously increasing in the coming years. For the first time in history the expected technology advance itself will not be sufficient to cover the arising gap between required and available resources based on the assumption of maintaining the current flat budget hardware procurement strategy. This leads to...

    Go to contribution page
  10. Daniela Bauer (Imperial College (GB))
    09/07/2018, 14:45
    Track 3 – Distributed computing
    presentation

    LZ is a Dark Matter experiment based at the Sanford Underground Research Facility. It is currently under construction and aims to start data taking in 2020. Its computing model is based on two data centres, one in the USA (USDC) and one in the UK (UKDC), both holding a complete copy of its data. During stable periods of running both data centres plan to concentrate on different aspects of...

    Go to contribution page
  11. Vladimir Korenkov (Joint Institute for Nuclear Research (RU))
    09/07/2018, 15:00
    Track 3 – Distributed computing
    presentation

    Computing in the field of high energy physics requires usage of heterogeneous computing resources and IT, such as grid, high performance computing, cloud computing and big data analytics for data processing and analysis. The core of the distributed computing environment at the Joint Institute for Nuclear Research is the Multifunctional Information and Computing Complex (MICC). It includes...

    Go to contribution page
  12. David Cameron (University of Oslo (NO))
    09/07/2018, 15:15
    Track 3 – Distributed computing
    presentation

    LHC@home has provided computing capacity for simulations under BOINC since 2005. Following the introduction of virtualisation with BOINC to run HEP Linux software in a virtual machine on volunteer desktops, initially started on the test BOINC projects, like Test4Theory and ATLAS@home, all CERN applications distributed to volunteers have been consolidated under a single LHC@home BOINC project....

    Go to contribution page
  13. Andrew John Washbrook (The University of Edinburgh (GB))
    09/07/2018, 15:30
    Track 3 – Distributed computing
    presentation

    The Edinburgh (UK) Tier-2 computing site has provided CPU and storage resources to the Worldwide LHC Computing Grid (WLCG) for close to 10 years. Unlike other sites, resources are shared amongst members of the hosting institute rather than being exclusively provisioned for Grid computing. Although this unconventional approach has posed challenges for troubleshooting and service delivery there...

    Go to contribution page
  14. David Cameron (University of Oslo (NO))
    09/07/2018, 15:45
    Track 3 – Distributed computing
    presentation

    The volunteer computing project ATLAS@Home has been providing a stable computing resource for the ATLAS experiment since 2013. It has recently undergone some significant developments and as a result has become one of the largest resources contributing to ATLAS computing, by expanding its scope beyond traditional volunteers and into exploitation of idle computing power in ATLAS data centres....

    Go to contribution page
  15. Mr Adrian Coveney (STFC)
    10/07/2018, 11:00
    Track 3 – Distributed computing
    presentation

    While the WLCG and EGI have both made significant progress towards solutions for storage space accounting, one area that is still quite exploratory is that of dataset accounting. This type of accounting would enable resource centre and research community administrators to report on dataset usage to the data owners, data providers, and funding agencies. Eventually decisions could be made about...

    Go to contribution page
  16. Brian Paul Bockelman (University of Nebraska Lincoln (US))
    10/07/2018, 11:15
    Track 3 – Distributed computing
    presentation

    The OSG has long maintained a central accounting system called Gratia. It uses small probes on each computing and storage resource in order to usage. The probes report to a central collector which stores the usage in a database. The database is then queried to generate reports. As the OSG aged, the size of the database grew very large. It became too large for the database technology to...

    Go to contribution page
  17. Jaroslava Schovancova (CERN)
    10/07/2018, 11:30
    Track 3 – Distributed computing
    presentation

    HammerCloud is a testing service and framework to commission, run continuous tests or on-demand large-scale stress tests, and benchmark computing resources and components of various distributed systems with realistic full-chain experiment workflows.

    HammerCloud, userd by the ATLAS and CMS experiments in production, has been a useful service to commission both compute resources and various...

    Go to contribution page
  18. Yuji Kato
    10/07/2018, 11:45
    Track 3 – Distributed computing
    presentation

    The Belle II is an asymmetric energy e+e- collider experiment at KEK, Japan. The Belle II aims to reveal the physics beyond the standard model with a data set of about 5×10^10 BB^bar pairs and starts the physics run in 2018. In order to store such a huge amount of data including simulation events and analyze it in a timely manner, Belle II adopts a distributed computing model with DIRAC...

    Go to contribution page
  19. Adam Wegrzynek (CERN)
    10/07/2018, 12:00
    Track 3 – Distributed computing
    presentation

    ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout system and computing for LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect...

    Go to contribution page
  20. Alexey Anisenkov (Budker Institute of Nuclear Physics (RU))
    10/07/2018, 12:15
    Track 3 – Distributed computing
    presentation

    The WLCG Information System (IS) is an important component of the huge heterogeneous distributed infrastructure. Considering the evolution of LHC computing towards high luminosity era and analyzing experience accumulated by the computing operations teams and limitations of the current information system, the WLCG IS evolution task force came up with the proposal to develop Computing Resource...

    Go to contribution page
  21. Andrew McNab (University of Manchester)
    10/07/2018, 14:00
    Track 3 – Distributed computing
    presentation

    During 2017 LHCb developed the ability to interrupt Monte Carlo
    simulation jobs and cause them to finish cleanly with the events
    simulated so far correctly uploaded to grid storage. We explain
    how this functionality is supported in the Gaudi framework and handled
    by the LHCb simulation framework Gauss. By extending DIRAC, we have been
    able to trigger these interruptions when running...

    Go to contribution page
  22. Johannes Elmsheuser (Brookhaven National Laboratory (US))
    10/07/2018, 14:15
    Track 3 – Distributed computing
    presentation

    The CERN ATLAS experiment grid workflow system manages routinely 250 to
    500 thousand concurrently running production and analysis jobs
    to process simulation and detector data. In total more than 300 PB
    of data is distributed over more than 150 sites in the WLCG.
    At this scale small improvements in the software and computing
    performance and workflows can lead to significant resource usage...

    Go to contribution page
  23. Todor Trendafilov Ivanov (University of Sofia (BG)), Jose Hernandez (CIEMAT)
    10/07/2018, 14:30
    Track 3 – Distributed computing
    presentation

    Hundreds of physicists analyse data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) using the CMS Remote Analysis builder (CRAB) and the CMS GlideinWMS global pool to exploit the resources of the World LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time the CMS collaboration is committed on...

    Go to contribution page
  24. Peter Love (Lancaster University (GB))
    10/07/2018, 14:45
    Track 3 – Distributed computing
    presentation

    ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In...

    Go to contribution page
  25. Hannah Short (CERN)
    10/07/2018, 15:00
    Track 3 – Distributed computing
    presentation

    Federated identity management (FIM) is an arrangement that can be made among multiple organisations that lets subscribers use the same identification data to obtain access to the secured resources of all organisations in the group. In many research communities there is an increasing interest in a common approach to FIM as there is obviously a large potential for synergies. FIM4R [1] provides a...

    Go to contribution page
  26. David Crooks (University of Glasgow (GB))
    10/07/2018, 15:15
    Track 3 – Distributed computing
    presentation

    The modern security landscape for distributed computing in High Energy Physics (HEP) includes a wide range of threats employing different attack vectors. The nature of these threats is such that the most effective method for dealing with them is to work collaboratively, both within the HEP community and with partners further afield - these can, and should, include institutional and campus...

    Go to contribution page
  27. Paul Millar (DESY)
    10/07/2018, 15:30
    Track 3 – Distributed computing
    presentation

    X.509 is the dominate security infrastructure used in WLCG. Although
    this technology has worked well, it has some issues. One is that,
    currently, a delegated proxy can do everything the parent credential
    can do. A stolen "production" proxy could be used from any machine in
    the world to delete all data owned by that VO on all storage systems
    in the grid.

    Generating a delegated X.509...

    Go to contribution page
  28. Mr Nicolas Liampotis (Greek Research and Technology Network - GRNET)
    10/07/2018, 15:45
    Track 3 – Distributed computing
    presentation

    The European Open Science Cloud (EOSC) aims to enable trusted access to services and the re-use of shared scientific data across disciplinary, social and geographical borders. The EOSC-hub will realise the EOSC infrastructure as an ecosystem of research e-Infrastructures leveraging existing national and European investments in digital research infrastructures. EGI Check-in and EUDAT B2ACCESS...

    Go to contribution page
  29. Artem Petrosyan (Joint Institute for Nuclear Research (RU))
    11/07/2018, 11:30
    Track 3 – Distributed computing
    presentation

    LHC Computing Grid was a pioneer integration effort, managed to unite computing and
    storage resources all over the world, thus making them available to experiments on the Large Hadron Collider. During decade of LHC computing, Grid software has learned to effectively utilise different types of computing resources, such as classic computing clusters, clouds and hyper power computers. While the...

    Go to contribution page
  30. Johannes Elmsheuser (Brookhaven National Laboratory (US))
    11/07/2018, 11:45
    Track 3 – Distributed computing
    presentation

    The CERN ATLAS experiment successfully uses a worldwide
    computing infrastructure to support the physics program during LHC
    Run 2. The grid workflow system PanDA routinely manages 250 to
    500 thousand concurrently running production and analysis jobs
    to process simulation and detector data. In total more than 300 PB
    of data is distributed over more than 150 sites in the WLCG and
    handled by the...

    Go to contribution page
  31. David Schultz (University of Wisconsin-Madison)
    11/07/2018, 12:00
    Track 3 – Distributed computing
    presentation

    IceCube is a cubic kilometer neutrino detector located at the south pole. IceProd is IceCube’s internal dataset management system, keeping track of where, when, and how jobs run. It schedules jobs from submitted datasets to HTCondor, keeping track of them at every stage of the lifecycle. Many updates have happened in the last years to improve stability and scalability, as well as increase...

    Go to contribution page
  32. Federico Stagni (CERN)
    11/07/2018, 12:15
    Track 3 – Distributed computing
    presentation

    The DIRAC project is developing interware to build and operate distributed computing systems. It provides a development framework and a rich set of services for both Workload and Data Management tasks of large scientific communities. DIRAC is adopted by a growing number of collaborations, including LHCb, Belle2, the Linear Collider, and CTA.

    The LHCb experiment will be upgraded during the...

    Go to contribution page
  33. Matteo Cremonesi (Fermi National Accelerator Lab. (US))
    11/07/2018, 12:30
    Track 3 – Distributed computing
    presentation

    In recent years the LHC delivered a record-breaking luminosity to the CMS experiment making it a challenge to successfully handle all the demands for the efficient Data and Monte Carlo processing. In the presentation we will review major issues managing such requests and how we were able to address them. Our main strategy relies on the increased automation and dynamic workload and data...

    Go to contribution page
  34. 11/07/2018, 12:45
    Track 3 – Distributed computing
    presentation
  35. Boris Bauermeister (Stockholm University)
    12/07/2018, 11:00
    Track 3 – Distributed computing
    presentation

    The Xenon Dark Matter experiment is looking for non baryonic particle Dark Matter in the universe. The demonstrator is a dual phase time projection chamber (TPC), filled with a target mass of ~2000 kg of ultra pure liquid xenon. The experimental setup is operated at the Laboratori Nazionali del Gran Sasso (LNGS).
    We present here a full overview about the computing scheme for data distribution...

    Go to contribution page
  36. Tadashi Maeno (Brookhaven National Laboratory (US))
    12/07/2018, 11:15
    Track 3 – Distributed computing
    presentation

    The Production and Distributed Analysis (PanDA) system has been successfully used in the ATLAS experiment as a data-driven workload management system. The PanDA system has proven to be capable of operating at the Large Hadron Collider data processing scale over the last decade including the Run 1 and Run 2 data taking periods. PanDA was originally designed to be weakly coupled with the WLCG...

    Go to contribution page
  37. Pavlo Svirin (Brookhaven National Laboratory (US))
    12/07/2018, 11:30
    Track 3 – Distributed computing
    presentation

    A goal of LSST (Large Synoptic Survey Telescope) project is to conduct a 10-year survey of the sky that is expected to deliver 200 petabytes of data after it begins full science operations in 2022. The project will address some of the most pressing questions about the structure and evolution of the universe and the objects in it. It will require a large amount of simulations to understand the...

    Go to contribution page
  38. Luisa Arrabito
    12/07/2018, 11:45
    Track 3 – Distributed computing
    presentation

    The Cherenkov Telescope Array (CTA) is the next-generation instrument in the field of very high energy gamma-ray astronomy. It will be composed of two arrays of Imaging Atmospheric Cherenkov Telescopes, located at La Palma (Spain) and Paranal (Chile). The construction of CTA has just started with the installation of the first telescope on site at La Palma and the first data expected by the end...

    Go to contribution page
  39. Xiaomei Zhang (Chinese Academy of Sciences (CN))
    12/07/2018, 12:00
    Track 3 – Distributed computing
    presentation

    The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment which will start in 2020. To fasten JUNO data processing over multicore hardware, the JUNO software framework is introducing parallelization based on TBB. To support JUNO multicore simulation and reconstruction jobs in the near future, a new workload scheduling model has to be explored and implemented in...

    Go to contribution page
  40. Antonio Perez-Calero Yzquierdo (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
    12/07/2018, 12:15
    Track 3 – Distributed computing
    presentation

    The CMS Submission Infrastructure Global Pool, built on GlideinWMS and HTCondor, is a worldwide distributed dynamic pool responsible for the allocation of resources for all CMS computing workloads. Matching the continuously increasing demand for computing resources by CMS requires the anticipated assessment of its scalability limitations. Extrapolating historical usage trends, by LHC Run III...

    Go to contribution page
  41. David Lange (Princeton University (US))
    12/07/2018, 14:00
    Track 3 – Distributed computing
    presentation

    The HL-LHC program has seen numerous extrapolations of its needed computing resources that each indicate the need for substantial changes if the desired HL-LHC physics program is to be supported within the current level of computing resource budgets. Drivers include large increases in event complexity (leading to increased processing time and analysis data size) and trigger rates needed (5-10...

    Go to contribution page
  42. Stefan Roiser (CERN)
    12/07/2018, 14:15
    Track 3 – Distributed computing
    presentation

    The LHCb experiment will be upgraded for data taking in the LHC Run 3. The foreseen trigger output bandwidth trigger of a few GB/s will result in datasets of tens of PB per year, which need to be efficiently streamed and stored offline for low-latency data analysis. In addition, simulation samples of up to two orders of magnitude larger than those currently simulated are envisaged, with big...

    Go to contribution page
  43. Fernando Harald Barreiro Megino (University of Texas at Arlington)
    12/07/2018, 14:30
    Track 3 – Distributed computing
    presentation

    The Production and Distributed Analysis system (PanDA) for the ATLAS experiment at the Large Hadron Collider has seen big changes over the past couple of years to accommodate new types of distributed computing resources: clouds, HPCs, volunteer computers and other external resources. While PanDA was originally designed for fairly homogeneous resources available through the Worldwide LHC...

    Go to contribution page
  44. Miguel Martinez Pedreira (Johann-Wolfgang-Goethe Univ. (DE))
    12/07/2018, 14:45
    Track 3 – Distributed computing
    presentation

    The ALICE experiment will undergo an extensive detector and readout upgrade for the LHC Run3 and will collect a 10 times larger data volume than today. This will translate into increase of the required CPU resources worldwide as well as higher data access and transfer rates. JAliEn (Java ALICE Environment) is the new Grid middleware designed to scale-out horizontally and satisfy the ALICE...

    Go to contribution page
  45. Andrea Sciaba (CERN)
    12/07/2018, 15:00
    Track 3 – Distributed computing
    presentation

    The increase in the scale of LHC computing expected for Run 3 and even more so for Run 4 (HL-LHC) over the course of the next ten years will most certainly require radical changes to the computing models and the data processing of the LHC experiments. Translating the requirements of the physics programmes into computing resource needs is an extremely complicated process and subject to...

    Go to contribution page
  46. Edgar Fajardo Hernandez (Univ. of California San Diego (US))
    12/07/2018, 15:15
    Track 3 – Distributed computing
    presentation

    With the increase of power and reduction of cost of GPU accelerated processors a corresponding interest in their uses in the scientific domain has spurred. OSG users are no different and they have shown an interest in accessing GPU resources via their usual workload infrastructures. Grid sites that have these kinds of resources also want to make them grid available. In this talk, we discuss...

    Go to contribution page
Building timetable...