Conveners
T3 - Distributed computing: Experiment Frameworks and HPC
- David Cameron (University of Oslo (NO))
T3 - Distributed computing: Facilities
- Julia Andreeva (CERN)
T3 - Distributed computing: Testing, Monitoring and Accounting
- Julia Andreeva (CERN)
T3 - Distributed computing: Performance Optimization, Security and Federated Identity
- David Cameron (University of Oslo (NO))
T3 - Distributed computing: Experiment Frameworks and Operational Experiences (1)
- Hannah Short (CERN)
T3 - Distributed computing: Experiment Frameworks and Operational Experiences (2)
- Julia Andreeva (CERN)
T3 - Distributed computing: Computing Models and Future Views
- David Cameron (University of Oslo (NO))
IceCube Neutrino Observatory is a neutrino detector located at the South Pole. Here we present experiences acquired when using HTCondor to run IceCube’s GPU simulation worksets on the Titan supercomputer. Titan is a large supercomputer geared for High Performance Computing (HPC). Several factors make it challenging to use Titan for IceCube’s High Throughput Computing (HTC) workloads: (1) Titan...
Predictions for requirements for the LHC computing for Run 3 and for Run 4 (HL_LHC) over the course of the next 10 years, show a considerable gap between required and available resources, assuming budgets will globally remain flat at best. This will require some radical changes to the computing models for the data processing of the LHC experiments. The use of large scale computational...
Many HEP experiments are moving beyond experimental studies to making large-scale production use of HPC resources at NERSC including the knights landing architectures on the Cori supercomputer. These include ATLAS, Alice, Belle2, CMS, LSST-DESC, and STAR among others. Achieving this has involved several different approaches and has required innovations both on NERSC and the experiments’ sides....
The Titan supercomputer at Oak Ridge National Laboratory prioritizes the scheduling of large leadership class jobs, but even when the supercomputer is fully loaded and large jobs are standing in the queue to run, 10 percent of the machine remains available for a mix of smaller jobs, essentially ‘filling in the cracks’ between the very large jobs. Such utilisation of the computer resources is...
PanDA executes millions of ATLAS jobs a month on Grid systems with more than
300k cores. Currently, PanDA is compatible only with few HPC resources due to
different edge services and operational policies, does not implement the pilot
paradigm on HPC, and does not dynamically optimize resource allocation among
queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP)
system...
The FabrIc for Frontier Experiments (FIFE) project within the Fermilab Scientific Computing Division is charged with integrating offline computing components into a common computing stack for the non-LHC Fermilab experiments, supporting experiment offline computing, and consulting on new, novel workflows. We will discuss the general FIFE onboarding strategy, the upgrades and enhancements in...
HEPCloud is rapidly becoming the primary system for provisioning compute resources for all Fermilab-affiliated experiments. In order to reliably meet peak demands of the next generation of High Energy Physics experiments, Fermilab must either plan to locally provision enough resources to cover the forecasted need, or find ways to elastically expand its computational capabilities. Commercial...
The amount of data to be processed by experiments in high energy physics is tremendously increasing in the coming years. For the first time in history the expected technology advance itself will not be sufficient to cover the arising gap between required and available resources based on the assumption of maintaining the current flat budget hardware procurement strategy. This leads to...
LZ is a Dark Matter experiment based at the Sanford Underground Research Facility. It is currently under construction and aims to start data taking in 2020. Its computing model is based on two data centres, one in the USA (USDC) and one in the UK (UKDC), both holding a complete copy of its data. During stable periods of running both data centres plan to concentrate on different aspects of...
Computing in the field of high energy physics requires usage of heterogeneous computing resources and IT, such as grid, high performance computing, cloud computing and big data analytics for data processing and analysis. The core of the distributed computing environment at the Joint Institute for Nuclear Research is the Multifunctional Information and Computing Complex (MICC). It includes...
LHC@home has provided computing capacity for simulations under BOINC since 2005. Following the introduction of virtualisation with BOINC to run HEP Linux software in a virtual machine on volunteer desktops, initially started on the test BOINC projects, like Test4Theory and ATLAS@home, all CERN applications distributed to volunteers have been consolidated under a single LHC@home BOINC project....
The Edinburgh (UK) Tier-2 computing site has provided CPU and storage resources to the Worldwide LHC Computing Grid (WLCG) for close to 10 years. Unlike other sites, resources are shared amongst members of the hosting institute rather than being exclusively provisioned for Grid computing. Although this unconventional approach has posed challenges for troubleshooting and service delivery there...
The volunteer computing project ATLAS@Home has been providing a stable computing resource for the ATLAS experiment since 2013. It has recently undergone some significant developments and as a result has become one of the largest resources contributing to ATLAS computing, by expanding its scope beyond traditional volunteers and into exploitation of idle computing power in ATLAS data centres....
While the WLCG and EGI have both made significant progress towards solutions for storage space accounting, one area that is still quite exploratory is that of dataset accounting. This type of accounting would enable resource centre and research community administrators to report on dataset usage to the data owners, data providers, and funding agencies. Eventually decisions could be made about...
The OSG has long maintained a central accounting system called Gratia. It uses small probes on each computing and storage resource in order to usage. The probes report to a central collector which stores the usage in a database. The database is then queried to generate reports. As the OSG aged, the size of the database grew very large. It became too large for the database technology to...
HammerCloud is a testing service and framework to commission, run continuous tests or on-demand large-scale stress tests, and benchmark computing resources and components of various distributed systems with realistic full-chain experiment workflows.
HammerCloud, userd by the ATLAS and CMS experiments in production, has been a useful service to commission both compute resources and various...
The Belle II is an asymmetric energy e+e- collider experiment at KEK, Japan. The Belle II aims to reveal the physics beyond the standard model with a data set of about 5×10^10 BB^bar pairs and starts the physics run in 2018. In order to store such a huge amount of data including simulation events and analyze it in a timely manner, Belle II adopts a distributed computing model with DIRAC...
ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout system and computing for LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect...
The WLCG Information System (IS) is an important component of the huge heterogeneous distributed infrastructure. Considering the evolution of LHC computing towards high luminosity era and analyzing experience accumulated by the computing operations teams and limitations of the current information system, the WLCG IS evolution task force came up with the proposal to develop Computing Resource...
During 2017 LHCb developed the ability to interrupt Monte Carlo
simulation jobs and cause them to finish cleanly with the events
simulated so far correctly uploaded to grid storage. We explain
how this functionality is supported in the Gaudi framework and handled
by the LHCb simulation framework Gauss. By extending DIRAC, we have been
able to trigger these interruptions when running...
The CERN ATLAS experiment grid workflow system manages routinely 250 to
500 thousand concurrently running production and analysis jobs
to process simulation and detector data. In total more than 300 PB
of data is distributed over more than 150 sites in the WLCG.
At this scale small improvements in the software and computing
performance and workflows can lead to significant resource usage...
Hundreds of physicists analyse data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) using the CMS Remote Analysis builder (CRAB) and the CMS GlideinWMS global pool to exploit the resources of the World LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time the CMS collaboration is committed on...
ATLAS Distributed Computing (ADC) uses the pilot model to submit jobs to Grid computing resources. This model isolates the resource from the workload management system (WMS) and helps to avoid running jobs on faulty resources. A minor side-effect of this isolation is that the faulty resources are neglected and not brought back into production because the problems are not visible to the WMS. In...
Federated identity management (FIM) is an arrangement that can be made among multiple organisations that lets subscribers use the same identification data to obtain access to the secured resources of all organisations in the group. In many research communities there is an increasing interest in a common approach to FIM as there is obviously a large potential for synergies. FIM4R [1] provides a...
The modern security landscape for distributed computing in High Energy Physics (HEP) includes a wide range of threats employing different attack vectors. The nature of these threats is such that the most effective method for dealing with them is to work collaboratively, both within the HEP community and with partners further afield - these can, and should, include institutional and campus...
X.509 is the dominate security infrastructure used in WLCG. Although
this technology has worked well, it has some issues. One is that,
currently, a delegated proxy can do everything the parent credential
can do. A stolen "production" proxy could be used from any machine in
the world to delete all data owned by that VO on all storage systems
in the grid.
Generating a delegated X.509...
The European Open Science Cloud (EOSC) aims to enable trusted access to services and the re-use of shared scientific data across disciplinary, social and geographical borders. The EOSC-hub will realise the EOSC infrastructure as an ecosystem of research e-Infrastructures leveraging existing national and European investments in digital research infrastructures. EGI Check-in and EUDAT B2ACCESS...
LHC Computing Grid was a pioneer integration effort, managed to unite computing and
storage resources all over the world, thus making them available to experiments on the Large Hadron Collider. During decade of LHC computing, Grid software has learned to effectively utilise different types of computing resources, such as classic computing clusters, clouds and hyper power computers. While the...
The CERN ATLAS experiment successfully uses a worldwide
computing infrastructure to support the physics program during LHC
Run 2. The grid workflow system PanDA routinely manages 250 to
500 thousand concurrently running production and analysis jobs
to process simulation and detector data. In total more than 300 PB
of data is distributed over more than 150 sites in the WLCG and
handled by the...
IceCube is a cubic kilometer neutrino detector located at the south pole. IceProd is IceCube’s internal dataset management system, keeping track of where, when, and how jobs run. It schedules jobs from submitted datasets to HTCondor, keeping track of them at every stage of the lifecycle. Many updates have happened in the last years to improve stability and scalability, as well as increase...
The DIRAC project is developing interware to build and operate distributed computing systems. It provides a development framework and a rich set of services for both Workload and Data Management tasks of large scientific communities. DIRAC is adopted by a growing number of collaborations, including LHCb, Belle2, the Linear Collider, and CTA.
The LHCb experiment will be upgraded during the...
In recent years the LHC delivered a record-breaking luminosity to the CMS experiment making it a challenge to successfully handle all the demands for the efficient Data and Monte Carlo processing. In the presentation we will review major issues managing such requests and how we were able to address them. Our main strategy relies on the increased automation and dynamic workload and data...
The Xenon Dark Matter experiment is looking for non baryonic particle Dark Matter in the universe. The demonstrator is a dual phase time projection chamber (TPC), filled with a target mass of ~2000 kg of ultra pure liquid xenon. The experimental setup is operated at the Laboratori Nazionali del Gran Sasso (LNGS).
We present here a full overview about the computing scheme for data distribution...
The Production and Distributed Analysis (PanDA) system has been successfully used in the ATLAS experiment as a data-driven workload management system. The PanDA system has proven to be capable of operating at the Large Hadron Collider data processing scale over the last decade including the Run 1 and Run 2 data taking periods. PanDA was originally designed to be weakly coupled with the WLCG...
A goal of LSST (Large Synoptic Survey Telescope) project is to conduct a 10-year survey of the sky that is expected to deliver 200 petabytes of data after it begins full science operations in 2022. The project will address some of the most pressing questions about the structure and evolution of the universe and the objects in it. It will require a large amount of simulations to understand the...
The Cherenkov Telescope Array (CTA) is the next-generation instrument in the field of very high energy gamma-ray astronomy. It will be composed of two arrays of Imaging Atmospheric Cherenkov Telescopes, located at La Palma (Spain) and Paranal (Chile). The construction of CTA has just started with the installation of the first telescope on site at La Palma and the first data expected by the end...
The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment which will start in 2020. To fasten JUNO data processing over multicore hardware, the JUNO software framework is introducing parallelization based on TBB. To support JUNO multicore simulation and reconstruction jobs in the near future, a new workload scheduling model has to be explored and implemented in...
The CMS Submission Infrastructure Global Pool, built on GlideinWMS and HTCondor, is a worldwide distributed dynamic pool responsible for the allocation of resources for all CMS computing workloads. Matching the continuously increasing demand for computing resources by CMS requires the anticipated assessment of its scalability limitations. Extrapolating historical usage trends, by LHC Run III...
The HL-LHC program has seen numerous extrapolations of its needed computing resources that each indicate the need for substantial changes if the desired HL-LHC physics program is to be supported within the current level of computing resource budgets. Drivers include large increases in event complexity (leading to increased processing time and analysis data size) and trigger rates needed (5-10...
The LHCb experiment will be upgraded for data taking in the LHC Run 3. The foreseen trigger output bandwidth trigger of a few GB/s will result in datasets of tens of PB per year, which need to be efficiently streamed and stored offline for low-latency data analysis. In addition, simulation samples of up to two orders of magnitude larger than those currently simulated are envisaged, with big...
The Production and Distributed Analysis system (PanDA) for the ATLAS experiment at the Large Hadron Collider has seen big changes over the past couple of years to accommodate new types of distributed computing resources: clouds, HPCs, volunteer computers and other external resources. While PanDA was originally designed for fairly homogeneous resources available through the Worldwide LHC...
The ALICE experiment will undergo an extensive detector and readout upgrade for the LHC Run3 and will collect a 10 times larger data volume than today. This will translate into increase of the required CPU resources worldwide as well as higher data access and transfer rates. JAliEn (Java ALICE Environment) is the new Grid middleware designed to scale-out horizontally and satisfy the ALICE...
The increase in the scale of LHC computing expected for Run 3 and even more so for Run 4 (HL-LHC) over the course of the next ten years will most certainly require radical changes to the computing models and the data processing of the LHC experiments. Translating the requirements of the physics programmes into computing resource needs is an extremely complicated process and subject to...
With the increase of power and reduction of cost of GPU accelerated processors a corresponding interest in their uses in the scientific domain has spurred. OSG users are no different and they have shown an interest in accessing GPU resources via their usual workload infrastructures. Grid sites that have these kinds of resources also want to make them grid available. In this talk, we discuss...