Conveners
Track 3: Distributed Computing: 3.1 - Chair Mr. Latchezar Betev
- Tanya Levshina
- Latchezar Betev (CERN)
- Weidong Li (Institute of High Energy Physics, Chinese Academy of Sciences)
Track 3: Distributed Computing: 3.2 - Chair Mr. Miguel Pedreira
- Latchezar Betev (CERN)
- Weidong Li (Institute of High Energy Physics, Chinese Academy of Sciences)
- Tanya Levshina
Track 3: Distributed Computing: 3.3 - Chair Mr. Maarten Litmaath
- Weidong Li (Institute of High Energy Physics, Chinese Academy of Sciences)
- Tanya Levshina
- Latchezar Betev (CERN)
Track 3: Distributed Computing: 3.4 - Chair Mr. Andreas Joachim Peters
- Latchezar Betev (CERN)
- Tanya Levshina
- Weidong Li (Institute of High Energy Physics, Chinese Academy of Sciences)
Track 3: Distributed Computing: 3.5 - Chair Mr. Xavier Espinal Curull
- Weidong Li (Institute of High Energy Physics, Chinese Academy of Sciences)
- Latchezar Betev (CERN)
- Tanya Levshina
Track 3: Distributed Computing: 3.6 - chair Mr. Barthelemy Von Haller
- Tanya Levshina
- Weidong Li (Chinese Academy of Sciences (CN))
- Latchezar Betev (CERN)
The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production
and analysis requirements for a data-driven workload management system capable of operating
at the Large Hadron Collider (LHC) data processing scale. Heterogeneous resources used by the ATLAS
experiment are distributed worldwide at hundreds of sites, thousands of physicists analyse the...
The ATLAS workload management system is a pilot system based on a late binding philosophy that avoided for many years
to pass fine grained job requirements to the batch system. In particular for memory most of the requirements were set to request
4GB vmem as defined in the EGI portal VO card, i.e. 2GB RAM + 2GB swap. However in the past few years several changes have happened
in the...
All four of the LHC experiments depend on web proxies (that is, squids) at each grid site in order to support software distribution by the CernVM FileSystem (CVMFS). CMS and ATLAS also use web proxies for conditions data distributed through the Frontier Distributed Database caching system. ATLAS & CMS each have their own methods for their grid jobs to find out which web proxy to use for...
The ATLAS computing model was originally designed as static clouds (usually national or geographical groupings of sites) around the
Tier 1 centers, which confined tasks and most of the data traffic. Since those early days, the sites' network bandwidth has
increased at O(1000) and the difference in functionalities between Tier 1s and Tier 2s has reduced. After years of manual,
intermediate...
With ever-greater computing needs and fixed budgets, big scientific experiments are turning to opportunistic resources as a means
to add much-needed extra computing power. These resources can be very different in design from the resources that comprise the Grid
computing of most experiments, therefore exploiting these resources requires a change in strategy for the experiment. The...
The ATLAS production system has provided the infrastructure to process of tens of thousand of events during LHC Run1 and the first year of the LHC Run2 using grid, clouds and high performance computing. We address in this contribution all the strategies and improvements added to the production system to optimize its performance to get the maximum efficiency of available resources from...
We previously described Lobster, a workflow management tool for exploiting volatile opportunistic computing resources for computation in HEP. We will discuss the various challenges that have been encountered while scaling up the simultaneous CPU core utilization and the software improvements required to overcome these challenges.
Categories: Workflows can now be divided into categories...
In the near future, many new experiments (JUNO, LHAASO, CEPC, etc) with challenging data volume are coming into operations or are planned in IHEP, China. The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment to be operational in 2019. The Large High Altitude Air Shower Observatory (LHAASO) is oriented to the study and observation of cosmic rays, which is...
In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run-2 events requires parallelization of the code in order to reduce the memory-per-core footprint constraining serial-execution programs, thus optimizing the exploitation of present multi-core...
The GridPP project in the UK has a long-standing policy of supporting non-LHC VOs with 10% of the provided resources. Up until recently this had only been taken up be a very limited set of VOs, mainly due to a combination of the (perceived) large overhead of getting started, the limited computing support within non-LHC VOs and the ability to fulfill their computing requirements on local batch...
CRAB3 is a workload management tool used by more than 500 CMS physicists every month to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). CRAB3 allows users to analyze a large collection of input files (datasets), splitting the input into multiple Grid jobs depending on parameters provided by users.
The process of manually specifying...
The CMS Computing and Offline group has put in a number of enhancements into the main software packages and tools used for centrally managed processing and data transfers in order to cope with the challenges expected during the LHC Run 2. In the presentation we will highlight these improvements that allow CMS to deal with the increased trigger output rate and the increased collision pileup in...
On a typical WLCG site providing batch access to computing resources according to a fairshare policy, the idle timelapse after a job ends and before a new one begins on a given slot is negligible if compared to the duration of typical jobs. The overall amount of these intervals over a time window increases with the size of the cluster and the inverse of job duration and can be considered...
We present a system deployed in the summer of 2015 for the automatic assignment of production and reprocessing workflows for simulation and detector data in the frame of the Computing Operation of the CMS experiment at the CERN LHC. Processing requests involves a number of steps in the daily operation, including transferring input datasets where relevant and monitoring them, assigning work to...
The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. Total resources at Tier-1 and Tier-2 sites pledged to CMS exceed 100,000 CPU cores, and another 50,000-100,000 CPU cores are available opportunistically, pushing the needs of the...
The need for computing in the HEP community follows cycles of peaks and valleys mainly driven by conference dates, accelerator shutdown, holiday schedules, and other factors. Because of this, the classical method of provisioning these resources at providing facilities has drawbacks such as potential overprovisioning. As the appetite for computing increases, however, so does the need to...
The FabrIc for Frontier Experiments (FIFE) project is a major initiative within the Fermilab Scientific Computing Division charged with leading the computing model for Fermilab experiments. Work within the FIFE project creates close collaboration between experimenters and computing professionals to serve high-energy physics experiments of differing size, scope, and physics area. The FIFE...
The second generation of the ATLAS production system called ProdSys2 is a
distributed workload manager that runs daily hundreds of thousands of jobs,
from dozens of different ATLAS specific workflows, across more than
hundred heterogeneous sites. It achieves high utilization by combining
dynamic job definition based on many criteria, such as input and output
size, memory requirements and...
Networks have played a critical role in high-energy physics
(HEP), enabling us to access and effectively utilize globally distributed
resources to meet the needs of our physicists.
Because of their importance in enabling our grid computing infrastructure
many physicists have taken leading roles in research and education (R&E)
networking, participating in, and even convening, network...
With the increased load and pressure on required computing power brought by the higher luminosity in LHC during Run2, there is a need to utilize opportunistic resources not currently dedicated to the Compact Muon Solenoid (CMS) collaboration. Furthermore, these additional resources might be needed on demand. The Caltech group together with the Argonne Leadership Computing Facility (ALCF) are...
PanDA - Production and Distributed Analysis Workload Management System has been developed to address ATLAS experiment at LHC data processing and analysis challenges. Recently PanDA has been extended to run HEP scientific applications on Leadership Class Facilities and supercomputers. The success of the projects to use PanDA beyond HEP and Grid has drawn attention from other compute intensive...
ATLAS Distributed Computing during LHC Run-1 was challenged by steadily increasing computing, storage and network
requirements. In addition, the complexity of processing task workflows and their associated data management requirements
led to a new paradigm in the ATLAS computing model for Run-2, accompanied by extensive evolution and redesign of the
workflow and data management systems. The...
ATLAS has been extensively exploring possibilities of using computing resources extending beyond conventional grid sites in the WLCG fabric to deliver as many computing cycles as possible and thereby enhance the significance of the Monte-Carlo samples to deliver better physics results.
The difficulties of using such opportunistic resources come from architectural differences such as...
Distributed data processing in High Energy and Nuclear Physics (HENP) is a prominent example of big data analysis. Having petabytes of data being processed at tens of computational sites with thousands of CPUs, standard job scheduling approaches either do not address well the problem complexity or are dedicated to one specific aspect of the problem only (CPU, network or storage). As a result, ...
A lot of experiments in the field of accelerator based science are actively running at High Energy Accelerator Research Organization (KEK) by using SuperKEKB and J-PARC accelerator in Japan. In these days at KEK, the computing demand from the various experiments for the data processing, analysis and MC simulation is monotonically increasing. It is not only for the case with high-energy...
The Cherenkov Telescope Array (CTA) – an array of many tens of Imaging Atmospheric Cherenkov Telescopes deployed on an unprecedented scale – is the next-generation instrument in the field of very high energy gamma-ray astronomy. An average data stream of about 0.9 GB/s for about 1300 hours of observation per year is expected, therefore resulting in 4 PB of raw data per year and a total of 27...
More than one thousand physicists analyse data collected by the ATLAS experiment at the Large Hadron Collider (LHC) at CERN through 150 computing facilities around the world. Efficient distributed analysis requires optimal resource usage and the interplay of several
factors: robust grid and software infrastructures, and system capability to adapt to different workloads. The continuous...
With the LHC Run2, end user analyses are increasingly challenging for both users and resource providers.
On the one hand, boosted data rates and more complex analyses favor and require larger data volumes to be processed.
On the other hand, efficient analyses and resource provisioning require fast turnaround cycles.
This puts the scalability of analysis infrastructures to new...
One of the challenges a scientific computing center has to face is to keep delivering a computational framework well consolidated within the community (i.e. the batch farm), while complying to modern computing paradigms. The aim is to ease system administration at all levels (from hardware to applications) and to provide a smooth end-user experience.
HTCondor is a LRMS widely used in the...
The Cloud Area Padovana has been running for almost two years. This is an OpenStack-based scientific cloud, spread across two different sites: the INFN Padova Unit and the INFN Legnaro National Labs.
The hardware resources have been scaled horizontally and vertically, by upgrading some hypervisors and by adding new ones: currently it provides about 1100 cores.
Some in-house developments were...
This contribution reports on solutions, experiences and recent developments with the dynamic, on-demand provisioning of remote computing resources for analysis and simulation workflows. Local resources of a physics institute are extended by private and commercial cloud sites, ranging from the inclusion of desktop clusters over institute clusters to HPC centers.
Rather than relying on...
The distributed cloud using the CloudScheduler VM provisioning service is one of the longest running systems for HEP workloads. It has run millions of jobs for ATLAS and Belle II over the past few years using private and commercial clouds around the world. Our goal is to scale the distributed cloud to the 10,000-core level, with the ability to run any type of application (low I/O, high I/O...
Historically high energy physics computing has been performed on large purpose-built computing systems. In the beginning there were single site computing facilities, which evolved into the Worldwide LHC Computing Grid (WLCG) used today. The vast majority of the WLCG resources are used for LHC computing and the resources are scheduled to be continuously used throughout the year. In the last...
HEP is only one of many sciences with sharply increasing compute requirements that cannot be met by profiting from Moore's law alone. Commercial clouds potentially allow for realising larger economies of scale. While some small-scale experience requiring dedicated effort has been collected, public cloud resources have not been integrated yet with the standard workflows of science organisations...
The HNSciCloud project (presented in general by another contribution) faces the challenge to accelerate developments performed by the selected commercial providers. In order to guarantee cost-efficient usage of IaaS resources across a wide range of scientific communities, the technical requirements had to be carefully constructed. With respect to current IaaS offerings, data-intensive science...
In the competitive 'market' for large-scale storage solutions, EOS has been showing its excellence in the multi-Petabyte high-concurrency regime. It has also shown a disruptive potential in powering the CERNBox service in providing sync&share capabilities and in supporting innovative analysis environments along the storage of LHC data. EOS has also generated interest as generic storage...
ATLAS@Home is a volunteer computing project which allows the public to contribute to computing for the ATLAS experiment through
their home or office computers. The project has grown continuously since its creation in mid-2014 and now counts almost 100,000
volunteers. The combined volunteers' resources make up a sizable fraction of overall resources for ATLAS simulation. This paper
takes...
This talk will present the result of recent developments to support new users from the Large Scale Survey Telescope (LSST) group on the GridPP DIRAC instance. I will describe a workflow used for galaxy shape identification analyses whilst highlighting specific challenges as well as the solutions currently being explored. The result of this work allows this community to make best use of...