Conveners
Track 7: Middleware, Monitoring and Accounting: 7.1
- Randall Sobie (University of Victoria (CA))
Track 7: Middleware, Monitoring and Accounting: 7.2
- Ian Collier (STFC RAL)
Track 7: Middleware, Monitoring and Accounting: 7.3
- Randall Sobie (University of Victoria (CA))
Track 7: Middleware, Monitoring and Accounting: 7.4
- Rolf Seuster (University of Victoria (CA))
We present the Web-Based Monitoring project of the CMS experiment at the LHC at CERN. With the growth in size and complexity of High Energy Physics experiments and the accompanying increase in the number of collaborators spread across the globe, the importance of broadly accessible monitoring has grown. The same can be said about the increasing relevance of operation and reporting web tools...
The CMS experiment has collected an enormous volume of metadata about its computing operations in its monitoring systems, describing its experience in operating all of the CMS workflows on all of the Worldwide LHC Computing Grid Tiers. Data mining efforts into all these information have rarely been done, but are of crucial importance for a better understanding of how CMS did successful...
This paper introduces the evolution of the monitoring system of the Alpha Magnetic Spectrometer (AMS) Science Operation Center (SOC) at CERN.
The AMS SOC monitoring system includes several independent tools: Network Monitor to poll the health metrics of AMS local computing farm, Production Monitor to show the production status, Frame Monitor to record the flight data arriving status, and...
For over a decade, LHC experiments have been relying on advanced and specialized WLCG dashboards for monitoring, visualizing and reporting the status and progress of the job execution, data management transfers and sites availability across the WLCG distributed grid resources.
In the recent years, in order to cope with the increase of volume and variety of the grid resources, the WLCG...
The CERN Control and Monitoring Platform (C2MON) is a modular, clusterable framework designed to meet a wide range of monitoring, control, acquisition, scalability and availability requirements. It is based on modern Java technologies and has support for several industry-standard communication protocols. C2MON has been reliably utilised for several years as the basis of multiple monitoring...
In order to ensure an optimal performance of the LHCb Distributed Computing, based on LHCbDIRAC, it is necessary to be able to inspect the behavior over time of many components: firstly the agents and services on which the infrastructure is built, but also all the computing tasks and data transfers that are managed by this infrastructure. This consists of recording and then analyzing time...
One of the principle goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics “datasets” that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics...
The LHC is the world's most powerful particle accelerator, colliding protons at centre of mass energy of 13 TeV. As the
energy and frequency of collisions has grown in the search for new physics, so too has demand for computing resources needed for
event reconstruction. We will report on the evolution of resource usage in terms of CPU and RAM in key ATLAS offline
reconstruction workflows at...
Changes in the trigger menu, the online algorithmic event-selection of the ATLAS experiment at the LHC in response to luminosity and detector changes are followed by adjustments in their monitoring system. This is done to ensure that the collected data is useful, and can be properly reconstructed at Tier-0, the first level of the computing grid. During Run 1, ATLAS deployed monitoring updates...
MonALISA, which stands for Monitoring Agents using a Large Integrated Services Architecture, has been developed over the last fourteen years by Caltech and its partners with the support of the CMS software and computing program. The framework is based on Dynamic Distributed Service Architecture and is able to provide complete monitoring, control and global optimization services for complex...
Physics analysis at the Compact Muon Solenoid (CMS) requires both a vast production of simulated events and an extensive processing of the data collected by the experiment.
Since the end of the LHC runI in 2012, CMS has produced over 20 Billion simulated events, from 75 thousand processing requests organised in one hundred different campaigns, which emulate different configurations of...
Over the past two years, the operations at INFN-CNAF have undergone significant changes.
The adoption of configuration management tools, such as Puppet and the constant increase of dynamic and cloud infrastructures, have led us to investigate a new monitoring approach.
Our aim is the centralization of the monitoring service at CNAF through a scalable and highly configurable monitoring...
IceProd is a data processing and management framework developed by the IceCube Neutrino Observatory for processing of Monte Carlo simulations, detector data, and analysis levels. It runs as a separate layer on top of grid and batch systems. This is accomplished by a set of daemons which process job workflow, maintaining configuration and status information on the job before, during, and after...
Performing efficient resource provisioning is a fundamental aspect for any resource provider. Local Resource Management Systems (LRMS) have been used in data centers for decades in order to obtain the best usage of the resources, providing their fair usage and partitioning for the users. In contrast, current cloud schedulers are normally based on the immediate allocation of resources on a...
As a new approach to manage resource, virtualization technology is more and more widely applied in high-energy physics field. A virtual computing cluster based on Openstack was built at IHEP, and with HTCondor as the job queue management system. An accounting system which can record the resource usages of different experiment groups in details was also developed. There are two types of the...
Over the past few years, Grid Computing technologies have reached a high
level of maturity. One key aspect of this success has been the development and adoption of newer Compute Elements to interface the external Grid users with local batch systems. These new Compute Elements allow for better handling of jobs requirements and a more precise management of diverse local resources.
However,...
The HTCondor-CE is the primary Compute Element (CE) software for the Open Science Grid. While it offers many advantages for large sites, for smaller, WLCG Tier-3 sites or opportunistic clusters, it can be a difficult task to install and configure the HTCondor-CE. Installing a CE typically involves understanding several pieces of software, installing hundreds of packages on a dedicated node,...
Containers remain a hot topic in computing, with new use cases and tools appearing every day. Basic functionality such as spawning containers seems to have settled, but topics like volume support or networking are still evolving. Solutions like Docker Swarm, Kubernetes or Mesos provide similar functionality but target different use cases, exposing distinct interfaces and APIs.
The CERN...
The development of scientific computing is increasingly moving to web and mobile applications. All these clients need high-quality implementations of accessing heterogeneous computing resources provided by clusters, grid computing or cloud computing. We present a web service called SCEAPI and describe how it can abstract away many details and complexities involved in the use of scientific...
The DIRAC project is developing interware to build and operate distributed
computing systems. It provides a development framework and a rich set of services
for both Workload and Data Management tasks of large scientific communities.
A number of High Energy Physics and Astrophysics collaborations have adopted
DIRAC as the base for their computing models. DIRAC was initially developed for...
In the last few years, new types of computing models, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of pledged resources, while others are in the form of opportunistic ones. Most but not all of these new infrastructures are based on virtualization techniques. In addition, some of them, present opportunities...
The Worldwide LHC Computing Grid infrastructure links about 200 participating computing centers affiliated with several partner projects. It is built by integrating heterogeneous computer and storage resources in diverse data centers all over the world and provides CPU and storage capacity to the LHC experiments to perform data processing and physics analysis. In order to be used by the...
The Belle II experiment will generate very large data samples. In order to reduce the time for data analyses, loose selection criteria will be used to create files rich in samples of particular interest for a specific data analysis (data skims). Even so, many of the resultant skims will be very large, particularly for highly inclusive analyses. The Belle II collaboration is investigating the...
The higher energy and luminosity from the LHC in Run2 has put increased pressure on CMS computing resources. Extrapolating to even higher luminosities (and thus higher event complexities and trigger rates) in Run3 and beyond, it becomes clear the current model of CMS computing alone will not scale accordingly. High Performance Computing (HPC) facilities, widely used in scientific computing...
The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment.
PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year.
While...