- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
CHEP 2016 took place from Monday October 10 to Friday October 14, 2016 at the Marriott Marquis, San Francisco
A WLCG Workshop was held on October 8 and 9 at the Marriott.
The SND detector takes data at the e+e- collider VEPP-2000 in Novosibirsk. We present here
recent upgrades of the SND DAQ system which are mainly aimed to handle the enhanced events
rate load after the collider modernization. To maintain acceptable events selection quality the electronics
throughput and computational power should be increased. These goals are achieved with the new fast (Flash ADC)
digitizing electronics and distributed data taking. The data flow for the most congested detector subsystems
is distributed and processed separately. We describe the new distributed SND DAQ software architecture,
its computational and network infrastructure.
The Cherenkov Telescope Array (CTA) will be the next generation ground-based gamma-ray observatory. It will be made up of approximately 100 telescopes of three different sizes, from 4 to 23 meters in diameter. The previously presented prototype of a high speed data acquisition (DAQ) system for CTA (CHEP 2012) has become concrete within the NectarCAM project, one of the most challenging camera projects with very demanding needs for bandwidth of data handling.
We designed a Linux-PC system able to concentrate and process without packet loss the 40 Gb/s average data rate coming from the 265 Front End Boards (FEB) through Gigabit Ethernet links, and to reduce data to fit the two ten-Gigabit Ethernet downstream links by external trigger decisions as well as custom tailored compression algorithms. Within the given constraints, we implemented de-randomisation of the event fragments received as relatively small UDP packets emitted by the FEB, using off-the-shelf equipment as required by the project and for an operation period of at least 30 years.
We tested out-of-the-box interfaces and used original techniques to cope with these requirements, and set up a test bench with hundreds of synchronous Gigabit links in order to validate and tune the acquisition chain including downstream data logging based on zeroMQ and Google ProtocolBuffers.
The LHC will collide protons in the ATLAS detector with increasing luminosity through 2016, placing stringent operational and physical requirements to the ATLAS trigger system in order to reduce the 40 MHz collision rate to a manageable event storage rate of about 1 kHz, while not rejecting interesting physics events. The Level-1 trigger is the first rate-reducing step in the ATLAS trigger system with an output rate of up to 100 kHz and decision latency smaller than 2.5 μs. It consists of a calorimeter trigger, muon trigger and a central trigger processor.
During the LHC shutdown after the Run 1 finished in 2013, the Level-1 trigger system was upgraded at hardware, firmware and software levels. In particular, a new electronics sub-system was introduced in the real-time data processing path: the Topological Processor System (L1Topo). It consists of a single AdvancedCTA shelf equipped with two Level-1 topological processor blades. They receive real-time information from the Level-1 calorimeter and muon triggers, which is processed by four individual state-of-the-art FPGAs. It needs to deal with a large input bandwidth of up to 6 Tb/s, optical connectivity and low processing latency on the real-time data path. The L1Topo firmware includes measurements of angles between jets and/or leptons and determination of kinematic variables based on lists of selected or sorted trigger objects. All these complex calculations are executed in hardware within 200 ns. Over one hundred VHDL algorithms are producing trigger outputs that are incorporated into the logic of the central trigger processor, responsible of generating the Level-1 acceptance signal. The detailed additional information provided by L1Topo, will improve the ATLAS physics reach in a harsher collisions environment.
The system has been installed and commissioning started during 2015 and continued during 2016. As part of the firmware commissioning, the physics output from individual algorithms needs to be simulated and compared with the hardware response. An overview of the design, commissioning process and early impact on physics results of the new L1Topo system will be illustrated.
ALICE HLT Run2 performance overview
M.Krzewicki for the ALICE collaboration
The ALICE High Level Trigger (HLT) is an online reconstruction and data compression system used in the ALICE experiment at CERN. Unique among the LHC experiments, it extensively uses modern coprocessor technologies like general purpose graphic processing units (GPGPU) and field programmable gate arrays (FPGA) in the data flow.
Real-time data compression is performed using a cluster finder algorithm implemented on FPGA boards and subsequent optimisation and Huffman encoding stages. These data, instead of raw clusters, are used in storage and the subsequent offline processing. For Run 2 and beyond, the compression scheme is being extended to provide higher compression ratios.
Track finding is performed using a cellular automaton and a Kalman filter algorithm on GPGPU hardware, where CUDA, OpenCL and OpenMP (for CPU support) technologies can be used interchangeably.
In the context of the upgrade of the readout system the HLT framework was optimised to fully handle the increased data and event rates due to the time projection chamber (TPC) readout upgrade and the increased LHC luminosity.
Online calibration of the TPC using HLT's online tracking capabilities was deployed. To this end, offline calibration code was adapted to run both online and offline and the HLT framework was extended accordingly. The performance of this schema is important to Run 3 related developments. Online calibration can, next to being an important exercise for Run 3, reduce the computing workload during the offline calibration and reconstruction cycle already in Run 2.
A new multi-part messaging approach was developed forming at the same time a test bed for the new data flow model of the O2 system, where further development of this concept is ongoing.
This messaging technology, here based on ZeroMQ, was used to implement the calibration feedback loop on top of the existing, graph oriented HLT transport framework.
Utilising the online reconstruction of many detectors, a new asynchronous monitoring scheme was developed to allow real-time monitoring of the physics performance of the ALICE detector, again making use of the new messaging scheme for both internal and external communication.
The spare compute resource, the development cluster consisting of older HLT infrastructure is run as a tier-2 GRID site using an Openstack-based setup, contributing as many resources as feasible depending on the data taking conditions. In periods of inactivity during shutdowns, both the production and development clusters contribute significant computing power to the ALICE GRID.
The ALICE HLT uses a data transport framework based on the publisher subscriber message principle, which transparently handles the communication between processing components over the network and between processing components on the same node via shared memory with a zero copy approach.
We present an analysis of the performance in terms of maximum achievable data rates and event rates as well as processing capabilities during Run 1 and Run 2.
Based on this analysis, we present new optimization we have developed for ALICE in Run 2.
These include support for asynchronous transport via Zero-MQ which enables loops in the reconstruction chain graph and which is used to ship QA histograms to DQM.
We have added asynchronous processing capabilities in order to support long-running tasks besides the event-synchronous reconstruction tasks in normal HLT operation.
These asynchronous components run in an isolated process such that the HLT as a whole is resilient even to fatal errors in these asynchronous components.
In this way, we can ensure that new developments cannot break data taking.
On top of that, we have tuned the processing chain to cope with the higher event and data rates expected from the new TPC readout electronics (RCU2) and we have improved the configuration procedure and the startup time in order to increase the time where ALICE can take physics data.
We present an analysis of the maximum achievable data processing rates taking into account processing capabilities of CPUs and GPUs, buffer sizes, network bandwidth, the incoming links from the detectors, and the outgoing links to data acquisition.
The LHCb software trigger underwent a paradigm shift before the start of Run-II. From being a system to select events for later offline reconstruction, it can now perform the event analysis in real-time, and subsequently decide which part of the event information is stored for later analysis.
The new strategy is only possible due to a major upgrade during the LHC long shutdown I (2012-2015). The CPU farm was increased by almost a factor of two and the software trigger was split into two stages. The first stage performs a partial event reconstruction and inclusive selections to reduce the 1 MHz input rate from the hardware trigger to an output rate of 150 kHz. The output is buffered on hard disks distributed across the trigger farm. This allows for an asynchronous execution of the second stage where the CPU farm can be exploited also in between fills, and, as an integral part of the new strategy, the real-time alignment and calibration of sub-detectors before further processing. The second stage performs a full event reconstruction which is identical to the configuration used offline. LHCb is the first high energy collider experiment to do this. Hence, event selections are based on the best quality information and physics analyses can be performed directly in and on the output of the trigger. This concept, called the "LHCb Turbo stream", where reduced event information is saved, increases the possible output rate while keeping the storage footprint small.
In 2017, around half of the 400 trigger selections send their output to the Turbo stream and, for the first time, the Turbo stream no longer keeps the raw sub-detector data banks that would be needed for a repeated offline event reconstruction.
This allows up to a factor of 10 decrease in the size of the events, and thus an equivalent factor higher rate of signals that can be exploited in physics analyses. Additionally, the event format has been made more flexible, which has allowed more
used of the turbo stream in more physics analyses. We review the status of this real time analysis and discuss our plans for its evolution during Run-II towards the upgraded LHCb experiment that will begin operation in Run-III.
ATLAS's current software framework, Gaudi/Athena, has been very successful for the experiment in LHC Runs 1 and 2. However, its single threaded design has been recognised for some time to be increasingly problematic as CPUs have increased core counts and decreased available memory per core. Even the multi-process version of Athena, AthenaMP, will not scale to the range of architectures we expect to use beyond Run2.
After concluding a rigorous requirements phase, where many design components were examined in detail, ATLAS has begun the migration to a new data-flow driven, multi-threaded framework, which enables the simultaneous processing of singleton, thread unsafe legacy Algorithms, cloned Algorithms that execute concurrently in their own threads with different Event contexts, and fully re-entrant, thread safe Algorithms.
In this paper we will report on the process of modifying the framework to safely process multiple concurrent events in different threads, which entails significant changes in the underlying handling of features such as event and time dependent data, asynchronous callbacks, metadata, integration with the Online High Level Trigger for partial processing in certain regions of interest, concurrent I/O, as well as ensuring thread safety of core services. We will also report on the migration of user code to the new framework, including that of upgrading select Algorithms to be fully re-entrant.
In 2015, CMS was the first LHC experiment to begin using a multi-threaded framework for doing event processing. This new framework utilizes Intel's Thread Building Block library to manage concurrency via a task based processing model. During the 2015 LHC run period, CMS only ran reconstruction jobs using multiple threads because only those jobs were sufficiently thread efficient. Recent work now allows simulation and digitization to be thread efficient. In addition, during 2015 the multi-threaded framework could run events in parallel but could only use one thread per event. Work done in 2016 now allows multiple threads to be used while processing one event. In this presentation we will show how these recent changes have improved CMS's overall threading and memory efficiency and we will discuss work to be done to further increase those efficiencies.
The Future Circular Collider (FCC) software effort is supporting the different experiment design studies for the three future collider options, hadron-hadron, electron-electron or electron-hadron. The software framework used by data processing applications has to be independent of the detector layout and the collider configuration. The project starts from the premise of using existing software packages that are experiment independent and of leveraging other packages, such as the LHCb simulation framework or the ATLAS tracking software, that can be easily modified to factor out any experiment dependency. At the same time, new components are being developed with a view to allowing usage outside of the FCC software project; for example, the data analysis front-end, which is written in Python, is decoupled from the main software stack and is only dependent on the FCC event data model. The event data model itself is generated from configuration files, allowing customisation, and enables parallelisation by supporting a corresponding data layout. A concise overview of the FCC software project will be presented and developments that can be of use to the HEP community highlighted, including the experiment-independent event data model library, the integrated simulation framework that supports Fast and Full simulation and the Tracking Software package.
LArSoft is an integrated, experiment-agnostic set of software tools for liquid argon (LAr) neutrino experiments
to perform simulation, reconstruction and analysis within Fermilab art framework.
Along with common algorithms, the toolkit provides generic interfaces and extensibility
that accommodate the needs of detectors of very different size and configuration.
To date, LArSoft has been successfully used
by ArgoNeuT, MicroBooNE, LArIAT, SBND, DUNE, and other prototype single phase LAr TPCs.
Work is in progress to include support for Dual Phase argon TPCs
such as some of the DUNE prototypes.
The LArSoft suite provides a wide selection of algorithms for event generation,
simulation of LAr TPCs including optical readouts,
and facilities for signal processing and the simulation of "auxiliary" detectors (e.g., scintillators).
Additionally, it offers algorithms for the full range of reconstruction
from hits on single channels, up to track trajectories,
identification of electromagnetic cascades,
estimation of particle momentum and energy,
and particle identification.
LArSoft presents data structures describing common physics objects and concepts,
that constitute a protocol connecting the algorithms.
It also includes the visualisation of generated and reconstructed objects,
which helps with algorithm development.
LArSoft content is contributed by the adopting experiments.
New experiments must provide a description of their detector geometry
and specific code for the treatment of the TPC electronic signals.
With that, the experiments gain instant access to the full set of algorithms within the suite.
The improvements which they achieve can be pushed back into LArSoft,
allowing the rest of the community to take advantage of them.
The sharing of algorithms enabled by LArSoft has been a major advantage for small experiments
that have little effort to devote to the creation of equivalent software infrastructure and tools.
LArSoft is also a collaboration of experiments, Fermilab and associated software projects
which cooperate in setting requirements, priorities and schedules.
A core project team provides support for infrastructure, architecture, software, documentation and coordination,
with oversight and input from the collaborating experiments and projects.
In this talk, we outline the general architecture of the software
and the interaction with external libraries and experiment specific code.
We also describe the dynamics of LArSoft development
between the contributing experiments,
the projects supporting the software infrastructure LArSoft relies on,
and the LArSoft support core project.
In 2012 CMS evaluated which underlying concurrency technology would be the best to use for its multi-threaded framework. The available technologies were evaluated on the high throughput computing systems dominating the resources in use at that time. A skeleton framework benchmarking suite that emulates the tasks performed within a CMSSW application was used to select Intel's Thread Building Block library, based on the measured overheads in both memory and CPU on the different technologies benchmarked. In 2016 CMS will get access to high performance computing resources that use new many core architectures; machines such as Cori Phase 1&2, Theta, Mira. Because of this we have revived the 2012 benchmark to test it’s performance and conclusions on these new architectures. This talk will discuss the results of this exercise.
The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production
and analysis requirements for a data-driven workload management system capable of operating
at the Large Hadron Collider (LHC) data processing scale. Heterogeneous resources used by the ATLAS
experiment are distributed worldwide at hundreds of sites, thousands of physicists analyse the data
remotely, the volume of processed data is beyond the exabyte scale, dozens of scientific
applications are supported, while data processing requires more than a few billion hours of computing
usage per year. PanDA performed very well over the last decade including the LHC Run 1 data
taking period. However, it was decided to upgrade the whole system concurrently with the LHC's
first long shutdown in order to cope with rapidly changing computing infrastructure.
After two years of reengineering efforts, PanDA has embedded capabilities for fully dynamic
and flexible workload management. The static batch job paradigm was discarded in favor of a more
automated and scalable model. Workloads are dynamically tailored for optimal usage of resources,
with the brokerage taking network traffic and forecasts into account. Computing resources
are partitioned based on dynamic knowledge of their status and characteristics. The pilot has been
re-factored around a plugin structure for easier development and deployment. Bookkeeping is handled
with both coarse and fine granularities for efficient utilization of pledged or opportunistic resources.
Leveraging direct remote data access and federated storage relaxes the geographical coupling between
processing and data. An in-house security mechanism authenticates the pilot and data management
services in off-grid environments such as volunteer computing and private local clusters.
The PanDA monitor has been extensively optimized for performance and extended with analytics to provide
aggregated summaries of the system as well as drill-down to operational details. There are as well many
other challenges planned or recently implemented, and adoption by non-LHC experiments
such as bioinformatics groups successfully running Paleomix (microbial genome and metagenomes)
payload on supercomputers. In this talk we will focus on the new and planned features that are most
important to the next decade of distributed computing workload management.
The ATLAS workload management system is a pilot system based on a late binding philosophy that avoided for many years
to pass fine grained job requirements to the batch system. In particular for memory most of the requirements were set to request
4GB vmem as defined in the EGI portal VO card, i.e. 2GB RAM + 2GB swap. However in the past few years several changes have happened
in the operating system kernel and in the applications that make such a definition of memory to use for requesting slots obsolete
and ATLAS has introduced the new PRODSYS2 workload management which has a more flexible system to evaluate the memory requirements
and to submit to appropriate queues. The work stemmed in particular from the introduction of 64bit multicore workloads and the
increased memory requirements of some of the single core applications. This paper describes the overall review and changes of
memory handling starting from the definition of tasks, the way tasks memory requirements are set using scout jobs and the new
memory tool produced in the process to do that, how the factories set these values and finally how the jobs are treated by the
sites through the CEs, batch systems and ultimately the kernel.
All four of the LHC experiments depend on web proxies (that is, squids) at each grid site in order to support software distribution by the CernVM FileSystem (CVMFS). CMS and ATLAS also use web proxies for conditions data distributed through the Frontier Distributed Database caching system. ATLAS & CMS each have their own methods for their grid jobs to find out which web proxy to use for Frontier at each site, and CVMFS has a third method. Those diverse methods limit usability and flexibility, particularly for opportunistic use cases. This paper describes a new Worldwide LHC Computing Grid (WLCG) system for discovering the addresses of web proxies that is based on an internet standard called Web Proxy Auto Discovery (WPAD). WPAD is in turn based on another standard called Proxy Auto Configuration (PAC) files. Both the Frontier and CVMFS clients support this standard. The input into the WLCG system comes from squids registered by sites in the Grid Configuration Database (GOCDB) and the OSG Information Management (OIM) system, combined with some exceptions manually configured by people from ATLAS and CMS who participate in WLCG squid monitoring. Central WPAD servers at CERN respond to http requests from grid nodes all over the world with a PAC file describing how grid jobs can find their web proxies, based on IP addresses matched in a database that contains the IP address ranges registered to organizations. Large grid sites are encouraged to supply their own WPAD web servers for more flexibility, to avoid being affected by short term long distance network outages, and to offload the WPAD servers at CERN. The CERN WPAD servers additionally support requests from jobs running at non-grid sites (particularly for LHC@Home) which it directs to the nearest publicly accessible web proxy servers. The responses to those requests are based on a separate database that maps IP addresses to longitude and latitude.
The ATLAS computing model was originally designed as static clouds (usually national or geographical groupings of sites) around the
Tier 1 centers, which confined tasks and most of the data traffic. Since those early days, the sites' network bandwidth has
increased at O(1000) and the difference in functionalities between Tier 1s and Tier 2s has reduced. After years of manual,
intermediate solutions, we have now ramped up to full usage of World-cloud, the latest step in the PanDA Workload Management System
to increase resource utilization on the ATLAS Grid, for all workflows (MC production, data (re)processing, etc.).
We have based the development on two new site concepts. Nuclei sites are the Tier 1s and large Tier 2s, where tasks will be
assigned and the output aggregated, and satellites are the sites that will execute the jobs and send the output to their nucleus.
Nuclei and satellite sites are dynamically paired by PanDA for each task based on the input data availability, capability matching,
site load and network connectivity. This contribution will introduce the conceptual changes for World-cloud, the development
necessary in PanDA, an insight into the network model and the first half year of operational experience.
With ever-greater computing needs and fixed budgets, big scientific experiments are turning to opportunistic resources as a means
to add much-needed extra computing power. These resources can be very different in design from the resources that comprise the Grid
computing of most experiments, therefore exploiting these resources requires a change in strategy for the experiment. The resources
may be highly restrictive in what can be run or in connections to the outside world, or tolerate opportunistic usage only on
condition that tasks may be terminated without warning. The ARC CE with its non-intrusive architecture is designed to integrate
resources such as High Performance Computing (HPC) systems into a computing Grid. The ATLAS experiment developed the Event Service
primarily to address the issue of jobs that can be terminated at any point when opportunistic resources are needed by someone else.
This paper describes the integration of these two systems in order to exploit opportunistic resources for ATLAS in a restrictive
environment. In addition to the technical details, results from deployment of this solution in the SuperMUC HPC in Munich are
shown.
The ATLAS production system has provided the infrastructure to process of tens of thousand of events during LHC Run1 and the first year of the LHC Run2 using grid, clouds and high performance computing. We address in this contribution all the strategies and improvements added to the production system to optimize its performance to get the maximum efficiency of available resources from operational perspective and focussing in detail in the recent developments.
The DPM (Disk Pool Manager) project is the most widely deployed solution for storage of large data repositories on Grid sites, and is completing the most important upgrade in its history, with the aim of bringing important new features, performance and easier long term maintainability.
Work has been done to make the so-called "legacy stack" optional, and substitute it with an advanced implementation that is based on the fastCGI and RESTful technologies.
Beside the obvious gain in making optional several legacy components that are difficult to maintain, this step brings important features together with performance enhancements. Among the most important features we can cite the simplification of the configuration, the possibility of working in a totally SRM-free mode, the implementation of quotas, free/used space on directories, and the implementation of volatile pools that can pull files from external sources, which can be used to deploy data caches.
Moreover, the communication with the new core, called DOME (Disk Operations Management Engine) now happens through secure HTTPS channels through an extensively documented,
industry-compliant protocol.
For this leap, referred to with the codename "DPM Evolution", the help of the DPM collaboration has been very important in the beta testing phases, and here we report about the technical choices and the first site experiences.
CERN has been developing and operating EOS as a disk storage solution successfully for 5 years. The CERN deployment provides 135 PB and stores 1.2 billion replicas distributed over two computer centres. Deployment includes four LHC instances, a shared instance for smaller experiments and since last year an instance for individual user data as well. The user instance represents the backbone of the CERNBOX service for file sharing. New use cases like synchronisation and sharing, the planned migration to reduce AFS usage at CERN and the continuous growth has brought EOS to new challenges.
Recent developments include the integration and evaluation of various technologies to do the transition from a single active in-memory namespace to a scale-out implementation distributed over many meta-data servers. The new architecture aims to separate the data from the application logic and user interface code, thus providing flexibility and scalability to the namespace component.
Another important goal is to provide EOS as a CERN-wide mounted filesystem with strong authentication making it a single storage repository accessible via various services and front-ends ( /eos initiative ). This required new developments in the security infrastructure of the EOS Fuse implementation. Furthermore, there were a series of improvements targeting the end-user experience like tighter consistency and latency optimisations.
In collaboration with SEAGATE as openlab partner, EOS has a complete integration of OpenKinetic object drive cluster as a high-throughput, high-availability, low-cost storage solution.
This contribution will discuss these three main development projects and present new performance metrics.
XRootD is a distributed, scalable system for low-latency file access. It is the primary data access framework for the high-energy physics community. One of the latest developments in the project has been to incorporate metalink and segmented file transfer technologies.
We report on the implementation of the metalink metadata format support within XRootD client. This includes both the CLI and the API semantics. Moreover, we give an overview of the employed segmented file transfer mechanism that exploits metalink-based data sources. Its aim is to provide multisource file transmission (BitTorrent-like), which results in increased transfer rates.
This contribution summarizes these two development projects and presents the outcomes.
For the previous decade, high performance, high capacity Open Source storage systems have been designed and implemented, accommodating the demanding needs of the LHC experiments. However, with the general move away from the concept of local computer centers, supporting their associated communities, towards large infrastructures, providing Cloud-like solutions to a large variety of different scientific groups, storage systems needed to adjust their capabilities in many areas, as there are federated identities, non authenticated delegation to portals or platforms, modern sharing and user defined Quality of Storage.
This presentation will give an overview on how dCache is keeping up with modern Cloud storage requirements by partnering with EU projects, which provide the necessary contact to a large set of Scientific Communities.
Regarding authentication, there is no strict relationship anymore between the individual scientist, the scientific community and the infrastructure, providing resources. Federated identity systems like SAML or “OpenID Connect” are growing into the method-of-choice for new scientific groups and are even sneaking their way into HEP.
Therefor, under the umbrella of the INDIGO-DataCloud project, dCache is implementing those authentication mechanisms in addition to the already established ones, like username/password, Kerberos and X509 Certificates.
To simplify the use of dCache as back-end of scientific portals, dCache is experimenting with new anonymous delegation methods, like “Macaroons”, which the dCache team would like to introduce in order to start a discussion, targeting their broader acceptance in portals and at the level of service providers.
As the separation between managing scientific mass data and scientific semi-private data, like publications, is no longer strict, large data management systems are supposed to provide a simple interface to easily share data among individuals or groups. While some systems are offering that feature through web portals only, dCache will show that this can be provided uniquely for all protocols the system supports, including NFS and GridFTP.
Furthermore, in modern storage infrastructures, storage media, and consequently the quality and price of the request storage space are no longer negotiated with the responsible system administrators but dynamically selected by the end user or by automated computing platforms. The same is true for data migration between different qualities of storage. To accommodate this conceptual change, dCache is exposing it’s entire data management interface through a RESTful service and a graphical user interface. The implemented mechanisms are following the recommendation of the corresponding working groups in RDA and SNIA and are agreed-upon with the INDIGO-DataCloud project to be compatible with similar functionalities of other INDIGO provided storage systems.
ZFS is a combination of file system, logical volume manager, and software raid system developed by SUN Microsystems for the Solaris OS. ZFS simplifies the administration of disk storage and on Solaris it has been well regarded for its high performance, reliability, and stability for many years. It is used successfully for enterprise storage administration around the globe, but so far on such systems ZFS was mainly used to provide storage, like for users home directories, through NFS and similar network related protocols. Since ZFS became available in a stable version on Linux recently, here we will present the usage and benefits of ZFS as backend for WLCG storage servers based on Linux and its advantages over current WLCG storage practices using hardware raid systems.
We tested ZFS in comparison to hardware raid configurations on WLCG DPM storage servers used to provide data storage to the LHC experiments. Tests investigated the performance as well as reliability and behavior in different failure scenarios, such as simulating failures of single disks and whole storage devices. The test results comparing ZFS to other file systems based on a hardware raid vdev will be presented, as well as recommendations for a ZFS based storage setup for a WLCG data storage server based on our test results. Among others, we tested the performance under different vdev and redundancy configurations, behaviour in failure situations, and redundancy rebuild behaviour. We will also report on the importance of ZFS’ own unique features and their benefits for WLCG storage. For example, initial tests using ZFS’ built in compression on sample data containing ROOT files indicated a reduction in space of 4% without any negative impact on the performance. We will report on space reduction and how the compression performance scales to 1PB of LHC experiment data. Scaled to the whole LHC experiments’ data amount, that could provide a significant amount of additional storage at no extra costs to the sites. Since more sites provide data storage also to other non-LHC experiments, be able to use compression could be of even greater benefit to the overall disk capacity provided by a site.
After very promising first results on using ZFS on Linux at one of the NGI UK distributed Tier-2 ScotGrid sites together with the much easier administration and better reliability compared to hardware raid systems, we switched the whole storage on this site to ZFS and will report about the longer term experience of using it, too.
All ZFS tests are based on a Linux system (SL6) with the latest stable ZFS-on-Linux version instead of using a traditional Solaris based system. To make the test results transferable to other WLCG sites, typical storage servers were used as client machines managing 36 disks of different capacity, used before in hardware raid configurations based on typical hardware raid controllers.
Based on GooFit, a GPU-friendly framework for doing maximum-likelihood fits, we have developed a tool for extracting model-independent S-wave amplitudes from three-body decays such as D+ --> h(')-,h+,h+. A full amplitude analysis is done where the magnitudes and phases of the S-wave amplitudes (or alternatively, the real and imaginary components), are anchored at a finite number of m^2(h(')-,h+), and a cubic spline is used to interpolate between these points. The amplitudes for P-wave and D-wave resonant states are modeled as spin-dependent Breit-Wigners. GooFit uses the Thrust library to launch all kernels, with a CUDA back-end for nVidia GPUs and an OpenMP back-end for compute nodes with conventional CPUs. Performance on a variety of these platforms is compared. Execution time on systems with GPUs is a few hundred times faster than running the same algorithm on a single CPU.
PODIO is a C++ library that supports the automatic creation and efficient handling of HEP event data, developed as a new EDM toolkit for future particle physics experiments in the context of the AIDA2020 EU programme. Event
data models (EDMs) are at the core of every HEP experiment’s software framework, essential for providing a communication channel between different algorithms in the data processing chain as well as for efficient I/O. Experience from LHC and the Linear collider community shows that existing solutions partly suffer from overly complex data models with deep object-hierarchies or unfavourable I/O performance. The PODIO project was created in order to address these problems. PODIO is based on the idea of employing plain-old-data (POD) data structures wherever possible, while avoiding deep object-hierarchies and virtual inheritance. At the same time it provides the necessary high-level interface towards the developer physicist, such as the support for inter-object relations, and automatic memory-management, as well as a ROOT-assisted Python interface. To simplify the creation of efficient data models, PODIO employs code generation from a simple yams-based markup language. In addition, it was developed with concurrency in mind in order to support the usage of modern CPU features, for example giving basic support for vectorisation techniques. This contribution presents the PODIO design, first experience in the context of the Future Circular Collider (FCC) and Liner Collider (LC) software projects, as well as performance figures when using ROOT as storage backend.
The instantaneous luminosity of the LHC is expected to increase at HL-LHC so that the amount of pile-up can reach a level of 200 interaction per bunch crossing, almost a factor of 10 w.r.t the luminosity reached at the end of run 1. In addition, the experiments plan a 10-fold increase of the readout rate. This will be a challenge for the ATLAS and CMS experiments, in particular for the tracking, which will be performed with a new all Silicon tracker in both experiments. In terms of software, the increased combinatorial complexity will have to be dealt with within flat budget at best.
Preliminary studies show that the CPU time to reconstruct the events explodes with the increased pileup level. The increase is dominated by the increase of the CPU time of the tracking, itself dominated by the increase of the CPU time of the pattern recognition stage. In addition to traditional CPU optimisation and better use of parallelism, exploration of completely new approaches to pattern recognition has been started.
To reach out to Computer Science specialists, a Tracking Machine Learning challenge (trackML) has been set up, building on the experience of the successful Higgs Machine Learning challenge in 2014 (see talk by Glen Cowan at CHEP 2015). It associates ATLAS and CMS physicists with Computer Scientists. A few relevant points:
The emphasis is to expose innovative approaches, rather than hyper-optimising known approaches. Machine Learning specialists have shown a deep interest to participate to the challenge, with new approaches like Convolutional Neural Network, Deep Neural Net, Monte Carlo Tree Search and others.
Radiotherapy is planned with the aim of delivering a lethal dose of radiation to a tumour, while keeping doses to nearby healthy organs at an acceptable level. Organ movements and shape changes, over a course of treatment typically lasting four to eight weeks, can result in actual doses being different from planned. The UK-based VoxTox project aims to compute actual doses, at the level of millimetre-scale volume elements (voxels), and to correlate with short- and long-term side effects (toxicity). The initial focuses are prostate cancer, and cancers of the head and neck. Results may suggest improved treatment strategies, personalised to individual patients.
The VoxTox studies require analysis of anonymised patient data. Production tasks include: calculations of actual dose, based on material distributions shown in computed-tomography (CT) scans recorded at treatment time to guide patient positioning; pattern recognition to locate organs of interest in these scans; mapping of toxicity data to standard scoring systems. User tasks include: understanding differences between planned and actual dose; evaluating the pattern recognition; searching for correlations between actual dose and toxicity scores. To provide for the range of production and user tasks, an analysis system has been developed that uses computing models and software tools from particle physics.
The VoxTox software framework is implemented in Python, but is inspired by the Gaudi C++ software framework of ATLAS and LHCb. Like Gaudi, it maintains a distinction between data objects, which are processed, and algorithm objects, which perform processing. It also provides services to simplify common operations. Applications are built as ordered sets of algorithm objects, which may be passed configuration parameters at run time. Analysis algorithms make use of ROOT. An application using Geant4 to simulate CT guidance scans is under development.
Drawing again from ATLAS and LHCb, VoxTox computing jobs are created and managed within Ganga. This allows transparent switching between different processing platforms, provides cross-platform job monitoring, performs job splitting and output merging, and maintains a record of job definitions. For VoxTox, Ganga has been extended through the addition of components with built-in knowledge of the software framework and of patient data. Jobs can be split based on either patients or guidance scans per sub-job.
This presentation details use of computing models and software tools from particle physics to develop the data-analysis system for the VoxTox project, investigating dose-toxicity correlations in cancer radiotherapy. Experience of performing large-scale data processing on an HTCondor cluster is summarised, and example results are shown.
The use of up-to-date machine learning methods, including deep neural networks, running directly on raw data has significant potential in High Energy Physics for revealing patterns in detector signals and as a result improving reconstruction and the sensitivity of the final physics analyses. In this work, we describe a machine-learning analysis pipeline developed and operating at the National Energy Research Scientific Computing Center (NERSC), processing data from the Daya Bay Neutrino Experiment. We apply convolutional neural networks to raw data from Daya Bay in an unsupervised mode where no input physics knowledge or training labels are used.
The observation of neutrino oscillation provides evidence of physics beyond the standard model, and the precise measurement of those oscillations remains an important goal for the field of particle physics. Using two finely segmented liquid scintillator detectors located 14 mrad off-axis from the NuMI muon-neutrino beam, NOvA is in a prime position to contribute to precision measurements of the neutrino mass splitting, mass hierarchy, and CP violation.
A key part of that precise measurement is the accurate characterization of neutrino interactions in our detector. This presentation will describe a convolutional neural network based approach to neutrino interaction type identification in the NOvA detectors. The Convolutional Visual Network (CVN) algorithm is an innovative and powerful new approach to event identification which uses the technology of convolutional neural networks, developed in the computer vision community, to identify events in the detector without requiring detailed reconstruction. This approach has produced a 40% improvement in electron-neutrino efficiency without a loss in purity as compared to selectors previously used by NOvA. We will discuss the core concept of convolutional neural networks, modern innovations in convolutional neural network architecture related to the nascent field of deep learning, and the performance of our own novel network architecture in event selection for the NOvA oscillation analyses. This talk will also discuss the architecture and performance of two new variants of CVN. One variant classifies constituent particles of an interaction rather than the neutrino origin which will allow for detailed investigations into event topology and the separation of hadronic and lepton energy depositions in the interaction. The other variant focuses on classifying interactions in the Near Detector to improve cross-section analyses as well as making it possible to search for anomalous tau-neutrino appearance at short baselines.
Access and exploitation of large scale computing resources, such as those offered by general
purpose HPC centres, is one import measure for ATLAS and the other Large Hadron Collider experiments
in order to meet the challenge posed by the full exploitation of the future data within the constraints of flat budgets.
We report on the effort moving the Swiss WLCG T2 computing,
serving ATLAS, CMS and LHCb, from a dedicated cluster to large CRAY systems
at the Swiss national supercomputing centre, CSCS. These systems do not only offer
very efficient hardware, cooling and highly competent operators, but also have large
backfill potentials due to size and multidisciplinary usage, and potential gains due to economy of scale.
Technical solutions, performance, expected return and future plans are discussed.
Fifteen Chinese High Performance Computing sites, many of them on the TOP500 list of most powerful supercomputers, are integrated into a common infrastructure providing coherent access to a user through an interface based on a RESTful interface called SCEAPI. These resources have been integrated into the ATLAS Grid production system using a bridge between ATLAS and SCEAPI which translates the authorization and job submission protocols between the two environments. The ARC Computing Element (ARC CE) forms the bridge using an extended batch system interface to allow job submission to SCEAPI. The ARC CE was setup at the Institute for High Energy Physics, Beijing, in order to be as close as possible to the SCEAPI front-end interface at the Computing Network Information Center, also in Beijing. This paper describes the technical details of the integration between ARC CE and SCEAPI and presents results so far with two supercomputer centers, Tianhe-IA and ERA. These two centers have been the pilots for ATLAS Monte Carlo Simulation in SCEAPI and have been providing CPU power since fall 2015.
Obtaining CPU cycles on an HPC cluster is nowadays relatively simple and sometimes even cheap for academic institutions. However, in most of the cases providers of HPC services would not allow changes on the configuration, implementation of special features or a lower-level control on the computing infrastructure and networks, for example for testing new computing patterns or conducting research on HPC itself. The variety of use cases proposed by several departments of the University of Torino, including ones from solid-state chemistry, high-energy physics, computer science, big data analytics, computational biology, genomics and many others, called for different and sometimes conflicting configurations; furthermore, several R&D activities in the field of scientific computing, with topics ranging from GPU acceleration to Cloud Computing technologies, needed a platform to be carried out on.
The Open Computing Cluster for Advanced data Manipulation (OCCAM) is a multi-purpose flexible HPC cluster designed and operated by a collaboration between the University of Torino and the Torino branch of the Istituto Nazionale di Fisica Nucleare. It is aimed at providing a flexible, reconfigurable and extendable infrastructure to cater to a wide range of different scientific computing needs, as well as a platform for R&D activities on computational technologies themselves. Extending it with novel architecture CPU, accelerator or hybrid microarchitecture (such as forthcoming Intel Xeon Phi Knights Landing) will be as a simple as plugging a node in a rack.
The initial system counts slightly more than 1100 cpu cores and includes different types of computing nodes (standard dual-socket nodes, large quad-sockets nodes with 768 GB RAM, and multi-GPU nodes) and two separate disk storage subsystems: a smaller high-performance scratch area, based on the Lustre file system, intended for direct computational I/O and a larger one, of the order of 1PB, to archive near-line data for archival purposes. All the components of the system are interconnected through a 10Gb/s Ethernet layer with one-level topology and an InfiniBand FDR 56Gbps layer in fat-tree topology.
A system of this kind, heterogeneous and reconfigurable by design, poses a number of challenges related to the frequency at which heterogeneous hardware resources might change their availability and shareability status, which in turn affect methods and means to allocate, manage, optimize, bill, monitor VMs, virtual farms, jobs, interactive bare-metal sessions, etc.
This poster describes some of the use cases that prompted the design ad construction of the HPC cluster, its architecture and a first characterization of its performance by some synthetic benchmark tools and a few realistic use-case tests.
The Open Science Grid (OSG) is a large, robust computing grid that started primarily as a collection of sites associated with large HEP experiments such as ATLAS, CDF, CMS, and DZero, but has evolved in recent years to a much larger user and resource platform. In addition to meeting the US LHC community’s computational needs, the OSG continues to be one of the largest providers of distributed high-throughput computing (DHTC) to researchers from a wide variety of disciplines via the OSG Open Facility. The Open Facility consists of OSG resources that are available opportunistically to users other than resource owners and their collaborators. In the past two years, the Open Facility has doubled its annual throughput to over 200 million wall hours. More than half of these resources are used by over 100 individual researchers from over 60 institutions in fields such as biology, medicine, math, economics, and many others. Over 10% of these individual users utilized in excess of 1 million computational hours each in the past year. The largest source of these cycles is temporary unused capacity at institutions affiliated with US LHC computational sites. An increasing fraction, however, comes from university HPC clusters and large national infrastructure supercomputers offering unused capacity. Such expansions have allowed the OSG to provide ample computational resources to both individual researchers and small groups as well as sizeable international science collaborations such as LIGO, AMS, IceCube, and sPHENIX. Opening up access to the Fermilab FabrIc for Frontier Experiments (FIFE) project has also allowed experiments such as mu2e and NOvA to make substantial use of Open Facility resources, the former with over 40 million wall hours in a year. We present how this expansion was accomplished as well as future plans for keeping the OSG Open Facility at the forefront of enabling scientific research by way of DHTC.
ALICE HLT Cluster operation during ALICE Run 2
(Johannes Lehrbach) for the ALICE collaboration
ALICE (A Large Ion Collider Experiment) is one of the four major detectors located at the LHC at CERN, focusing on the study of heavy-ion collisions. The ALICE High Level Trigger (HLT) is a compute cluster which reconstructs the events and compresses the data in real-time. The data compression by the HLT is a vital part of data taking especially during the heavy-ion runs in order to be able to store the data which implies that reliability of the whole cluster is an important matter.
To guarantee a consistent state among all compute nodes of the HLT cluster we have automatized the operation as much as possible. For automatic deployment of the nodes we use Foreman with locally mirrored repositories and for configuration management of the nodes we use Puppet. Important parameters like temperatures of the nodes are monitored with Zabbix.
During periods without beam the HLT cluster is used for tests and as one of the WLCG Grid sites to compute offline jobs in order maximize the usage of our cluster. To prevent interference with normal HLT operations we introduced a separation via virtual LANs between the normal HLT operation and the grid jobs running inside virtual machines.
During the past years an increasing number of CMS computing resources are offered as clouds, bringing the flexibility of having virtualised compute resources and centralised management of the Virtual Machines (VMs). CMS has adapted its job submission infrastructure from a traditional Grid site to operation using a cloud service and meanwhile can run all types of offline workflows. The cloud service provided by the online cluster for the Data Acquisition (DAQ) and High Level Trigger (HLT) of the experiment was one of the first facilities to commission and deploy this submission infrastructure. The CMS HLT is a considerable compute resource. It consists currently of approximately 1000 dual socket PC server nodes with a total of ~25 k cores, corresponding to ~500 kHEPSpec06. This compares to a total Tier-0 / Tier-1 CMS resources request of 292 / 461 kHEPSpec06. The HLT has no local mass disk storage and is currently connected to the CERN IT datacenter via a dedicated 160 Gbps network connection.
One of the main requirements for the online cloud facility is the parasitic use of HLT which shall never interfere with its primary function as part of the data acquisition system. Hence a design has been chosen where an Openstack infrastructure is overlaid over the HLT hardware resources. This overlay also abstracts the different hardware and networks that the cluster is composed of. The online cloud is meanwhile a well established facility to substantially augment the CMS computing resources when the HLT is not needed for data acquisition, such as during technical stop periods of the LHC. In this static mode of operation, this facility acts as any other Tier-0 or Tier-1 facility. During high workload periods it provided up to ~40% of the combined Tier-0/Tier-1 capacity, including workflows with demanding I/O requirements. Data needed by the running jobs was read from the remote EOS disk system at CERN and and data produced was written back out to EOS. The achieved throughput from the remote EOS came close to the installed bandwidth of the 4x40 Gpbs long range links.
The next step is to extend the usage of the online cloud to the opportunistic usage of the periods between LHC fills. These periods are a-priori unscheduled and of undetermined length, typically at least 5 hours and once or more a day. This mode of operation of a dynamic usage of the cloud infrastructure requires a fast turn-around for the starting and stopping of the VMs. A more advanced mode of operation where the VMs are hibernated and jobs are not killed is also being explored. Finally, one could envisage to ramp up VMs while the load on the HLT reduces towards the end of the fill. We will discuss the optimisation of the cloud infrastructure for the dynamic operation and the design and implementation of the mechanism in the DAQ system to gracefully switch from DAQ mode to providing cloud resources based on LHC state or server load.
The installation of Virtual Visit services by the LHC collaborations began shortly after the first high energy collisions were provided by the CERN accelerator in 2010. The experiments: ATLAS, CMS, LHCb, and ALICE have all joined in this popular and effective method to bring the excitement of scientific exploration and discovery into classrooms and other public venues around the world. Their programmes, which use a combination of video conference, webcast, and video recording to communicate with remote audiences have already reached tens of thousands of viewers, and the demand only continues to grow. Other venues, such as the CERN Control Centre, are also considering similar permanent installations.
We present a summary of the development of the various systems in use around CERN today, including the technology deployed and a variety of use cases. We then lay down the arguments for the creation of a CERN-wide service that would support these programmes in a more coherent and effective manner. Potential services include a central booking system and operational management similar to what is currently provided for the common CERN video conference facilities. Key technological choices would provide additional functionality that could support communication and outreach programmes based on popular tools including (but not limited to) Skype, Google Hangouts, and Periscope. Successful implementation of the project, which relies on close partnership between the experiments, CERN IT CDA, and CERN IR ECO, has the potential to reach an even larger, global audience, more effectively than ever before.
CERN openlab is a unique public-private partnership between CERN and leading IT companies and research institutes. Having learned a lot from the close collaboration with industry in many different projects we now are using this experience to transfer some of our knowledge to other scientific fields, specifically in the areas of code optimization for the simulations of biological dynamics and the advanced usage of ROOT for the storage and processing of genomics data. In this presentation I will give an overview of the knowledge transfer projects we are currently engaged in. How they are relevant and beneficial for all parties involved, the interesting technologies that are being developed and what the potential and exciting results will be.
Since the launch of HiggsHunters.org in November 2014, citizen science volunteers
have classified more than a million points of interest in images from the ATLAS experiment
at the LHC. Volunteers have been looking for displaced vertices and unusual features in images
recorded during LHC Run-1. We discuss the design of the project, its impact on the public,
and the surprising results of how the human volunteers performed relative to the computer
algorithms in identifying displaced secondary vertices.
The vast majority of high-energy physicists use and produce software every day. Software skills are usually acquired “on the go” and dedicated training courses are rare. The LHCb Starterkit is a new training format for getting LHCb collaborators started in effectively using software to perform their research. The course focuses on teaching basic skills for research computing. Unlike traditional tutorials we focus on starting with basics, performing all the material live, with a high degree of interactivity, giving priority to understanding the tools as opposed to handing out recipes that work “as if by magic”. The LHCb Starterkit was started by two young members of the collaboration inspired by the principles of Software Carpentry (http://software-carpentry.org), and the material is created in a collaborative fashion using the tools we teach. Three successful entry-level workshops, as well as an advance one, have taken place since the start of the initiative in 2015, and were taught largely by PhD students to other PhD students.
We present the new Invenio 3 digital library framework and demonstrate
its application in the field of open research data repositories. We
notably look at how the Invenio technology has been applied in two
research data services: (1) the CERN Open Data portal that provides
access to the approved open datasets and software of the ALICE, ATLAS,
CMS and LHCb collaborations; (2) the Zenodo service that offers an open
research data archiving solution to world-wide scientific communities in
any research discipline.
Invenio digital library framework is composed of more than sixty
independently developed packages on top of the Flask web development
environment. The packages share a set of common patterns and communicate
together via well-established APIs. The packages come with extensive
test suite and example applications and use Travis continuous
integration practices to ensure quality. The packages are often
developed by independent teams with special focus on topical use cases
(e.g. library circulation, multimedia, research data). The separation of
packages in the Invenio 3 ecosystem enables their independent
development, maintenance and rapid release cycle. This also allows the
prospective digital repository managers who are interested in deploying
an Invenio solution at their institutions to cherry-pick the individual
modules of interest with the aim of building a customised digital
repository solution targeting their particular needs and use cases.
We discuss the application of the Invenio package ecosystem in the
research data repository problem domain. We present how a researcher can
easily archive their data files as well as their analysis software code
or their Jupyter notebooks via GitHub <-> Zenodo integration. The
archived data and software is minted with persistent identifiers to
ensure their further citeability. We present how the JSON Schema
technology is used to define the data model describing all the data
managed by the repository. The conformance to versioned JSON schemas
ensure the coherence of metadata structure across the managed assets.
The data is further indexed using Elasticsearch for the information
retrieval needs. We describe the role of the CERN EOS system used as the
underlying data storage via a Pythonic XRootD based protocol. Finally,
we discuss the role of virtual environments (CernVM) and container-based
solutions (Docker) used with the aim of reproducing the archived
research data and analyses software even many years after their
publication.
A framework for performing a simplified particle physics data analysis has been created. The project analyses a pre-selected sample from the full 2011 LHCb data. The analysis aims to measure matter antimatter asymmetries. It broadly follows the steps in a significant LHCb publication where large CP violation effects are observed in charged B meson three-body decays to charged pions and kaons. The project is a first-of-its-kind analysis on the CERN open portal as its students are guided through elements of a full particle physics analysis but use a simplified interface. The analysis has multiple stages culminating in the observation of matter anti-matter differences between Dalitz plots of the B+ and B- meson decay. The project uses the open source Jupyter Notebook project, the Docker open platform for distributed applications, and can be hosted through the open source Everware platform. The target audience includes advanced high school students, undergraduate societies and enthusiastic scientifically literate members of the general public. The public use of this data set has been approved by the LHCb collaboration. The project plans to launch for the public in summer 2016 through the CERN Open Data Portal. The project development has been supported by final year undergraduates at the University of Manchester, Yandex school of data analysis and the CERN Open Data team.
For a few years now, the artdaq data acquisition software toolkit has
provided numerous experiments with ready-to-use components which allow
for rapid development and deployment of DAQ systems. Developed within
the Fermilab Scientific Computing Division, artdaq provides data
transfer, event building, run control, and event analysis
functionality. This latter feature includes built-in support for the
art event analysis framework, allowing experiments to run art modules
for real-time filtering, compression, disk writing and online
monitoring; as art, also developed at Fermilab, is also used for
offline analysis, a major advantage of artdaq is that it allows
developers to easily switch between developing online and offline
software.
artdaq continues to be improved. Support for an alternate mode of
running whereby data from some subdetector components are only
streamed if requested has been
added; this option will reduce unnecessary DAQ throughput. Real-time
reporting of DAQ metrics has been implemented, along with the
flexibility to choose the format through which experiments receive the
reports; these formats include the Ganglia, Graphite and syslog
software packages, along with flat ASCII files. Additionally, work has
been performed investigating more flexible modes of online monitoring,
including the capability of being able to run multiple online
monitoring processes on different hosts, each running its own set of
art modules. Finally, a web-based GUI interface through which users
can configure details of their DAQ system has been implemented,
increasing the ease of use of the system.
Already successfully deployed on the LArIAT, DarkSide-50, DUNE 35ton
and Mu2e experiments, artdaq will be employed for SBND and is a strong candidate for use on ICARUS
and protoDUNE. With each experiment comes new ideas for how artdaq can
be made more flexible and powerful; the above improvements will be described, along with potential ideas for the future.
The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GByte/s to the high-level trigger (HLT) farm. The HLT farm selects and classifies interesting events for storage and offline analysis at a rate of around 1 kHz.
The DAQ system has been redesigned during the accelerator shutdown (LS1) in 2013/14. In order to handle higher LHC luminosities and event pileup, a number of sub-detectors are upgraded, increasing the number of readout channels and replacing the off-detector readout electronics with a μTCA implementation. The new DAQ system support the read-out of of the off-detector electronics with point to point links of both the legacy systems, as well as the new uTCA based systems with a fibre based implementation up to 10 Gbps and reliable protocol.
The new DAQ architecture takes advantage of the latest developments in the computing industry. For data concentration, 10/40 Gbit Ethernet technologies are used, as well as an implementation of a reduced TCP/IP in FPGA for a reliable transport between DAQ custom electronics and commercial computing hardware. A 56 Gbps Infiniband FDR CLOS network has been chosen for the event builder with a throughput of ~4 Tbps. The HLT processing is entirely file-based. This allows the DAQ and HLT systems to be independent, and to use the same framework for the HLT as for the offline processing. The fully built events are sent to the HLT with 1/10/40 Gbit Ethernet via network file systems. A hierarchical collection of HLT accepted events and monitoring meta-data are stored in to a global file system. The monitoring of the HLT farm is done with the Elasticsearch analytics tool.
This paper presents the requirements, implementation, and performance of the system. Experience is reported on the operation for the LHC pp runs as well as at the heavy ion Pb-Pb runs. The evolution of the DAQ system will be presented including the expansion to accommodate new detectors
Support for Online Calibration in the ALICE HLT Framework
Mikolaj Krzewicki, for the ALICE collaboration
ALICE (A Large Heavy Ion Experiment) is one of the four major experiments at the Large Hadron Collider (LHC) at CERN. The High Level Trigger (HLT) is an online compute farm, which reconstructs events measured by the ALICE detector in real-time. The HLT uses a custom online data-transport framework to distribute the data and the workload among the compute nodes. ALICE employs subdetectors sensitive to environmental conditions such as pressure and temperature, e.g. the Time Projection Chamber (TPC). A precise reconstruction of particle trajectories requires the calibration of these detectors. Performing the calibration in real time in the HLT improves the online reconstructions and renders certain offline calibration steps obsolete, speeding up offline physics analysis. For LHC Run 3, starting in 2020 when data reduction will rely on reconstructed data, online calibration becomes a necessity. In order to run the calibration online, the HLT now supports the processing of tasks that typically run offline. These tasks run massively parallel on all HLT compute nodes, their output is gathered and merged periodically. The calibration results are both stored offline for later use and fed back into the HLT chain via a feedback loop in order to apply calibration information to the track reconstruction. Online calibration and feedback loop are subject to certain time constraints in order to provide up-to-date calibration information and they must not interfere with ALICE data taking. Our approach to run these tasks in asynchronous processes enables us to separate them from normal data taking in a way that makes it failure resilient. We performed a first test of online TPC drift time calibration under real conditions during the heavy-ion run in December 2015. We present an analysis and conclusions of this first test, new improvements and developments based on this, as well as our current scheme to commission this for production use.
LHCb has introduced a novel real-time detector alignment and calibration strategy for LHC Run 2. Data collected at the start of the fill are processed in a few minutes and used to update the alignment parameters, while the calibration constants are evaluated for each run. This procedure improves the quality of the online reconstruction. For example, the vertex locator is retracted and reinserted for stable beam conditions in each fill to be centred on the primary vertex position in the transverse plane. Consequently its position changes on a fill-by-fill basis. Critically, this new real-time alignment and calibration procedure allows identical constants to be used in the online and offline reconstruction, thus improving the correlation between triggered and offline selected events. This offers the opportunity to optimise the event selection in the trigger by applying stronger constraints. The required computing time constraints are met thanks to a new dedicated framework using the multi-core farm infrastructure for the trigger. The motivation for a real-time alignment and calibration of the LHCb detector is discussed from both the operational and physics performance points of view. Specific challenges of this novel configuration are discussed, as well as the working procedures of the framework and its performance.
The exploitation of the full physics potential of the LHC experiments requires fast and efficient processing of the largest possible dataset with the most refined understanding of the detector conditions. To face this challenge, the CMS collaboration has setup an infrastructure for the continuous unattended computation of the alignment and calibration constants, allowing for a refined knowledge of the most time-critical parameters already a few hours after the data have been saved to disk. This is the prompt calibration framework which, since the beginning of the LHC RunI, enables the analysis and the High Level Trigger of the experiment to consume the most up-to-date conditions optimizing the performance of the physics objects. In RunII this setup has been further expanded to include even more complex calibration algorithms requiring higher statistics to reach the needed precision. This imposed the introduction of a new paradigm in the creation of the calibration datasets for unattended workflows and opened the door to a further step in performance.
The presentation reviews the design of these automated calibration workflows, the operational experience in RunII and the monitoring infrastructure developed to ensure the reliability of the service.
The SuperKEKB $\mathrm{e^{+}\mkern-9mu-\mkern-1mue^{-}}$collider
has now completed its first turns. The planned running luminosity
is 40 times higher than its previous record during the KEKB operation.
The Belle II detector placed at the interaction point will acquire
a data sample 50 times larger than its predecessor. The monetary and
time costs associated with storing and processing this quantity of
data mean that it is crucial for the detector components at Belle II
to be calibrated quickly and accurately. A fast and accurate calibration
allows the trigger to increase the efficiency of event selection,
and gives users analysis-quality reconstruction promptly. A flexible
framework for fast production of calibration constants is being developed
in the Belle II Analysis Software Framework (basf2). Detector experts
only need to create two components from C++ base classes. The first
collects data from Belle II datasets and passes it to the second
stage, which uses this much smaller set of data to run calibration
algorithms to produce calibration constants. A Python framework coordinates
the input files, order of processing, upload to the conditions database,
and monitoring of the output. Splitting the operation into collection
and algorithm processing stages allows the framework to optionally
parallelize the collection stage in a distributed environment. Additionally,
moving the workflow logic to a separate Python framework allows fast
development and easier integration with DIRAC; The grid middleware
system used at Belle II. The current status of this calibration and
alignment framework will be presented.
The ATLAS High Level Trigger Farm consists of around 30,000 CPU cores which filter events at up to 100 kHz input rate.
A costing framework is built into the high level trigger, this enables detailed monitoring of the system and allows for data-driven predictions to be made
utilising specialist datasets. This talk will present an overview of how ATLAS collects in-situ monitoring data on both CPU usage and dataflow
over the data-acquisition network during the trigger execution, and how these data are processed to yield both low level monitoring of individual
selection-algorithms and high level data on the overall performance of the farm. For development and prediction purposes, ATLAS uses a special
`Enhanced Bias' event selection. This mechanism will be explained along with how is used to profile expected resource usage and output event-rate of
new physics selections, before they are executed on the actual high level trigger farm.
The Run Control System of the Compact Muon Solenoid (CMS) experiment at CERN is a distributed Java web application running on Apache Tomcat servers. During Run-1 of the LHC, many operational procedures have been automated. When detector high voltages are ramped up or down or upon certain beam mode changes of the LHC, the DAQ system is automatically partially reconfigured with new parameters. Certain types of errors such as errors caused by single-event upsets may trigger an automatic recovery procedure. Furthermore, the top-level control node continuously performs cross-checks to detect sub-system actions becoming necessary because of changes in configuration keys, changes in the set of included front-end drivers or because of potential clock instabilities. The operator is guided to perform the necessary actions through graphical indicators displayed next to the relevant command buttons in the user interface. Through these indicators, consistent configuration of CMS is insured. However, manually following the indicators can still be inefficient at times. A new assistant to the operator has therefore been developed that can automatically perform all the necessary actions in a streamlined order. If additional problems arise, the new assistant tries to automatically recover from these. With the new assistant, a run can be started from any state of the subsystems with a single click. An ongoing run may be recovered with a single click, once the appropriate recovery action has been selected. We review the automation features of the CMS run control system and discuss the new assistant in detail including first operational experience.
In preparation for the XENON1T Dark Matter data acquisition, we have
prototyped and implemented a new computing model. The XENON signal and data processing
software is developed fully in Python 3, and makes extensive use of generic scientific data
analysis libraries, such as the SciPy stack. A certain tension between modern “Big Data”
solutions and existing HEP frameworks is typically experienced in smaller particle physics
experiments. ROOT is still the “standard” data format in our field, defined by large experiments
(ATLAS, CMS). To ease the transition, our computing model caters to both analysis paradigms,
leaving the choice of using ROOT-specific C++ libraries, or alternatively, Python and its data
analytics tools, as a front-end choice of developing physics algorithms. We present our path on
harmonizing these two ecosystems, which allowed us to use off-the-shelf software libraries (e.g.,
NumPy, SciPy, scikit-learn, matplotlib) and lower the cost of development and maintenance.
To analyse the data, our software allows researchers to easily create “mini-trees”; small, tabular
ROOT structures for Python analysis, which can be read directly into pandas DataFrame
structures. One of our goals was making ROOT available as a cross-platform binary for an
easy installation from the Anaconda Cloud (without going through the “dependency hell”). In
addition to helping us discover dark matter interactions, lowering this barrier helps shift the
particle physics toward non-domain-specific code.
The Muon Ionization Cooling Experiment (MICE) is a proof-of-principle experiment designed to demonstrate muon ionization cooling for the first time. MICE is currently on Step IV of its data taking programme, where transverse emittance reduction will be demonstrated. The MICE Analysis User Software (MAUS) is the reconstruction, simulation and analysis framework for the MICE experiment. MAUS is used for both offline data analysis and fast online data reconstruction and visualisation to serve MICE data taking.
This paper provides an introduction to the MAUS framework, describing the central Python and C++ based framework, code management procedure, the current performance for detector reconstruction and results from real data analysis of the recent MICE Step IV data. The ongoing development goals will also be described, including introducing multithreaded processing for the online detector reconstruction.
The Belle II experiment at KEK is preparing for first collisions in 2017. Processing the large amounts of data that will be produced will require conditions data to be readily available to systems worldwide in a fast and efficient manner that is straightforward for both the user and maintainer.
The Belle II conditions database was designed with a straightforward goal: make it as easily maintainable as possible. To this end, HEP-specific software tools were avoided as much as possible and industry standard tools used instead. HTTP REST services were selected as the application interface, which provide a high-level interface to users through the use of standard libraries such as curl. The application interface itself is written in Java and runs in an embedded Payara-Micro Java EE application server. Scalability at the application interface is provided by use of Hazelcast, an open source In-Memory Data Grid (IMDG) providing distributed in-memory computing and supporting the creation and clustering of new application interface instances as demand increases. The IMDG provides fast and efficient access to conditions data, and is persisted or backed by OpenStack’s Trove. Trove manages MySQL databases in a multi-master configuration used to store and replicate data in support of the application cluster.
This talk will present the design of the conditions database environment at Belle II and its use as well as go into detail about the actual implementation of the conditions database, its capabilities, and its performance.
Since the 2014 the ATLAS and CMS experiments share a common vision for the Condition Database infrastructure required to handle the non-event data for the forthcoming LHC runs. The large commonality in the use cases allows to agree on a common overall design solution meeting the requirements of both experiments. A first prototype implementing these solutions has been completed in 2015 and was made available to both experiments.
The prototype is based on a web service implementing a REST api with a set of functions for the management of conditions data. The choice to use a REST api in the architecture has two main advantages: - the Conditions data are exchanged in a neutral format ( JSON or XML), allowing to be processed by different languages and/or technologies in different frameworks. - the client is agnostic with respect to the underlying technology used for the persistency (allowing standard RDBMS and NoSQL back-ends)
The implementation of this prototype server uses standard technologies available in Java for server based applications. This choice has the benefit of easing the integration with the existing java-based applications in use by both experiments, notably the Frontier service in the distributed computing environment.
In this contribution, we describe the testing of this prototype performed within the CMS computing infrastructure, with the aim of validating the support of the main use cases and of suggesting future improvements. Since the data-model reflected in this prototype is very close to the layout of the current CMS Condition Database, the tests could be performed directly with the existing CMS condition data.
The strategy for the integration of the prototype into the experiments' frameworks consists in replacing the innermost software layer handling the Conditions with a plugin. This plugin is capable of accessing the web service and of decoding the retrieved data into the appropriate object structures used in the CMS offline software. This strategy has been applied to run a test suite on the specific physics data samples, used at CMS for the software release validation.
Conditions data (for example: alignment, calibration, data quality) are used extensively in the processing of real and simulated data in ATLAS. The volume and variety of the conditions data needed by different types of processing are quite diverse, so optimizing its access requires a careful understanding of conditions usage patterns. These patterns can be quantified by mining representative log files from each type of processing and gathering detailed information about conditions usage for that type of processing into a central repository.
In this presentation, we describe the systems developed to collect this conditions usage metadata per job type and describe a few specific (but very different) ways in which it has been used. For example, it can be used to cull specific conditions data into a much more compact package to be used by jobs doing similar types of processing: these customized collections can then be shipped with jobs to be executed on isolated worker nodes (such as HPC farms) that have no network access to conditions. Another usage is in the design of future ATLAS software: to provide Run 3 software developers essential information about the nature of current conditions accessed by software. This helps to optimize internal handling of conditions data to minimize its memory footprint while facilitating access to this data by the sub-processes that need it.
The ATLAS EventIndex System has amassed a set of key quantities for a large number of ATLAS events into a Hadoop based infrastructure for the purpose of providing the experiment with a number of event-wise services. Collecting this data in one place provides the opportunity to investigate various storage formats and technologies and assess which best serve the various use cases as well as consider what other benefits alternative storage systems provide.
In this presentation we describe how the data are imported into an Oracle RDBMS, the services we have built based on this architecture, and our experience with it. We've indexed about 26 billion real data events thus far and have designed the system to accommodate future data which has expected rates of 5 and 20 billion events per year. We have found this system offers outstanding performance for some fundamental use cases. In addition, profiting from the co-location of this data with other complementary metadata in ATLAS, the system has been easily extended to perform essential assessments of data integrity and completeness and to identify event duplication, including at what step in processing the duplication occurred.
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB3) that manages users’ transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It addresses a major weakness of the previous, decentralized model, namely that the transfer of the user's output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate.
Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 site and uses a NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB3 components. Since ASO/CRAB3 were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort.
In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and of how their different strength and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering the output to the user.
This work reports on the activities of integrating Oracle and Hadoop technologies for CERN database services and in particular in the development of solutions for offloading data and queries from Oracle databases into Hadoop-based systems. This is of interest to increase the scalability and reduce cost for some our largest Oracle databases. These concepts have been applied, among others, to build offline copies of controls and logging databases, which allow reports to be run without affecting critical production and also reduces the storage cost. Other use cases include making data stored in Hadoop/Hive available from Oracle SQL, which opens the possibility for building applications that integrate data from both sources.
We previously described Lobster, a workflow management tool for exploiting volatile opportunistic computing resources for computation in HEP. We will discuss the various challenges that have been encountered while scaling up the simultaneous CPU core utilization and the software improvements required to overcome these challenges.
Categories: Workflows can now be divided into categories based on their required system resources. This allows the batch queueing system to optimize assignment of tasks to nodes with the appropriate capabilities. Within each category, limits can be specified for the number of running jobs to regulate the utilization of communication bandwidth. System resource specifications for a task category can now be modified while a project is running, avoiding the need to restart the project if resource requirements differ from the initial estimates. Lobster now implements time limits on each task category to voluntarily terminate tasks. This allows partially completed work to be recovered.
Workflow dependency specification: One workflow often requires data from other workflows as input. Rather than waiting for earlier workflows to be completed before beginning later ones, Lobster now allows dependent tasks to begin as soon as sufficient input data has accumulated.
Resource monitoring: Lobster utilizes a new capability in Work Queue to monitor the system resources each task requires in order to identify bottlenecks and optimally assign tasks.
The capability of the Lobster opportunistic workflow management system for HEP computation has been significantly increased. We have demonstrated efficient utilization of 25K non-dedicated cores and achieved a data input rate of 9 Gb/s and an output rate of 400 GB/h. This has required new capabilities in task categorization, workflow dependency specification, and resource monitoring.
In the near future, many new experiments (JUNO, LHAASO, CEPC, etc) with challenging data volume are coming into operations or are planned in IHEP, China. The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment to be operational in 2019. The Large High Altitude Air Shower Observatory (LHAASO) is oriented to the study and observation of cosmic rays, which is going to collect data in 2019. The Circular Electron Positron Collider (CEPC) is planned to be a Higgs factory and upgraded to a proton-proton collider in second phase. The DIRAC-based distributed computing system has been enabled to support multi experiments. Development of task submission and management system is the first step for new experiments to have a try or use distributed computing resources in their early stages. In the paper we will present the design and development of a common framework to ease the process of building experiment-specific task submission and management system. Advanced object-oriented programming technology has been used to make infrastructure easy to extend for new experiments. The framework covers the functions including user interface, task creation and submission, run-time workflow control, task monitor and management, dataset management. YAML description language has been used to define tasks, which can be easily interpreted to get configurations from users. The run-time workflow control adopts the concept of DIRAC workflow and allows applications easily to define their several steps in one job and report status separately. Common modules including splitter to split tasks, backend to heterogeneous resources, job factory to generate the related parameters and files for submission have been provided. The monitoring service with web portal has been provided to monitor status for tasks and the related jobs. The dataset management module has been designed to communicate with Dirac File Catalog to implement query and register of dataset. At last the paper will show two experiments JUNO and CEPC how to use this infrastructure to build up their own task submission and management system and complete their first scale try on distributed computing resources.
In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run-2 events requires parallelization of the code in order to reduce the memory-per-core footprint constraining serial-execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks however becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable the scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This contribution will present this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015 will be described, along with the current phase of deployment to Tier-2 sites during 2016. The process of performance monitoring and optimization in order to achieve efficient and flexible use of the resources will also be described.
The GridPP project in the UK has a long-standing policy of supporting non-LHC VOs with 10% of the provided resources. Up until recently this had only been taken up be a very limited set of VOs, mainly due to a combination of the (perceived) large overhead of getting started, the limited computing support within non-LHC VOs and the ability to fulfill their computing requirements on local batch farms.
In the past year, increased computing requirements and a general tendency for more centralised
computing resources including cloud technologies has lead a number of small VOs to reevaluate their strategy.
In response to this, the GridPP project commissioned a multi-VO DIRAC server to act as a unified interface to all its grid and cloud resources. This was then offered as a service to a number of small VOs. Six VOs, four of which were completely new to the grid and two transitioning from a glite/WMS based model, have so far taken up the offer and have used the (mainly) UK grid/cloud infrastructure to complete a significant amount of work each in the last 6 months.
In this talk we present the varied approaches taken by each VO, the support issues arising from these and how these can be re-used by other new communities in the future.
CRAB3 is a workload management tool used by more than 500 CMS physicists every month to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). CRAB3 allows users to analyze a large collection of input files (datasets), splitting the input into multiple Grid jobs depending on parameters provided by users.
The process of manually specifying exactly how a large project is divided into jobs is tedious and often results in sub-optimal splitting due to its dependence on the performance of the user code and the content of the input dataset. This introduces two types of problems; jobs that are too big will have excessive runtimes and will not distribute the work across all of the available nodes. However, splitting the project into a large number of very small jobs is also inefficient, as each job creates additional overhead which increases load on scheduling infrastructure resources.
In this work we present a new feature called “automatic splitting” which removes the need for users to manually specify job splitting parameters. We discuss how HTCondor DAGMan can be used to build dynamic Directed Acyclic Graphs (DAGs) on the fly to optimize the performance of large CMS analysis jobs on the Grid.
We use DAGMan to dynamically generate interconnected DAGs that estimate the time per event of the user code, then run a set of jobs of preconfigured runtime to analyze the dataset. If some jobs have terminated before completion, the unfinished portions are assembled into smaller jobs and resubmitted to the worker nodes.
The CMS Computing and Offline group has put in a number of enhancements into the main software packages and tools used for centrally managed processing and data transfers in order to cope with the challenges expected during the LHC Run 2. In the presentation we will highlight these improvements that allow CMS to deal with the increased trigger output rate and the increased collision pileup in the context of the evolution in computing technology. The overall system aims for higher usage efficiency through increased automation and enhanced operational flexibility in terms of data transfers (dynamic) and workflow handeling. The tight coupling of workflow classes to types of sites has been drastically reduced. Reliable and high-performing networking between most of the computing sites and the successful deployment of a data-federation allow the execution of workflows using remote data access. Another step towards flexibility has been the introduction of one large global HTCondor Pool for all types of processing workflows and analysis jobs implementing the 'late binding' principle. Besides classical grid resources also some opportunistic resources as well as cloud resources have been integrated into that pool, which gives reach to more than 200k CPU cores.
On a typical WLCG site providing batch access to computing resources according to a fairshare policy, the idle timelapse after a job ends and before a new one begins on a given slot is negligible if compared to the duration of typical jobs. The overall amount of these intervals over a time window increases with the size of the cluster and the inverse of job duration and can be considered equivalent to an average number of unavailable slots over that time window. This value has been investigated for the Tier-1 at CNAF, and observed to occasionally grow and reach up to more than the 10% of the about 20,000 available computing slots. Analysis reveals that this happens when a sustained rate of short jobs is submitted to the cluster and dispatched by the batch system. Because of how the default fairshare policy works, it increases the dynamic priority of those users mostly submitting short jobs, since they are not accumulating runtime, and will dispatch more of their jobs at the next round, thus worsening the situation until the submission flow ends. To address this problem the default behaviour of the fairshare have been altered by adding a correcting term to the default formula for the dynamic priority. The LSF batch system, currently adopted at CNAF, provides a way to define its value by invoking a C function, which returns it for each user in the cluster. The correcting term works by rounding up to a minimum defined runtime the most recently done jobs. Doing so, each short job looks almost like a regular one and the dynamic priority value equilibrates to a proper value. The net effect is a reduction of the dispatching rate of short jobs and, consequently, the average number of available slots greatly improves. Furthermore, a potential starvation problem, actually observed at least once is also prevented. After describing short jobs and reporting about their impact on the cluster, possible workarounds are discussed and the selected solution is motivated. Details on the most critical aspects of the implementation are explained and the observed results are presented.
For over a decade, dCache.ORG has provided robust software that is used at more than 80 Universities and research institutes around the world, allowing these sites to provide reliable storage services for the WLCG experiments and many other scientific communities. The flexible architecture of dCache allows running it in a wide variety of configurations and platforms - from all-in-one Raspberry-Pi up to hundreds of nodes in multi-petabyte infrastructures.
Due to lack of managed storage at the time, dCache implemented data placement, replication and data integrity directly. Today, many alternatives are available: S3, GlusterFS, CEPH and others. While such systems position themselves as scalable storage systems, they can not be used by many scientific communities out of the box. The absence of specific authentication and authorization mechanisms, the use of product specific protocols and the lack of namespace are some of reasons that prevent wide-scale adoption of these alternatives.
Most of these limitations are already solved by dCache. By delegating low level storage management functionality to the above mentioned new systems and providing the missing layer through dCache, we provide a system which combines the benefits of both worlds - industry standard storage building blocks with the access protocols and authentication required by scientific communities.
In this presentation, we focus on CEPH, a popular software for clustered storage that supports file, block and object interfaces. CEPH is often used in modern computing centres, for example as a backend to OpenStack services. We will show prototypes of dCache running with a CEPH backend and discuss the benefits and limitations of such an approach. We will also outline the roadmap for supporting ‘delegated storage’ within the dCache releases.
Understanding how cloud storage can be effectively used, either standalone or in support of its associated compute, is now an important consideration for WLCG.
We report on a suite of extensions to familiar tools targeted at enabling the integration of cloud object stores into traditional grid infrastructures and workflows. Notable updates include support for a number of object store flavours in FTS3, Davix and gfal2, including mitigations for lack of vector reads; the extension of Dynafed to operate as a bridge between grid and cloud domains; protocol translation in FTS3; the implementation of extensions to DPM (also implemented by the dCache project) to allow 3rd party transfers over HTTP.
The result is a toolkit which facilitates data movement and access between grid and cloud infrastructures, broadening the range of workflows suitable for cloud. We report on deployment scenarios and prototype experience, explaining how, for example, an Amazon S3 or Azure allocation can be exploited by grid workflows.
Since 2014, the RAL Tier 1 has been working on deploying a Ceph backed object store. The aim is to replace Castor for disk storage. This new service must be scalable to meet the data demands of the LHC to 2020 and beyond. As well as offering access protocols the LHC experiments currently use, it must also provide industry standard access protocols. In order to keep costs down the service must use erasure coding rather than replication to ensure data reliability. This paper will present details of the storage service setup, which has been named Echo, as well as the experience gained from running and upgrading it.
In October 2015 a pre-production service offering the S3 and Swift APIs was launched. This paper will present details of the setup as well as the testing that has been done. This includes the use of S3 as a backend for the CVMFS Stratum 1s, for writing ATLAS log files and for testing FTS transfers. Additionally throughput testing from local experiments based at RAL will be discussed.
While there is certainly interest from the LHC experiments regarding the S3 and Swift APIs, they are still currently dependant on the XrootD and GridFTP protocols. The RAL Tier 1 has therefore also been developing an XrootD and GridFTP plugin for Ceph. Both plugins are built on top of the same libraries that write striped data into Ceph and therefore data written by one protocol will be accessible by the other. In the long term we hope the LHC experiments will migrate to industry standard protocols, therefore these plugins will only provide the features needed by the LHC VOs. This paper will report on the development and testing of these plugins.
Dependability, resilience, adaptability, and efficiency. Growing requirements require tailoring storage services and novel solutions. Unprecedented volumes of data coming from the detectors need to be quickly available in a highly scalable way for large-scale processing and data distribution while in parallel they are routed to tape for long-term archival. These activities are critical for the success of HEP experiments. Nowadays we operate at high incoming throughput (14GB/s during 2015 LHC Pb-Pb run) and with concurrent complex production work-loads. In parallel our systems provide the platform for the continuous user and experiment driven work-loads for large-scale data analysis, including end-user access and sharing. The storage services at CERN cover the needs of our community: EOS and CASTOR as a large-scale storage; CERNBox for end-user access and sharing; Ceph as data back-end for the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for legacy distributed-file-system services. In this paper we will summarise the experience in supporting LHC experiments and the transition of our infrastructure from static monolithic systems to flexible components providing a more coherent environment with pluggable protocols, tunable QoS, sharing capabilities and fine grained ACLs management while continuing to guarantee the dependable and robust services.
This work will present the status of Ceph-related operations and development within the CERN IT Storage Group: we summarise significant production experience at the petabyte scale as well as strategic developments to integrate with our core storage services. As our primary back-end for OpenStack Cinder and Glance, Ceph has provided reliable storage to thousands of VMs for more than 3 years; this functionality is used by the full range of IT services and experiment applications.
Ceph at the LHC scale (above 10's of PB) has required novel contributions both in the development and operational side. For this reason, we have performed scale testing in cooperation with the core Ceph team. This work has been incorporated into the latest Ceph releases and enables Ceph to operate with at least 7,200 OSDs (totaling 30 PB in our tests). CASTOR has been evolved with the possibility to use a Ceph cluster as extensible high-performance data pool. The main advantages of this solution are the drastic reduction of the operational load and the possibility to deliver high single-stream performances to efficiently drive the CASTOR tape infrastructure. Ceph is currently our laboratory to explore S3 usage in HEP and to evolve other infrastructure services.
In this paper, we will highlight our Ceph-based services, the NFS Filer and CVMFS, both of which use virtual machines and Ceph block devices at their core. We will then discuss the experience in running Ceph at LHC scale (most notably early results with Ceph-CASTOR).
CEPH is a cutting edge, open source, self-healing distributed data storage technology which is exciting both the enterprise and academic worlds. CEPH delivers an object storage layer (RADOS), block storage layer, and file system storage in a single unified system. CEPH object and block storage implementations are widely used in a broad spectrum of enterprise contexts, from dynamic provision of bare block storage to object storage backends of virtual machines images in cloud platforms. The High Energy Particle Physics community has also recognized its potential by deploying CEPH object storage clusters both at the Tier-0 (CERN) and in some Tier-1s, and by developing support for the GRIDFTP and XROOTD (a bespoke HEP) transfer and access protocols. However, the CEPH filesystem (CEPHFS) has not been subject to the same level of interest. CEPHFS layers a distributed POSIX file system over CEPH's RADOS using a cluster of metadata servers dynamically partitioning responsibility for the file system namespace and distributing the metadata workload based on client accesses. It is the less mature CEPH product and has been waiting to be tagged as a production-like product for a long time.
In this paper we present a CEPHFS use case implementation at the Center of Excellence for Particle Physics at the TeraScale (CoEPP). CoEPP operates the Australia Tier-2 for ATLAS and joins experimental and theoretical researchers from the Universities of Adelaide, Melbourne, Sydney and Monash. CEPHFS is used to provide a unique object storage system, deployed on commodity hardware and without single points of failure, used by Australian HEP researchers in the different CoEPP locations to store, process and share data, independent of their geographical location. CEPHFS is also working in combination with a SRM and XROOTD implementation, integrated in ATLAS Data Management operations, and used by HEP researchers for XROOTD or/and POSIX-like access to ATLAS Tier-2 user areas. We will provide details on the architecture, its implementation and tuning, and report performance I/O metrics as experienced by different clients deployed over WAN. We will also explain our plan to collaborate with Red Hat Inc. on extending our current model so that the metadata cluster distribution becomes multi-site aware, such that regions of the namespace can be tied or migrated to metadata servers in different data centers.
In its current status, CoEPP's CEPHFS has already been in operation for almost a year (at the time of the conference). It has proven to be a service that follows the best industry standards at a significantly lower cost and fundamental to promote data sharing and collaboration between Australian HEP researchers.
We will report on the first year of the OSiRIS project (NSF Award #1541335, UM, IU, MSU and WSU) which is targeting the creation of a distributed Ceph storage infrastructure coupled together with software-defined networking to provide high-performance access for well-connected locations on any participating campus. The project’s goal is to provide a single scalable, distributed storage infrastructure that allows researchers at each campus to read, write, manage and share data directly from their own computing locations. The NSF CC*DNI DIBBs program which funded OSiRIS is seeking solutions to the challenges of multi-institutional collaborations involving large amounts of data and we are exploring the creative use of Ceph and networking to address those challenges.
While OSiRIS will eventually be serving a broad range of science domains, its first adopter will be ATLAS, via the ATLAS Great Lakes Tier-2 (AGLT2), jointly located at the University of Michigan and Michigan State University. Part of our presentation will cover how ATLAS is using the OSiRIS infrastructure and our experiences integrating our first user community. The presentation will also review the motivations for and goals of the project, cover the technical details of the OSiRIS infrastructure, the challenges in providing such an infrastructure, and the technical choices made to address those challenges. We will conclude with our plans for the remaining 4 years of the project and our vision for what we hope to deliver by the project’s end.
With ROOT 6 in production in most experiments, ROOT has changed gear during the past year: the development focus on the interpreter has been redirected into other areas.
This presentation will summarize the developments that have happened in all areas of ROOT, for instance concurrency mechanisms, the serialization of C++11 types, new graphics palettes, new "glue" packages for multivariate analyses, and the state of the Jupyter and JavaScript interfaces and language bindings.
It will lay out the short term plans for ROOT 5 and ROOT 6 and try to forecast the future evolution of ROOT, for instance with respect to more robust interfaces and a fundamental change in the graphics and GUI system.
ROOT is one of the core software tool for physicists. For more than a decade it has a central position in the physicists' analysis code and the experiments' frameworks thanks in parts to its stability and simplicity of use. This allowed software development for analysis and frameworks to use ROOT as a "common language" for HEP, across virtually all experiments.
Software development in general and in HEP frameworks in particular has become increasingly complex over the years. From straightforward code fitting in a single FORTRAN source file, HEP software has grown over the years to span millions of lines of code spread amongst many, more or so collaborating, packages and libraries. To add to the complexity, in an effort to better exploit current and upcoming hardware, this code is being adapted to move from purely scalar and serial algorithm to complex multithread, multi-tasked and/or vectorized versions.
The C++ language itself and the software development community’s understanding of the best way to leverage its strength has evolved significantly. One of the best example of this being the “C++ Core Guidelines” which purports to get a “smaller, simpler and safer language” out of C++. At the same time new tools and techniques are being developed to facilitate proving and testing the correctness of software programs, as exemplified by the C++ Guideline Support Library, but those require the tool to be able to understand the semantic of the interfaces. Design patterns and interface tricks that were appropriate in the early days of C++ are often no longer the best choices for API design. ROOT is at the heart of virtually all physics analysis and most HEP frameworks and as such needs to lead the way and help demonstrate and facilitate the application of those modern paradigms.
This presentation will review what theses lessons are and how they can be applied to an evolution of the ROOT C++ interfaces, striking a balance between conserving familiarity with the legacy interfaces (to facilitate both the transition of existing code and the learning of the new interfaces) and significantly improving the expressiveness, clarity, (re)usability, thread friendliness, and robustness of the user code.
ROOT version 6 comes with a C++ compliant interpreter cling. Cling needs to know everything about the code in libraries to be able to interact with them.
This translates into increased memory usage with respect to previous versions of
ROOT.
During the runtime automatic library loading process, ROOT6 re-parses a
set of header files, which describe the library; and enters "recursive" parsing.
The former has a noticeable effect on CPU and memory performance, whereas the
latter is fragile and can introduce correctness issues. An elegant solution to
the shortcoming is to feed the necessary information only when required and in a
non-recursive way.
The LLVM community has started working on a powerful tool for reducing build
times and peak memory usage of the clang compiler called "C++ Modules".
The feature matured and it is on its way to the C++ standard. C++ Modules are
a flexible concept, which can be employed to match CMS and other experiments'
requirement for ROOT: to optimize both runtime memory usage and performance.
The implementation of the missing concepts in cling and its underlying LLVM
libraries and adopting the changes in ROOT is a complex endeavor. I describe the
scope of the work and I present a few techniques used to lower ROOT's runtime
memory footprint. I discuss the status of the C++ Modules in the context of ROOT
and show some preliminary performance results.
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments.
The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components).
For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Notebooks represent an exciting new approach that will considerably facilitate collaborative physics analysis.
They are a modern and widely-adopted tool to express computational narratives comprising, among other elements, rich text, code and data visualisations. Several notebook flavours exist, although one of them has been particularly successful: the Jupyter open source project.
In this contribution we demonstrate how the ROOT framework is integrated with the Jupyter technology, reviewing features such as an unprecedented integration of Python and C++ languages and interactive data visualisation with JavaScript ROOT. In this context, we show the potential of the complete interoperability of ROOT with other analysis ecosystems such as SciPy.
We discuss through examples and use-cases how the notebook approach boosts the productivity of physicists, engineers and non-coding lab scientists. Opportunities in the field of outreach, education and open-data initiatives are also reviewed.
ROOT provides advanced statistical methods needed by the LHC experiments to analyze their data. These include machine learning tools for classification, regression and clustering. TMVA, a toolkit for multi-variate analysis in ROOT, provides these machine learning methods.
We will present new developments in TMVA, including parallelisation, deep-learning neural networks, new features and
additional interfaces to external machine learning packages.
We will show the new modular design of the new version of TMVA, cross-validation and hyper parameter tuning capabilities, feature engineering and deep learning.
We will further describe new parallelisation features including multi-threading, multi-processing, cluster parallelisation and present GPU support for intensive machine learning applications, such as deep learning.
ROOT provides an extremely flexible format used throughout the HEP community. The number of use cases – from an archival data format to end-stage analysis – has required a number of tradeoffs to be exposed to the user. For example, a high “compression level” in the traditional DEFLATE algorithm will result in a smaller file (saving disk space) at the cost of slower decompression (costing CPU time when read). If not done correctly, at the scale of a LHC experiment, poor design choices can result in terabytes of wasted space.
We explore and attempt to quantify some of these tradeoffs. Specifically, we explore: the use of alternate compression algorithms to optimize for read performance; an alternate method of compression individual events to allow efficient random access; and a new approach to whole-file compression. Quantitative results are given, as well as guidance on how to make compression decisions for different use cases.
We present rootJS, an interface making it possible to seamlessly integrate ROOT 6 into applications written for Node.js, the JavaScript runtime platform increasingly commonly used to create high-performance Web applications. ROOT features can be called both directly from Node.js code and by JIT-compiling C++ macros. All rootJS methods are invoked asynchronously and support callback functions, allowing non-blocking operation of Node.js applications using them. Last but not least, our bindings have been designed to platform-independent and should therefore work on all systems supporting both ROOT 6 and Node.js.
Thanks to rootJS it is now possible to create ROOT-aware Web applications taking full advantage of the high performance and extensive capabilities of Node.js. Examples include platforms for the quality assurance of acquired, reconstructed or simulated data, book-keeping and e-log systems, and even Web browser-based data visualisation and analysis.
Brookhvaven National Laboratory (BNL) anticipates significant growth in scientific programs with large computing and data storage needs in the near future and has recently re-organized support for scientific computing to meet these needs.
A key component is the enhanced role of the RHIC-ATLAS Computing Facility
(RACF)in support of high-throughput and high-performance computing (HTC and HPC) at BNL.
This presentation discusses the evolving role of the RACF at BNL, in
light of its growing portfolio of responsibilities and its increasing
integration with cloud (academic and for-profit) computing activities.
We also discuss BNL's plan to build a new computing center to support
the new responsibilities of the RACF and present a summary of the cost
benefit analysis done, including the types of computing activities
that benefit most from a local data center vs. cloud computing. This
analysis is partly based on an updated cost comparison of Amazon EC2
computing services and the RACF, which was originally conducted in 2012.
The Worldwide LHC Computing Grid (WLCG) infrastructure
allows the use of resources from more than 150 sites.
Until recently the setup of the resources and the middleware at a site
were typically dictated by the partner grid project (EGI, OSG, NorduGrid)
to which the site is affiliated.
Since a few years, however, changes in hardware, software, funding and
experiment computing requirements have increasingly affected
the way resources are shared and supported. At the WLCG level this implies
a need for more flexible and lightweight methods of resource provisioning.
In the WLCG cost optimisation survey presented at CHEP 2015 the concept of
lightweight sites was introduced, viz. sites essentially providing only
computing resources and aggregating around core sites that provide also storage.
The efficient use of lightweight sites requires a fundamental reorganisation
not only in the way jobs run, but also in the topology of the infrastructure
and the consolidation or elimination of some established site services.
This contribution gives an overview of the solutions being investigated
through "demonstrators" of a variety of lightweight site setups,
either already in use or planned to be tested in experiment frameworks.
The INFN CNAF Tier-1 computing center is composed by 2 different main rooms containing IT resources and 4 additional locations that hosts the necessary technology infrastructures providing the electrical power and refrigeration to the facility. The power supply and continuity are ensured by a dedicated room with three 15,000 to 400 V transformers in a separate part of the principal building and 2 redundant 1.4MW diesel rotary uninterruptible power supplies. The cooling is provided by six free cooling chillers of 320 kW each with a N+2 redundancy configuration. Clearly, considering the complex physical distribution of the technical plants, a detailed Building Management System (BMS) was designed and implemented as part of the original project in order to monitor and collect all the necessary information and for providing alarms in case of malfunctions or major failures. After almost 10 years of service, a revision of the BMS system was somewhat necessary. In addition, the increasing cost of electrical power is nowadays a strong motivation for improving the energy efficiency of the infrastructure. Therefore the exact calculation of the power usage effectiveness (PUE) metric has become one of the most important factors when aiming for the optimization of a modern data center. For these reasons, an evolution of the BMS system was designed using the Schneider StruxureWare infrastructure hardware and software products. This solution demonstrates to be a natural and flexible development of the previous TAC Vista software with advantages in the ease of use and the possibility to customize the data collection and the graphical interfaces display. Moreover, the addition of protocols like open standard Web services gives the possibility to communicate with the BMS from custom user application and permits the exchange of data and information through the Web between different third-party systems. Specific Web services SOAP requests has been implemented in our Tier-1 monitoring system in order to collect historical trends of power demands and calculate the partial PUE (pPUE) of specific area of the infrastructure. This would help in the identification of “spots” that may need optimization in the power usage. The StruxureWare system maintains compatibility with standard protocols like Modbus as well as native LonWorks, making possible reusing the existing network between physical locations as well as a considerable number of programmable controller and I/O modules that interact with the facility. The high increase in detailed statistical information of power consumption and the HVAC (heat, ventilation and air conditioning) parameters could prove to be a very valuable strategic choice for improving the overall PUE. This will bring remarkable benefits for the overall management costs, despite the limits in the actual location of the facility, and it will help the process of building a more energy efficient data center that embraces the concept of green IT.
1. Statement
OpenCloudMesh has a very simple goal: to be an open and vendor agnostic standard for private cloud interoperability.
To address the YetAnotherDataSilo problem, a working group under the umbrella of the GÉANT Association is has been created with the goal of ensuring neutrality and a clear context for this project.
All leading partners of the OpenCloudMesh project - GÉANT, CERN and ownCloud Inc. - are fully committed to the open API design principle. This means that - from day one - the OCM sharing API should be discussed, designed and developed as a vendor neutral protocol to be adopted by any on-premise sync&share product vendor or service provider. We acknowledge the fact that the piloting of the first working interface prototype was carried out in an ownCloud environment and has been in production since 2015 with a design size of 500.000 users, called “Sciebo”, interconnecting dozens of private research clouds in Germany. Pydo adopting the standard in March 2016 underlines that this did and will not affect the adoption of the open API in any other vendor or service/domain provider.
2. OpenCloudMesh talk at CHEP2016 titled “Interconnected Private Clouds for Universities and Researchers”
The content of the presentation is an overview over the project currently managed by GEANT (Peter Szegedi), CERN (Dr. Jakub Moscicki) and ownCloud (Christian Schmitz), outlining overall concepts, past, present achievements and future milestones with a clear call to participation.
The presentation will summarize the problem originally scoped, then making a shift to the past success milestones (ex. demonstrated simultaneous interoperability between CERN, AARnet, Sciebo, UniVienna and, at the time of the talk expected, interoperability between different clouds running different software vendors) and then shift to future milestones, moonshot challenges and a call to participation.
3. OpenCloudMesh Moonshot scope
The problems, concepts and solution approaches to solving this are absolutely cutting edge as of 2016, hence they offer both practical and research challenges. Science and research in its open and peer reviewed nature has become a truly globalized project with OCM having the potential to be the “usability fabric” of a network of private clouds acting as one global research cloud.
4. Links
Project Wiki
https://wiki.geant.org/display/OCM/Open+Cloud+Mesh
Milestone Press Release, February 10th 2016 http://www.geant.org/News_and_Events/Pages/OpenCloudMesh.aspx
Sciebo https://www.sciebo.de/en/
GÉANT http://www.geant.org/
CERN http://home.cern/
ownCloud https://owncloud.com/
The Tier-1 at CNAF is the main INFN computing facility offering computing and storage resources to more than 30 different scientific collaborations including the 4 experiments at the LHC. It is also foreseen a huge increase in computing needs in the following years mainly driven by the experiments at the LHC (especially starting with the run 3 from 2021) but also by other upcoming experiments such as CTA.
While we are considering the upgrade of the infrastructure of our data center, we are also evaluating the possibility of using CPU resources available in other data centers or even leased from commercial cloud providers.
Hence, at INFN Tier-1, besides participating to the EU project HNSciCloud, we have also pledged a small amount of computing resources (~2000 cores) located at the Bari ReCaS for the WLCG experiments for 2016 and we are testing the use of resources provided by a commercial cloud provider. While the Bari ReCaS data center is directly connected to the GARR network with the obvious advantage of a low latency and high bandwidth connection, in the case of the commercial provider we rely only on the General Purpose Network.
In this paper we describe the setup phase and the first results of these installations started in the last quarter of 2015, focusing on the issues that we have had to cope with and discussing the measured results in terms of efficiency.
The WLCG Tier-1 center GridKa is developed and operated by the Steinbuch Centre for Computing (SCC)
at the Karlsruhe Institute of Technology (KIT). It was the origin of further Big Data research activities and
infrastructures at SCC, e.g. the Large Scale Data Facility (LSDF), providing petabyte scale data storage
for various non-HEP research communities.
Several ideas and plans exist to address the increasing demand for large scale data storage and data management services
and computing resources by more and more science communities within Germany and Europe, e.g. the
the Helmholtz Data Federation (HDF) or the European Open Science Cloud (EOSC).
In the era of LHC run 3 and Belle-II, high energy physics will produce even more data than in the
past years, requiring improvements of the computing and data infrastructures as well as computing models.
We present our plans to further develop GridKa
as topical center within the multidisciplinary research evironments at KIT, in Germany and Europe and
in the light of increasing requirements and advanced computing models to provide the best possible services to high energy physics.
The KEK central computer system (KEKCC) supports various activities in KEK, such as the Belle / Belle II, J-PARC experiments, etc. The system is now under replacement and will be put into production in September 2016. The computing resources, CPU and storage, in the next system are much enhanced as recent increase of computing resource demand. We will have 10,000 CPU cores, 13 PB disk storage, and 70 PB maximum capacity of tape system.
Grid computing can help distribute large amount of data in geographically dispersed sites and share data in an efficient way for world wide collaborations. But the data centers of host institutes of large HEP experiments have to take into serious consideration for managing huge amount of data. For example, the Belle II experiment expects that several hundred PB storage has to be stored in the KEK site even if Grid computing is taken as an analysis model. The challenge is not only for storage capacity. I/O scalability, usability and power efficiency and so on should be considered for the system design of storage system. Our storage system is designed to meet requirements for managing 100 PB-order data. We introduce IBM Elastic Storage Server (ESS) and DDN SFA12K as storage hardware, and adopts GPFS parallel file system to realize high I/O performance. The GPFS file system can take several tiers for recalling data from local SSD cache in computing nodes to HDD and tape storage. We take full advantage of this hierarchical storage management in the next system. Actually we have long history of using HPSS system as HSM of tape system. Since the current system, we introduced GHI (GPFS HPSS Interface) as the layer of disk and tape system, which enables high I/O performance and good usability as GPFS disk file system for tape data.
In this talk, we mainly focus on the design and performance of our new storage system. In addition, issues on workload management, system monitoring, data migration and so on are described. Our knowledge, experience and challenges can be usefully shared among HEP data centers as a data-intensive computing facility for the next generation of HEP experiments.
At the RAL Tier-1 we have been deploying production services on both bare metal and a variety of virtualisation platforms for many years. Despite the significant simplification of configuration and deployment of services due to the use of a configuration management system, maintaining services still requires a lot of effort. Also, the current approach of running services on static machines results in a lack of fault tolerance, which lowers availability and increases the amount of manual interventions required. In the current climate more and more non-LHC communities are becoming important, with the potential for the need to run additional instances of existing services as well as new services, but at the same time comes the likelyhood that staff effort is more likely to decrease rather than increase. It is therefore important that we are able to reduce the amount of effort required to maintain services whilst ideally improving availability, in addition to being able to maximise the utilisation of resources and become more adaptive to changing conditions.
These problems are not unique to RAL, and from looking at what is happening in the wider world it is clear that container orchestration has the possibility to provide a solution to many of these issues. Therefore last year we began investigating the migration of services to an Apache Mesos cluster running on bare metal. In this model the concept of individual machines is abstracted away and services are run on the cluster in Docker containers, managed by a scheduler. This means that any host or application failures, as well as procedures such as rolling starts or upgrades, can be handled automatically and no longer require any human intervention. Similarly, the number of instances of applications can be scaled automatically in response to changes in load. On top of this it also gives us the important benefit of being able to run a wide range of services on a single set of resources without involving virtualisation.
In this presentation we will describe the Mesos infrastructure that has been deployed at RAL, including how we deal with service discovery, the challenge of monitoring, logging and alerting in a dynamic environment and how it integrates with our existing traditional infrastructure. We will report on our experiences in migrating both stateless and stateful applications, the security issues surrounding running services in containers, and finally discuss some aspects of our internal process for making Mesos a platform for running production services.
We present the Web-Based Monitoring project of the CMS experiment at the LHC at CERN. With the growth in size and complexity of High Energy Physics experiments and the accompanying increase in the number of collaborators spread across the globe, the importance of broadly accessible monitoring has grown. The same can be said about the increasing relevance of operation and reporting web tools used by run coordinators, sub-system experts and managers. CMS Web-Based Monitoring has played a crucial role providing that for the CMS experiment through the commissioning phase and the LHC Run I data taking period in 2010-2012. It has adapted to many CMS changes and new requirements during the Long Shutdown 1 and even now, during the ongoing LHC Run II. We have developed a suite of web tools to present data to the users from many underlying heterogeneous sources, from real time messaging systems to relational databases. The tools combine, correlate and visualize information in both graphical and tabular formats of interest to the experimentalist, with data such as beam conditions, luminosity, trigger rates, DAQ, detector conditions, operational efficiency and more, allowing for flexibility on the user side. In addition, we provide data aggregation, not only at display level but also at database level. An upgrade of the Web Based Monitoring project is being planned, implying major changes, and that is also discussed here.
The CMS experiment has collected an enormous volume of metadata about its computing operations in its monitoring systems, describing its experience in operating all of the CMS workflows on all of the Worldwide LHC Computing Grid Tiers. Data mining efforts into all these information have rarely been done, but are of crucial importance for a better understanding of how CMS did successful operations, and to reach an adequate and adaptive modelling of the CMS operations, in order to allow detailed optimizations and eventually a prediction of system behaviours. These data are now streamed into the CERN Hadoop data cluster for further analysis. Specific sets of information (e.g. data on how many replicas of datasets CMS wrote on disks at WLCG Tiers, data on which datasets were primarily requested for analysis, etc) were collected on Hadoop and processed with MapReduce applications profiting of the parallelization on the Hadoop cluster. We present the implementation of new monitoring applications on Hadoop, and discuss the new possibilities in CMS computing monitoring introduced with the ability to quickly process big data sets from mulltiple sources, looking forward to a predictive modeling of the system.
This paper introduces the evolution of the monitoring system of the Alpha Magnetic Spectrometer (AMS) Science Operation Center (SOC) at CERN.
The AMS SOC monitoring system includes several independent tools: Network Monitor to poll the health metrics of AMS local computing farm, Production Monitor to show the production status, Frame Monitor to record the flight data arriving status, and SOC monitor to check the production latency.
Currently CERN has adopted Metrics as the main monitoring platform, and we are working to integrate our monitoring tools to this platform to provide dashboard like monitoring pages which will show the overall status of SOC as well as more detailed information. A diagnostic tool based on set of expandable rules and capable to automatically locate the possible issues and provide suggestions for the fixes is also being designed.
For over a decade, LHC experiments have been relying on advanced and specialized WLCG dashboards for monitoring, visualizing and reporting the status and progress of the job execution, data management transfers and sites availability across the WLCG distributed grid resources.
In the recent years, in order to cope with the increase of volume and variety of the grid resources, the WLCG monitoring had started to evolve towards data analytics technologies such as ElasticSearch, Hadoop and Spark. Therefore, at the end of 2015, it was agreed to merge these WLCG monitoring services, resources and technologies with the internal CERN IT data centres monitoring services also based on the same solutions.
The overall mandate was to migrate, in concertation with representatives of the users of the LHC experiments, the WLCG monitoring to the same technologies used for the IT monitoring. It started by merging the two small IT and WLCG monitoring teams, in order to join forces to review, rethink and optimize the IT and WLCG monitoring and dashboards within a single common architecture, using the same technologies and workflows used by the CERN IT monitoring services.
This work, in early 2016, resulted in the definition and the development of a Unified Monitoring Architecture aiming at satisfying the requirements to collect, transport, store, search, process and visualize both IT and WLCG monitoring data. The newly-developed architecture, relying on state-of-the-art open source technologies and on open data formats, will provide solutions for visualization and reporting that can be extended or modified directly by the users according to their needs and their role. For instance it will be possible to create new dashboards for the shifters and new reports for the managers, or implement additional notifications and new data aggregations directly by the service managers, with the help of the monitoring support team but without any specific modification or development in the monitoring service.
This contribution provides an overview of the Unified Monitoring Architecture, currently based on technologies such as Flume, ElasticSearch, Hadoop, Spark, Kibana and Zeppelin, with insight and details on the lessons learned, and explaining the work done to monitor both the CERN IT data centres and the WLCG job, data transfers and sites and services.
The CERN Control and Monitoring Platform (C2MON) is a modular, clusterable framework designed to meet a wide range of monitoring, control, acquisition, scalability and availability requirements. It is based on modern Java technologies and has support for several industry-standard communication protocols. C2MON has been reliably utilised for several years as the basis of multiple monitoring systems at CERN, including the Technical Infrastructure Monitoring (TIM) service and the DIAgnostics and MONitoring (DIAMON) service. The central Technical Infrastructure alarm service for the accelerator complex (LASER) is in the final migration phase. Furthermore, three more services at CERN are currently being prototyped with C2MON.
Until now, usage of C2MON has been limited to internal CERN projects. However, C2MON is trusted and mature enough to be made publically available. Aiming to build a user community, encourage collaboration with external institutes and create industry partnerships, the C2MON platform will be distributed as an open-source package under the LGPLv3 licence within the context of the knowledge transfer initiative at CERN.
This paper gives an overview of the C2MON platform focusing on its ease of use, integration with modern technologies, and its other features such as standards-based web support and flexible archiving techniques. The challenges faced when preparing an in-house platform for general release to external users are also described.
In order to ensure an optimal performance of the LHCb Distributed Computing, based on LHCbDIRAC, it is necessary to be able to inspect the behavior over time of many components: firstly the agents and services on which the infrastructure is built, but also all the computing tasks and data transfers that are managed by this infrastructure. This consists of recording and then analyzing time series of a large number of observables, for which the usage of SQL relational databases is far from optimal. Therefore within DIRAC we have been studying novel possibilities based on NoSQL databases (ElasticSearch, OpenTSDB and InfluxDB) as a result of this study we developed a new monitoring system based on ElasticSearch. It has been deployed on the LHCb Distributed Computing infrastructure for which it collects data from all the components (agents, services, jobs) and allows creating reports through Kibana and a web user interface, which is based on the DIRAC web framework.
In this paper we describe this new implementation of the DIRAC monitoring system. We give details on the ElasticSearch implementation within the DIRAC general framework, as well as an overview of the advantages of the pipeline aggregation used for creating a dynamic bucketing of the time series. We present the advantages of using the ElasticSearch DSL high-level library for creating and running queries. Finally we shall present the performances of that system.
One of the principle goals of the Dept. of Energy funded SciDAC-Data project is to analyze the more than 410,000 high energy physics “datasets” that have been collected, generated and defined over the past two decades by experiments using the Fermilab storage facilities. These datasets have been used as the input to over 5.6 million recorded analysis projects, for which detailed analytics have been gathered. The analytics and meta information regarding these for these datasets and analysis projects are being combined with knowledge of their part of the HEP analysis chains for major experiments to understand how modern computing and data delivery is being used.
We present the first results of this project, which examine in detail how the CDF, DØ and NO𝜈A experiments have organized, classified and consumed petascale datasets to produce their physics results. The results include the analysis of the correlations in dataset/file overlap, data usage patterns, data popularity, dataset dependency and temporary dataset consumption. The results provide critical insight into how workflows and data delivery schemes can be combined with different caching strategies to more efficiently perform the work required to mine these large HEP data volumes and to understand the physics analysis requirements for the next generation of HEP computing facilities.
In particular we present detailed analysis of the NO𝜈A data organization and consumption model corresponding to their first and second oscillation results (2014-2016) and the first look at the analysis of the Tevatron Run II experiments. We present statistical distributions for the characterization of these data and data driven models describing their consumption.
In Long Shutdown 3 the CMS Detector will undergo a major upgrade to prepare for the second phase of the LHC physics program, starting around 2026. The HL-LHC upgrade will bring instantaneous luminosity up to 5x10^34 cm-2 s-1 (levelled), at the price of extreme pileup of 200 interactions per crossing. A new silicon tracker with trigger capabilities and extended coverage, and new high granularity forward calorimetry will enhance the CMS acceptance and selection power. This will enable precision measurements of the Higgs boson properties, as well as extend the discovery reach for physics beyond the standard model, while coping with conditions dictated by the HL-LHC parameters.
Following the tradition, the CMS Data Acquisition System will continue to feature two trigger levels.
The detector will be read out at an unprecedented data rate of up to 50 Tb/s read at a Level-1 rate of 750 kHz from some 50k high-speed optical detector links, for an average expected event size of 5MB. Complete events will be analysed by a software trigger (HLT) running on standard processing nodes, and selected events will be stored permanently at a rate of up to 10 kHz for offline processing and analysis.
In this paper we discuss the baseline design of the DAQ and HLT systems for the Run 4, taking into account the projected evolution of high speed network fabrics for event building and distribution, and the anticipated performance of many-core CPU and their memory and I/O architectures. Assuming a modest improvement of the processing power of 12.5% per year for the standard Intel architecture CPU and the affordability, by 2026, of 100-200 Gb/s links, and scaling the current HLT CPU needs for increased event size, pileup, and rate, the CMS DAQ will require about:
Implications on hardware and infrastructure requirements for the DAQ “data center” are analysed. Emerging technologies for data reduction, in particular of CPU-FPGA hybrid systems, but also alternative CPU architectures, are considered. These technologies may in the future help containing the TCO of the system, while improving the energy performance and reducing the cooling requirements.
Novel possible approaches to event building and online processing are also examined, which are inspired by trending developments in other areas of computing dealing with large masses of data.
We conclude by discussing the opportunities offered by reading out and processing parts of the detector, wherever the front-end electronics allows, at the machine clock rate (40 MHz). While the full detector is being read out and processed at the Level-1 rate, a second, parallel DAQ system would run as an "opportunistic experiment” processing tracker trigger and calorimeter data at 40 MHz. This idea presents interesting challenges and its physics potential should be studied.
The ATLAS experiment at CERN is planning a second phase of upgrades to prepare for the "High Luminosity LHC", a 4th major run due to start in 2026. In order to deliver an order of magnitude more data than previous runs, 14 TeV protons will collide with an instantaneous luminosity of 7.5 × 1034 cm−2s−1, resulting in much higher pileup and data rates than the current experiment was designed to handle. While this extreme scenario is essential to realise the physics programme, it is a huge challenge for the detector, trigger, data acquisition and computing. The detector upgrades themselves also present new requirements and opportunities for the trigger and data acquisition system.
Initial upgrade designs for the trigger and data acquisition system are shown, including the real time low latency hardware trigger, hardware-based tracking, the high throughput data acquisition system and the commodity hardware and software-based data handling and event filtering. The motivation, overall architecture and expected performance are explained. Some details of the key components are given. Open issues and plans are discussed.
After the Phase-I upgrade and onward, the Front-End Link eXchange (FELIX) system will be the interface between the data handling system and the detector front-end electronics and trigger electronics at the ATLAS experiment. FELIX will function as a router between custom serial links and a commodity switch network which will use standard technologies (Ethernet or Infiniband) to communicate with data collecting and processing components. The system architecture of FELIX will be described and the results of the demonstrator program currently in progress will be presented.
ALICE, the general purpose, heavy ion collision detector at the CERN LHC is designed
to study the physics of strongly interacting matter using proton-proton, nucleus-nucleus and proton-nucleus collisions at high energies. The ALICE experiment will be
upgraded during the Long Shutdown 2 in order to exploit the full scientific potential of the future LHC. The requirements will then be significantly different from what they were during the original design of the experiment and will require major changes to the detector read-out.
The main physics topics addressed by the ALICE upgrade are characterized by rare
processes with a very small signal-to-background ratio, requiring very large statistics of fully reconstructed events. In order to keep up with the 50 kHz interaction rate, the upgraded detectors will be read out continuously. However, triggered read-out will be used by some detectors and for commissioning and some calibration runs.
The total data volume collected from the detectors will increase significantly reaching a sustained data throughput of up to 3 TB/s with the zero-suppression of the TPC data performed after the data transfer to the detector read-out system. A flexible mechanism of bandwidth throttling will allow the system to gracefully degrade the effective rate of recorded interactions in case of saturation of the computing system.
This paper includes a summary of these updated requirements and presents a refined
design of the detector read-out and of the interface with the detectors and the online systems. It also elaborates on the system behaviour in continuous and triggered readout and defines ways to throttle the data read-out in both cases.
The ALICE Collaboration and the ALICE O$^2$ project have carried out detailed studies for a new online computing facility planned to be deployed for Run 3 of the Large Hadron Collider (LHC) at CERN. Some of the main aspects of the data handling concept are partial reconstruction of raw data organized in so called time frames, and based on that information reduction of the data rate without significant loss in the physics information.
A production solution for data compression is running for the ALICE Time Projection Chamber (TPC) in the ALICE High Level Trigger online system since 2011. The solution is based on reconstruction of space points from raw data. These so called clusters are the input for reconstruction of particle trajectories by the tracking algorithm. Clusters are stored instead of raw data after a transformation of required parameters into an optimized format and subsequent lossless data compression techniques. With this approach, a reduction of 4.4 has been achieved on average.
For Run 3, a significantly higher reduction is required. Several options are under study for cluster data to be stored. As the first group of options, alternative lossless techniques like e.g. arithmetic coding have been investigated.
Furthermore, theoretical studies had shown a significant potential of compressed data formats for clusters relative to the particle trajectory they belong to. In the present scheme, cluster parameters are stored in uncalibrated detector format while the track as reference for residual calculation is described in Cartesian space. This results into higher entropy of the parameter residuals and smaller data reduction. The track reconstruction scheme of the O$^2$ system will allow for storing calibrated clusters. The distribution of residuals has a smaller entropy and is better suited for data compression. A further contribution is expected from adaptive precision for individual cluster parameters based on reconstructed particle trajectories.
As one major difference in the mode of operation, the increase in the flux of particles leads to larger accumulation of space charge in the detector volume and significant distortions of cluster positions relative to the physical particle trajectory. The influence of the space charge distortions to the data compression is under study.
Though data compression is being studied for the TPC as premier use case, concept and code development is kept open to be applied to other detectors as well.
In this contribution we report on general concepts of data compression in ALICE O$^2$ and recent results for all different options under study.
The LHCb experiment will undergo a major upgrade during the second long shutdown (2018 - 2019). The upgrade will concern both the detector and the Data Acquisition (DAQ) system, to be rebuilt in order to optimally exploit the foreseen higher event rate. The Event Builder (EB) is the key component of the DAQ system which gathers data from the sub-detectors and build up the whole event. The EB network has to manage an incoming data flux of 32 Tb/s running at 40 MHz, with a cardinality of about 500 nodes. In this contribution we present the EB implementation based on the InfiniBand (IB) network technology. The EB software relies on IB verbs, which offer user space API to employ the Remote Direct Memory Access (RDMA) capabilities provided by IB the network devices. We will present the performance of the EB on different High Performance Computing (HPC) clusters.
The Geant4 Collaboration released a new generation of the Geant4 simulation toolkit (version 10) in December 2013 and reported its new features at CHEP 2015. Since then, the Collaboration continues to improve its physics and computing performance and usability. This presentation will survey the major improvements made since version 10.0. On the physics side, it includes fully revised multiple scattering models, new Auger atomic de-excitation cascade simulation, significant improvements in string models, and an extension of the low-energy neutron model to protons and light ions. Extensions and improvements of the unified solid library provide more functionality and better computing performance, while a review of the navigation algorithm improved code stability. The continued effort to reduce memory consumption per thread allows for massive parallelism of large applications in the multithreaded mode. Toolkit usability was improved with an evolved real-time visualization in multithreaded mode. Prospects for short- and long-term development will also be discussed.
A status of recent developments of the DELPHES C++ fast detector simulation framework will be given. New detector cards for the LHCb detector and prototypes for future e+ e- (ILC, FCC-ee) and p-p colliders at 100 TeV (FCC-hh) have been designed. The particle-flow algorithm has been optimised for high multiplicity environments such as high luminosity and boosted regimes. In addition, several new features such as photon conversions/brehmsstrahlung and vertex reconstruction including timing information have been included. State-of-the-art pile-up treatment and jet filtering/boosted techniques (such as PUPPI, SoftKiller, SoftDrop, Trimming, N-subjettiness, etc..) have been added. Finally, Delphes has been fully interfaced with the Pythia8 event generator allowing for a complete event generation/detector simulation sequence within the framework.
Detector design studies, test beam analyses, or other small particle physics experiments require the simulation of more and more detector geometries and event types, while lacking the resources to build full scale Geant4 applications from
scratch. Therefore an easy-to-use yet flexible and powerful simulation program
that solves this common problem but can also be adapted to specific requirements
is needed.
The groups supporting studies of the linear collider detector concepts ILD, SiD and CLICdp as well as detector development collaborations CALICE and FCal
have chosen to use the DD4hep geometry framework and its DDG4 pathway to Geant4 for this purpose. DD4hep with DDG4 offers a powerful tool to create arbitrary detector geometries and gives access to all Geant4 action stages.
The DDG4 plugins suite includes the handling of a wide variety of
input formats; access to the Geant4 particle gun or general particles source;
the handling of Monte Carlo truth information -- e.g., linking hits and the
primary particle that caused them -- indispensable for performance and
efficiency studies. An extendable array of segmentations and sensitive detector
allows the simulation of a wide variety of detector technologies.
In this presentation we will show how our DD4hep based simulation program allows
one to perform complex Geant4 detector simulations without compiling a single
line of additional code by providing a palette of sub-detector components that
can be combined and configured via compact XML files, and steering the
simulation either completely via the command line or via
simple python steering files interpreted by a python executable. We will also show how additional plugins and extensions can be created to increase the functionality.
GeantV simulation is a complex system based on the interaction of different modules needed for detector simulation, which include transportation (heuristically managed mechanism of sets of predefined navigators), scheduling policies, physics models (cross-sections and reaction final states) and a geometrical modeler library with geometry algorithms. The GeantV project is recasting the simulation framework to get maximum benefit from SIMD/MIMD computational architecture and highly massive parallel systems. This involves finding the appropriate balance of several aspects influencing computational performance (floating-point performance, usage of off-chip memory bandwidth, specification of cache hierarchy, and etc.) and a large number of program parameters that have to be optimized to achieve the best speedup of simulation. This optimisation task can be treated as a "black-box” optimization problem, which requires searching the optimum set of parameters using only point-wise function evaluations. The goal of this study is to provide a mechanism for optimizing complex systems (high energy physics particle transport simulations) with the help of genetic algorithms and evolution strategies as a tuning process for massive coarse-grain parallel simulations. One of the described approaches is based on introduction of specific multivariate analysis operator that could be used in case of resource expensive or time consuming evaluations of fitness functions, in order to speed-up the convergence of the "black-box" optimization problem.
Particle physics experiments make heavy use of the Geant4 simulation package to model interactions between subatomic particles and bulk matter. Geant4 itself employs a set of carefully validated physics models that span a wide range of interaction energies.
They rely on measured cross-sections and phenomenological models with the physically motivated parameters that are tuned to cover many application domains.
The aggregated sum of these components is what experiments use to study their apparatus.
This raises a critical question of what uncertainties are associated with a particular tune of one or another Geant4 physics model, or a group of models, involved in modeling and optimization of a detector design.
In response to multiple requests from the simulation community, the Geant4 Collaboration has started an effort to address the challenge.
We have designed and implemented a comprehensive, modular, user-friendly software toolkit that allows modifications of parameters of one or several Geant4 physics models involved in the simulation studies, and to perform collective analysis of multiple variants of the resulting physics observables of interest, in order to estimate an uncertainty on a measurement due to the simulation model choices.
Based on modern event-processing infrastructure software, the toolkit offers a variety of attractive features, e.g. flexible run-time configurable workflow, comprehensive bookkeeping, easy to expand collection of analytical components.
Design, implementation technology , and key functionalities of the toolkit will be presented and highlighted with selected results.
Keywords: Geant4 model parameters perturbation, systematic uncertainty in detector simulation
Opticks is an open source project that integrates the NVIDIA OptiX
GPU ray tracing engine with Geant4 toolkit based simulations.
Massive parallelism brings drastic performance improvements with
optical photon simulation speedup expected to exceed 1000 times Geant4
when using workstation GPUs. Optical photon simulation time becomes
effectively zero compared to the rest of the simulation.
Optical photons from scintillation and Cherenkov processes
are allocated, generated and propagated entirely on the GPU, minimizing
transfer overheads and allowing CPU memory usage to be restricted to
optical photons that hit photomultiplier tubes or other photon detectors.
Collecting hits into standard Geant4 hit collections then allows the
rest of the simulation chain to proceed unmodified.
Optical physics processes of scattering, absorption, reemission and
boundary processes are implemented as CUDA OptiX programs based on the Geant4
implementations. Wavelength dependent material and surface properties as well as
inverse cumulative distribution functions for reemission are interleaved into
GPU textures providing fast interpolated property lookup or wavelength generation.
Geometry is provided to OptiX in the form of CUDA programs that return bounding boxes
for each primitive and single ray geometry intersection results. Some critical parts
of the geometry such as photomultiplier tubes have been implemented analytically
with the remainder being tesselated.
OptiX handles the creation and application of a choice of acceleration structures
such as boundary volume heirarchies and the transparent use of multiple GPUs.
OptiX interoperation with OpenGL and CUDA Thrust has enabled
unprecedented visualisations of photon propagations to be developed
using OpenGL geometry shaders to provide interactive time scrubbing and
CUDA Thrust photon indexing to provide interactive history selection.
Validation and performance results are shown for the photomultiplier based
Daya Bay and JUNO Neutrino detectors.
We present a system deployed in the summer of 2015 for the automatic assignment of production and reprocessing workflows for simulation and detector data in the frame of the Computing Operation of the CMS experiment at the CERN LHC. Processing requests involves a number of steps in the daily operation, including transferring input datasets where relevant and monitoring them, assigning work to computing resources available on the CMS grid, and delivering the output to the Physics groups. Automatization is critical above a certain number of requests to be handled, especially in the view of using more efficiently computing resources and reducing latencies. An effort to automatize the necessary steps for production and reprocessing recently started and a new system to handle workflows has been developed. The state-machine system described consists in a set of modules whose key feature is the automatic placement of input datasets, balancing the load across multiple sites. By reducing the operation overhead, these agents enable the utilization of more than double the amount of resources with robust storage system. Additional functionalities were added after months of successful operation to further balance the load on the computing system using remote read and additional ressources. This system contributed to reducing the delivery time of datasets, a crucial aspect to the analysis of CMS data. We report on lessons learned from operation towards increased efficiency in using a largely heterogeneous distributed system of computing, storage and network elements.
The CMS Global Pool, based on HTCondor and glideinWMS, is the main computing resource provisioning system for all CMS workflows, including analysis, Monte Carlo production, and detector data reprocessing activities. Total resources at Tier-1 and Tier-2 sites pledged to CMS exceed 100,000 CPU cores, and another 50,000-100,000 CPU cores are available opportunistically, pushing the needs of the Global Pool to higher scales each year. These resources are becoming more diverse in their accessibility and configuration over time. Furthermore, the challenge of stably running at higher and higher scales while introducing new modes of operation such as multi-core pilots, as well as the chaotic nature of physics analysis workflows, place huge strains on the submission infrastructure. This paper details some of the most important challenges to scalability and stability that the Global Pool has faced since the beginning of the LHC Run II and how they were overcome.
The need for computing in the HEP community follows cycles of peaks and valleys mainly driven by conference dates, accelerator shutdown, holiday schedules, and other factors. Because of this, the classical method of provisioning these resources at providing facilities has drawbacks such as potential overprovisioning. As the appetite for computing increases, however, so does the need to maximize cost efficiency by developing a model for dynamically provisioning resources only when needed.
To address this issue, the HEP Cloud project was launched by the Fermilab Scientific Computing Division in June 2015. Its goal is to develop a facility that provides a common interface to a variety of resources, including local clusters, grids, high performance computers, and community and commercial Clouds. Initially targeted experiments include CMS and NOvA, as well as other Fermilab stakeholders.
In its first phase, the project has demonstrated the use of the “elastic” provisioning model offered by commercial clouds, such as Amazon Web Services. In this model, resources are rented and provisioned automatically over the Internet upon request. In January 2016, the project demonstrated the ability to increase the total amount of global CMS resources by 58,000 cores from 150,000 cores - a 25 percent increase - in preparation for the Recontres de Moriond. In March 2016, the NOvA experiment has also demonstrated resource burst capabilities with an additional 7,300 cores, achieving a scale almost four times as large as the local allocated resources and utilizing the local AWS s3 storage to optimize data handling operations and costs. NOvA was using the same familiar services used for local computations, such as data handling and job submission, in preparation for the Neutrino 2016 conference. In both cases, the cost was contained by the use of the Amazon Spot Instance Market and the Decision Engine, a HEP Cloud component that aims at minimizing cost and job interruption.
This paper describes the Fermilab HEP Cloud Facility and the challenges overcome for the CMS and NOvA communities.
The FabrIc for Frontier Experiments (FIFE) project is a major initiative within the Fermilab Scientific Computing Division charged with leading the computing model for Fermilab experiments. Work within the FIFE project creates close collaboration between experimenters and computing professionals to serve high-energy physics experiments of differing size, scope, and physics area. The FIFE project has worked to develop common tools for job submission, certificate management, software and reference data distribution through CVMFS repositories, robust data transfer, job monitoring, and databases for project tracking. Since the project's inception the experiments under the FIFE umbrella have significantly matured, and present an increasingly complex list of requirements to service providers. To meet these requirements, the FIFE project has been involved in transitioning the Fermilab General Purpose Grid cluster to support a partitionable slot model, expanding the resources available to experiments via the Open Science Grid, assisting with commissioning dedicated high-throughput computing resources for individual experiments, supporting the efforts of the HEP Cloud projects to provision a variety of back end resources, including public clouds and high performance computers, and developing rapid onboarding procedures for new experiments and collaborations. The larger demands also require enhanced job monitoring tools, which the project has developed using such tools as ElasticSearch and Grafana. FIFE has also closely worked with the Fermilab Scientific Computing Division's Offline Production Operations Support Group (OPOS) in helping experiments manage their large-scale production workflows. This group in turn requires a structured service to facilitate smooth management of experiment requests, which FIFE provides in the form of the Production Operations Management Service (POMS). POMS is designed to track and manage requests from the FIFE experiments to run particular workflows, and support troubleshooting and triage in case of problems. Recently we have started to work on a new certificate management infrastructure called Distributed Computing Access with Federated Identities (DCAFI) that will eliminate our dependence on a specific third-party Certificate Authority service and better accommodate FIFE collaborators without a Fermilab Kerberos account. DCAFI integrates the existing InCommon federated identity infrastructure, CILogon Basic CA, and a MyProxy service using a new general purpose open source tool. We will discuss the general FIFE onboarding strategy, progress in expanding FIFE experiments' presence on the Open Science Grid, new tools for job monitoring, the POMS service, and the DCAFI project. We will also discuss lessons learned from collaborating with the OPOS effort and how they can be applied to improve efficiency in current and future experiment's computational work.
The second generation of the ATLAS production system called ProdSys2 is a
distributed workload manager that runs daily hundreds of thousands of jobs,
from dozens of different ATLAS specific workflows, across more than
hundred heterogeneous sites. It achieves high utilization by combining
dynamic job definition based on many criteria, such as input and output
size, memory requirements and CPU consumption, with manageable scheduling
policies and by supporting different kind of computational resources, such
as GRID, clouds, supercomputers and volunteering computers. The system
dynamically assigns a group of jobs (task) to a group of geographically
distributed computing resources. Dynamic assignment and resources
utilization is one of the major features of the system, it didn’t exist in the
earliest versions of the production system where Grid resources topology
has been predefined using national or/and geographical pattern.
Production System has a sophisticated job fault-recovery mechanism, which
efficiently allows to run a multi-Terabyte tasks without human intervention.
We have implemented train model and open-ended production which allows to
submit tasks automatically as soon as new set of data is available and to
chain physics groups data processing and analysis with central production
run by the experiment.
ProdSys2 simplifies life to ATLAS scientists by offering a flexible web
user interface, which implements a user-friendly environment for main ATLAS
workflows, e.g. simple way of combining different data flows, and a real-time
monitoring optimised to present a huge amount of information.
We present an overview of the ATLAS Production System and its major
components features and architecture: task definition, web user interface
and monitoring. We describe the important design decisions and lessons
learned from an operational experience during the first years of LHC Run2.
We also report the performance of the designed system and how various
workflows such as data (re)processing, Monte-Carlo and physics group
production, users analysis are scheduled and executed within one
production system on heterogeneous computing resources.
Networks have played a critical role in high-energy physics
(HEP), enabling us to access and effectively utilize globally distributed
resources to meet the needs of our physicists.
Because of their importance in enabling our grid computing infrastructure
many physicists have taken leading roles in research and education (R&E)
networking, participating in, and even convening, network related meetings
and research programs with the broader networking community worldwide. This
has led to HEP benefiting from excellent global networking capabilities for
little to no direct cost. However, as other science domains ramp-up their
need for similar networking it becomes less clear that this situation will
continue unchanged.
What this means for ATLAS in particular needs to be understood. ATLAS has
evolved its computing model since the LHC started based upon its experience
with using globally distributed resources. The most significant theme of
those changes has been increased reliance upon, and use of, its networks.
We will report on a number of networking initiatives in ATLAS including the
integration of network awareness into PANDA, the changes in our DDM system
to allow remote access to data and participation in the global perfSONAR
network monitoring efforts of WLCG.
We will also discuss new efforts underway that are exploring the inclusion
and use of software defined networks (SDN) and how ATLAS might benefit from:
Orchestration and optimization of distributed data access and data movement.
Better control of workflows, end to end.
Enabling prioritization of time-critical vs normal tasks
Improvements in the efficiency of resource usage
For the upcoming experiments at the European XFEL light source facility, a new online and offline data processing and storage infrastructure is currently being built and verified. Based on the experience of the system being developed for the Petra III light source at DESY, presented at the last CHEP conference, we further develop the system to cope with the much higher volumes and rates (~50GB/sec) together with a more complex data analysis and infrastructure conditions (i.e. long range InfiniBand connections). This work will be carried out in collaboration of DESY/IT, European XFEL and technology support from IBM/Research.
This presentation will shortly wrap up the experience of ~1 year runtime of the PetraIII system, continue with a short description of the challenges for the European XFEL experiments and the main section, showing the proposed system for online and offline with initial result from real implementation (HW & SW). This will cover the selected cluster filesystem GPFS including Quality of Service (QOS), extensive use of Flash subsystems and other new and unique features this architecture will benefit from.
When preparing the Data Management Plan for larger scientific endeavours, PI’s have to balance between the most appropriate qualities of storage space along the line of the planned data lifecycle, it’s price and the available funding. Storage properties can be the media type, implicitly determining access latency and durability of stored data, the number and locality of replicas, as well as available access protocols or authentication mechanisms. Negotiations between the scientific community and the responsible infrastructures generally happen upfront, where the amount of storage space, media types, like: disk, tape and SSD and the foreseeable data lifecycles are negotiated.
With the introduction of cloud management platforms, both in computing and storage, resources can be brokered to achieve the best price per unit of a given quality. However, in order to allow the platform orchestrators to programatically negotiate the most appropriate resources, a standard vocabulary for different properties of resources and a commonly agreed protocol to communicate those, has to be available. In order to agree on a basic vocabulary for storage space properties, the storage infrastructure group in INDIGO-DataCloud together with INDIGO-associated and external scientific groups, created a working group under the umbrella of the “Research Data Alliance (RDA)”. As communication protocol, to query and negotiate storage qualities, the “Cloud Data Management Interface (CDMI)” has been selected. Necessary extensions to CDMI are defined in regular meetings between INDIGO and the “Storage Network Industry Association (SNIA)”. Furthermore, INDIGO is contributing to the SNIA CDMI reference implementation as the basis for interfacing the various storage systems in INDIGO to the agreed protocol and to provide an official OpenSource skeleton for systems not being maintained by INDIGO partners.
In a first step, INDIGO will equip its supported storage systems, like dCache, StoRM, IBM GPFS and HPSS and possibly public cloud systems, with the developed interface to enable the INDIGO platform layer to programatically auto-detect the available storage properties and select the most appropriate endpoints based on its own policies.
In a second step INDIGO will provide means to change the quality of storage, mainly to support data life cycle but as well to make data available for on low latency media for demanding HPC application before the requesting jobs are launched, which maps to the ‘bring online’ command in current HEP frameworks.
Our presentation will elaborate on the planned common agreements between the involved scientific communities and the supporting infrastructures, the available software stack, the integration into the general INDIGO framework and our plans for the remaining time of the INDIGO funding period.
Nowadays users have a variety of options to get access to storage space, including private resources, commercial Cloud storage services as well as storage provided by e-Infrastructures. Unfortunately, all these services provide completely different interfaces for data management (REST, CDMI, command line) and different protocols for data transfer (FTP, GridFTP, HTTP). The goal of the INDIGO-DataCloud project is to give users a unified interface for managing and accessing storage resources provided by different storage providers and to enable them to treat all that space as a single virtual file system with standard interfaces for accessing and transfer, including CDMI and POSIX. This solution enables users to access and manage their data crossing the typical boundaries of federations, created by incompatible technologies and security domains. INDIGO provides ways for storage providers to create and connect trust domains, and allows users to access data across federations, independently of the actual underlying low-level storage technology or security mechanism. The basis of this solution is the Onedata platform (http://www.onedata.org). Onedata is a globally distributed virtual file system, built around the concept of “Spaces”. Each space can be seen as a virtual folder with an arbitrary directory tree structure. The actual storage space can be distributed among several storage providers around the world. Each provider gives the user support for each space in a fixed amount and the actual capacity of the space is the sum of all declared provisions. Each space can be accessed and managed through a web user interface (Dropbox-like), REST and CDMI interfaces, command line as well as mounted directly through POSIX. This gives users several options, the major of which is the ability to access large data sets on remote machines (e.g. worker nodes or Docker containers in the Cloud) without pre-staging and thus interface with existing filesystems. Moreover, Onedata allows for automatic replication and caching of data across different sites and allows cross-interface access (e.g. S3 via POSIX). Performance results covering selected scenarios will also be presented.
Besides Onedata, as a complete monolithic middleware, INDIGO offers a data management toolbox allowing communities to provide their own data handling policy engines and delegating the actual work to dedicated services. The INDIGO portfolio ranges from multi tier storage systems with automated media transition based on access profiles and user policies, like StoRM and dCache, via a reliable and highly scalable file transfer service (FTS), with adaptive data rate management to DynaFed, a lightweight WebDAV storage federation network. FTS is in production for over a decade and is the workhorse of the Worldwide Large Hadron Collider Computing GRID. The DynaFed network federates WebDAV endpoints and lets them appear as a single overlay filesystem.
The SciDAC-Data project is a DOE funded initiative to analyze and exploit two decades of information and analytics that have been collected, by the Fermilab Data Center, on the organization, movement, and consumption of High Energy Physics data. The project is designed to analyze the analysis patterns and data organization that have been used by the CDF, DØ, NO𝜈A, Minos, Minerva and other experiments, to develop realistic models of HEP analysis workflows and data processing. The SciDAC-Data projects aims to provide both realistic input vectors and corresponding output data which can be used to optimize and validate simulations of HEP analysis in different high performance computing (HPC) environments. These simulations are designed to address questions of data handling, cache optimization and workflow structures that are the prerequisites for modern HEP analysis chains to be mapped and optimized to run on the next generation of leadership class exascale computing facilities.
We will address the use of the SciDAC-Data distributions acquired from over 5.6 million analysis workflows and corresponding to over 410,000 HEP datasets, as the input to detailed queuing simulations that model the expected data consumption and caching behaviors of the work running in HPC environments. In particular we describe in detail how the SAM data handling system in combination with the dCache/Enstore based data archive facilities have been analyzed to develop the radically different models of the analysis of collider data and that of neutrino datasets. We present how the data is being used for model output validation and tuning of these simulations. The paper will address the next stages of the SciDAC-Data project which will extend this work to more detailed modeling and optimization of the models for use in real HPC environments.
High Energy Physics experiments have long had to deal with huge amounts of data. Other fields of study are now being faced with comparable volumes of experimental data and have similar requirements to organize access by a distributed community of researchers. Fermilab is partnering with the Simons Foundation Autism Research Initiative (SFARI) to adapt Fermilab’s custom HEP data management system (SAM) to catalog genome data. SFARI has petabyte scale datasets stored in the Fermilab Active Archive Facility and needs to catalog the data, organizing it according to metadata for processing and analysis by a diverse community of researchers. The SAM system is used for data management by multiple HEP experiments at Fermilab and is flexible enough to provide the basis for handling other types of data. This presentation describes both the similarities and the differences in requirements and the challenges in adapting an existing system to a new field.
HEP applications perform an excessive amount of allocations/deallocations within short time intervals which results in memory churn, poor locality and performance degradation. These issues are already known for a decade, but due to the complexity of software frameworks and the large amount of allocations (which are in the order of billions for a single job), up until recently no efficient meachnism has been available to correlate these issues with source code lines. However, with the advent of the Big Data era, many tools and platforms are available nowadays in order to do memory profiling at large scale. Therefore, a prototype program has been developed to track and identify each single de-/allocation. The CERN IT Hadoop cluster is used to compute memory key metrics, like locality, variation, lifetime and density of allocations. The prototype further provides a web based visualization backend that allows the user to explore the results generated on the Hadoop cluster. Plotting these metrics for each single allocation over time gives new insight into application's memory handling. For instance, it shows which algorithms cause which kind of memory allocation patterns, which function flow causes how many shortlived objects, what are the most commonly allocated sizes etc. The paper will give an insight into the prototype and will show profiling examples for LHC reconstruction, digitization and simulation jobs.
The recent progress in parallel hardware architectures with deeper
vector pipelines or many-cores technologies brings opportunities for
HEP experiments to take advantage of SIMD and SIMT computing models.
Launched in 2013, the GeantV project studies performance gains in
propagating multiple particles in parallel, improving instruction
throughput and data locality in HEP event simulation.
One of challenges in developing highly parallel and efficient detector
simulation is the minimization of the number of conditional branches
or thread divergence during the particle transportation process.
Due to the complexity of geometry description and physics algorithms
of a typical HEP application, performance analysis is indispensable
in identifying factors limiting parallel execution.
In this report, we will present design considerations and computing
performance of GeantV physics models on coprocessors (Intel Xeon Phi
and NVidia GPUs) as well as on mainstream CPUs.
As the characteristics of these platforms are very different, it is
essential to collect profiling data with a variety of tools and to
analyze hardware specific metrics and their derivatives to be able
to evaluate and tune the performance.
We will also show how the performance of parallelized physics models
factorizes from the rest of GeantV event simulation.
As the ATLAS Experiment prepares to move to a multi-threaded framework
(AthenaMT) for Run3, we are faced with the problem of how to migrate 4
million lines of C++ source code. This code has been written over the
past 15 years and has often been adapted, re-written or extended to
the changing requirements and circumstances of LHC data taking. The
code was developed by different authors, many of whom are no longer
active, and under the deep assumption that processing ATLAS data would
be done in a serial fashion.
In order to understand the scale of the problem faced by the ATLAS
software community, and to plan appropriately the significant efforts
posed by the new AthenaMT framework, ATLAS embarked on a wide ranging
review of our offline code, covering all areas of activity: event
generation, simulation, trigger, reconstruction. We discuss the
difficulties in even logistically organising such reviews in an
already busy community, how to examine areas in sufficient depth to
learn key areas in need of upgrade, yet also to finish the reviews in
a timely fashion.
We show how the reviews were organised and how the ouptuts were
captured in a way that the sub-system communities could then tackle
the problems uncovered on a realistic timeline. Further, we discuss
how the review influenced overall planning for the ATLAS Run3 use of
AthenaMT and report on how progress is being made towards realistic
framework prototypes.
Some data analysis methods typically used in econometric studies and in ecology have been evaluated and applied in physics software environments. They concern the evolution of observables through objective identification of change points and trends, and measurements of inequality, diversity and evenness across a data set. Within each one of these analysis areas, several statistical tests and measures have been examined, often comparing multiple implementations of the same algorithm available in R or developed by us.
The presentation will introduce the analysis methods and the details of their statistical formulation, and will review their relation with information theory concepts, such as Shannon entropy. It will report the results of their use in two real-life scenarios, which pertain to diverse application domains: the validation of simulation models and the quantification of software quality. It will discuss the lessons learned, highlighting the capabilities and shortcomings identified in this pioneering study.
The IT Analysis Working Group (AWG) has been formed at CERN across individual computing units and the experiments to attempt a cross cutting analysis of computing infrastructure and application metrics. In this presentation we will describe the first results obtained using medium/long term data (1 months - 1 year) correlating box level metrics, job level metrics from LSF and HTCondor, I/O metrics from the physics analysis disk pools (EOS) and networking and application level metrics from the experiment dashboards.
We will cover in particular the measurement of hardware performance and prediction of job durations, the latency sensitivity of different job types and a search for bottlenecks with the production job mix in the current infrastructure. The presentation will conclude with the proposal of a small set of metrics to simplify drawing conclusions also in the more constrained environment of public cloud deployments.
The HEP prototypical systems at the Supercomputing conferences each year have served to illustrate the ongoing state of the art developments in high throughput, software-defined networked systems important for future data operations at the LHC and for other data intensive programs. The Supercomputing 2015 SDN demonstration revolved around an OpenFlow ring connecting 7 different booths and the WAN connections. Some of the WAN connections were built using the Open Grid Forum's Network Service Interface (NSI) and then stitched together using a custom SDN application developed at Caltech. This helped create an intelligent network design, where large scientific data flows traverse various paths provisioned dynamically with guaranteed bandwidth, with the path selection based on either the shortest or fastest routes available, or through other conditions. An interesting aspect of the demonstrations at SC15 is that all the local and remote network switches were controlled by a single SDN controller in the Caltech booth on the show floor. The SDN controller used at SC 15 was built on top of the OpenDaylight (Lithium) software framework. The software library was written in Python and has been made publicly available at pypi and github: pypi.python.org/pypi/python-odl/.
At SC 16 we plan to further improve and extend the SDN network design, we plan to enhance the SDN controller by introducing a number of higher level services, including the Application-Layer Traffic Optimization (ALTO) software and its path computation engine (PCE) in the OpenDaylight controller framework. In addition, we will use OpenvSwitch at the network edges and incorporate its rate-limiting features in the SDN data transfer plane. The CMS data transfer applications PhEDEx and ASO will be used as high level services to oversee large data transactions. The scale of storage to storage operations will be scaled up further relative to past demonstrations, working at the leading edge of NVMe storage and switching fabric technologies.
In today's world of distributed scientific collaborations, there are many challenges to providing reliable inter-domain network infrastructure. Network operators use a combination of
active monitoring and trouble tickets to detect problems, but these are often ineffective at identifying issues that impact wide-area network users. Additionally, these approaches do not scale to wide area inter-domain networks due to unavailability of data from all the domains along typical network paths. The Pythia Network Diagnostic InfrasTructure (PuNDIT) project aims to create a scalable infrastructure for automating the detection and localization of problems across these networks.
The project goal is to gather and analyze metrics from existing perfSONAR monitoring infrastructures to identify the signatures of possible problems, locate affected network links, and report them to the user in an intuitive fashion. Simply put, PuNDIT seeks to convert complex network metrics into easily understood diagnoses in an automated manner.
At CHEP 2016, we plan to present our findings from deploying a first version of PuNDIT in one or more communities that are already using perfSONAR. We will report on the project progress to-date in working with the OSG and various WLCG communities, describe the current implementation architecture and demonstrate the various user interfaces it supports. We will also show examples of how PuNDIT is being used and where we see the project going in the future.
The Open Science Grid (OSG) relies upon the network as a critical part of the distributed infrastructures it enables. In 2012 OSG added a new focus area in networking with a goal of becoming the primary source of network information for its members and collaborators. This includes gathering, organizing and providing network metrics to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing.
In September of 2015 this service was deployed into the OSG production environment. We will report on the creation, implementation, testing and deployment of the OSG Networking Service. Starting from organizing the deployment of perfSONAR toolkits within OSG and its partners, to the challenges of orchestrating regular testing between sites, to reliably gathering the resulting network metrics and making them available for users, virtual organizations and higher level services all aspects of implementation will be reviewed. In particular, several higher level services were developed to bring the OSG network service to its full potential. These include a web-based mesh configuration system, which allows central scheduling and management all the network tests performed by the instances, a set of probes to continually gather metrics from the remote instances and publish it to different sources, a central network datastore (Esmond), which provides interfaces to access the network monitoring information in close to real time and historically (up to a year) giving the state of the tests and the perfSONAR infrastructure monitoring, ensuring the current perfSONAR instances are correctly configured and operating as intended.
We will also describe the challenges we encountered in ongoing operations for the network service and how we have evolved our procedures to address those challenges. Finally we will describe our plans for future extensions and improvements to the service.
The fraction of internet traffic carried over IPv6 continues to grow rapidly. IPv6 support from network hardware vendors and carriers is pervasive and becoming mature. A network infrastructure upgrade often offers sites an excellent window of opportunity to configure and enable IPv6.
There is a significant overhead when setting up and maintaining dual stack machines, so where possible sites would like to upgrade their services directly to IPv6 only. In doing so, they are also expediting the transition process towards its desired completion. While the LHC experiments accept there is a need to move to IPv6, it is currently not directly affecting their work. Sites are unwilling to upgrade if they will be unable to run LHC experiment workflows. This has resulted in a very slow uptake of IPv6 from WLCG sites.
For several years the HEPiX IPv6 Working Group has been testing a range of WLCG services to ensure they are IPv6 compliant. Several sites are now running many of their services as dual stack. The working group, driven by the requirements of the LHC VOs to be able to use IPv6-only opportunistic resources, continues to encourage wider deployment of dual-stack services to make the use of such IPv6-only clients viable.
This paper will present the HEPiX plan and progress so far to allow sites to deploy IPv6 only CPU resources. This will include making experiment central services dual stack as well as a number of storage services. The monitoring, accounting and information services that are used by jobs also needs to be upgraded. Finally the VO testing that has taken place on hosts connected via IPv6 only will be reported.
Over the last few years, the number of mobile devices connected to the CERN internal network has increased from a handful in 2006 to more than 10,000 in 2015. Wireless access is no longer a “nice to have” or just for conference and meeting rooms, now support for mobility is expected by most, if not all, of the CERN community. In this context, a full renewal of the CERN Wi-Fi network has been launched in order to provide a state-of-the-art Campus-wide Wi-Fi Infrastructure. Which technologies can provide an end-user experience comparable, for most applications, to a wired connection? Which solution can cover more than 200 office buildings, which represent a total surface of more than 400.000 m2, while keeping a single, simple, flexible and open management platform? The presentation will focus on the studies and tests performed at CERN to address these issues, as well as some feedback about the global project organisation.
The ATLAS experiment successfully commissioned a software and computing infrastructure to support
the physics program during LHC Run 2. The next phases of the accelerator upgrade will present
new challenges in the offline area. In particular, at High Luminosity LHC (also known as Run 4)
the data taking conditions will be very demanding in terms of computing resources:
between 5 and 10 KHz of event rate from the HLT to be reconstructed (and possibly further reprocessed)
with an average pile-up of up to 200 events per collision and an equivalent number of simulated samples
to be produced. The same parameters for the current run are lower by up to an order of magnitude.
While processing and storage resources would need to scale accordingly, the funding situation
allows one at best to consider a flat budget over the next few years for offline computing needs.
In this paper we present a study quantifying the challenge in terms of computing resources for HL-LHC
and present ideas about the possible evolution of the ATLAS computing model, the distributed computing
tools, and the offline software to cope with such a challenge.
The LHCb detector will be upgraded for the LHC Run 3 and will be readout at 40 MHz, with major implications on the software-only trigger and offline computing. If the current computing model is kept, the data storage capacity and computing power required to process data at this rate, and to generate and reconstruct equivalent samples of simulated events, will exceed the current capacity by a couple of orders of magnitude. A redesign of the software framework, including scheduling, the event model, the detector description and the conditions database, is needed to fully exploit the computing power of new architectures. Data processing and the analysis model will also change towards an early streaming of different data types, in order to limit storage resources, with further implications for the data analysis workflows. Fast simulation will allow to obtain a reasonable parameterization of the detector response in considerably less computing time. Finally, the upgrade of LHCb will be a good opportunity to review and implement changes in the domains of software design, test and review, and analysis workflow and preservation.
In this contribution, activities and recent results in all the above areas are presented.
The Belle II is the next-generation flavor factory experiment at the SuperKEKB accelerator in Tsukuba, Japan. The first physics run will take place in 2017, then we plan to increase the luminosity gradually. We will reach the world’s highest luminosity L=8x10^35 cm-2s-1 after roughly five years operation and finally collect ~25 Petabyte of raw data per year. Such a huge amount of data allows us to explore the new physics possibilities through a large variety of analyses in quark sectors as well as tau physics and to deepen understanding of nature.
The Belle II computing system is expected to manage the process of massive raw data, production of copious simulation and many concurrent user analysis jobs. The required resource estimation for the Belle II computing system shows a similar evolution profile of the resource pledges in LHC experiments. For the Belle II is a worldwide collaboration of about 700 scientists working in 23 countries and region, we adopted a distributed computing model with DIRAC as a workload and data management system.
In 2015, we performed the successful large-scale MC production campaign with the highest CPU resources for the longest period ever. The biggest difference from the past campaigns is the first practical version of the production system we introduced and tested. We also raked up computational resources such as grid, commercial and academic clouds, and the local batch computing clusters, as much as possible from inside and outside the collaboration. In this March, the commissioning of the SuperKEKB accelerator has started and the Trans-Pacific network was upgraded. Then the full replacement of the KEK central computing system is planned in this summer.
We will present the highlights of the recent achievements, output from the latest MC production campaign and current status of the Belle II computing system in this report.
The Compressed Baryonic Matter experiment (CBM) is a next-generation heavy-ion experiment to be operated at the FAIR facility, currently under construction in Darmstadt, Germany. A key feature of CBM are very high intercation rates, exceeding those of contemporary nuclear collision experiments by several orders of magnitude. Such interaction rates forbid a conventional, hardware-triggered readout; instead, experiment data will be freely streaming from self-triggered frontend electronics. In order to reduce the huge raw data volume to a recordable rate, data will be selected exclusively on CPU, which necessitates partial event reconstruction in real-time. Consequently, the traditional segregation of online and offline software vanishes; an integrated on- and offline data processing concept is called for. In this paper, we will report on concepts and developments for computing for CBM as well as on the status of preparations for its first physics run.
The Electron-Ion Collider (EIC) is envisioned as the
next-generation U.S. facility to study quarks and gluons in
strongly interacting matter. Developing the physics program for
the EIC, and designing the detectors needed to realize it,
requires a plethora of software tools and multifaceted analysis
efforts. Many of these tools have yet to be developed or need to
be expanded and tuned for the physics reach of the EIC. Currently,
various groups use disparate sets of software tools to achieve the
same or similar analysis tasks such as Monte Carlo event
generation, detector simulations, track reconstruction, event
visualization, and data storage to name a few examples. With a
long-range goal of the successful execution of the EIC scientific
program in mind, it is clear that early investment in the
development of well-defined interfaces for communicating, sharing,
and collaborating, will facilitate a timely completion of not just
the planning and design of an EIC but ultimate delivery the
physics capable with an EIC. In this presentation, we give an
outline of forward-looking global objectives that we think will
help sustain a software community for more than a decade. We then
identify the high-priority projects for immediate development and
also those, which will ensure an open-source development
environment for the future.
We present an implementation of the ATLAS High Level Trigger that provides parallel execution of trigger algorithms within the ATLAS multithreaded software framework, AthenaMT. This development will enable the ATLAS High Level Trigger to meet future challenges due to the evolution of computing hardware and upgrades of the Large Hadron Collider, LHC, and ATLAS Detector. During the LHC datataking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further, to up to 7.5 times the design value, in 2026 following LHC and ATLAS upgrades. This includes an upgrade of the ATLAS trigger architecture that will result in an increase in the High Level Trigger input rate by a factor of 4 to 10 compared to the current maximum rate of 100 kHz.
The current ATLAS multiprocess framework, AthenaMP, manages a number of processes that process events independently, executing algorithms sequentially in each process. AthenaMT will provide a fully multithreaded environment that will enable concurrent execution of algorithms also within an event.This has the potential to significantly reduce the memory footprint on future manycore devices. An additional benefit of the High Level Trigger implementation within the AthenaMT is that it facilitates the integration of offline code into the High Level Trigger. The trigger must retain high rejection in the face of increasing numbers of pileup collisions. This will be achieved by greater use of offline algorithms that are designed to maximize the discrimination of signal from background. Therefore a unification of the High Level Trigger and offline reconstruction software environment is required. This has been achieved while at the same time retaining important High Level Trigger-specific optimisations that minimize the computation performed to reach a trigger decision. Such optimizations include early event rejection and reconstruction within restricted geometrical regions.
We report on a High Level Trigger prototype in which the need for High Level Trigger-specific components has been reduced to a minimum. Promising results have been obtained with a prototype that includes the key elements of trigger functionality including regional reconstruction and early event rejection. We report on the first experience of migrating trigger selections to this new framework and present the next steps towards a full implementation of the ATLAS trigger within this framework.
The ATLAS experiment at the high-luminosity LHC will face a five-fold
increase in the number of interactions per collision relative to the ongoing
Run 2. This will require a proportional improvement in rejection power at
the earliest levels of the detector trigger system, while preserving good signal efficiency.
One critical aspect of this improvement will be the implementation of
precise track reconstruction, through which sharper turn-on curves,
b-tagging and tau-tagging techniques can in principle be implemented. The challenge of such a project comes in the development of a fast, precise custom electronic device integrated in the hardware-based first trigger level of the experiment, with repercussions propagating as far as the detector read-out philosophy.
This talk will
discuss the projected performance of the system in terms of tracking, timing
and physics.
The High Luminosity LHC (HL-LHC) will deliver luminosities of up to 5x10^34 cm^2/s, with an average of about 140-200 overlapping proton-proton collisions per bunch crossing. These extreme pileup conditions can significantly degrade the ability of trigger systems to cope with the resulting event rates. A key component of the HL-LHC upgrade of the CMS experiment is a Level-1 (L1) track finding system that will identify tracks with transverse momentum above 3 GeV within ~5 us. Output tracks will be merged with information from other sub-detectors in the downstream L1 trigger to improve the identification and resolution of physics objects. The CMS collaboration is exploring several designs for a L1 tracking system that can confront the challenging latency, occupancy and bandwidth requirements associated with L1 tracking. This presentation will review the three state-of-the-art L1 tracking architectures proposed for the CMS HL-LHC upgrade. Two of these architectures ( “Tracklet” and “TMT”) are fully FPGA-based, while a third (“AM+FPGA”) employs a combination of FPGAs and ASICs. The FPGA-based approaches employ a road-search algorithm (“Tracklet”) or a Hough transform (“TMT”), while the AM+FPGA approach uses content-addressable memories for pattern recognition. Each approach aims to perform the demanding data distribution, pattern recognition, track reconstruction tasks required of L1 tracking in real-time.
Micropattern gaseous detector (MPGD) technologies, such as GEMs or MicroMegas, are particularly suitable for precision tracking and triggering in high rate environments. Given their relatively low production costs, MPGDs are an exemplary candidate for the next generation of particle detectors. Having acknowledged these advantages, both the ATLAS and CMS collaborations at the LHC are exploiting these new technologies for their detector upgrade programs in the coming years. When MPGDs are utilized for triggering purposes, the measured signals need to be precisely reconstructed within less than 200 ns, which can be achieved by the usage of FPGAs.
In this work, we present for a novel approach to identify reconstructed signals, their timing and the corresponding spatial position on the detector. In particular, we study the effect of noise and dead readout strips on the reconstruction performance. Our approach leverages the potential of convolutional neural networks (CNNs), which have been recently manifesting an outstanding performance in a range of modeling tasks. The proposed neural network architecture of our CNN is designed simply enough, so that it can be modeled directly by an FPGA and thus provide precise information on reconstructed signals already in trigger level.
The Compressed Baryonic Matter (CBM) experiment is currently under construction at the upcoming FAIR accelerator facility in Darmstadt, Germany. Searching for rare probes, the experiment requires complex online event selection criteria at a high event rate.
To achieve this, all event selection is performed in a large online processing farm of several hundred nodes, the "First-level Event Selector" (FLES). This compute farm will consist primarily of standard PC components including GPGPUs and many-core architectures. The data rate at the input to this compute farm is expected to exceed 1 TByte/s of time-stamped signal messages from the detectors. The distributed input interface will be realized using custom FPGA-based PCIe add-on cards, which preprocess and index the incoming data streams.
At event rates of up to 10 MHz, data from several events overlaps. Thus, there is no a priori assignment of data messages to events. Instead, event recognition is performed in combination with track reconstruction. Employing a new container data format to decontextualize the information from specific time intervals, data segments can be distributed on the farm and processed independently. This allows to optimize the event reconstruction and analysis code without additional networking overhead and aids parallel computation in the online analysis task chain.
Time slice building, the continuous process of collecting the data of a time interval simultaneously from all detectors, places a high load on the network and requires careful scheduling and management. Using InfiniBand FDR hardware, this process has been demonstrated at rates of up to 6 GByte/s per node in a prototype system.
The design of the event selector system is optimized for modern computer architectures. This includes minimizing copy operations of data in memory, using DMA/RDMA wherever possible, reducing data interdependencies, and employing large memory buffers to limit the critical network transaction rate. A fault-tolerant control system will ensure high availability of the event selector.
This presentation will give an overview of the online event selection architecture of the upcoming CBM experiment and discuss the premises and benefits of the design. The presented material includes latest results from performance studies on different prototype systems.
The low flux of the ultra-high energy cosmic rays (UHECR) at the highest energies provides a challenge to answer the long standing question about their origin and nature. Even lower fluxes of neutrinos with energies above 10^22 eV are predicted in certain Grand-Unifying-Theories (GUTs) and e.g. models for super-heavy dark matter (SHDM). The significant increase in detector volume required to detect these particles can be achieved by searching for the nano-second radio pulses that are emitted when a particle interacts in Earth's moon with current and future radio telescopes.
In this contribution we present the design of an online analysis and trigger pipeline for the detection of nano-second pulses with the LOFAR radio telescope. The most important steps of the processing pipeline are digital focusing of the antennas towards the Moon, correction of the signal for ionospheric dispersion, and synthesis of the time-domain signal from the polyphased-filtered signal in frequency domain. The implementation of the pipeline on a GPU/CPU cluster will be discussed together with the computing performance of the prototype.
We report current status of the CMS full simulation. For run-II CMS is using Geant4 10.0p02 built in sequential mode. About 8 billion events are produced in 2015. In 2016 any extra production will be done using the same production version. For the development Geant4 10.0p03 with CMS private patches built in multi-threaded mode were established. We plan to use newest Geant4 10.2 for 2017 production. In this work we will present CPU and memory performance of CMS full simulation for various configurations and Geant4 versions, will also discuss technical aspects of the migration to Geant4 10.2. CMS plan to install a new pixel detector for 2017. This allows to perform a revision of the geometry of other sub-detectors and add necessary fixes. For 2016 the digitization of CMS is capable to work in the multi-threaded mode. For simulation of pile up events a method of premixing of QCD events in one file has been established. Performance of CMS digitization will be also discussed in this report.
The ATLAS Simulation infrastructure has been used to produce upwards of 50 billion proton-proton collision events for analyses
ranging from detailed Standard Model measurements to searches for exotic new phenomena. In the last several years, the
infrastructure has been heavily revised to allow intuitive multithreading and significantly improved maintainability. Such a
massive update of a legacy code base requires careful choices about what pieces of code to completely rewrite and what to wrap or
revise. The initialization of the complex geometry was generalized to allow new tools and geometry description languages, popular
in some detector groups. The addition of multithreading requires Geant4 MT and GaudiHive, two frameworks with fundamentally
different approaches to multithreading, to work together. It also required enforcing thread safety throughout a large code base,
which required the redesign of several aspects of the simulation, including “truth,” the record of particle interactions with the
detector during the simulation. These advances were possible thanks to close interactions with the Geant4 developers.
Software for the next generation of experiments at the Future Circular Collider (FCC), should by design efficiently exploit the available computing resources and therefore support of parallel execution is a particular requirement. The simulation package of the FCC Common Software Framework (FCCSW) makes use of the Gaudi parallel data processing framework and external packages commonly used in HEP simulation, including the Geant4 simulation toolkit, the DD4HEP geometry toolkit and the Delphes framework for simulating detector response.
Using Geant4 for Full simulation implies taking into account all physics processes for transporting the particles through detector material and this is highly CPU-intensive. At the early stage of detector design and for some physics studies such accuracy is not needed. Therefore, the overall response of the detector may be simulated in a parametric way. Geant4 provides the tools to define a parametrisation, which for the tracking detectors is performed by smearing the particle space-momentum coordinates and for calorimeters by reproducing the particle showers.
The parametrisation may come from either external sources, or from the Full simulation (being detector-dependent but also more accurate). The tracker resolutions may be derived from measurements of the existing detectors or from the external tools, for instance tkLayout, used in the CMS tracker performance studies. Regarding the calorimeters, the longitudinal and radial shower profiles can be parametrised using the GFlash library. The Geant4 Fast simulation can be applied to any type of particle in any region of the detector. The possibility to run both Full and Fast simulation in Geant4 creates a chance for an interplay, performing the CPU-consuming Full simulation only for the regions and particles of interest.
FCCSW also incorporates the Delphes framework for Fast simulation studies in a multipurpose detector. Phenomenological studies may be performed in an idealised geometry model, simulating the overall response of the detector. Having Delphes inside FCCSW allows users to create the analysis tools that may be used for Full simulation studies as well.
This presentation will show the status of the simulation package of the FCC common software framework.
or some physics processes studied with the ATLAS detector, a more
accurate simulation in some respects can be achieved by including real
data into simulated events, with substantial potential improvements in the CPU,
disk space, and memory usage of the standard simulation configuration,
at the cost of significant database and networking challenges.
Real proton-proton background events can be overlaid (at the detector
digitization output stage) on a simulated hard-scatter process, to account for pileup
background (from nearby bunch crossings), cavern background, and
detector noise. A similar method is used to account for the large
underlying event from heavy ion collisions, rather than directly
simulating the full collision. Embedding replaces the muons found in
Z->mumu decays in data with simulated taus at the same 4-momenta, thus
preseving the underlying event and pileup from the original data
event. In all these cases, care must be taken to exactly match
detector conditions (beamspot, magnetic fields, alignments, dead sensors, etc.)
between the real data event and the simulation.
We will discuss the current status of these overlay and embedding techniques
within ATLAS software and computing.
The long standing problem of reconciling the cosmological evidence of the existence of dark matter with the lack of any clear experimental observation of it, has recently revived the idea that the new particles are not directly connected with the Standard Model gauge fields, but only through mediator fields or ''portals'', connecting our world with new ''secluded'' or ''hidden'' sectors. One of the simplest models just adds an additional U(1) symmetry, with its corresponding vector boson A'.
At the end of 2015 INFN has formally approved a new experiment, PADME (Positron Annihilation into Dark Matter Experiment), to search for invisible decays of the A' at the DAFNE BTF in Frascati. The experiment is designed to detect dark photons produced in positron on fixed target annihilations ($e^+e^-\to \gamma A'$) decaying to dark matter by measuring the final state missing mass.
The collaboration aims to complete the design and construction of the experiment by the end of 2017 and to collect $\sim 10^{13}$ positrons on target by the end of 2018, thus allowing to reach the $\epsilon \sim 10^{-3}$ sensitivity up to a dark photon mass of $\sim 24$ MeV/c$^2$.
The experiment will be composed by a thin active diamond target where the positron beam from the DAFNE Linac will impinge to produce $e^+e^-$ annihilation events. The surviving beam will be deflected with a ${\cal O}$(0.5 Tesla) magnet, on loan from the CERN PS, while the photons produced in the annihilation will be measured by a calorimeter composed of BGO crystals recovered from the L3 experiment at LEP. To reject the background from bremsstrahlung gamma production, a set of segmented plastic scintillator vetoes will be used to detect positrons exiting the target with an energy below that of the beam, while a fast small angle calorimeter will be used to reject the $e^+e^- \to \gamma\gamma(\gamma)$ background.
To optimize the experimental layout in terms of signal acceptance and background rejection, the full layout of the experiment has been modeled with the GEANT4 simulation package. In this talk we will describe the details of the simulation and report on the results obtained with the software.
Many physics and performance studies with the ATLAS detector at the Large Hadron Collider require very large samples of simulated events, and producing these using the full GEANT4 detector simulation is highly CPU intensive.
Often, a very detailed detector simulation is not needed, and in these cases fast simulation tools can be used
to reduce the calorimeter simulation time by a few orders of magnitude.
The new ATLAS Fast Calorimeter Simulation (FastCaloSim) is an improved parametrisation compared to the one used in the LHC Run-1.
It provides a simulation of the particle energy response at the calorimeter read-out cell level, taking into account
the detailed particle shower shapes and the correlations between the energy depositions in the various calorimeter layers.
It is interfaced to the standard ATLAS digitization and reconstruction software, and can be tuned to data more easily
than with GEANT4. The new FastCaloSim incorporates developments in geometry and physics lists of the last five years
and benefits from knowledge acquired with the Run-1 data. It makes use of statistical techniques such as principal component
analysis, and a neural network parametrisation to optimise the amount of information to store in the ATLAS simulation
infrastructure. It is planned to use this new FastCaloSim parameterization to simulate several billion events
in the upcoming LHC runs.
In this talk, we will describe this new FastCaloSim parametrisation.
With the increased load and pressure on required computing power brought by the higher luminosity in LHC during Run2, there is a need to utilize opportunistic resources not currently dedicated to the Compact Muon Solenoid (CMS) collaboration. Furthermore, these additional resources might be needed on demand. The Caltech group together with the Argonne Leadership Computing Facility (ALCF) are collaborating to demonstrate the feasibility of using resources from one of the fastest supercomputers in the world, Mira (10 petaflops IBM Blue Gene/Q system). CMS uses the HTCondor/glideinWMS job submission infrastructure for all its batch processing. On the other hand, Mira only supports MPI applications using Cobalt submission, which is not yet available through HTCondor. Majority of computing facilities utilized by CMS experiment are powered by x86_64 processors while Mira is Blue Gene/Q based (PowerPC Architecture). The CMS Monte-Carlo and Data production makes use of a bulk of pledge resource and other opportunistic resource. For efficient use, Mira's resource has to be transparently integrated into the CMS production infrastructure. We address the challenges posed by submitting MPI applications through CMS infrastructure to Argonne PowerPC (Mira) supercomputer. We will describe the design and implementation of the computing and networking systems for running CMS Production jobs with first operational prototype for running on PowerPC. We also demonstrate the state of the art high networking from the LHC Grid to ALCF requirement of CMS data intensive computation.
PanDA - Production and Distributed Analysis Workload Management System has been developed to address ATLAS experiment at LHC data processing and analysis challenges. Recently PanDA has been extended to run HEP scientific applications on Leadership Class Facilities and supercomputers. The success of the projects to use PanDA beyond HEP and Grid has drawn attention from other compute intensive sciences such as bioinformatics.
Modern biology uses complex algorithms and sophisticated software, which is impossible to run without access to significant computing resources. Recent advances of Next Generation Genome Sequencing (NGS) technology led to increasing streams of sequencing data that need to be processed, analysed and made available for bioinformaticians worldwide. Analysis of ancient genomes sequencing data using popular software pipeline PALEOMIX can take a month even running it on the powerful computer resource. PALEOMIX include typical set of software used to process NGS data including adapter trimming, read filtering, sequence alignment, genotyping and phylogenetic or metagenomic analysis. Sophisticated computing software WMS and efficient usage of the supercomputers can greatly enhance this process.
In this paper we will describe the adaptation the PALEOMIX pipeline to run it on a distributed computing environment powered by PanDA. We used PanDA to manage computational tasks on a multi-node parallel supercomputer. To run pipeline we split input files into chunks which are run separately on different nodes as separate inputs for PALEOMIX and finally merge output file, it is very similar to what it done by ATLAS to process and to simulate data. We dramatically decreased the total walltime because of jobs (re)submission automation and brokering within PanDA, what was earlier demonstrated for the ATLAS applications on the Grid. Using software tools developed initially for HEP and Grid can reduce payload execution time for Mammoths DNA samples from weeks to days.
ATLAS Distributed Computing during LHC Run-1 was challenged by steadily increasing computing, storage and network
requirements. In addition, the complexity of processing task workflows and their associated data management requirements
led to a new paradigm in the ATLAS computing model for Run-2, accompanied by extensive evolution and redesign of the
workflow and data management systems. The new systems were put into production at the end of 2014, and gained robustness
and maturity during 2015 data taking. ProdSys2, the new request and task interface; JEDI, the dynamic job execution
engine developed as an extension to PanDA; and Rucio, the new data management system, form the core of the Run-2 ATLAS
distributed computing engine.
One of the big changes for Run-2 was the adoption of the Derivation Framework, which moves the chaotic CPU and data
intensive part of the user analysis into the centrally organized train production, delivering derived AOD datasets to
user groups for final analysis. The effectiveness of the new model was demonstrated through the delivery of analysis
datasets to users just one week after data taking, by completing the calibration loop, Tier-0 processing and train
production steps promptly. The great flexibility of the new system also makes it possible to execute part of the Tier-0
processing on the grid when Tier-0 resources experience a backlog during high data-taking periods.
The introduction of the data lifetime model, where each dataset is assigned a finite lifetime (with extensions possible for frequently accessed data), was made possible by Rucio. Thanks to this the storage crises experienced in Run-1 have
not reappeared during Run-2. In addition, the distinction between Tier-1 and Tier-2 disk storage, now largely artificial
given the quality of Tier-2 resources and their networking, has been removed through the introduction of dynamic ATLAS
clouds that group the storage endpoint nucleus and its close-by execution satellite sites. All stable ATLAS sites are now
able to store unique or primary copies of the datasets.
ATLAS Distributed Computing is further evolving to speed up request processing by introducing network awareness, using
machine learning and optimization of the latencies during the execution of the full chain of tasks. The Event Service, a
new workflow and job execution engine, is designed around check-pointing at the level of event processing to use
opportunistic resources more efficiently.
ATLAS has been extensively exploring possibilities of using computing resources extending beyond conventional grid sites in the WLCG fabric to deliver as many computing cycles as possible and thereby enhance the significance of the Monte-Carlo samples to deliver better physics results.
The difficulties of using such opportunistic resources come from architectural differences such as unavailability of grid services, the absence of network connectivity on worker nodes or inability to use standard authorization protocols. Nevertheless, ATLAS has been extremely successful in running production payloads on a variety of sites, thanks largely to the job execution workflow design in which the job assignment, input data provisioning and execution steps are clearly separated and can be offloaded to custom services. To transparently include the opportunistic sites in the ATLAS central production system, several models with supporting services have been developed to mimic the functionality of a full WLCG site. Some are extending Computing Element services to manage job submission to non-standard local resource management systems, some are incorporating pilot functionality on edge services managing the batch systems, while the others emulate a grid site inside a fully virtualized cloud environment.
The exploitation of opportunistic resources was at an early stage throughout 2015, at the level of 10% of the total ATLAS computing power, but in the next few years it is expected to deliver much more. In addition, demonstrating the ability to use an opportunistic resource can lead to securing ATLAS allocations on the facility, hence the importance of this work goes beyond merely the initial CPU cycles gained.
In this presentation, we give an overview and compare the performance, development effort, flexibility and robustness of the various approaches. Full descriptions of each of those models are given in other contributions to this conference.
Distributed data processing in High Energy and Nuclear Physics (HENP) is a prominent example of big data analysis. Having petabytes of data being processed at tens of computational sites with thousands of CPUs, standard job scheduling approaches either do not address well the problem complexity or are dedicated to one specific aspect of the problem only (CPU, network or storage). As a result, the general orchestration of the system is left to the production managers and requires reconsideration each time new resources are added or withdrawn. In previous research we have developed a new job scheduling approach dedicated to distributed data production – an essential part of data processing in HENP (pre-processing in big data terminology). In our approach the load balancing across sites is provided by forwarding data in peer-to-peer manner, but guided by a centrally created (and periodically updated) plan, aiming to achieve global optimality. The planner considers network and CPU performance as well as available storage space at each site and plans data movements between them in order to maximize an overall processing throughput. In this work we extend our approach by distributed data production where multiple input data sources are initially available. Multi-source or provenance is common in user analysis scenario whereas the produced data may be immediately copied to several destinations. The initial input data set would hence be already partially replicated to multiple locations and the task of the scheduler is to maximize overall computational throughput considering possible data movements and CPU allocation. In particular, the planner should decide if it makes sense to transfer files to other sites or if they should be processed at the site where they are available. Reasoning about multiple data replicas allows to broaden the planner applicability beyond the scope of data production towards user analysis in HENP and other big data processing applications. In this contribution, we discuss the load balancing with multiple data sources, present recent improvements made to our planner and provide results of simulations which demonstrate the advantage against standard scheduling policies for the new use case. The studies have shown that our approach can provide a significant gain in overall computational performance in a wide scope of simulations considering realistic size of computational Grid, background network traffic and various input data distribution. The approach is scalable and adjusts itself to resource outage/addition/reconfiguration which becomes even more important with growing usage of cloud resources. The reasonable complexity of the underlying algorithm meets requirements for online planning of computational networks as large as one of the currently largest HENP experiments.
A lot of experiments in the field of accelerator based science are actively running at High Energy Accelerator Research Organization (KEK) by using SuperKEKB and J-PARC accelerator in Japan. In these days at KEK, the computing demand from the various experiments for the data processing, analysis and MC simulation is monotonically increasing. It is not only for the case with high-energy experiments, the computing requirement from the hadron and neutrino experiments and some projects of astro-particle physics is also rapidly increasing due to the very high precision measurement. Under this situation, several projects, Belle II, T2K, ILC and KAGRA experiments supported by KEK are going to utilize Grid computing infrastructure as the main computing resource. The Grid system and services in KEK, which is already in production, are upgraded for the further stable operation at the same time of whole scale hardware replacement of KEK Central Computer (KEKCC). The next generation system of KEKCC start the operation from the beginning of September 2016. The basic Grid services e.g. BDII, VOMS, LFC, CREAM computing element and StoRM storage element are made by the more robust hardware configuration. Since the raw data transfer is one of the most important tasks for the KEKCC, two redundant GridFTP servers are adapted to the StoRM service instances with 40Gbps network bandwidth on the LHCONE routing. These are dedicated to the Belle II raw data transfer to the other sites apart from the servers for the data transfer usage of the other VOs. Additionally, we prepare the redundant configuration for the database oriented services like LFC and AMGA by using LifeKeeper. The LFC servers are made by two read/write servers and one read-only server for the Belle II experiment, and all of them have individual database for the purpose of load balancing. The FTS3 service is newly deployed as a service for the Belle II data distribution. The service of CVMFS startum-0 is started for the Belle II software repository, and stratum-1 service is prepared for the other VOs. In this way, there are lot of upgrade for the real production service of Grid infrastructure at KEK Computing Research Center. In this presentation, we would like to introduce the detailed configuration of the hardware for Grid instance, and several mechanism to construct the robust Grid system in the next generation system of KEKCC.
The ATLAS EventIndex has been running in production since mid-2015,
reliably collecting information worldwide about all produced events and storing
them in a central Hadoop infrastructure at CERN. A subset of this information
is copied to an Oracle relational database for fast access.
The system design and its optimization is serving event picking from requests of
a few events up to scales of tens of thousand of events, and in addition, data
consistency checks are performed for large production campaigns. Detecting
duplicate events with a scope of physics collections has recently arisen as an
important use case.
This paper describes the general architecture of the project and the data flow
and operation issues, which are addressed by recent developments to improve the
throughput of the overall system. In this direction, the data collection system
is reducing the usage of the messaging infrastructure to overcome the
performance shortcomings detected during production peaks; an object storage
approach is instead used to convey the event index information, and messages to
signal their location and status. Recent changes in the Producer/Consumer
architecture are also presented in detail, as well as the monitoring
infrastructure.
The LHCb experiment stores around 10^11 collision events per year. A typical physics analysis deals with a final sample of up to 10^7 events. Event preselection algorithms (lines) are used for data reduction. They are run centrally and check whether an event is useful for a particular physical analysis. The lines are grouped into streams. An event is copied to all the streams its lines belong, possibly duplicating it. Due to the storage format allowing only sequential access, analysis jobs read every event and discard the ones they don’t need.
This scheme efficiency heavily depends on the streams composition. By putting similar lines together and balancing the streams sizes it’s possible to reduce the overhead. There are additional constraints that some lines are meant to be used together so they must go to one stream. The total number of streams is also limited by the file management infrastructure.
We developed a method for finding an optimal streams composition. It can be used for different cost functions, has the number of streams as an input parameter and accommodates the grouping constraint. It has been implemented using Theano [1] and the results are being incorporated into the streaming [2] of the LHCb Turbo [3] output with the projected analysis jobs IO time decrease of 20-50%.
[1] Theano: A Python framework for fast computation of mathematical expressions, The Theano Development Team
[2] Separate file streams https://gitlab.cern.ch/hschrein/Hlt2StreamStudy, Henry Schreiner et. al
[3] The LHCb Turbo Stream, Sean Benson et al., CHEP-2015
The Canadian Advanced Network For Astronomical Research (CANFAR)
is a digital infrastructure that has been operational for the last
six years.
The platform allows astronomers to store, collaborate, distribute and
analyze large astronomical datasets. We have implemented multi-site storage and
in collaboration with an HEP group at University of Victoria, multi-cloud processing.
CANFAR is deeply integrated with the Canadian Astronomy Data Centre
(CADC), one of the first public astronomy data delivery service
initiated 30 years ago that is now expanding its services beyond data
delivery. Individual astronomers, official telescope archives, and large astronomy survey
collaborations are the current CANFAR users.
This talk will describe some CANFAR use cases, the internal infrastructure, the
lessons learned and future directions.
The ATLAS Distributed Data Management (DDM) system has evolved drastically in the last two years with the Rucio software fully
replacing the previous system before the start of LHC Run-2. The ATLAS DDM system manages now more than 200 petabytes spread on 130
storage sites and can handle file transfer rates of up to 30Hz. In this talk, we discuss our experience acquired in developing,
commissioning, running and maintaining such a large system. First, we describe the general architecture of the system, our
integration with external services like the WLCG File Transfer Service and the evolution of the system over its first year of
production. Then, we show the performance of the system, describe the integration of new technologies such as object stores, and
outline future developments which mainly focus on performance and automation. Finally we discuss the long term evolution of ATLAS
data management.
The ATLAS Event Service (ES) has been designed and implemented for efficient
running of ATLAS production workflows on a variety of computing platforms, ranging
from conventional Grid sites to opportunistic, often short-lived resources, such
as spot market commercial clouds, supercomputers and volunteer computing.
The Event Service architecture allows real time delivery of fine grained workloads to
running payload applications which process dispatched events or event ranges
and immediately stream the outputs to highly scalable Object Stores. Thanks to its agile
and flexible architecture the ES is currently being used by grid sites for assigning low
priority workloads to otherwise idle computing resources; similarly harvesting HPC resources
in an efficient back-fill mode; and massively scaling out to the 50-100k concurrent core
level on the Amazon spot market to efficiently utilize those transient resources for peak
production needs. Platform ports in development include ATLAS@Home (BOINC) and the
Goggle Compute Engine, and a growing number of HPC platforms.
After briefly reviewing the concept and the architecture of the Event Service, we will
report the status and experience gained in ES commissioning and production operations
on various computing platforms, and our plans for extending ES application beyond Geant4
simulation to other workflows, such as reconstruction and data analysis.
The goal of the comparison is to summarize the state-of-the-art techniques of deep learning which is boosted with modern GPUs. Deep learning, which is also known as deep structured learning or hierarchical learning, is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers composed of multiple non-linear transformations. Deep learning is part of a broader family of machine learning methods based on learning representations of data. The representations are inspired by advances in neuroscience and are loosely based on interpretation of information processing and communication patterns in a nervous system, such as neural coding which attempts to define a relationship between various stimuli and associated neuronal responses in the brain. In this paper, a brief history of deep learning research is discussed first. Then, different deep learning models such as deep neural networks, convolutional deep neural networks, deep belief networks and recurrent neural networks are analyzed to summarize major work reported in the deep learning literature. Then we discuss the general deep learning system architecture including hardware layer and software middleware. In this architecture GPU subsystem is widely used to accelerate computation and its architecture is especially discussed. To show the performance of the deep learning system with GPU acceleration, we choose various deep learning models, compare their performances with/without GPU and list various acceleration rates. Various deep learning models have been applied to fields like computer vision, automatic speech recognition, natural language processing, audio recognition and bioinformatics. Selected applications are reviewed to show state-of-the-art results on various tasks. Finally, future directions of deep learning are discussed.
In the midst of the multi- and many-core era, the computing models employed by
HEP experiments are evolving to embrace the trends of new hardware technologies.
As the computing needs of present and future HEP experiments -particularly those
at the Large Hadron Collider- grow, adoption of many-core architectures and
highly-parallel programming models is essential to prevent degradation in scientific capability.
Simulation of particle interactions is typically a major consumer of CPU
resources in HEP experiments. The recent release of a highly performant
multi-threaded version of Geant4 opens the door for experiments to fully take
advantage of highly-parallel technologies.
The Many Integrated Core (MIC) architecture of Intel, known as the Xeon Phi
family of products, provide a platform for highly-parallel applications. Their large
number of cores and Linux-based environment make them an attractive compromise
between conventional CPUs and general-purpose GPUs. Xeon Phi processors will be
appearing in next-generation supercomputers such as Cori Phase 2 at NERSC.
To prepare for tusing hese next-generation supercomputers, a Geant4 application
has been developed to test and study HEP particle simulations on the MIC Intel architectures (HepExpMT).
This application serves as a demonstrator of the feasibility and
computing-opportunity of utilizing this advanced architecture with a complex
detector geometry.
We have measured the performances of the application on the first generation of Xeon Phi
coprocessors (code name Knights Corner, KNC). In this work we extend the scalability measurements to the second generation of Xeon Phi architectures (code name Knights Landing, KNL) in
preparation of further testing on Cori Phase 2 supercomputer at NERSC.
Around the year 2000, the convergence on Linux and commodity x86_64 processors provided a homogeneous scientific computing platform which enabled the construction of the Worldwide LHC Computing Grid (WLCG) for LHC data processing. In the last decade the size and density of computing infrastructure has grown significantly. Consequently, power availability and dissipation have become important limiting factors for modern data centres. The on-chip power density limitations, which have brought us into the multicore era, are driving the computing market towards a greater variety of solutions. This, in turn, necessitates a broader look at the future computing infrastructure.
Given the planned High Luminosity LHC and detector upgrades, changes are required to the computing models and infrastructure to enable data processing at increased rates through the early 2030s. Understanding how to maximize throughput for minimum initial cost and power consumption will be a critical aspect for computing in HEP in the coming years.
We present results from our work to compare performance and energy efficiency for different general purpose architectures, such as traditional x86_64 processors, ARMv8, PowerPC 64-bit, as well as specialized parallel architectures, including Xeon Phi and GPUs. In our tests we use a variety of HEP-related benchmarks, including HEP-SPEC 2006, GEANT4 ParFullCMS and the realistic production codes from the LHC experiments. Finally we conclude on the suitability of the architectures under test for the future computing needs of HEP data centres.
Exascale computing resources are roughly a decade away and will be capable of 100 times more computing than current supercomputers. In the last year, Energy Frontier experiments crossed a milestone of 100 million core-hours used at the Argonne Leadership Computing Facility, Oak Ridge Leadership Computing Facility, and NERSC. The Fortran-based leading-order parton generator called Alpgen was successfully scaled to millions of threads to achieve this level of usage on Mira. Sherpa and MadGraph are next-to-leading order generators used heavily by LHC experiments for simulation. Integration times for high-multiplicity or rare NLO processes can take a week or more on standard Grid machines, even using all 16-cores. We will describe our work to scale these generators to millions of threads on leadership-class machines to reduce run times to less than a day. This work allows the experiments to leverage large-scale parallel supercomputers for event generation today, freeing tens of millions of grid hours for other work, and paving the way for future applications (simulation, reconstruction) on these and future supercomputers.
ALICE (A Large Ion Collider Experiment) is a heavy-ion detector studying the physics of strongly interacting matter and the quark-gluon plasma at the CERN LHC (Large Hadron Collider). After the second long shut-down of the LHC, the ALICE detector will be upgraded to cope with an interaction rate of 50 kHz in Pb-Pb collisions, producing in the online computing system (O2) a sustained throughput of 3 TB/s. This data will be processed on the fly so that the stream to permanent
storage does not exceed 80GB/s peak, the raw data being discarded.
In the context of assessing different computing platforms for the O2 system, we have developed a framework for the Intel Xeon Phi processors (MIC).
It provides the components to build a processing pipeline streaming the data from the PC memory to a pool of permanent threads running on the MIC, and back to the host after processing. It is based on explicit offloading mechanisms (data transfer, asynchronous tasks) and basic building blocks (FIFOs, memory pools, C++11 threads). The user only needs to implement the processing method to be
run on the MIC.
We present in this paper the architecture, implementation, and performance of this system.
RapidIO (http://rapidio.org/) technology is a packet-switched high-performance fabric, which has been under active development since 1997. Originally meant to be a front side bus, it developed into a system level interconnect which is today used in all 4G/LTE base stations world wide. RapidIO is often used in embedded systems that require high reliability, low latency and scalability in a heterogeneous environment - features that are highly interesting for several use cases, such as data analytics and data acquisition networks.
We will present the results of evaluating RapidIO in a Data Analytics environment, from setup to benchmark. Specifically, we will share the experience of running ROOT and Hadoop on top of RapidIO.
To demonstrate the multi-purpose characteristics of RapidIO, we will also present the results of investigating RapidIO as a technology for high-speed Data Acquisition networks using a generic multi-protocol event-building emulation tool.
In addition we will present lessons learned from implementing native ports of CERN applications to RapidIO.
HPC network technologies like Infiniband, TrueScale or OmniPath provide low-
latency and high-throughput communication between hosts, which makes them
attractive options for data-acquisition systems in large-scale high-energy
physics experiments. Like HPC networks, data acquisition networks are local
and include a well specified number of systems. Unfortunately traditional network
communication APIs for HPC clusters like MPI or PGAS exclusively target the HPC
community and are not well suited for data acquisition applications. It is possible
to build distributed data acquisition applications using low-level system APIs like
Infiniband Verbs, but it requires non negligible effort and expert knowledge.
On the other hand, message services like 0MQ have gained popularity in the HEP
community. Such APIs facilitate the building of distributed applications with a
high-level approach and provide good performance. Unfortunately their usage usually
limits developers to TCP/IP-based networks. While it is possible to operate a
TCP/IP stack on top of Infiniband and OmniPath, this approach may not be very
efficient compared to direct use of native APIs.
NetIO is a simple, novel asynchronous message service that can operate on
Ethernet, Infiniband and similar network fabrics. In our presentation we describe
the design and implementation of NetIO, evaluate its use in comparison to other
approaches and show performance studies.
NetIO supports different high-level programming models and typical workloads of
HEP applications. The ATLAS front end link exchange project successfully uses NetIO
as its central communication platform.
The NetIO architecture consists of two layers:
The outer layer provides users with a choice of several socket types for
different message-based communication patterns. At the moment NetIO features a
low-latency point-to-point send/receive socket pair, a high-throughput
point-to-point send/receive socket pair, and a high-throughput
publish/subscribe socket pair.
The inner layer is pluggable and provides a basic send/receive socket pair to
the upper layer to provide a consistent, uniform API across different network
technologies.
There are currently two working backends for NetIO:
The Ethernet backend is based on TCP/IP and POSIX sockets.
The Infiniband backend relies on libfabric with the Verbs provider from the
OpenFabrics Interfaces Working Group.
The libfabric package also supports other fabric technologies like iWarp, Cisco
usNic, Cray GNI, Mellanox MXM and others. Via PSM and PSM2 it also natively
supports Intel TrueScale and Intel OmniPath. Since libfabric is already used for
the Infiniband backend, we do not foresee major challenges for porting NetIO to
OmniPath, and a native OmniPath backend is currently under development.
In recent years there has been increasing use of HPC facilities for HEP experiments. This has initially focussed on less I/O intensive workloads such as generator-level or detector simulation. We now demonstrate the efficient running of I/O-heavy ‘analysis’ workloads for the ATLAS and ALICE collaborations on HPC facilities at NERSC, as well as astronomical image analysis for DESI.
To do this we exploit a new 900 TB NVRAM-based storage system recently installed at NERSC, termed a ‘Burst Buffer’. This is a novel approach to HPC storage that builds on-demand filesystems on all-SSD hardware that is placed on the high-speed network of the new Cori supercomputer. The system provides over 900 GB/s bandwidth and 12.5 million I/O operations per second.
We describe the hardware and software involved in this system, and give an overview of its capabilities and use-cases beyond the HEP community before focussing in detail on how the ATLAS, ALICE and astronomical
workflows were adapted to work on this system. To achieve this, we have also made use of other novel techniques, such as use of docker-like container technology, and tuning of the I/O layer experiment software.
We describe these modifications and the resulting performance results, including comparisons to other approaches and filesystems. We provide detailed performance studies and results, demonstrating that we can meet the challenging I/O requirements of HEP experiments and scale to tens of thousands of cores accessing a single storage system.
Abstract: Southeast University Science Operation Center (SEUSOC) is one of the computing centers of the Alpha Magnetic Spectrometer (AMS-02) experiment. It provides 2000 CPU cores for AMS scientific computing and a dedicated 1Gbps Long Fat Network (LFN) for AMS data transmission between SEU and CERN. In this paper, the workflows of SEUSOC Monte Carlo (MC) production are discussed in detail, including the process of the MC job request and execution, the data transmission strategy, the MC database and the MC production monitoring tool. Moreover, to speed up the data transmission in LFN between SEU and CERN, an optimized transmission strategy in TCP layer and application layer is further introduced.
With processor architecture evolution, the HPC market has undergone a paradigm shift. The adoption of low-cost, Linux-based clusters extended HPC’s reach from its roots in modeling and simulation of complex physical systems to a broad range of industries, from biotechnology, cloud computing, computer analytics and big data challenges to manufacturing sectors. In this perspective, the near future HPC systems will be composed of millions of low-power-consumption computing cores, tightly interconnected by a low latency high performance network, equipped with a new distributed storage architecture, densely packaged but cooled by an appropriate technology.
In the road towards Exascale-class system, several additional challenges wait for a solution; the storage and interconnect subsystems, as well as a dense packaging technology, are three of them.
The ExaNeSt project, started on December 2015 and funded in EU H2020 research framework (call H2020-FETHPC-2014, n. 671553), is a European initiative aiming to develop the system-level interconnect, the NVM (Non-Volatile Memory) storage and the cooling infrastructure for an ARM-based Exascale-class supercomputers. The ExaNeSt Consortium combines industrial and academic research expertise, especially in the areas of system cooling and packaging, storage, interconnects, and the HPC applications that drive all of the above.
ExaNeSt will develop an in-node storage architecture, leveraging on low cost, low-power consumption NVM devices. The storage distributed sub-system will be accessed by a unified low latency interconnect enabling scalability of storage size and I/O bandwidth with the compute capacity.
The unified, low latency, RDMA enhanced network will be designed and validated using a network test-bed based on FPGA and passive copper and/or active optical channels allowing the exploration of different interconnection topologies (from low radix n-dimensional torus mesh to higher radix DragonFly topology), routing functions minimizing data traffic congestion and network support to system resiliency.
ExaNeSt also addresses packaging and advanced liquid cooling, which are of strategic importance for the design of realistic systems, and aims at an optimal integration, dense, scalable, and power efficient. In an early stage of the project an ExaNeSt system prototype, characterized by 1000+ ARM cores, will be available acting as platform demonstrator and hardware emulator.
A set of relevant ambitious applications, including HPC codes for astrophysics, nuclear physics, neural network simulation and big data, will support the co-design of the ExaNeSt system providing specifications during design phase and application benchmarks for the prototype platform.
In this talk a general overview of project motivations and objectives will be discussed and the preliminary developments status will be reported.
This contribution gives a report on the remote evaluation of the pre-production Intel Omni-Path (OPA) interconnect hardware and software performed by RHIC & ATLAS Computing Facility (RACF) at BNL in Dec 2015 - Feb 2016 time period using a 32 node “Diamond” cluster with a single Omni-Path Host Fabric Interface (HFI) installed on each and a single 48-port Omni-Path switch with the non-blocking fabric (capable of carrying up to 9.4 Tbps of the aggregate traffic if all ports are involved) provided by Intel. The main purpose of the tests was to assess the basic features and functionality of the control and diagnostic tools available for the pre-production version of the pre-production version of the Intel Omni-Path low latency interconnect technology, as well as the Omni-Path interconnect performance in a realistic environment of a multi-node HPC cluster running RedHat Enterprise Linux 7 x86_64 OS. The interconnect performance metering was performed using the low level fabric layer and MPI communication layer benchmarking tools available in the OpenFabrics Enterprise Distribution (OFED), Intel Fabric Suite and OpenMPI v1.10.0 distributions with pre-production support of the Intel OPA interconnect technology built with both GCC v4.9.2 and Intel Compiler v15.0.2 versions and provided in the existing test cluster setup. A subset of the tests were performed with benchmarking tools built with GCC and Intel Compiler, with and without explicit mapping of a test processes to the physical CPU cores on the compute nodes in order to determine wither these changes result in a statistically significant difference in performance observed. Despite the limited scale of the test cluster used, the test environment provided was sufficient to carry out a large variety of RDMA, native and Intel OpenMPI, and IP over Omni-Path performance measurements and functionality tests. In addition to presenting the results of the performance benchmarks we also discuss the prospects for the future used of the Intel Omni-Path technology as a future interconnect solutions for both the HPC and HTC scientific workloads.
The LHC is the world's most powerful particle accelerator, colliding protons at centre of mass energy of 13 TeV. As the
energy and frequency of collisions has grown in the search for new physics, so too has demand for computing resources needed for
event reconstruction. We will report on the evolution of resource usage in terms of CPU and RAM in key ATLAS offline
reconstruction workflows at the Tier0 at CERN and on the WLCG. Monitoring of workflows is achieved using the ATLAS PerfMon
package, which is the standard ATLAS performance monitoring system running inside Athena jobs. Systematic daily monitoring has
recently been expanded to include all workflows beginning at Monte Carlo generation through to end user physics analysis, beyond
that of event reconstruction. Moreover, the move to a multiprocessor mode in production jobs has facilitated the use of tools, such
as "MemoryMonitor", to measure the memory shared across processors in jobs. Resource consumption is broken down into software
domains and displayed in plots generated using Python visualization libraries and collected into pre-formatted auto-generated
Web pages, which allow ATLAS' developer community to track the performance of their algorithms. This information is however
preferentially filtered to domain leaders and developers through the use of JIRA and via reports given at ATLAS software meetings.
Finally, we take a glimpse of the future by reporting on the expected CPU and RAM usage in benchmark workflows associated with the
High Luminosity LHC and anticipate the ways performance monitoring will evolve to understand and benchmark future workflows.
Changes in the trigger menu, the online algorithmic event-selection of the ATLAS experiment at the LHC in response to luminosity and detector changes are followed by adjustments in their monitoring system. This is done to ensure that the collected data is useful, and can be properly reconstructed at Tier-0, the first level of the computing grid. During Run 1, ATLAS deployed monitoring updates with the installation of new software releases at Tier-0. This created unnecessary overhead for developers and operators, and unavoidably led to different releases for the data-taking and the monitoring setup.
We present a "trigger menu-aware" monitoring system designed for the ATLAS Run 2 data-taking. The new monitoring system aims to simplify the ATLAS operational workflows, and allows for easy and flexible monitoring configuration changes at the Tier-0 site via an Oracle DB interface. We present the design and the implementation of the menu-aware monitoring, along with lessons from the operational experience of the new system with the 2016 collision data.
MonALISA, which stands for Monitoring Agents using a Large Integrated Services Architecture, has been developed over the last fourteen years by Caltech and its partners with the support of the CMS software and computing program. The framework is based on Dynamic Distributed Service Architecture and is able to provide complete monitoring, control and global optimization services for complex systems.
The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of information gathering and processing tasks. These agents can analyze and process the information, in a distributed way, to provide optimization decisions in large scale distributed applications. An agent-based architecture provides the ability to invest the system with increasing degrees of intelligence, to reduce complexity and make global systems manageable in real time. The scalability of the system derives from the use of multithreaded execution engine to host a variety of loosely coupled self-describing dynamic services or agents and the ability of each service to register itself and then to be discovered and used by any other services, or clients that require such information. The system is designed to easily integrate existing monitoring tools and procedures and to provide this information in a dynamic, customized, self describing way to any other services or clients.
A report of the present status of development in MonALISA as well as outlook on future developments will be given.
Physics analysis at the Compact Muon Solenoid (CMS) requires both a vast production of simulated events and an extensive processing of the data collected by the experiment.
Since the end of the LHC runI in 2012, CMS has produced over 20 Billion simulated events, from 75 thousand processing requests organised in one hundred different campaigns, which emulate different configurations of collision events, CMS detector and LHC running conditions. In the same time span, sixteen data processing campaigns have taken place to reconstruct different portions of the runI and runII data with ever improving algorithms and calibrations.
The scale and complexity of the events simulation and processing and the requirement that multiple campaigns must proceed in parallel, demand that a comprehensive, frequently updated and easily accessible monitoring be made available to the CMS collaboration.
Such monitoring must serve both the analysts, who want to know which and when datasets will become available, and the central teams in charge of submitting, prioritizing and running the requests across the distributed computing infrastructure of CMS.
The Production Monitoring Platform (pMp) web-based service, has been developed in 2015 to address those needs. It aggregates information from multiple services used to define, organize and run the processing requests; pMp updates hourly a dedicated Elastic database, and provides multiple configurable views to assess the status of single datasets as well as entire production campaigns.
This contribution will cover the pMp development, the evolution of its functionalities and one and half year of operational experience.
Over the past two years, the operations at INFN-CNAF have undergone significant changes.
The adoption of configuration management tools, such as Puppet and the constant increase of dynamic and cloud infrastructures, have led us to investigate a new monitoring approach.
Our aim is the centralization of the monitoring service at CNAF through a scalable and highly configurable monitoring infrastructure.
The selection of tools has been made taking into account the following requirements given by our users: adaptability to dynamic infrastructures, ease of configuration and maintenance, capability to provide more flexibility, compatibility with existing monitoring system, re-usability and ease of access to information and data.
We are going to describe our monitoring infrastructure composed of the following components: Sensu as monitoring router, InfluxDB as time series database to store data gathered from sensors and Grafana as a tool to create dashboards and to visualize time series metrics.
IceProd is a data processing and management framework developed by the IceCube Neutrino Observatory for processing of Monte Carlo simulations, detector data, and analysis levels. It runs as a separate layer on top of grid and batch systems. This is accomplished by a set of daemons which process job workflow, maintaining configuration and status information on the job before, during, and after processing. IceProd can also manage complex workflow DAGs across distributed computing grids in order to optimize usage of resources.
IceProd has recently been rewritten to increase its scaling capabilities, handle user analysis workflows together with simulation production, and facilitate the integration with 3rd party scheduling tools. IceProd 2, the second generation of IceProd, has been running in production for several months now. We share our experience setting up the system and things we’ve learned along the way.
The Simulation at Point1 project is successfully running traditional ATLAS simulation jobs
on the trigger and data aquisition high level trigger resources.
The pool of the available resources changes dynamically and quickly, therefore we need to be very
effective in exploiting the available computing cycles.
We will present our experience with using the Event Service that provides the event-level
granularity for computations. We will show the design decisions and overhead time related
to the usage of the Event Service. The improved utilization of the resources will
also be presented with the recent development in the monitoring and the automatic alerting,
as well as in the deployment and GUI.
The Scientific Computing Department of the STFC runs a cloud service for internal users and various user communities. The SCD Cloud is configured using a Configuration Management System called Aquilon. Many of the virtual machine images are also created/configured using Aquilon. These are not unusual however our Integrations also allow Aquilon to be altered by the Cloud. For instance creation or destruction of a Virtual Machine can affect its configuration in Aquilon.
The current tier-0 processing at CERN is done on two managed sites, the CERN computer centre and the Wigner computer centre. With the proliferation of public cloud resources at increasingly competitive prices, we have been investigating how to transparently increase our compute capacity to include these providers. The approach taken has been to integrate these resources using our existing deployment and computer management tools and to provide them in a way that exposes them to users as part of the same site. The paper will describe the architecture, the toolset and the current production experiences of this model.
The Computing Center of the Institute of Physics (CC IoP) of the Czech Academy of Sciences serves a broad spectrum of users with various computing needs. It runs WLCG Tier-2 center for the ALICE and the ATLAS experiments; the same group of services is used by astroparticle physics projects the Pierre Auger Observatory (PAO) and the Cherenkov Telescope Array (CTA). OSG stack is installed for the NOvA experiment. Other groups of users use directly local batch system. Storage capacity is distributed to several locations. DPM servers used by the ATLAS and the PAO are all in the same server room, but several xrootd servers for the ALICE experiment are operated in the Nuclear Physics Institute in Rez, about 10 km away. The storage capacity for the ATLAS and the PAO is extended by resources of the CESNET - the Czech National Grid Initiative representative. Those resources are in Plzen and Jihlava, more than 100 km away from the CC IoP. Both distant sites use a hierarchical storage solution based on disks and tapes. They installed one common dCache instance, which is published in the CC IoP BDII. ATLAS users can use these resources using the standard ATLAS tools in the same way as the local storage without noticing this geographical distribution.
Computing clusters LUNA and EXMAG dedicated to users mostly from the Solid State Physics departments offer resources for parallel computing. They are part of the Czech NGI infrastructure MetaCentrum with distributed batch system based on torque with a custom scheduler. Clusters are installed remotely by the MetaCentrum team and a local contact helps only when needed. Users from IoP have exclusive access only to a part of these two clusters and take advantage of higher priorities on the rest (1500 cores in total), which can also be used by any user of the MetaCentrum. IoP researchers can also use distant resources located in several towns of the Czech Republic with a capacity of more than 12000 cores in total.
This contribution will describe installation and maintenance procedures, transition from cfengine to puppet, monitoring infrastructure based on tools like nagios, munin, ganglia and organization of the user support via Request Tracker. We will share our experience with log file processing using ELK stack. The network infrastructure description and its load will also be given.
The software suite required to support a modern high energy physics experiment is typically made up of many experiment-specific packages in addition to a large set of external packages. The developer-level build system has to deal with external package discovery, versioning, build variants, user environments, etc. We find that various systems for handling these requirements divide the problem in different ways, making simple substitution of one set of build tools for another impossible. Recently, there has been a growing interest in the HEP community in using Spack, https://github.com/llnl/spack. to handle various aspects of the external package portion of the build problem. We describe a new build system that utilizes Spack for external dependencies and emphasizes common open source software solutions for the rest of the build process.
The ALICE experiment at CERN was designed to study the properties of the strongly-interacting hot and dense matter created in heavy-ion collisions at the LHC energies. The computing model of the experiment currently relies on the hierarchical Tier-based structure, with a top-level Grid site at CERN (Tier-0, also extended to Wigner) and several globally distributed datacenters at national and regional level (Tier-1 and Tier-2 sites). The Italian computing infrastructure is mainly composed by a Tier-1 site at CNAF (Bologna) and four Tier-2 sites (at Bari, Catania, Padova-Legnaro and Torino), with the addition of two small WLCG centers in Cagliari and Trieste. Globally it contributes by about 15% to the overall ALICE computing resources.
Actually the management of a Tier-2 site is based on a few complementary monitoring tools, each looking at the ALICE activity from a different point of view: for instance, MonALISA is used to extract information from the experiment side, the Local Batch System allows to store statistical data on the overall site activity and the Local Monitoring System provides the status of the computing machines. This typical schema makes somewhat difficult to figure out at a glance the status of the ALICE activity in the site and to compare information extracted from different sources for debugging purposes. In this contribution, a monitoring system able to gather information from all the available sources to improve the management of an ALICE Tier-2 site will be presented. A centralized site dashboard based on specific tools selected to meet tight technical requirements, like the capability to manage a huge amount of data in a fast way and through an interactive and customizable Graphical User Interface, has been developed. The current version, running in the Bari Tier-2 site since more than one year, relies on an open source time-series database (InfluxDB): a dataset of about 20 M values is currently stored in 400 MB with on-the-fly aggregation allowing to return downsampled series with a factor of 10 gain in the retrieval time. A dashboard builder for visualizing time-series metrics (Grafana) has been identified as best suited option, while dedicated code has been written to implement the gathering phase. Details of the dashboard performance as observed along the last year will be also provided.
The system is currently being exported to all the other sites in order to allow a next step where a unique centralized dashboard for the ALICE computing in Italy will be implemented. Prospects of such an Italian dashboard and further developments on this side will be discussed. They also include the design of a more general monitoring system for distributed datacenters able to provide active support to site administrators in detecting critical events as well as in improving problem solving and debugging procedures.
In the ideal limit of infinite resources, multi-tenant applications are able to scale in/out on a Cloud driven only by their functional requirements. A large Public Cloud may be a reasonable approximation of this condition, where tenants are normally charged a posteriori for their resource consumption. On the other hand, small scientific computing centres usually work in a saturated regime and tenants are charged a priori for their computing needs by paying for a fraction of the computing/storage resources constituting the Cloud infrastructure. Within this context, an advanced resource allocation policy is needed in order to optimise the use of the data center. We consider a scenario in which a configurable fraction of the available resources is statically assigned and partitioned among projects according to fixed shares. Additional assets are partitioned dynamically following the effective requests per project; efficient and fair access to such resources must be granted to all projects.
The general topic of advanced resource scheduling is addressed by several components of the EU-funded INDIGO-DataCloud project. In this context, dedicated services for the OpenNebula and OpenStack cloud management systems are addressed separately, because of the different internal architectures of the systems.
In this contribution, we describe the FairShare Scheduler Service (FSS) for OpenNebula (ON). The service satisfies resource requests according to an algorithm which prioritises tasks according to an initial weight and to the historical resource usage of the project, irrespective of the number of tasks she has running on the system. The software was designed to be less intrusive as possible in the ON code. By keeping minimal dependencies on the ON implementation details, we expect our code to be fairly independent on future ON internals changes and developments.
The scheduling service is structured as a self-contained module interacting only with the ON XML-RPC interface. Its core component is the Priority Manager (PM), whose main task is to calculate a set of priorities for queued jobs. The manager interacts with a set of pluggable algorithms to calculate priorities. The PM exposes an XML-RPC interface, independent from the ON core one, and uses an independent Priority Database as data back-end. The second fundamental building block of the FSS module is the scheduler itself. The default ON scheduler is responsible for the matching of pending requests to the most suited physical resources. The queue of pending jobs is retrieved through an XML-RPC call to the ON core and they are served in a first-in-first-out manner. We keep the original scheduler implementation, but the queue of pending jobs to be processed is the one ordered according to priorities as delivered by the PM.
After a description of the module’s architecture, internal data representation and APIs, we show the results of the tests performed on the first prototype.
Application performance is often assessed using the Performance Monitoring Unit (PMU) capabilities present in modern processors. One popular tool that can read the PMU's performance counters is the Linux-perf tool. pmu-tools is a toolkit built around Linux-perf that provides a more powerful interface to the different PMU events and give a more abstracted view of the events. Unfortunately pmu-tools report results only in text form or simple static graphs, limiting their usability.
We report on our efforts of developing a web-based front-end for pmu-tools allowing the application developer to more easily visualize, analyse and interpret performance monitoring results. Our contribution should boost programmer productivity and encourage continuous monitoring of the application's performance. Furthermore, we discuss our tool's capability to quickly construct and test new performance metrics for characterizing application performance. This will allow the user to experiment with new high level metrics that reflect the performance requirements of his application more accurately.
OpenStack is an open source cloud computing project that is enjoying wide popularity. More and more organizations and enterprises deploy it to provide their private cloud services. However, most organizations and enterprises cannot achieve unified user management access control to the cloud service, since the authentication and authorization systems of Cloud providers are generic and they cannot be easily adapted to the requirements of each individual organization or enterprise.
In this paper we present the design of a lightweight access control solution that overcomes this problem. Our solution access control is offered as a service by a third trusted party, the Access Control Provider. Access control as a service enhances end-user privacy, eliminates the need for developing complex adaptation protocols, and offers user flexibility to switch among the Cloud service and another different services.
We have implemented and incorporated our solution in the popular open-source Cloud stack OpenStack. Moreover, we have designed and implemented a web application that enables the incorporation of our solution into the UMT of IHEP based on Auth2.0. The UMT of IHEP as a tool which is used to manage IHEP user, record user’s information, account, password and so on. Moreover, UMT provide the unified authentication service.
In our access control solution for Openstack, we create an information table to record all Openstack accounts and their passwords which will be queried when these accounts were authenticated by the third trusted party. As the new registered UMT user login the Cloud service for the first time, our system will create the user's resources automatically by Openstack API, and record the user information into the information table immediately. Moreover, we still keep Openstack original login web page, so administrators and some special users can access Openstack and do some background management. We have applied the solution to IHEPCloud, an IaaS cloud platform at IHEP. Except UMT, it is easy to expand other third-party authentication tools, for example CERN account management system, google, sina, or tecent.
Belle II experiment can take advantage from Data federation technologies to simplify access to distributed datasets and file replicas. The increasing adoption of http and webdav protocol by sites, enable to create lightweight solutions to give an aggregate view of the distributed storage.
In this work, we make a study on the possible usage of the software Dynafed developed by CERN for the creation of an on-the-fly data federation.
We created a first dynafed server, hosted in the datacentre in Napoli, and connected with about the 50% of the production storages of Belle II. Then we aggregated all the file systems under a unique http path. We implemented as well an additional view, in order to browse the single storage file system.
On this infrastructure, we performed a stress test in order to evaluate the impact of federation overall performances, the service resilience, and to study the capability of redirect clients properly to the file replica in case of fault, temporary unavailability of a server.
The results show a good potentiality of the service and suggest additional investigation for additional setup.
Virtual machines have many features — flexibility, easy controlling and customized system environments. More and more organizations and enterprises begin to deploy virtualization technology and cloud computing to construct their distributed system. Cloud computing is widely used in high energy physics field. In this presentation, we introduce an integration of virtual machines with HTCondor, which support resource management of multiple groups and preemptive scheduling policy. The system makes resources management more flexible and more efficient. Firstly, computing resources belong to different experiments, and each experiment has one or more user groups. All users of a same experiment have the access permission to all the resources owned by that experiment. Therefore, we have two types of groups, resource group and user group. In order to manage the mapping of user group and resource group, we design a permission controlling component to ensure jobs are delivered to suitable resource groups. Secondly, for elastically adjusting the resource scale of a resource group, it is necessary to schedule resources in the way of scheduling jobs. So we design a resource scheduler that focusing on virtual resources. The resource scheduler maintains a resource queue and matches an appropriate amount of virtual machines from the requested resource-group. Thirdly, in some conditions, one case that the resource may be occupied by a resource-group for a long time, it needs to be preempted. This presentation adds the preemptive feature to the resource scheduler based on the group priority. Higher priority leads to lower preemption probability, and lower priority leads to higher preemption probability. Virtual resources can be smoothly preempted, and running jobs are held and re-matched later. The feature is based on HTCondor, storing the held job, releasing the job to idle status and waiting for a secondary matching. We built a distributed virtual computing system based on HTCondor and Openstack. This presentation also shows some use cases of the JUNO and LHAASO experiments. The result shows that multi-group and preemptive resource scheduling perform well. Besides, the permission controlling component are not only used in virtual cluster but also in the local cluster, and the amount of experiments which it supports are expanding.
With the era of big data emerging, Hadoop has become de facto standard of big data processing. However, it is still difficult to get High Energy Physics (HEP) applications run efficiently on HDFS platform. There are two reasons to explain. Firstly, Random access to events data is not supported by HDFS platform. Secondly, it is difficult to make HEP applications adequate to Hadoop data processing mode. In order to address this problem, a new read and write mechanism of HDFS is proposed. With this mechanism, data access is done on local filesystem instead of through HDFS streaming interface. For data writing, the first file replica is written to the local DataNode, the rest replicas produced by copy of the first replica stored on other DataNodes. The first replica is written under the Blocks storage directory and calculates data checksum after write completion. For data reading, DataNode Daemon provides the data access interface for local Blocks, and Map tasks can read the file replica directly on local DataNode when running locally. To enable files modified by users, three attributes including permissions, owner and group are imposed on Block objects. Blocks stored on DataNode have the same attributes as the file they belong to. Users can modify Blocks when the Map task running locally, and HDFS is responsible to update the rest replicas later after data access done. To further improve the performance of Hadoop system, two optimization on Hadoop scheduler are conducted. Firstly, a Hadoop task selection strategy is presented based on disk I/O performance. With this strategy, an appropriate Map task is selected according to disk workloads, so that disk balance workload is achieved on DataNodes. Secondly, a complete localization task execution mechanism is implemented for I/O intensive jobs. Test results show that average CPU utilization is improved by 10% with the new task selection strategy, data read and write performance is improved about 10% and 40% separately.
The complex geometry of the whole detector of the ATLAS experiment at LHC is currently stored only in custom online databases, from which it is built on-the-fly on request. Accessing the online geometry guarantees accessing the latest version of the detector description, but requires the setup of the full ATLAS software framework "Athena", which provides the online services and the tools to retrieve the data from the database. This operation is cumbersome and slows down the applications that need to access the geometry. Moreover, all applications that need to access the detector geometry need to be built and run on the same platform as the ATLAS framework, preventing the usage of the actual detector geometry in stand-alone applications.
Here we propose a new mechanism to persistify and serve the geometry of HEP experiments. The new mechanism is composed by a new file format and a REST API. The new file format allows to store the whole detector description locally in a flat file, and it is especially optimized to describe large complex detectors with the minimum file size, making use of shared instances and storing compressed representations of geometry transformations. On the other side, the dedicated REST API is meant to serve the geometry in standard formats like JSON, to let users and applications download specific partial geometry information.
With this new geometry persistification a new generation of applications could be developed, which can use the actual detector geometry while being platform-independent and experiment-agnostic.
The INFN Section of Turin hosts a middle-size multi-tenant cloud infrastructure optimized for scientific computing.
A new approach exploiting the features of VMDIRAC and aiming to allow for dynamic automatic instantiation and destruction of Virtual Machines from different tenants, in order to maximize the global computing efficiency of the infrastructure, has been designed, implemented and is now being tested.
Making use of the standard EC2 API, the two OpenNebula and OpenStack platforms are addressed by this approach.
The use of Webdav protocol to access at large storage areas is becoming popular in the High Energy Physics community. All the main Grid and Cloud storage solutions provide such kind of interface, in this scenario, tuning the storage systems and performance evaluation became crucial aspects to promote the adoption of these protocols within the Belle II community.
In this work, we present the results of a large-scale test activity, made with the goal to evaluate performances and reliability of the WebDAV protocol, and study a possible adoption for the user analysis, in integration or in alternative of the most used protocols.
More specifically, we considered a pilot infrastructure composed by a set of storage elements configured with the webdav interface, hosted at the Belle II sites. The performance tests include also a comparison with xrootd, popular in the HEP community.
As reference tests, we used a set of analysis jobs running in the Belle II software framework, accessing the input data with the ROOT I/O library, in order to simulate as much as possible a realistic user activity.
The final analysis shows the possibility to achieve promising performances with webdav on different storage systems, and gives an interesting feedback, for Belle II community and for other high energy physics experiments.
The ATLAS software infrastructure facilitates efforts of more than 1000
developers working on the code base of 2200 packages with 4 million C++
and 1.4 million python lines. The ATLAS offline code management system is
the powerful, flexible framework for processing new package versions
requests, probing code changes in the Nightly Build System, migration to
new platforms and compilers, deployment of production releases for
worldwide access and supporting physicists with tools and interfaces for
efficient software use. It maintains multi-stream, parallel development
environment with about 70 multi-platform branches of nightly releases and
provides vast opportunities for testing new packages, for verifying
patches to existing software and for migrating to new platforms and
compilers. The system evolution is currently aimed on the adoption of
modern continuous integration (CI) practices focused on building nightly
releases early and often, with rigorous unit and integration testing. This
presentation describes the CI incorporation program for the ATLAS software
infrastructure. It brings modern open source tools such as Jenkins and
CTest into the ATLAS Nightly System, rationalizes hardware resource
allocation and administrative operations, provides improved feedback and
means to fix broken builds promptly for developers. Once adopted, ATLAS CI
practices will improve and accelerate innovation cycles and result in
increased confidence in new software deployments. The presentation reports
the status of Jenkins integration with the ATLAS Nightly System as well as
short and long term plans for the incorporation of CI practices.
ATLAS is a high energy physics experiment in the Large Hadron Collider
located at CERN.
During the so called Long Shutdown 2 period scheduled for late 2018,
ATLAS will undergo
several modifications and upgrades on its data acquisition system in
order to cope with the
higher luminosity requirements. As part of these activities, a new
read-out chain will be built
for the New Small Wheel muon detector and the one of the Liquid Argon
calorimeter will be
upgraded. The subdetector specific electronic boards will be replaced
with new
commodity-server-based systems and instead of the custom SLINK-based
communication,
the new system will make use of a yet to be chosen commercial network
technology.
The new network will be used as a data acquisition network and at the
same time it is intended
to allow communication for the control, calibration and monitoring of
the subdetectors.
Therefore several types of traffic with different bandwidth requirements
and different criticality
will be competing for the same underlying hardware. One possible way to
address this problem
is using a SDN based solution.
SDN stands for Software Defined Networking and it is an innovative
approach to network
management. Instead of the classic network protocols used to build a
network topology and to
create traffic forwarding rules, SDN allows a centralized controller
application to programmatically
build the topology and create the rules that are loaded into the network
devices. The controller can
react very fast to new conditions and new rules can be installed on the
fly. A typical use case is a
network topology change due to a device failure which is handled
promptly by the SDN controller.
Dynamically assigning bandwidth to different traffic types based on
different criteria is also possible.
On the other hand, several difficulties can be anticipated such as the
connectivity to the controller
when the network is booted and the scalability of the number of rules as
the network grows.
This work summarizes the evaluation of the SDN technology in the context
of the research carried
out for the ATLAS data acquisition system upgrade. The benefits and
drawbacks of the new approach
will be discussed and a deployment proposal will be made.
Distributed computing infrastructures require automatic tools to strengthen, monitor and analyze the security behavior of computing devices. These tools should inspect monitoring data such as resource usage, log entries, traces and even processes' system calls. They also should detect anomalies that could indicate the presence of a cyber-attack. Besides, they should react to attacks without administrator intervention, depending on custom configuration parameters. We describe the development of a novel framework that implements these requirements for HEP systems. It is based on Linux container technologies. A previously unexplored deployment of Kubernetes on top of Mesos as Grid site container based batch system, and Heapster as a monitoring solution are being utilized. We show how we achieve a fully virtualized environment that improves the security by isolating services and jobs without an appreciable performance impact. We also describe an novel benchmark dataset for Machine Learning based Intrusion Prevention and Detection Systems on Grid computing. This dataset is built upon resource consumption, logs, and system call data collected from jobs running in a test site that has been developed for the ALICE Grid at CERN as a described framework's proof of concept. Further, we will use this dataset to develop a Machine Learning module that will be integrated with the framework, performing the autonomous Intrusion Detection task.
This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms.
The query engine plays also a critical role in the architecture. This paper reports on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.
The engineering design of a particle detector is usually performed in a
Computer Aided Design (CAD) program, and simulation of the detector's performance
can be done with a Geant4-based program. However, transferring the detector
design from the CAD program to Geant4 can be laborious and error-prone.
SW2GDML is a tool that reads a design in the popular SolidWorks CAD
program and outputs Geometry Description Markup Language (GDML), used
by Geant4 for importing and exporting detector geometries. SW2GDML utilizes
the SolidWorks Application Programming Interface for direct access to
the design and then converts the geometric shapes described in SolidWorks
into standard GDML solids.
Other methods for outputting CAD designs are available, such as the STEP
and STL formats, and tools exist to convert these formats into GDML.
However, these conversion methods produce very large and unwieldy designs
composed of tessellated solids that can reduce Geant4 performance. In
contrast, SW2GDML produces compact, human-readable GDML that employs standard
geometric shapes rather than tessellated solids.
This talk will describe the development and current capabilities of SW2GDML
and plans for its enhancement. The aim of this tool is to automate
importation of detector engineering models into Geant4-based simulation
programs to support rapid, iterative cycles of detector design, simulation, and
optimization.
The Compact Muon Solenoid (CMS) experiment makes a vast use of alignment and calibration measurements in several data processing workflows: in the High Level Trigger, in the processing of the recorded collisions and in the production of simulated events for data analysis and studies of detector upgrades. A complete alignment and calibration scenario is factored in approximately three-hundred records, which are updated independently and can have a time-dependent content, to reflect the evolution of the detector and data taking conditions. Given the complexity of the CMS condition scenarios and the large number (50) of experts who actively measure and release calibration data, in 2015 a novel web-based service has been developed to structure and streamline their management: the cmsDbBrowser. cmsDbBrowser provides an intuitive and easily accessible entry point for the navigation of existing conditions by any CMS member, for the bookkeeping of record updates and for the actual composition of complete calibration scenarios. This paper describes the design, choice of technologies and the first year of usage in production of the cmsDbBrowser.
The Trigger and Data Acquisition system of the ATLAS detector at the Large Hadron
Collider at CERN is composed of a large number of distributed hardware and software
components (about 3000 machines and more than 25000 applications) which, in a coordinated
manner, provide the data-taking functionality of the overall system.
During data taking runs, a huge flow of operational data is produced in order to constantly
monitor the system and allow proper detection of anomalies or misbehaviors. In the ATLAS
trigger and data acquisition system, operational data are archived and made available to
applications by the P-Beast (Persistent Back-End for the Atlas Information System of TDAQ) service,
implementing a custom time-series database.
The possibility to efficiently visualize both real-time and historical operational data is a great asset
facilitating both online identification of problems and post-mortem analysis. This paper will present
a web-based solution developed to achieve such a goal: the solution leverages the flexibility of the
P-Beast archiver to retrieve data, and exploits the versatility of the Grafana dashboard builder to offer
a very rich user experience. Additionally, particular attention will be given to the way some technical
challenges (like the efficient visualization of a huge amount of data and the integration of the P-Beast
data source in Grafana) have been faced and solved.
Volunteer computing has the potential to provide significant additional computing capacity for the LHC experiments.
One of the challenges with exploiting volunteer computing is to support a global community of volunteers that provides heterogeneous resources.
However, HEP applications require more data input and output than the CPU intensive applications that are typically used by other volunteer computing projects.
While the so-called "databridge" has already been successfully proposed as a method to span the untrusted and
trusted domains of volunteer computing and Grid computing respective, globally transferring data between potentially poor-performing public networks at home and CERN can be fragile and lead to wasted resources usage.
The expectation is that by placing closer to the volunteers a storage endpoint that is part of a wider, flexible
geographical databridge deployment, the transfer success rate and the overall performance can be improved.
This contribution investigates the provision of a globally distributed databridge implemented upon a commercial cloud provider.
Deploying a complex application on a Cloud-based infrastructure can be a challenging task. Among other things, the complexity can derive from software components the application relies on, from requirements coming from the use cases (i.e. high availability of the components, autoscaling, disaster recovery), from the skills of the users that have to run the application.
Using an orchestration service allows to hide the complex deployment of the application components, the order of each of them in the instantiation process and the relationships among them. In order to further simplify the application deployment to users not familiar with Cloud infrastructures and above layers, it can be worthwhile to provide an abstraction layer on top of the orchestration one.
In this contribution we present an approach for Cloud-based deployment of applications and its implementation in the framework of several projects, such as “!CHAOS: a cloud of controls, a project funded by MIUR (Italian Ministry of Research and Education) to create a Cloud-based deployment of a control system and data acquisition framework, "INDIGO-DataCloud", an EC H2020 project targeting among other things high-level deployment of applications on hybrid Clouds, and "Open City Platform", an Italian project aiming to provide open Cloud solutions for Italian Public Administrations.
Through orchestration services, we prototyped a dynamic, on-demand, scalable platform of software components, based on OpenStack infrastructures. A set of Heat templates developed ad-hoc allow to automatically deploy all the application components, minimize the faulty situations and guarantee the same configuration every time they run. The automatic orchestration is an example of Platform as a Service, that can be instantiated both via command-line or OpenStack dashboard, presuming a certain level of knowledge of OpenStack usage.
On top of the orchestration services we developed a prototype of a web interface exploiting the Heat APIs, that can be related to a specific application provided that ad-hoc Heat templates are available. The user can start an instance of the application without having knowledge about the underlying Cloud infrastructure and services. Moreover, the platform instance can be customized by choosing parameters related to the application such as the size of a File System or the number of instances of a NoSQL DB cluster. As soon as the desired platform is running, the web interface offers the possibility to scale some infrastructure components.
By providing this abstraction layer, users have a simplified access to Cloud resources and data center administrators can limit the degrees of freedom granted to the users.
In this contribution we describe the solution design and implementation, based on the application requirements, the details of the development of both the Heat templates and of the web interface, together with possible exploitation strategies of this work in Cloud data centers.