Conveners
Track 6: Infrastructures: 6.1
- Catherine Biscarat (LPSC Grenoble, IN2P3/CNRS)
Track 6: Infrastructures: 6.2
- Olof Barring (CERN)
Track 6: Infrastructures: 6.3
- Francesco Prelz (Università degli Studi e INFN Milano (IT))
Track 6: Infrastructures: 6.4
- Francesco Prelz (Università degli Studi e INFN Milano (IT))
Track 6: Infrastructures: 6.5
- Olof Barring (CERN)
Track 6: Infrastructures: 6.6
- Catherine Biscarat (LPSC Grenoble, IN2P3/CNRS)
Track 6: Infrastructures: 6.7
- Catherine Biscarat (LPSC Grenoble, IN2P3/CNRS)
Access and exploitation of large scale computing resources, such as those offered by general
purpose HPC centres, is one import measure for ATLAS and the other Large Hadron Collider experiments
in order to meet the challenge posed by the full exploitation of the future data within the constraints of flat budgets.
We report on the effort moving the Swiss WLCG T2 computing,
serving ATLAS, CMS...
Fifteen Chinese High Performance Computing sites, many of them on the TOP500 list of most powerful supercomputers, are integrated into a common infrastructure providing coherent access to a user through an interface based on a RESTful interface called SCEAPI. These resources have been integrated into the ATLAS Grid production system using a bridge between ATLAS and SCEAPI which translates the...
Obtaining CPU cycles on an HPC cluster is nowadays relatively simple and sometimes even cheap for academic institutions. However, in most of the cases providers of HPC services would not allow changes on the configuration, implementation of special features or a lower-level control on the computing infrastructure and networks, for example for testing new computing patterns or conducting...
The Open Science Grid (OSG) is a large, robust computing grid that started primarily as a collection of sites associated with large HEP experiments such as ATLAS, CDF, CMS, and DZero, but has evolved in recent years to a much larger user and resource platform. In addition to meeting the US LHC community’s computational needs, the OSG continues to be one of the largest providers of distributed...
ALICE HLT Cluster operation during ALICE Run 2
(Johannes Lehrbach) for the ALICE collaboration
ALICE (A Large Ion Collider Experiment) is one of the four major detectors located at the LHC at CERN, focusing on the study of heavy-ion collisions. The ALICE High Level Trigger (HLT) is a compute cluster which reconstructs the events and compresses the data in real-time. The data compression...
During the past years an increasing number of CMS computing resources are offered as clouds, bringing the flexibility of having virtualised compute resources and centralised management of the Virtual Machines (VMs). CMS has adapted its job submission infrastructure from a traditional Grid site to operation using a cloud service and meanwhile can run all types of offline workflows. The cloud...
Brookhvaven National Laboratory (BNL) anticipates significant growth in scientific programs with large computing and data storage needs in the near future and has recently re-organized support for scientific computing to meet these needs.
A key component is the enhanced role of the RHIC-ATLAS Computing Facility
(RACF)in support of high-throughput and high-performance computing (HTC and HPC) ...
The Worldwide LHC Computing Grid (WLCG) infrastructure
allows the use of resources from more than 150 sites.
Until recently the setup of the resources and the middleware at a site
were typically dictated by the partner grid project (EGI, OSG, NorduGrid)
to which the site is affiliated.
Since a few years, however, changes in hardware, software, funding and
experiment computing requirements have...
The INFN CNAF Tier-1 computing center is composed by 2 different main rooms containing IT resources and 4 additional locations that hosts the necessary technology infrastructures providing the electrical power and refrigeration to the facility. The power supply and continuity are ensured by a dedicated room with three 15,000 to 400 V transformers in a separate part of the principal building...
1. Statement
OpenCloudMesh has a very simple goal: to be an open and vendor agnostic standard for private cloud interoperability.
To address the YetAnotherDataSilo problem, a working group under the umbrella of the GÉANT Association is has been created with the goal of ensuring neutrality and a clear context for this project.
All leading partners of the OpenCloudMesh project - GÉANT,...
The Tier-1 at CNAF is the main INFN computing facility offering computing and storage resources to more than 30 different scientific collaborations including the 4 experiments at the LHC. It is also foreseen a huge increase in computing needs in the following years mainly driven by the experiments at the LHC (especially starting with the run 3 from 2021) but also by other upcoming experiments...
The WLCG Tier-1 center GridKa is developed and operated by the Steinbuch Centre for Computing (SCC)
at the Karlsruhe Institute of Technology (KIT). It was the origin of further Big Data research activities and
infrastructures at SCC, e.g. the Large Scale Data Facility (LSDF), providing petabyte scale data storage
for various non-HEP research communities.
Several ideas and plans...
The KEK central computer system (KEKCC) supports various activities in KEK, such as the Belle / Belle II, J-PARC experiments, etc. The system is now under replacement and will be put into production in September 2016. The computing resources, CPU and storage, in the next system are much enhanced as recent increase of computing resource demand. We will have 10,000 CPU cores, 13 PB disk storage,...
At the RAL Tier-1 we have been deploying production services on both bare metal and a variety of virtualisation platforms for many years. Despite the significant simplification of configuration and deployment of services due to the use of a configuration management system, maintaining services still requires a lot of effort. Also, the current approach of running services on static machines...
The HEP prototypical systems at the Supercomputing conferences each year have served to illustrate the ongoing state of the art developments in high throughput, software-defined networked systems important for future data operations at the LHC and for other data intensive programs. The Supercomputing 2015 SDN demonstration revolved around an OpenFlow ring connecting 7 different booths and the...
In today's world of distributed scientific collaborations, there are many challenges to providing reliable inter-domain network infrastructure. Network operators use a combination of
active monitoring and trouble tickets to detect problems, but these are often ineffective at identifying issues that impact wide-area network users. Additionally, these approaches do not scale to wide area...
The Open Science Grid (OSG) relies upon the network as a critical part of the distributed infrastructures it enables. In 2012 OSG added a new focus area in networking with a goal of becoming the primary source of network information for its members and collaborators. This includes gathering, organizing and providing network metrics to guarantee effective network usage and prompt detection and...
The fraction of internet traffic carried over IPv6 continues to grow rapidly. IPv6 support from network hardware vendors and carriers is pervasive and becoming mature. A network infrastructure upgrade often offers sites an excellent window of opportunity to configure and enable IPv6.
There is a significant overhead when setting up and maintaining dual stack machines, so where possible...
Over the last few years, the number of mobile devices connected to the CERN internal network has increased from a handful in 2006 to more than 10,000 in 2015. Wireless access is no longer a “nice to have” or just for conference and meeting rooms, now support for mobility is expected by most, if not all, of the CERN community. In this context, a full renewal of the CERN Wi-Fi network has been...
RapidIO (http://rapidio.org/) technology is a packet-switched high-performance fabric, which has been under active development since 1997. Originally meant to be a front side bus, it developed into a system level interconnect which is today used in all 4G/LTE base stations world wide. RapidIO is often used in embedded systems that require high reliability, low latency and scalability in a...
HPC network technologies like Infiniband, TrueScale or OmniPath provide low-
latency and high-throughput communication between hosts, which makes them
attractive options for data-acquisition systems in large-scale high-energy
physics experiments. Like HPC networks, data acquisition networks are local
and include a well specified number of systems. Unfortunately traditional...
In recent years there has been increasing use of HPC facilities for HEP experiments. This has initially focussed on less I/O intensive workloads such as generator-level or detector simulation. We now demonstrate the efficient running of I/O-heavy ‘analysis’ workloads for the ATLAS and ALICE collaborations on HPC facilities at NERSC, as well as astronomical image analysis for DESI.
To do...
Abstract: Southeast University Science Operation Center (SEUSOC) is one of the computing centers of the Alpha Magnetic Spectrometer (AMS-02) experiment. It provides 2000 CPU cores for AMS scientific computing and a dedicated 1Gbps Long Fat Network (LFN) for AMS data transmission between SEU and CERN. In this paper, the workflows of SEUSOC Monte Carlo (MC) production are discussed in...
With processor architecture evolution, the HPC market has undergone a paradigm shift. The adoption of low-cost, Linux-based clusters extended HPC’s reach from its roots in modeling and simulation of complex physical systems to a broad range of industries, from biotechnology, cloud computing, computer analytics and big data challenges to manufacturing sectors. In this perspective, the near...
This contribution gives a report on the remote evaluation of the pre-production Intel Omni-Path (OPA) interconnect hardware and software performed by RHIC & ATLAS Computing Facility (RACF) at BNL in Dec 2015 - Feb 2016 time period using a 32 node “Diamond” cluster with a single Omni-Path Host Fabric Interface (HFI) installed on each and a single 48-port Omni-Path switch with the non-blocking...
Over the past several years, rapid growth of data has affected many fields of science. This has often resulted in the need for overhauling or exchanging the tools and approaches in the disciplines’ data life cycles, allowing the application of new data analysis methods and facilitating improved data sharing.
The project Large-Scale Data Management and Analysis (LSDMA) of the German Helmholtz...
SWAN is a novel service to perform interactive data analysis in the cloud. SWAN allows users to write and run their data analyses with only a web browser, leveraging the widely-adopted Jupyter notebook interface. The user code, executions and data live entirely in the cloud. SWAN makes it easier to produce and share results and scientific code, access scientific software, produce tutorials and...
Open City Platform (OCP) is an industrial research project funded by the Italian Ministry of University and Research, started in 2014. It intends to research, develop and test new technological solutions open, interoperable and usable on-demand in the field of Cloud Computing, along with new sustainable organizational models for the public administration, to innovate, with scientific results,...
Apache Mesos is a resource management system for large data centres, initially developed by UC Berkeley, and now maintained under the Apache Foundation umbrella. It is widely used in the industry by companies like Apple, Twitter, and AirBnB and it's known to scale to 10'000s of nodes. Together with other tools of its ecosystem, like Mesosphere Marathon or Chronos, it provides an end-to-end...
Clouds and Virtualization are typically used in computing centers to satisfy diverse needs: different operating systems, software releases or fast servers/services delivery. On the other hand solutions relying on Linux kernel capabilities such as Docker are well suited for applications isolation and software developing. In our previous work (Docker experience at INFN-Pisa Grid Data Center*) we...
Bringing HEP computing to HPC can be difficult. Software stacks are often very complicated with numerous dependencies that are difficult to get installed on an HPC system. To address this issue, amongst others, NERSC has created Shifter, a framework that delivers Docker-like functionality to HPC. It works by extracting images from native formats (such as a Docker image) and converting them to...
COTS HPC has evolved for two decades to become an undeniable mainstream computing solution. It represents a major shift away from yesterday’s proprietary, vector-based processors and architectures to modern supercomputing clusters built on open industry standard hardware. This shift enabled the Industry with a cost-effective path to high-performance, scalable and flexible supercomputers (from...
With the imminent upgrades to the LHC and the consequent increase of the amount and complexity of data collected by the experiments, CERN's computing infrastructures will be facing a large and challenging demand of computing resources. Within this scope, the adoption of cloud computing at CERN has been evaluated and has opened the doors for procuring external cloud services from providers,...
INDIGO-DataCloud (INDIGO for short, https://www.indigo-datacloud.eu) is a project started in April 2015, funded under the EC Horizon 2020 framework program. It includes 26 European partners located in 11 countries and addresses the challenge of developing open source software, deployable in the form of a data/computing platform, aimed to scientific communities and designed to be deployed on...
The INDIGO-DataCloud project's ultimate goal is to provide a sustainable European software infrastructure for science, spanning multiple computer centers and existing public clouds.
The participating sites form a set of heterogeneous infrastructures, some running OpenNebula, some running OpenStack. There was the need to find a common denominator for the deployment of both the required PaaS...
JUNO (Jiangmen Underground Neutrino Observatory) is a multi-purpose neutrino experiment designed to measure the neutrino mass hierarchy and mixing parameters. JUNO is estimated to be in operation in 2019 with 2PB/year raw data rate. The IHEP computing center plans to build up virtualization infrastructure to manage computing resources in the coming years and JUNO is selected to be one of the...
When first looking at converting a part of our site’s grid infrastructure into a cloud based system in late 2013 we needed to ensure the continued accessibility of all of our resources during a potentially lengthy transition period.
Moving a limited number of nodes to the cloud proved ineffective as users expected a significant number of cloud resources to be available to justify the effort...
Randomly restoring files from tapes degrades the read performance primarily due to frequent tape mounts. The high latency and time-consuming tape mount and dismount is a major issue when accessing massive amounts of data from tape storage. BNL's mass storage system currently holds more than 80 PB of data on tapes, managed by HPSS. To restore files from HPSS, we make use of a scheduler...
The Pacific Research Platform is an initiative to interconnect Science DMZs between campuses across the West Coast of the United States over a 100 gbps network. The LHC @ UC is a proof of concept pilot project that focuses on interconnecting 6 University of California campuses. It is spearheaded by computing specialists from the UCSD Tier 2 Center in collaboration with the San Diego...
We describe the development and deployment of a distributed campus computing infrastructure consisting of a single job submission portal linked to multiple local campus resources, as well the wider computational fabric of the Open Science Grid (OSG). Campus resources consist of existing OSG-enabled clusters and clusters with no previous interface to the OSG. Users accessing the single...
Global Science experimental Data hub Center (GSDC) at Korea Institute of Science and Technology Information (KISTI) located at Daejeon in South Korea is the unique data center in the country which helps with its computing resources fundamental research fields deal with the large-scale of data. For historical reason, it has run Torque batch system while recently it starts running HTCondor for...
We present the consolidated batch system at DESY. As one of the largest resource centres DESY has to support differing work flows by HEP experiments in WLCG or Belle II as well as local users. By abandoning specific worker node setups in favour of generic flat nodes with middleware resources provided via CVMFS, we gain flexibility to subsume different use cases in a homogeneous environment. ...
Traditionally, the RHIC/ATLAS Computing Facility (RACF) at Brookhaven National Laboratory has only maintained High Throughput Computing (HTC) resources for our HEP/NP user community. We've been using HTCondor as our batch system for many years, as this software is particularly well suited for managing HTC processor farm resources. Recently, the RACF has also begun to design/administrate some...
In order to estimate the capabilities of a Computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot job to match a task for which the required CPU work is known, or to define the number of events to be processed knowing the CPU work per event. Otherwise one always has the risk that the task is aborted because...