Through participation in the Community Cluster Program of Purdue University, our Tier-2 center has for many years been one of the most productive and reliable sites for CMS computing, providing both dedicated and opportunistic resources to the collaboration. In this report we will present an overview of the site, review the successes and challenges of the last year of operation, and outline...
We will present an update on our site since the Fall 2017 report, covering our changes in software, tools and operations.
Some of the details to cover include the enabling of IPv6 for all of our AGLT2 nodes, our migration to SL7, exploration of the use of Bro/MISP at the UM site, the use of Open vSwitch on our dCache storage and information about our newest hardware purchases and deployed...
Updates from T2_US_Nebraska covering our experiences operating CentOS 7 + Docker/Singularity, random dabbling with SDN to better HEP transfers, involvement with the Open Science Grid, and trying to live the IPv6 dream.
PDSF, the Parallel Distributed Systems Facility, was moved to Lawrence Berkeley National Lab from Oakland CA in 2016. The cluster has been in continuous operation since 1996 serving high energy physics research. The cluster is a tier-1 site for Star, a tier-2 site for Alice and a tier-3 site for Atlas.
This site report will describe lessons learned and challenges met, when migrating from...
The computing center of IHEP maintains a HTC cluster with 10,000 cpu cores and a site including about 15,000 CPU cores and more than 10PB storage. The presentation will talk about the its progress and next plan of IHEP Site.
What do our users want?
One group wants the latest version of foo, but the stable version of bar.
The other group wants the latest version of bar, but the old version of foo.
What have we tried?
SCL
SCL's are great in theory. But in practice they are hard for the packagers. They also make the developers have to jump through several hoops. If something was developed in an SCL enviroment, it...
Updates on the status of Scientific Linux
CC-IN2P3 is one of the largest academic data centers in France. Its main mission is to provide the particle, astroparticle and nuclear physics community with IT services, including largescale compute and storage capacities. We are a partner for dozens of scientific experiments and hundreds of researchers that make a daily use of these resources. The CC-User Portal project's goal is to develop...
Trident, a tool to use low level metrics derived from hardware
counters to understand Core, Memory and I/O utilisation and bottlenecks.
The collection of time series of these low level counters does not
induce significant overhead to the execution of the application.
The Understanding Performance team is investigating on a new node
characterisation tool, ¹Trident¹, that can look at various...
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and
resolution of any network issues, including connection failures, congestion and traffic routing. The OSG Networking Area is a partner of the WLCG effort
and is focused on being the primary source of networking information for its partners and...
For several years the HEPiX IPv6 Working Group has been testing WLCG services to ensure their IPv6 compliance. The transition of WLCG central and storage services to dual-stack IPv4/IPv6 is progressing well, thus enabling the use of IPv6-only CPU resources as agreed by the WLCG Management Board and presented by us at previous HEPiX meetings.
By April 2018, all WLCG Tier 1 data centres have...
Recently, we've deployed IPv6 for the CMS dCache instance at KIT. We've run into a number of interesting problems with the IPv6 setup we had originally chosen. The presentation will detail the lessons we've learned and the resulting redesign of our IPv6 deployment strategy.
This presentation provides an update on the global security landscape since the last HEPiX meeting. It describes the main vectors of risks to and compromises in the academic community including lessons learnt, presents interesting recent attacks while providing recommendations on how to best protect ourselves. It also covers security risks management in general, as well as the security aspects...
News about what happened at DESY during the last months
News from CERN since the HEPiX Fall 2017 workshop at KEK, Tsukuba, Japan.
A brief update on INFN-T1 site, what is our current status and what is still to be done to reach 100% functionality
News from PIC since the HEPiX Fall 2017 workshop at KEK, Tsukuba, Japan.
Recently we deployed new cluster with worker nodes with 10 Gbps network connection
and new disk servers for DPM and xrootd. I will also discuss migration from Torque/Maui to HTCondor batch system.
There are many ongoing activities related to the development and deployment of Federated Identities and AAI (Authentication and Authorisation Infrastructures) in research communities and cyber Infrastructures including WLCG and others. This talk will give a high-level overview of the status of at least some of the current activities in FIM4R, AARC, WLCG and elsewhere.
High Energy Physics (HEP) experiments have greatly benefited from a strong relationship with Research and Education (R&E) network providers and thanks to the projects such as LHCOPN/LHCONE and REN contributions, have enjoyed significant capacities and high performance networks for some time. RENs have been able to continually expand their capacities to over-provision the networks relative to...
The Belle II detector is already taking data by cosmic ray test and is about to record data by beam. The importance of the network connectivity becomes higher than all other experiments in KEK. It is not only for the data transfer but also for researchers who are watching the condition of detectors from off sites.
We will report the present status of the campus network and the upgrade plan in...
Chinese Academy of Sciences has 104 research institutes, 12 branch academies, three universities and 11 supporting organizations in 23 provincial-level areas throughout the country. These institutions are home to more than 100 national key labs and engineering centers as well as nearly 200 CAS key labs and engineering centers. Altogether, CAS comprises 1,000 sites and stations across the...
Present the Network status at IHEP and LHCONE progress in China
Scientific activities generate huge data and need to transfer them to some places to research. Traditional networking infrastructure has a defined architecture and can not satisfy such real-time and high-quality transferring requirements.
China Science and Technology Network(CSTNet) was constructed in order to meet the needs of the research institutes under the Chinese Academy of Sciences and...
A report from the OpenAFS Release Team on recent OpenAFS releases, including the OpenAFS 1.8.0 release, the first major release in several years. Topics include acknowlegement of contributors, descriptions of issues recently resolved, and a discussion of commits under review for post 1.8.0.
We would like to have one of the Board members of The OpenAFS Foundation, Inc, speak about this 501(c)(3), US-based, non-profit organization dedicated to fostering the stability and growth of OpenAFS, an open source implementation of the AFS distributed network filesystem. The OpenAFS Foundation adopted a three-fold mission: to attract and increase the community of OpenAFS users, to foster...
The group has been formed to tackle two main themes
- establish a knowledge-sharing community for those operating archival storage for WLCG
- understand how to monitor usage of archival systems and optimise their exploitation by experiments
I will report on the recent activities of this group.
The computing center GridKa is serving the ALICE, ATLAS, CMS and LHCb experiments as one of the biggest WLCG Tier-1 centers world wide with compute and storage resources. It is operated by the Steinbuch Centre for Computing at Karlsruhe Institute of Technology in Germany. In April 2017 a new online storage system was put into operation. In its current stage of expansion it offers the HEP...
CERN IT Storage (IT/ST) group leads the development and operation of large-scale services based on EOS for the full spectrum of use-cases at CERN and in the HEP community. IT/ST group also provides storage for other internal services, such as Open Stack, using a solution based on Ceph. In this talk we present current operational status, ongoing development work and future architecture outlook...
Last May it was announced "AFS" was awarded the [2016 ACM System Software Award][4]. .This presentation will discuss the current state of the AFS file system family including:
- IBM AFS 3.6
- [OpenAFS][1]
- [kAFS][2]
- [AuriStor File System][3]
IBM AFS 3.6 is a commercial product no longer publicly available.
OpenAFS is fork from IBM AFS 3.6 available under the [IBM Public...
The ever-decreasing cost of high capacity spinning media has resulted in a trend towards very large capacity storage ‘building blocks’. Large numbers of disks - with up to 60 drives per enclosure being more-or-less standard – indeed allow for dense solutions, maximizing storage capacity in terms of floor space, and can in theory be packed almost exclusively with disks. The result are building...
After the successful adoption of the CMS Federation an opportunity arose to cache xrootd requests in Southern California. We present the operational challenges and the lessons learned from scaling a federated cache (a cache composed of several independent nodes) first at UCSD and the scaling and network challenges to augment it to include the Caltech Tier 2 Site. In which would be a first of...
One future model of software deployment and configuration is containerization.
AFS has been used for software distribution for many decades. Its global file namespace, the @sys path component substitution macro which permits file paths to be platform-agnostic, and the atomic publication model ("vos release") have proven to be critical components of successful software distribution systems...
The benchmarking working group holds biweekly meeting. we are focusing on the health of HS06, fast benchmark and study of a new benchmark to replace HS06 since SPEC has moved to a new family of benchmark
The working group has been established and is now working towards a cost and performance model that allows to quantitatively estimate the computing resources needed for HL-LHC and map them towards the cost at specific sites.
The group has defined a short and medium term plan and identified the main tasks. Around the tasks teams with members from experiments and sites have formed and started...
Computing is changing at BNL, we will discuss how we are restructuring our Condor pools, integrating them with new tools like Jupyter notebooks, and other resources like HPC systems run with Slurm.
The batch facilities at DESY are currently enlarged significantly while at the same time partly migrated from SGE to HTCondor.
This is a short overview of what is going on on site in terms of GRID-, local- and HPC cluster development.
At the last HEPix meeting we described the results of a proof of concept study to run batch jobs on EOS disc server nodes. By now we have moved forward towards a production level configuration and the first pre-production nodes have been setup. Beside the relevance for CERN this is also a more general step towards a hyper-converged infrastructure.
Techlab, a CERN IT project, is a hardware lab providing experimental systems and benchmarking data for the HEP community.
Techlab is constantly on the lookout for new trends in HPC, cutting-edge technologies and alternative architectures, in terms of CPUs and accelerators.
We believe that in the long run, a diverse offer and a healthy competition in the HPC market will serve science in...
he goal of the HTCondor team is to to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Increasingly, the work performed by the HTCondor developers is being driven by its partnership with the High Energy Physics (HEP) community.
This talk will present recent changes...
PDSF, the Parallel Distributed Systems Facility, has been in continuous operation since 1996 serving high energy physics research. It is currently a tier-1 site for Star, a tier-2 site for Alice and a tier-3 site for Atlas. We are in the process of migrating PDSF workload from commodity cluster to the Cori a Cray XC40 system. The process will involve preparing containers that will allow PDSF...
For the past 10 years, CSCS has been providing computational resources for the ATLAS, CMS, and LHCb experiments on a standard commodity cluster.
The High Luminosity LHC upgrade (HL-LHC) presents new challenges and demands with a predicted 50x increase in computing needs over the next 8 to 10 years. High Performance Computing capabilities could help to equalize the computing demands due to...
HPL and HPCG Benchmark on Brookhaven National Laboratory SDCC clusters and various generations of Linux Farm nodes has been conducted and compared with HS06 results. While HPL results are more aligned with CPU/GPU performance. HPCG results are impacted by memory performances as well.
In this work, we present a fast implementation for analytical image reconstruction from projections, using the so-called "backprojection-slice theorem" (BST). BST has the ability to reproduce reliable image reconstructions in a reasonable amount of time, before taking further decisions. The BST is easy to implement and can be used to take fast decisions about the quality of the measurement,...
When monitoring an increasing number of machines, infrastructure and tools need to be rethinked. A new tool, ExDeMon, for detecting anomalies and raising actions, has been developed to perform well on this growing infrastructure. Considerations of the development and implementation will be shared.
Daniel has been working at CERN for more than 3 years as Big Data developer, he has being...
BNL is planning a new on-site data center for its growing portfolio of programs in need of scientific computing support. This presentation will provide an update on the status and plans for this new data center.
In scope of the Wigner Datacenter cloud project we are consolidating our network equipment. According to our plans we would like to purchase 100 Gbps datacenter switches in order to anticipate our current and future needs. We need automated, vendor neutral and easily operable network. This presentation highlights our requirements and design goals, candidates we have tested in our lab. We take...
On November 9 2017, a major flooding occurred in the computing rooms: this has turned into a down of all the services for a prolonged period of time.
In this talk we will go through all the issues we faced in order to recover the services in the quickest and most efficient way; we will analyze in detail the incident and all the steps made to recover the computing rooms, electrical power,...
A short review of how technology and markets have evolved in areas relevant for HEP computing
Following up from abstract #117, a proposal to form a working group dedicated to technology watch
The BNL Scientific Data and Computing Center (SDCC) has begun to deploy a user analysis portal based on Jupyterhub. The Jupyter interfaces have back-end access to the Atlas compute farm via Condor for data analysis, and to the GP-GPU resources on the Institutional Cluster via Slurm, for machine learning applications. We will present the developing architecture of this system, current use...
As the complexity of systems increases and the scale of these systems increases, the amount of system level data recorded increases.
Managing the vast amounts of log data is a challenge that CSCS solved with the introduction of a centralized log and metrics infrastructure based on Elasticsearch, Graylog, Kibana, and Grafana.
This is a fundamental service at CSCS that provides easy correlation...
Since early 2017, the MONIT infrastructure provides services for monitoring the CERN data centre, together with the WLCG grid resources, and progressively replaces in-house technologies, such as LEMON and SLS, using consolidated open source solutions for monitoring and alarms.
The infrastructure collects data from more than 30k data centre hosts in Meyrin and Wigner sites, with a total...
In the Autumn of 2016 the Nikhef data processing facility (NDPF) found itself at a junction on the road of configuration management. The NDPF was one of the early adopters of Quattor, which served us well since the early days of the Grid. But where grid deployments were uniquely complex to require the likes of Quattor then, nowadays a plethora of configuration systems have cropped up to...
In the past, we have developed lots of smaller and larger tools to help in various aspects of Linux administration at DESY.
We present (some) of them in this talk.
An incomplete list is:
- Two-Factor-Authentication
- Timeline repositories
- Making Kernel upgrade notifications (more) audit safe
- Fail2ban
The interest in using Big Data solutions based on Hadoop ecosystem is constantly growing in HEP community. This drives the need for increased reliability and availability of the central Hadoop service and underlying infrastructure provided to the community by the CERN IT department.
This contribution will report on the overall status of the Hadoop platform and the recent enhancements and...
CERN runs a private OpenStack Cloud with ~300K cores, ~3000 users and a number of OpenStack services. CERN users can built services using a pool of compute and storage resources using the OpenStack APIs like Ironic, Nova, Magnum, Cinder and Manila, on the other hand CERN cloud operators face some operational challenges at scale in order to offer them. In this talk, you will learn about the...
The Helix Nebula Science Cloud (HNSciCloud) Horizon 2020 Pre-Commercial Procurement project (http://www.hnscicloud.eu/) brings together a group of 10 research organisations to procure innovative cloud services from commercial providers to establish a cloud platform for the European research community.
This 3 year project has recently entered its final phase which will deploy two pilots with a...
Virtual machines is the technology that formed the modern clouds - private and public - however the physical machine are back in a more cloudy way. Cloud providers are offering APIs for baremetal server provisioning on demand and users are leveraging containers for isolation and reproducible deployments. In this talk, I will be presenting one of the newest services at the CERN cloud, Ironic,...
As our OpenStack cloud enters full production, we give an overview of the design and how it leverages the RAL Tier 1 infrastructure & support. We also present some of teh new use cases and science being enabled by the cloud platform.
We are seeing an increasingly wide variety of uses being made of Hybrid Cloud (and Grid!) computing technologies at STFC, this talk will focus on the services being delivered to end users and novel integrations with existing local compute and data infrastructure.
Cloud computing enables flexible resource provisioning on demand. Through the collaboration with National Institute of Informatics (NII) Japan, we have been integrating our local batch job system with clouds for expanding its computing resource and providing heterogeneous clusters dynamically. In this talk, we will introduce our hybrid batch job system which can dispatch jobs to provisioned...
Distributed research organizations are faced with wide variation in computing environments to support. LIGO has historically resolved this problem by providing RPM/DEB packages for (pre-)production software and coordination between clusters operated by LIGO-affiliated facilities and research groups. This has been largely successful although it leaves a gap in operating system support and in...