PDSF, the Parallel Distributed Systems Facility has been in continuous operation since 1996, serving high energy physics research. The cluster is a tier-1 site for Star, a tier-2 site for Alice and a tier-3 site for Atlas.
We'll give a status report of the PDSF cluster and the migration into Cori, the primary computing resource at NERSC. We'll go into how we tried to ease the process by...
News and updates from BNL activities since the Barcelona meeting
I will present recent developments of the Canadian T1 and T2.
We will present an update on AGLT2, focusing on the changes since the Fall 2018 report.
The primary topics to cover include the update on VMware, update of dCache, status of new purchased hardware, encountered problems and solutions on improving the CPU utilization of our HTCondor system.
As a major WLCG/OSG T2 site, the University of Wisconsin-Madison CMS T2 has consistently been delivering highly reliable and productive services towards large scale CMS MC production/processing, data storage, and physics analysis for last 13 years. The site utilizes high throughput computing (HTCondor), highly available storage system (Hadoop), scalable distributed software systems (CVMFS),...
Updates on the activities at T2_US_Nebraska over the past year. Topics will cover the site configuration and tools we use, troubles we face in daily operation, and contemplation of what the future might hold for sites like ours.
We would like to report an update of the computing research center at KEK including the Grid system from the last HEPiX Fall 2018 for the data taking period of SuperKEKB and J-PARC experiments in 2019. The network connectivity of KEK site has been improved by the replacement of network equipment and security devices in September 2018. The situation of the international network for Japan will...
The Tokyo Tier-2 center, which is located in the International Center for Elementary Particle Physics (ICEPP) at the University of Tokyo, is providing computing resources for the ATLAS experiment in the WLCG. Almost all hardware devices of the center are supplied by a lease, and are upgraded in every three years. This hardware upgrade was performed in December 2018. In this presentation,...
At CENPA at the University of Washington we have a heterogeneous rocks 7 cluster of
about 135 nodes containing 1250 cores. We will present the current status and issues.
Updates from JLab since the Autumn 2018 meeting at PIC in Barcelona.
Comet is SDSC’s newest supercomputer. The result of a $27M National Science Foundation (NSF) award, Comet deliverers over 2.7 petaFLOPS of computing power to scientists, engineers, and researchers all around the world. In fact, within its first 18 months of operation, Comet served over 10,000 unique users across a range of scientific disciplines, becoming one of the most widely used...
Over the last years, there has been a number of trends related to how devices are provisioned and managed within organizations, such as BYOD - "Bring Your Own Device" or COPE: "Company Owned, Personally Enabled". In response, a new category of products called "Enterprise Mobility Management Suites", which includes MDM - "Mobile Device Management" and MAM - "Mobile Application Management"...
During the last two years, the computational systems group at NERSC, in partnership with Cray, has been developing SMWFlow, a tool that makes managing system state as simple as switching branches in git. This solution is the cornerstone of collaborative systems management at NERSC and enables code-review, automated testing and reproducibility.
Besides supercomputers, NERSC hosts Mendel, a...
A report from the OpenAFS Release Team on recent OpenAFS releases and development branch updates. Topics include acknowledgement of contributors, descriptions of issues fixed, updates for new versions of Linux and Solaris, changes currently under review, and an update on the new RXGK security class for improved security.
Logistical Storage (LStore) provides a flexible logistical networking storage framework for distributed and scalable access to data in both an HPC and WAN environment. LStore uses commodity hard drives to provide unlimited storage with user controllable fault tolerance and reliability. In this talk, we will briefly discuss LStore's features and discuss the newly developed native LStore plugin...
RAL's Ceph-based Echo storage system is now the primary disk storage system running at the Tier 1, replacing a legacy CASTOR system that will be retained for tape. This talk will give an update on Echo's recent development, in particular the adaptations needed to support the ALICE experiment and the challenges of scaling an erasure-coded Ceph cluster past the 30PB mark. These include the...
News and updates from IHEP since the last HEPiX Workshop. In this talk we would like to present the status of IHEP site including computing farm, HPC, IHEPCloud, Grid, data storage ,network and so on.
News from CERN since the HEPiX Autumn/Fall 2018 workshop in Barcelona.
An update on what's going on at the Italian Tier1 center
Brief overview of activities at DESY since the last site report
We will give an overview of the site including our recent network redesign. We will dedicate a part of the talk to disk servers: report on the newest additions as well as upgraded old hardware. We will also share experience with our distributed HT-Condor batch system.
Ongoing developments at GSI/FAIR: diggers, Lustres, procurements, relocations, operating systems
Diamond Light Source is an X-ray synchrotron light source co-located with STFC RAL in the UK. This is the first site report from Diamond at HEPiX since 2015. The talk will discuss recent changes, current status and future plans as well as the odd disaster story thrown in for good measure.
Diamond has a new data centre, new storage and new compute as well as new staff and a few forays into...
This talk will discuss how we worked with Dr. Amy Apon, Brandon Posey, AWS and the Clemson DICE lab team dynamically provisioned a large scale computational cluster of more than one million cores utilizing Amazon Web Services (AWS). We discuss the trade-offs, challenges, and solutions associated with creating such a large scale cluster with commercial cloud resources. We utilize our large...
I present the recent developments for our cloudschdeduler, which we use to run HEP workloads on various clouds in North America and Europe. We are working on a complete re-write utilizing modern software technologies and practices.
The configuration of the CERN IT central DNS servers, based on ISC BIND, is generated automatically from scratch every 10 minutes using a software developed at CERN several years ago. This in-house set of Perl scripts has evolved and is reaching its limits in terms of maintainability and architecture. CERN is in the process of reimplementing the software with a modern language and is taking...
This presentation provides an update on the global security landscape since the last HEPiX meeting.It describes the main vectors of risks to and compromises in the academic community including lessons learnt, presents interesting recent attacks while providing recommendations on how to best protect ourselves. It also covers security risks management in general, as well as the security aspects...
Various High energy and nuclear physics experiments already benefit from using the different components of Federated architecture to access storage and infrastructure services. BNL moved to Identity management (Redhat IPA) in late 2018 which will serve as the foundation to move to Federated authentication and authorization. IPA provides central authentication via Kerberos or LDAP, simplifies...
The network market has changed a lot compared with a decade ago. Every hardware vendor sells their own switches and routers. Most of the switches and routers are based on the same merchant silicon that is available on the market.
Therefore the amount of real choices is limited because what is inside is the same for most of them.
This talk will tell about the differences that are still there...
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. The OSG Networking Area is a partner of the WLCG effort and is focused on being the primary source of networking information for its partners and...
The transition of WLCG storage services to dual-stack IPv4/IPv6 is progressing well, aimed at enabling the use of IPv6-only CPU resources as agreed by the WLCG Management Board and presented by us at previous HEPiX meetings.
The working group, driven by the requirements of the LHC VOs to be able to use IPv6-only opportunistic resources, continues to encourage wider deployment of dual-stack...
High Energy Physics (HEP) experiments have greatly benefited from a strong relationship with Research and Education (R&E) network providers and thanks to the projects such as LHCOPN/LHCONE and REN contributions, have enjoyed significant capacities and high performance networks for some time. RENs have been able to continually expand their capacities to over-provision the networks relative to...
We describe our experience and use of the Dynafed data federator with cloud and traditional Grid computing resources as an substitute for a traditional Grid SE.
This is an update of the report given at the Fall HEPiX meeting of 2017 where we introduced our use case for such federation and described our initial experience with it.
We used Dynafed in production for Belle-II since late 2017 and...
OSiRIS is a pilot project funded by the NSF to evaluate a
software-defined storage infrastructure for our primary Michigan
research universities and beyond. In the HEP world OSiRIS is involved
with ATLAS as a provider of Event Service storage via the S3 protocol
as well as experimenting with dCache backend storage for AGLT2. We
are also in the very early stages of working with IceCube and...
The computing center GridKa is serving the ALICE, ATLAS, CMS and
LHCb experiments as Tier-1 center with compute and storage resources.
It is operated by the Steinbuch Centre for Computing at Karlsruhe Institute
of Technology in Germany. In its current stage of expansion GridKa
offers the HEP experiments a capacity of 35 Petabytes of online storage.
The storage system is based on Spectrum...
DPM (Disk Pool Manager) is mutli-protocol distributed storage system that can be easily used within grid environment and it is still popular for medium size sites. Currently DPM can be configured to run in legacy or DOME mode, but official support for the legacy flavour ends this summer and sites using DPM storage should think about their upgrade strategy or coordinate with WLCG DPM Upgrade...
The Storage group of the CERN IT department is responsible for the development and the operation of petabyte-scale services needed to accommodate the diverse requirements for storing physics data generated by LHC and non-LHC experiments as well as supporting users of the laboratory in their day-by-day activities.
This contribution presents the current operational status of the main storage...
Brookhaven National Laboratory stores and processes large amounts of data from the following: PHENIX,STAR,ATLAS,Belle II, Simons, as well as smaller local projects. This data is stored long term in tape libraries but one working data is stored in disk arrays. Hardware raid devices from companies such as Hitachi Ventara are very convenient and require minimal administrative intervention....
Status, ongoing activities, and future directions at Fermilab.
The Benchmarking Working Group has been very active in the last months. The group observed that SPEC CPU 2017 is not very different from SPEC CPU 2006. On the worker node available the two benchmark are higly correlated. Analysis with Trident shows that the hardware counters usage is rather different from the HEP applications. So the group started to investigate the usage of real applications...
Monitoring and analyzing how a workload is processed by a job and resource management system is at the core of the operation of data centers. It allows operators to verify that the operational objectives are satisfied, detect any unexpected and unwanted behavior, and react accordingly to such events. However, the scale and complexity of large workloads composed of millions of jobs executed...
This talk is focused on recent experiences and developments in providing data analytics platform SWAN based on Apache Spark for High Energy Physics at CERN.
The Hadoop Service expands its user base for analysts who want to perform analysis with big data technologies - namely Apache Spark – with main users from accelerator operations and infrastructure monitoring. Hadoop Service integration...
...Placeholder...
We will briefly show the current onsite accelerator infrastructure and their resulting computing and storage usage and future requirements. The second section will discuss the plans and work done regarding the hardware infrastructure, the system level middleware (i.e. container, storage connection, networks) and the higher level middleware (under development) covering low latency data access...
For the past 10 years, CSCS has been running compute capability in the WLCG Tier-2 for ATLAS, CMS and LHCb on standard commodity hardware (a cluster named Phoenix). Three years ago, CSCS began providing this service on the flagship High Performance Computing (HPC) system, Piz Daint (a Cray XC40/50 system). Piz Daint is a world-class HPC system with over 1800 dual-processor multicore nodes and...
Deep Learning techniques are gaining interest in the High Energy Physics, following a new and efficient approach to solve different problems. These techniques leverage the specific features of GPU accelerators and rely on a set of software packages allowing users to compute on GPUs and program Deep Learning algorithms. However, the rapid pace at which both the hardware and the low and high...
The HSF/WLCG cost and performance modeling working group was established in November 2017 and has since then achieved considerable progress in our understanding of the performance factors of the LHC applications, the estimation of the computing and storage resources and the cost of the infrastructure and its evolution for the WLCG sites. This contribution provides an update on the recent...
The BNL Computing Facility Revitalization (CFR) project aimed at repurposing the former National Synchrotron Light Source (NSLS-I) building (B725) located on BNL site as the new datacenter for BNL Computational Science Initiative (CSI) and RACF/SDCC Facility in particular. The CFR project is currently wrapping up the design phase and expected to enter the construction phase in the first half...
I will be presenting how we are using our data collection framework (Omni) to help facilitate the installation of N9 (our new system) and how this all ties together with the Superfacility concept which mentioned in the fall.
A short report on what has happened, how we have organised ourselves, how we intend to present results etc. Note that the findings themselves will be discussed in other contributions - this is about how the group works.
We will report on the findings of the technology watch working group concerning CPUs, storage, networks and related fields
In November 2018, running on a mere half-rack of ordinary SuperMicro servers, WekaIO's Matrix Filesystem outperformed 40 racks of specialty hardware on Oak Ridge National Labs' Summit system, yielding the #1 ranked result for the IO-500 10-Node Challenge. How can that even be possible?
This level of performance becomes important for modern use cases whether they involve GPU-accelerated...
For several years, the GridKa Tier-1 center, the Large Scale Data Facility and other infrastructures at KIT have been using Puppet and Foreman for configuration management and machine deployment.
We will present our experiences, the workflows that are used and our current efforts to establish a completely integrated system for all our infrastructures based on Katello.
The token renewal service (TRS) has been used at SLAC National Accelerator
Laboratory since the late 1990s. In 2018 it was found to be lacking in some
critical areas (encryption types used and other basic mechanism would no
longer be available for post Red Hat 6 systems.)
1-to-1 replacement areas:
Running Batch Jobs:
Our local solution to batch jobs (LSF): The need for TRS was already...
GlideinWMS is a workload management and provisioning system that lets
you share computing resources distributed over independent sites. A
dynamically sized pool of resources is created by GlideinWMS pilot
Factories, based on the requests made by GlideinWMS Frontends. More
than 400 computing elements are currently serving more than 10
virtual organizations through glideinWMS. This contribution...
The KEK Central Computer System (KEKCC) is a service, which provides large-scale computer resources, Grid and Cloud computing, as well as common IT services. The KEKCC is entirely replaced every four or five years according to Japanese government procurement policy for the computer system. Current KEKCC has been in operation since September 2016 and decommissioning will start in early...
The Pacific Research Platform (PRP) is operating a Kubernetes cluster that manages over 2.5k CPU cores and 250 GPUs. Most of the resources are being used by local users interactively starting directly Kubernetes Pods.
To fully utilize the available resources, we have deployed an opportunistic HTCondor pool as a Kubernetes deployment, with worker nodes environment being fully OSG compliant....
The vast breadth and configuration possibilities of the public cloud offer intriguing opportunities for loosely coupled computing tasks. One such class of tasks is simply statistical in nature requiring many independent trials over the targeted phase space in order to converge on robust, fault tolerant and optimized designs. Our single threaded target application (50-200 MB) solves a...
CERN, the European Laboratory for Particle Physics, is running OpenStack for its private Could Infrastructure among other leading open source tools that helps thousands of scientists around the world to uncover the mysteries of the Universe.
In 2012, CERN started the deployment of its private Cloud Infrastructure using OpenStack. Since then we moved from few hundred cores to a multi-cell...
Modern software development workflow patterns often involve the use of a developer’s local machine as the first platform for testing code. SLATE mimics this paradigm with an implementation of a light-weight version, called MiniSLATE, that runs completely contained on the developer local machine (laptop, virtual machine, or another physical server). MiniSLATE resolves many development...
In the spring of 2018, central operations services were migrated out of the Grid Operations Center of Indiana into other participating Open Science Grid institutions. This talk summarizes how the migration has affected the services provided by the OSG, and gives a summary of how central OSG services interface with US WLCG sites.