JLab high performance and experimental physics computing environment updates since the spring 2016 meeting, including recent hardware installs of KNL and Broadwell compute clusters, Supermicro storage; our Lustre Intel upgrade status; 12GeV computing updates; and Data Center modernization progress.
The site report contains the latest news and updates on
computing at BNL.
Updates on the status of the Canadian Tier-1 and other TRIUMF computing news will be presented.
We will present an update on our site since the Spring 2016 report, covering our changes in software, tools and operations.
We will also report on our recent significant hardware purchases during summer 2016 and the impact it is having on our site.
We conclude with a summary of what has worked and what problems we encountered and indicate directions for future work.
Updates from T2_US_Nebraska covering our experiences operating CentOS 7 + Docker/SL6 worker nodes, banishing SRM in favor of LVS balanced GridFTP, and some attempts at smashing OpenFlow + GridFTP + ONOS together to live the SDN dream.
As a major WLCG/OSG T2 site, the University of Wisconsin-Madison CMS T2 has consistently been delivering highly reliable and productive services towards large scale CMS MC production/processing, data storage, and physics analysis for last 10 years. The site utilises high throughput computing (HTCondor), highly available storage system (Hadoop), scalable distributed software systems (CVMFS),...
This talk will give a brief introduction to the status of computing center IHEP, CAS, including local cluster, Grid Tier2 site for Atlas and CMS, file and storage system, cloud infrastructure, planned HPC system, Internet and domestic network.
The new KEK Central Computer system started the service on September 1st, 2016 after the renewal of all hardware. In this talk, we would like to introduce the performance of the new system and improvement of network connectivity with LHCONE.
The Tokyo Tier-2 site, which is located in International Center for Elementary Particle Physics (ICEPP)
at the University of Tokyo, is providing resources for the ATLAS experiment in WLCG. In December 2015,
almost all hardware devices were replaced as the 4th system. Operation experiences with the new system
and ??a migration plan from CREAM-CE + Troque/Maui to ARC-CE + HTCondor will be reported.
Will provide updates on technical and managerial changes to Australia's only HEP grid computing site.
Scientific Linux status and news.
Filtering e-mails for security reasons is a common procedure. At DESY e-mails with suspicious content are quarantained, users are notified and may request delivery of those e-mails. DESY is in the process of shifting from a commercial product to a quarantine solution made of open source and self-made software. This solution will be presented in context with DESY's e-mail infrastructure.
With the change of the ATLAS computing model from hierarchical to dynamic, processing tasks are dispatched to sites based not only on availability of resources but also network conditions along the path between compute and storage, which may be topologically and/or geographically distant. We describe a system developed to collect, store, analyze and provide timely access to the network...
Since last Apr.1, SINET that is NREN for universities in Japan has started the operation of 5th generation infrastracture, SINET5. It accepts 100Gbps connection to the backbone from each institutes, and newly provides the direct path from Japan to Europe. KEK is connected to SINET by 120Gbps bandwidth in total and mostly the bandwidth
will be used by the mass data transmission via LHCONE. We...
CERN networks are dealing with an ever-increasing volume of network traffic. The traffic leaving and entering CERN has to be precisely monitored and analysed in order to properly protect the networks from potential security breaches. To provide the required monitoring capabilities, the Computer Security team and the Networking team at CERN have joined efforts in designing and deploying a...
High energy physics experiments produce huge amounts of raw data, while because of the sharing characteristics of the network resources, there is no guarantee of the available bandwidth for each experiment which may cause link competition problems. On the other side, with the development of cloud computing technologies,IHEP have established a cloud platform based on OpenStack which can ensure...
Update on SLAC Scientific Computing Service
SLAC’s Scientific Computing Services team provide long-term storage and
midrange compute capability for multiple science projects across the lab.
The team is also responsible for core enterprise (non-science) unix
infrastructure. Sustainable hardware lifecycle is a key part of the...
Caltech site report (USCMS Tier 2 site)
report on facility deployment, recent activities, collaborations and plans
A short update on what's going on at the Italian T1 center.
News and interesting events from NDGF and NeIC.
News about GridKa Tier-1 and other KIT IT projects and infrastructure.
During the last few months, HPC @ GSI has moved servers and services to the new data center Green IT Cube. This included moving the users from the old compute cluster to the new one with a new scheduler, and moving several Petabytes of data from the old to the new Lustre cluster.
Critical to the success of ITER reaching its scientific goal (Q≥10) is a data system that supports the broad range of diagnostics, data analysis, and computational simulations required for this scientific mission. Such a data system, termed ITERDB in this document, will be the centralized data access point and data archival mechanism for all of ITER’s scientific data. ITERDB will provide a...
- hardware renewal
- dCache and OS upgrade
- ansible
- Windows10 migration
- network : IPV6
- infra : monitoring
- new H2020 call EOSF
We give an update on the infrastructure, Tier-0 hosting services, Cloud services and other recent developments at the Wigner Datacenter.
This report from the HEPiX IPv6 Working Group will present activities during the last 6-12 months. With IPv4 addresses running out and with some sites and Cloud providers now wishing to offer IPv6-only CPU, together with the fact that several WLCG sites are already successfully running production dual-stack storage services, we have a plan to support IPv6 CPU from April 2017 onwards. This plan...
What’s been happening in security for HEP? We will discuss the recent trends in the ever changing threat landscape, and the new initiatives being put in place to protect our people, data and services. One such initiative to highlight is our focus on boostrapping international collaboration within research and academia, encouraging communities to participate in intelligence sharing and incident...
Over the last few years, the number of mobile devices connected to the CERN internal network has increased from a handful in 2006 to more than 10,000 in 2015. Wireless access is no longer a “nice to have” or just for conference and meeting rooms, now support for mobility is expected by most, if not all, of the CERN community. In this context, a full renewal of the CERN Wi-Fi network has been...
Over the last few years, the number of mobile devices connected to the CERN internal network has increased from a handful in 2006 to more than 10,000 in 2015. Wireless access is no longer a “nice to have” or just for conference and meeting rooms, now support for mobility is expected by most, if not all, of the CERN community. In this context, a full renewal of the CERN Wi-Fi network has been...
HEP use of cloud services has brought to light various network issues that hamper the full integration of such services with WLCG resources. In this presentation we comment on the issues that have been encountered and present the ongoing actions of the international network community to facilitate the integration of cloud services into the research computing environment.
EduGAIN, the international identity federation, allows users from all over the world to access a globally distributed suite of academic resources. You are most likely already able to use your primary account, from CERN or your home organisation, to tap in to these services! Federated Identity Management, the technology underpinning eduGAIN, brings many benefits for users and organisations...
Intent of this presentation is to give current (or potential) users of Spectrum Scale a deep dive into various key components and functions of the Product and its usage in High Performance Computing. i will share Performance data for problematic filesystem workloads like shared directory or file access as well as demonstrate some new capabilities that have been added into the 4.2.1 release. i...
The HEPiX Benchmarking Working Group has been relaunched in spring 2016. First tasks are:
-
Development and proposal of a fast benchmark to estimate the performance of the provided job slot (in traditional batch farms) or VM instance (in cloud environments)
-
Preliminary work for a successor of the HS06 benchmark
This talk provides a status report of the work done so far.
Big data is typically characterized by only a few features, such as Volume, Velocity and Variety. This is a simplification that overlooks many factors that affect the way data is used and managed, factors that can have a profound effect on the computing systems needed to serve different communities.
I compare the computing and data-management needs of the genomics domain with those of big...
Jefferson Lab recently installed a 200 node Knights Landing cluster, becoming an Intel® Parallel Computing Center. This talk will give an overview of the cluster installation and configuration, including its Omni-Path fabric, benchmarking, and integation with Lustre and NFS over Infiniband.
x86 processors have been the long-time leaders of the server market and x86_64 the uncontested target architecture for the development of High Energy Physics applications. Up until few years ago, interests in alternative architectures targeting server environments that could compete in terms of performance, power efficiency and total cost of ownership with x86 could not find any concrete...
We aim to build a software service for provisioning cloud-based computing resources that can be used to augment users’ existing, fixed resources and meet their batch job demands. This service must be designed to automate the delivery of compute resources (HTCondor execute nodes) to match user job demand in such a way that cloud-based resource utilization is high and, thus, cost per cpu-hour is...
The goal of the HTCondor team is to to develop, implement, deploy, and evaluate mechanisms
and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Increasingly, the work performed by the HTCondor developers is being driven by its partnership with the High Energy Physics (HEP) community.
This talk will present recent changes...
NERSC is well known for its user friendly, large-scale computing environment. Along with the large Cray systems (Edison and Cori), NERSC also supports data intensive workflows of the Joint Genome Institute, HEP and material science community via its Genepool, PDSF and Matgen clusters. These clusters are all provisioned from a single backend cluster, Mendel. This talk will briefly outline the...
In this paper we present a CEPHFS use case implementation at the Center of Excellence for Particle Physics at the TeraScale (CoEPP). CoEPP operates the Australia Tier-2 for ATLAS and joins experimental and theoretical researchers from the Universities of Adelaide, Melbourne, Sydney and Monash. CEPHFS is used to provide a unique object storage system, deployed on commodity hardware and without...
A new data storage system, Echo, has been developed as a replacement for CASTOR disk-only storage of LHC data at the RAL Tier-1 for the past two years. This presentation will share the RAL experience of developing and deploying a new, ceph-based storage service at the 13 PB scale to the standard required for production use.
This is the first new service that we have developed at this scale...
We give a report on the status of Ceph based storage systems deployed at the RHIC & ATLAS Computing Facility (RACF) that are currently providing 1 PB of data storage capacity for the object store (with Amazon S3 compliant Rados Gateway front end), block storage (RBD), and shared file system (CephFS with dCache/GridFTP front-ends) layers of Ceph storage system. The hardware and software...
New developments in dCache, in particular resilient features of redundant headnode services where we can now do automatic failover and rolling upgrades with low to none service impact.
Some other news too, on recent development in other areas like ceph support.
Randomly restoring files from tapes degrades the read performance primarily due to frequent tape mounts. The high latency and time-consuming tape mount and dismount is a major issue when accessing massive amounts of data from tape storage. BNL's mass storage system currently holds more than 80 PB of data on tapes, managed by HPSS. To restore files from HPSS, we make use of a scheduler...
The CERN IT-ST Analytics and Development section is responsible for the development of Data Management solution for Disk Storage and Data Transfer, namely EOS, DPM and FTS.
The talk will describe some recent developments in those 3 software solutions
EOS
The integration and evaluation of various technologies to do the transition from a single active in-memory namespace to a scale-out...
ZFS is a combination of file system, logical volume manager, and software raid system developed by SUN Microsystems for the Solaris OS. ZFS simplifies the administration of disk storage and on Solaris it has been well regarded for its high performance, reliability, and stability for many years. It is used successfully for enterprise storage administration around the globe, but so far on such...
The OSiRIS (Open Storage Research Infrastructure) project started in September 2015, funded under the NSF CC*DNI DIBBs program (NSF grant #1541335). This program seeks solutions to the challenges many scientific disciplines are facing with the rapidly increasing size,
variety and complexity of data they must work with. As the data grows, scientists are challenged to manage, share and...
With the terabytes of data stored in databases and Hadoop at CERN and great number of critical applications relying on them, the database service is evolving and the Hadoop service is expanding to adapt to changing needs and requirements of its users. The demand is high and the scope is broad. This presentation gives an overview of current state of databases services and new technologies...
(Open)AFS has been used at CERN as general purpose filesystem for Linux homedirectories and project space for over 20 years. It has an excellent track record, but is showing its age. It is now slowly being phased out due to concerns on the project's long-term viability. The talk will briefly explain CERN's reasons for phasing out, give an overview of the process, introduce the migration...
Since the introduction of Transarc AFS in 1991, the AFS family of file systems have played a role in research computing around the globe.
This talk will discuss the resurgence in development of the AFS family of file systems. A summary of recent development for several family members will be presented including:
- [AuriStor File System][1] suite of clients and servers
- [kAFS][2],...
The [Open Compute Project][1], OCP, was launched by Facebook in 2011 with the objective of building efficient computing infrastructures at lowest possible cost. Specifications and design documents for Open Compute systems are released under open licenses following the model traditionally associated with open source software projects. In 2014 we presented our plans for a public procurement...
This talk will give an overview of current activities to expand CERN's computing facilities infrastructure. This will include a description of the 2nd Network Hub currently being constructed as we ll as its purpose. It will also cover the initial plans for a possible second Data Centre on the CERN site.
BNL anticipates significant growth in scientific programs with large
computing and data storage needs in the near future and has recently
re-organized support for scientific computing to meet these needs.
A key component is the enhanced role of the RHIC-ATLAS Computing
Facility (RACF) in support of HTC and HPC at BNL.
This presentation discusses the evolving role of the RACF at BNL, in
light...
The GreenITCube is in production for half a year now. We want to present our experience so far, what we have learned about the system and give an outlook for the next couple of months.
As a second part of the talk, we want to give a detailed overview of the infrastructure monitoring. The focus will be on the different systems, we have in work and how we put all monitoring data together.
Grafana is a popular tool for data analytics, and HTCondor generates
large amounts of time-series data appropriate for the kinds of analysis
Grafana provides. We use a Graphite cluster, which will be described in
some detail, as a back-end for metric storage, and adapted some scripts
from Fermilab for metric gathering. This work is in the context of the
batch-monitoring working group.
Historically at the RAL Tier-1 we have always directly exposed public-facing services to the internet via static DNS entries. This is far from ideal as it means that users will experience connection failures during server maintenance (both planned and unplanned) and any changes to the servers behind a particular service require DNS changes. Since April we have been using in production HAProxy...
Australia-ATLAS has been running Puppet for all infrastructure and Grid nodes since 2012. With the release of Puppet 4, and the move to Centos 7, we decided to rejig our Puppet configuration using what we've learnt in 4 years, and best practice methodologies. This talk will describe the problems we had with the old Puppet config, the decisions we made constructing the new system, and how the...
Kibana and ElasticSearch are used for monitoring in many places. However, by default they do not support authentication and authorization features. In the case of single Kibana and ElasticSearch services shared among many users, any user that can access Kibana can retrieve any information from ElasticSearch.
In this talk, we will report on our latest R&D experience in securing the Kibana...
An overview of results and lessons learned from the Fermilab Scientific Linux and Architecture Management(SLAM) group's Satellite 6 Lifecycle Management Project. The SLAM team offers a portfolio of diverse system management service offerings with a small staff. Managing the risk of resource scarcity involves implementing tools and processes that will facilitate standardization, reduce...
Did you ever need hundreds of state-of-the-art nodes that you could use to scalably test new ideas on? Run experiments that are not disrupted by what other users are doing? A platform that allows you to reinstall the operating system, recompile the kernel, and gives you access to the console so that you can debug the system? A place where your research team can easily reproduce experiments...
The Tier-1 at CNAF is the main INFN computing facility offering computing and storage resources to more than 30 different scientific collaborations including the 4 experiments at the LHC. A huge increase in computing needs is foreseen in the next years mainly driven by the experiments at the LHC (especially starting with the run 3 from 2021) but also by other upcoming experiments such as...
Providing effective and non-intrusive security within NERSC’s Open
Science HPC environment introduces a number of challenges for both
researchers and operational personnel. As what constitutes HPC expands
in scope and complexity, the need for timely and accurate decision
making about user activity remains unchanged. This growing complexity
is balanced against a backdrop of routine user...
This contribution reports on solutions, experiences and recent developments with the dynamic, on-demand provisioning of remote computing resources for analysis and simulation workflows. Local resources of a physics institute are extended by private and commercial cloud sites, ranging from the inclusion of desktop clusters over institute clusters to HPC centers.
We report on recent...
Overview of what has happened in HNSciCloud over the last five months
In IHEP, more large scientific facilities requests more computing resources. Management of large scale resources requests efficient and flexible system architecture. Virtual computing through cloud technical is an approach. IHEPCloud is a private LaaS cloud which supports multi-users and multi-projects to achieve virtual computing. In this paper, we describe the infrastructure of virtual...
Running HEP workloads on a Cray system can be challenging since these systems typically don't look very much look a standard Linux system. This presentation will describe several tools NERSC has deployed to enhance HEP and other data intensive computing: Shifter, a docker-like container technology developed at NERSC, the Burst Buffer, a super fast IO layer, and a software defined network that...
We provide an update on our continued experiments with container orchestration at the RAL Tier 1.
OpenStack is an open source software for creating private and public clouds.It controls large pools of compute, storage, and networking resources throughout a datacenter, managed through a dashboard or via the OpenStack API. Hundreds of the world’s largest brands rely on OpenStack to run their businesses every day, reducing costs and helping them move faster.
We are applying this computing...