HEPiX Fall 2016 Workshop

US/Pacific
Building 50 Auditorium (LBNL)

Building 50 Auditorium

LBNL

Berkeley, CA 94720
Helge Meinhard (CERN), Tony Wong (Brookhaven National Laboratory)
Description

HEPiX Fall 2016 at Lawrence Berkeley National Laboratory, Berkeley, CA, USA

LBNL view

The HEPiX forum brings together worldwide Information Technology staff, including system administrators, system engineers, and managers from High Energy Physics and Nuclear Physics laboratories and institutes, to foster a learning and sharing experience between sites facing scientific computing and data challenges.

Participating sites include BNL, CERN, DESY, FNAL, IHEP, IN2P3, INFN, IRFU, JLAB, KEK, LBNL, NDGF, NIKHEF, PIC, RAL, SLAC, TRIUMF, and many others.

HEPiX Fall 2016 is proudly sponsored by Seagate at the platinum level and Intel and Penguin Computing at the silver level.

Platinum

Silver

Penguin Computing LogoIntel Logo

    • Registration Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Miscellaneous Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 1
        Logistics & Safety Announcement
        Speakers: Helge Meinhard (CERN), Tony Wong (Brookhaven National Laboratory)
      • 2
        Welcome To NERSC/LBNL

        Sudip Dosanjh, NERSC

      • 3
        Plans to Support Data-Intensive Computing on the NERSC 8 System
    • Site Report Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 4
        JLab Scientific and High Performance Computing

        JLab high performance and experimental physics computing environment updates since the spring 2016 meeting, including recent hardware installs of KNL and Broadwell compute clusters, Supermicro storage; our Lustre Intel upgrade status; 12GeV computing updates; and Data Center modernization progress.

        Speaker: Sandy Philpott
      • 5
        BNL Site Report

        The site report contains the latest news and updates on
        computing at BNL.

        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 6
        TRIUMF Site Report

        Updates on the status of the Canadian Tier-1 and other TRIUMF computing news will be presented.

        Speaker: Denice Deatrich
    • 10:40
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Site Report Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 7
        AGLT2 Site Update

        We will present an update on our site since the Spring 2016 report, covering our changes in software, tools and operations.

        We will also report on our recent significant hardware purchases during summer 2016 and the impact it is having on our site.

        We conclude with a summary of what has worked and what problems we encountered and indicate directions for future work.

        Speaker: Shawn Mc Kee (University of Michigan (US))
      • 8
        University of Nebraska CMS T2 Site Report

        Updates from T2_US_Nebraska covering our experiences operating CentOS 7 + Docker/SL6 worker nodes, banishing SRM in favor of LVS balanced GridFTP, and some attempts at smashing OpenFlow + GridFTP + ONOS together to live the SDN dream.

        Speaker: Garhan Attebury (University of Nebraska-Lincoln (US))
      • 9
        University of Wisconsin-Madison CMS T2 site report

        As a major WLCG/OSG T2 site, the University of Wisconsin-Madison CMS T2 has consistently been delivering highly reliable and productive services towards large scale CMS MC production/processing, data storage, and physics analysis for last 10 years. The site utilises high throughput computing (HTCondor), highly available storage system (Hadoop), scalable distributed software systems (CVMFS), and provides efficient data access using xrootd/AAA. The site fully supports IPv6 networking and is a member of the LHCONE community with 100Gb WAN connectivity. An update on the activities and developments at the T2 facility over the last year (since the BNL meeting) will be presented.

        Speaker: Ajit Mohapatra (University of Wisconsin-Madison (US))
      • 10
        Status of IHEP Site

        This talk will give a brief introduction to the status of computing center IHEP, CAS, including local cluster, Grid Tier2 site for Atlas and CMS, file and storage system, cloud infrastructure, planned HPC system, Internet and domestic network.

        Speaker: Yaodong Cheng (IHEP)
      • 11
        KEK Site Report

        The new KEK Central Computer system started the service on September 1st, 2016 after the renewal of all hardware. In this talk, we would like to introduce the performance of the new system and improvement of network connectivity with LHCONE.

        Speaker: Tomoaki Nakamura (KEK)
      • 12
        Fermilab Site Report

        News and updates from Fermilab.

        Speaker: Rennie Scott (Fermilab)
    • 12:40
      Lunch Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Site Report Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 13
        Tokyo Tier-2 Site Report

        The Tokyo Tier-2 site, which is located in International Center for Elementary Particle Physics (ICEPP)
        at the University of Tokyo, is providing resources for the ATLAS experiment in WLCG. In December 2015,
        almost all hardware devices were replaced as the 4th system. Operation experiences with the new system
        and ??a migration plan from CREAM-CE + Troque/Maui to ARC-CE + HTCondor will be reported.

        Speaker: Tomoe Kishimoto (University of Tokyo (JP))
      • 14
        Australia-ATLAS Site report

        Will provide updates on technical and managerial changes to Australia's only HEP grid computing site.

        Speaker: Lucien Philip Boland (University of Melbourne (AU))
    • End-User IT Services & Operating Systems Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 15
        Scientific Linux Status Update

        Scientific Linux status and news.

        Speaker: Rennie Scott (Fermilab)
      • 16
        An e-mail quarantine with open source software

        Filtering e-mails for security reasons is a common procedure. At DESY e-mails with suspicious content are quarantained, users are notified and may request delivery of those e-mails. DESY is in the process of shifting from a commercial product to a quarantine solution made of open source and self-made software. This solution will be presented in context with DESY's e-mail infrastructure.

        Speaker: Mr Dirk Jahnke-Zumbusch (DESY)
    • 15:20
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Security & Networking Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 17
        Platform Providing Network Awareness to ATLAS and Beyond

        With the change of the ATLAS computing model from hierarchical to dynamic, processing tasks are dispatched to sites based not only on availability of resources but also network conditions along the path between compute and storage, which may be topologically and/or geographically distant. We describe a system developed to collect, store, analyze and provide timely access to the network conditions for ATLAS sites, which is also generalized for broader use. We describe the data we collect from four different sources giving orthogonal views of network performance and utilization. The pre-existing ATLAS Distributed Computing Analytics platform is used for data transport and storage. The platform provides interactive monitoring dashboards, and serves as a backend to an alarm and alert system which we have developed for site operators. A co-located Jupyter service is used to perform in-depth interactive data analysis, train different Machine Learning algorithms and test models on historical data. We discuss how the derived knowledge gets used by ATLAS for network anomaly detection, job scheduling and data brokering.

        Speaker: Ilija Vukotic (University of Chicago (US))
      • 18
        Upgrade of network connection between KEK and SINET

        Since last Apr.1, SINET that is NREN for universities in Japan has started the operation of 5th generation infrastracture, SINET5. It accepts 100Gbps connection to the backbone from each institutes, and newly provides the direct path from Japan to Europe. KEK is connected to SINET by 120Gbps bandwidth in total and mostly the bandwidth
        will be used by the mass data transmission via LHCONE. We will report how we upgrade and change the monitoring scheme to keep the security level.

        Speaker: Soh Suzuki
      • 19
        SDN-enabled Intrusion Detection System

        CERN networks are dealing with an ever-increasing volume of network traffic. The traffic leaving and entering CERN has to be precisely monitored and analysed in order to properly protect the networks from potential security breaches. To provide the required monitoring capabilities, the Computer Security team and the Networking team at CERN have joined efforts in designing and deploying a scalable Intrusion Detection System (IDS) setup. The setup features symmetrical load-balancing of monitored traffic across a pool of IDS servers with optional OpenFlow-based traffic shunting (offloading) and selective packet capturing capabilities. Having an experimental instance deployed, the solution is currently under testing with a promising perspective of putting it in production in the near future.

        Speaker: Adam Lukasz Krajewski (CERN)
      • 20
        SDN Implementation in IHEP

        High energy physics experiments produce huge amounts of raw data, while because of the sharing characteristics of the network resources, there is no guarantee of the available bandwidth for each experiment which may cause link competition problems. On the other side, with the development of cloud computing technologies,IHEP have established a cloud platform based on OpenStack which can ensure the flexibility of the computing and storage resources, and more and more computing applications have been moved to this platform,however,under the traditional network architecture, network capability become the bottleneck of restricting the flexible application of cloud computing.
        This report introduces the SDN implemtation in IHEP to solve the above problems, we built a dedicated and elastic network platform based on the data center SDN technologies and network virtualization technologies, meanwhile the SDN@WAN solution in IHEP will also be introduced.
        In the end, the test results and future works will be shared and analyzed.

        Speaker: Mrs SHAN ZENG (IHEP)
    • Site Report Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 21
        SLAC Site Report

        Update on SLAC Scientific Computing Service

        SLAC’s Scientific Computing Services team provide long-term storage and
        midrange compute capability for multiple science projects across the lab.
        The team is also responsible for core enterprise (non-science) unix
        infrastructure. Sustainable hardware lifecycle is a key part of the central
        computing strategy. We continue to push the idea of business models for
        computing services as an alternative to one-time hardware investments.
        Seamless cloud bursting for high-throughput batch compute is under
        development using OpenStack and AWS with VPN.

        Speaker: Yemi Adesanya
      • 22
        Caltech Site Report

        Caltech site report (USCMS Tier 2 site)

        Speaker: Wayne Hendricks (California Institute of Technology (US))
    • Welcome Reception Lawrence Hall of Science

      Lawrence Hall of Science

      LBNL

      Berkeley, CA 94720
    • Registration Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Site Report Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • 10:15
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Site Report Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 28
        NDGF Site Report

        News and interesting events from NDGF and NeIC.

        Speaker: Erik Mattias Wadenstein (University of Umeå (SE))
      • 29
        KIT Site Report

        News about GridKa Tier-1 and other KIT IT projects and infrastructure.

        Speaker: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))
      • 30
        GSI Site Report

        During the last few months, HPC @ GSI has moved servers and services to the new data center Green IT Cube. This included moving the users from the old compute cluster to the new one with a new scheduler, and moving several Petabytes of data from the old to the new Lustre cluster.

        Speaker: Dr Thomas Roth (GSI Darmstadt)
      • 31
        ITER siter eport

        Critical to the success of ITER reaching its scientific goal (Q≥10) is a data system that supports the broad range of diagnostics, data analysis, and computational simulations required for this scientific mission. Such a data system, termed ITERDB in this document, will be the centralized data access point and data archival mechanism for all of ITER’s scientific data. ITERDB will provide a unified interface for accessing all types of ITER scientific data regardless of the consumer (e.g., scientist, engineer, plant operations) including interfaces for data management, archiving system administration, and health monitoring capabilities.
        Due to the INB nature of ITER, there are two parts – one located in POZ (Plant Operation Zone) to collect experimental data and another one located in XPOZ (outside Plant Operation Zone) to allow offline analysis execution and storage. In this paper, we will focus on ITERDB-POZ part, the other part being still under-designed.
        ITER is the international project consisting of seven Das (Domestic Agencies). Its procurement makes it quite challenging. To smooth integration, we developed the CODAC Core system which is a mini-platform based on RHEL and EPICS which simulates the functional CODAC behaviour. Since its first version (2010), it has been increased with new features and new APIs. ITER consists of roughly 200 systems (roughly millions of variables). In this paper, we will focus on the Data Acquisition Network (DAN). Many systems will stream data over DAN at various rates from a few hundred kB/sec to 50GB/sec). We describe in this document the various components involved in the data acquisition and a data storage chain.

        Speaker: lana abadie (ITER)
      • 32
        T2_FI_HIP Site Report
        • hardware renewal
        • dCache and OS upgrade
        • ansible
        Speaker: Johan Henrik Guldmyr (Helsinki Institute of Physics (FI))
      • 33
        Irfu site report
        • Windows10 migration
        • network : IPV6
        • infra : monitoring
        • new H2020 call EOSF
        Speaker: Sophie Ferry
      • 34
        Wigner Datacenter - Site report

        We give an update on the infrastructure, Tier-0 hosting services, Cloud services and other recent developments at the Wigner Datacenter.

        Speaker: Mr Domokos Szabo (Wigner Datcenter)
    • 12:30
      Lunch Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Security & Networking Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 35
        Plans to support IPv6-only CPU on WLCG - an update from the HEPiX IPv6 Working Group

        This report from the HEPiX IPv6 Working Group will present activities during the last 6-12 months. With IPv4 addresses running out and with some sites and Cloud providers now wishing to offer IPv6-only CPU, together with the fact that several WLCG sites are already successfully running production dual-stack storage services, we have a plan to support IPv6 CPU from April 2017 onwards. This plan will be presented.

        Speaker: Dave Kelsey (STFC - Rutherford Appleton Lab. (GB))
      • 36
        Security Update

        What’s been happening in security for HEP? We will discuss the recent trends in the ever changing threat landscape, and the new initiatives being put in place to protect our people, data and services. One such initiative to highlight is our focus on boostrapping international collaboration within research and academia, encouraging communities to participate in intelligence sharing and incident response. We will also discuss developments in the technologies being used to target us and the rest of the academic community.

        Speaker: Hannah Short (CERN)
      • 37
        Pre-Studies for Wi-Fi service enhancement at CERN

        Over the last few years, the number of mobile devices connected to the CERN internal network has increased from a handful in 2006 to more than 10,000 in 2015. Wireless access is no longer a “nice to have” or just for conference and meeting rooms, now support for mobility is expected by most, if not all, of the CERN community. In this context, a full renewal of the CERN Wi-Fi network has been launched in order to provide a state-of-the-art Campus-wide Wi-Fi Infrastructure. Which technologies can provide an end-user experience comparable, for most applications, to a wired connection? Which solution can cover more than 200 office buildings, which represent a total surface of more than 400.000 m2, while keeping a single, simple, flexible and open management platform? The presentation will focus on the pre-studies which were done at CERN to review the full Wi-Fi infrastructure across the Campus. Moreover modern demands for Wi-Fi connectivity, as well as designing process of new CERN Wi-Fi network (RF planning, simulation, site survey) will be presented.

        Speaker: Adam Wojciech Sosnowski (AGH University of Science and Technology (PL))
      • 38
        Wi-Fi service enhancement at CERN

        Over the last few years, the number of mobile devices connected to the CERN internal network has increased from a handful in 2006 to more than 10,000 in 2015. Wireless access is no longer a “nice to have” or just for conference and meeting rooms, now support for mobility is expected by most, if not all, of the CERN community. In this context, a full renewal of the CERN Wi-Fi network has been launched in order to provide a state-of-the-art Campus-wide Wi-Fi Infrastructure. Which technologies can provide an end-user experience comparable, for most applications, to a wired connection? Which solution can cover more than 200 office buildings, which represent a total surface of more than 400.000 m2, while keeping a single, simple, flexible and open management platform? The presentation will focus on the studies and tests performed at CERN to address these issues, as well as some feedback about the global project organisation.

        Speaker: Vincent Ducret (CERN)
    • 15:40
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Security & Networking Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 39
        Cloud Services – Network realities

        HEP use of cloud services has brought to light various network issues that hamper the full integration of such services with WLCG resources. In this presentation we comment on the issues that have been encountered and present the ongoing actions of the international network community to facilitate the integration of cloud services into the research computing environment.

        Speaker: Tony Cass (CERN)
      • 40
        Can we trust eduGAIN?

        EduGAIN, the international identity federation, allows users from all over the world to access a globally distributed suite of academic resources. You are most likely already able to use your primary account, from CERN or your home organisation, to tap in to these services! Federated Identity Management, the technology underpinning eduGAIN, brings many benefits for users and organisations alike but… how can we trust these users with our HEP services? This is one of the questions that the AARC project (https://aarc-project.eu), in which CERN is a partner, is seeking to answer. We will discuss the measures being put in place to allow WLCG to reap the rewards of eduGAIN without exposing itself to increased risk.

        Speaker: Hannah Short (CERN)
    • Storage and Filesystems Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 41
        Deep dive into Spectrum Scale (fomerly known as GPFS)

        Intent of this presentation is to give current (or potential) users of Spectrum Scale a deep dive into various key components and functions of the Product and its usage in High Performance Computing. i will share Performance data for problematic filesystem workloads like shared directory or file access as well as demonstrate some new capabilities that have been added into the 4.2.1 release. i will further explain some i/o optimization technologies like LROC and HAWC that allow the use of FLASH technologies of various sorts to accelerate workloads. if time permits i can show some of the advanced performance and problem determination capabilities that were recently added to the product as well, including a live realtime performance demo.

        Speaker: Sven Oehme
    • Board Meeting Building 59, room 4102

      Building 59, room 4102

      LBNL

      Berkeley, CA 94720
    • Registration Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Computing and Batch Services Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 42
        HEPiX Benchmarking Working Group - Status Report HEPiX Fall 2016

        The HEPiX Benchmarking Working Group has been relaunched in spring 2016. First tasks are:

        • Development and proposal of a fast benchmark to estimate the performance of the provided job slot (in traditional batch farms) or VM instance (in cloud environments)

        • Preliminary work for a successor of the HS06 benchmark

        This talk provides a status report of the work done so far.

        Speaker: Manfred Alef (Karlsruhe Institute of Technology (KIT))
      • 43
        Big Data: Genomics vs. Physics

        Big data is typically characterized by only a few features, such as Volume, Velocity and Variety. This is a simplification that overlooks many factors that affect the way data is used and managed, factors that can have a profound effect on the computing systems needed to serve different communities.

        I compare the computing and data-management needs of the genomics domain with those of big physics experiments, highlight the differences between them and discuss the implications of those differences.

        Speaker: Tony Wildish (Lawrence Berkeley National Laboratory)
      • 44
        JLab's SciPhi-XVI Knights Landing Cluster

        Jefferson Lab recently installed a 200 node Knights Landing cluster, becoming an Intel® Parallel Computing Center. This talk will give an overview of the cluster installation and configuration, including its Omni-Path fabric, benchmarking, and integation with Lustre and NFS over Infiniband.

        Speaker: Sandy Philpott
    • 10:15
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Computing and Batch Services Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 45
        A Race for the Data Center: POWER8 and AArch64

        x86 processors have been the long-time leaders of the server market and x86_64 the uncontested target architecture for the development of High Energy Physics applications. Up until few years ago, interests in alternative architectures targeting server environments that could compete in terms of performance, power efficiency and total cost of ownership with x86 could not find any concrete response. However, the past few years have seen the introduction of new processor architectures and initiatives aimed at challenging the leading position of x86. With the introduction in 2011 of the ARMv8 Instruction Set Architecture supporting 64-bit, ARM set the first milestone for the expansion into the server landscape. The OpenPOWER Foundation founded in 2013 set as its main goal the development of the POWER ecosystem in the server market, initially embracing under this initiative the POWER8 processor family. In 2015 we presented performance and power consumption benchmarks of uni-socket platforms that proved the existence of a significant gap between x86 and other competitors (A look beyond x86: OpenPOWER8 & AArch64, HEPiX Spring 2015) . The ecosystem has grown both in terms of availability of hardware platforms and software support. I will present new performance and power consumption results covering recent dual-socket ARMv8 and POWER8 platforms.

        Speaker: Marco Guerri (CERN)
      • 46
        Dynamical Provisioning of Cloud Computing Resources for Batch Processing

        We aim to build a software service for provisioning cloud-based computing resources that can be used to augment users’ existing, fixed resources and meet their batch job demands. This service must be designed to automate the delivery of compute resources (HTCondor execute nodes) to match user job demand in such a way that cloud-based resource utilization is high and, thus, cost per cpu-hour is low. In addition, since this provisioning service will acquire resources on behalf of its users, acting as a third-party buyer for them, it is also our fiduciary responsibility to ensure the system is stable or, at least, that stability can be maintained. In order to assess if stable resource utilization is possible, a dynamical systems approach is developed to provide a framework for understanding how the provisioning service will respond to user job demand. We will present our latest results on the project and give an overview of the development plan moving forward.

        Speaker: Dr Martin Kandes (Univ. of California San Diego (US))
      • 47
        What's new in HTCondor? What is upcoming?

        The goal of the HTCondor team is to to develop, implement, deploy, and evaluate mechanisms
        and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Increasingly, the work performed by the HTCondor developers is being driven by its partnership with the High Energy Physics (HEP) community.

        This talk will present recent changes and enhancements to HTCondor, including details on some of the enhancements created for the imminent HTCondor v8.6.0 release, changes created on behalf of the HEP community, and advancements on interactions with Docker and public cloud services. It will also discuss the upcoming HTCondor development roadmap, and seek to solicit feedback on the roadmap from HEPiX attendees.

        Speaker: Todd Tannenbaum
      • 48
        Profiling data intensive workflows on Genepool and PDSF clusters at NERSC.

        NERSC is well known for its user friendly, large-scale computing environment. Along with the large Cray systems (Edison and Cori), NERSC also supports data intensive workflows of the Joint Genome Institute, HEP and material science community via its Genepool, PDSF and Matgen clusters. These clusters are all provisioned from a single backend cluster, Mendel. This talk will briefly outline the workflows in Mendel and provide a comparative profile of its various applications. It will also summarize various user and system incidents over the last few years of its service. A deeper analysis of the bio-informatics workflow on the Genepool compute cluster, and a plan for testing workflows on a Mendel testbed with Cori-like environment will be discussed. Finally, a prospective plan for future evolution of Genepool part of Mendel will also be outlined.

        Speaker: Dr Bhupender Thakur (NERSC, Lawrence Berkeley National Lab)
    • 12:25
      Lunch Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • BOF session: HPC hardware acquisition practices, software and application porting experiences Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720

      HPC hardware acquisition practices, software and application porting experiences

    • Storage and Filesystems Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 49
        CEPHFS: a new generation storage platform for Australian high energy physics

        In this paper we present a CEPHFS use case implementation at the Center of Excellence for Particle Physics at the TeraScale (CoEPP). CoEPP operates the Australia Tier-2 for ATLAS and joins experimental and theoretical researchers from the Universities of Adelaide, Melbourne, Sydney and Monash. CEPHFS is used to provide a unique object storage system, deployed on commodity hardware and without single points of failure, used by Australian HEP researchers in the different CoEPP locations to store, process and share data, independent of their geographical location. CEPHFS is also working in combination with a SRM and XROOTD implementation, integrated in ATLAS Data Management operations, and used by HEP researchers for XROOTD or/and POSIX-like access to ATLAS Tier-2 user areas. We will provide details on the architecture, its implementation and tuning, and report performance I/O metrics as experienced by different clients deployed over WAN. We will also explain our plan to collaborate with Red Hat Inc. on extending our current model so that the metadata cluster distribution becomes multi-site aware, such that regions of the namespace can be tied or migrated to metadata servers in different data centers.

        Speaker: Goncalo Borges (University of Sydney (AU))
      • 50
        Experience of Development and Deployment of a Large-Scale Ceph-Based Data Storage System at RAL

        A new data storage system, Echo, has been developed as a replacement for CASTOR disk-only storage of LHC data at the RAL Tier-1 for the past two years. This presentation will share the RAL experience of developing and deploying a new, ceph-based storage service at the 13 PB scale to the standard required for production use.

        This is the first new service that we have developed at this scale for some time and ceph is a very different technology from our existing storage solution. This presentation will explore the changes required to accommodate such a service: from the location of servers in the data centre; development of the network topology and the effect this has on data placement; the design and construction of a system that is more manageable, maintainable and upgradable by a system administrator; the adaptation of existing software in order to support LHC VO workflows and the implementation of new software to support industry standard protocols for both LHC VOs and other user communities. I will also discuss the changes brought by the deployment of a new OS major version and the change from sysVinit to systemd for process management, the changes to monitoring and alerting required to support the continuous operation of the service and the risks and impacts of transitioning to this technology.

        Speaker: Bruno Canning (RAL)
      • 51
        Ceph Based Storage Systems at the RACF

        We give a report on the status of Ceph based storage systems deployed at the RHIC & ATLAS Computing Facility (RACF) that are currently providing 1 PB of data storage capacity for the object store (with Amazon S3 compliant Rados Gateway front end), block storage (RBD), and shared file system (CephFS with dCache/GridFTP front-ends) layers of Ceph storage system. The hardware and software upgrades performed over the duration of the last year are reported, including the results of performance tuning for the Rados Gateway subsystem of the cluster in order to support the high concurrency (up to 24k simultaneous connections), high granularity (about 1-10 MB payloads per client session), and high bandwidth (up to 1 GB/s of aggregate bandwidth on the WAN) data transfers via Amazon S3 compatible API in order to match the growing requirements of the ATLAS Event Service. The results of boosting the performance of our Ceph clusters using the low latency PCIe NVMe SSD storage devices and the future plans for our Ceph based storage systems are also discussed.

        Speaker: Alexandr Zaytsev (Brookhaven National Laboratory (US))
      • 52
        Resilient dCache and other news

        New developments in dCache, in particular resilient features of redundant headnode services where we can now do automatic failover and rolling upgrades with low to none service impact.

        Some other news too, on recent development in other areas like ceph support.

        Speaker: Erik Mattias Wadenstein (University of Umeå (SE))
    • 15:40
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Storage and Filesystems Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 53
        Effective Data Retrieval from Massive Amounts of Tape-Resident Data

        Randomly restoring files from tapes degrades the read performance primarily due to frequent tape mounts. The high latency and time-consuming tape mount and dismount is a major issue when accessing massive amounts of data from tape storage. BNL's mass storage system currently holds more than 80 PB of data on tapes, managed by HPSS. To restore files from HPSS, we make use of a scheduler software, called ERADAT. This scheduler system was originally based on code from Oak Ridge National Lab, developed in the early 2000s. After some major modifications and enhancements, ERADAT now provides advanced HPSS resource management, priority queuing, resource sharing, web-browser visibility of real-time staging activities and advanced real-time statistics and graphs. ERADAT is also integrated with ACSLS and HPSS for near real-time mount statistics and resource control in HPSS. ERADAT is also the interface between HPSS and other applications such as the locally developed Data Carousel providing fair resource-sharing policies and related capabilities.
        ERADAT has demonstrated great performance at BNL and other scientific organizations.

        Speaker: David Yu (Brookhaven National Laboratory (US))
      • 54
        EOS, DPM and FTS developments and plans

        The CERN IT-ST Analytics and Development section is responsible for the development of Data Management solution for Disk Storage and Data Transfer, namely EOS, DPM and FTS.

        The talk will describe some recent developments in those 3 software solutions

        EOS

        The integration and evaluation of various technologies to do the transition from a single active in-memory namespace to a scale-out implementation distributed over many meta-data servers. The new architecture aims to separate the data from the application logic and user interface code, thus providing flexibility and scalability to the namespace component.

        DPM

        The implementation of a new core daemon (DOME) based on the fast-CGI and RESTful technologies. This brings the opportunity of working in a totally SRM-free mode, the implementation of quotas, free/used space on directories, and the implementation of volatile pools that can pull files from external sources, which can be used to deploy data caches.

        FTS

        The extension to better support data transfer workflows between Grid, Cloud and HPC systems. This includes FTS3 implementing protocol translations and performing efficient 3rd party transfers over HTTP. One of the core component ( Optimizer ) has been also rewritten to allow ranges of active transfers and better exploitation of the network resources.

        Speaker: Andrea Manzi (CERN)
      • 55
        ZFS on Linux

        ZFS is a combination of file system, logical volume manager, and software raid system developed by SUN Microsystems for the Solaris OS. ZFS simplifies the administration of disk storage and on Solaris it has been well regarded for its high performance, reliability, and stability for many years. It is used successfully for enterprise storage administration around the globe, but so far on such systems ZFS was mainly used to provide storage, like for users home directories, through NFS and similar network related protocols.

        Within GridPP, ZFS was also used before for the management of user home directories through NFS. These systems were based on Solaris or similar systems like the ones provided by Nexenta. However, most of the Grid Middleware run on Linux systems and not on Solaris and therefore ZFS wasn't used so far for Grid storage management or in general for Grid middleware servers.

        Since ZFS is available in a stable version on Linux now, here I will present our experience made with ZFS on Linux since we started to updated all GridPP storage (about 1PB) at our site at the end of last year to be managed by ZFS using the current Linux version of it. Since with larger growing disk capacity raid6 rebuild times get soon too large to be feasible, ZFS built in raid functionality was tested as an alternative to hardware raid systems and the results will be presented. I'll also report on other ZFS specific properties like compression,nfs sharing, and snapshots and how it is working in the Linux port.
        ZFS on Linux could be an efficient and cost effective alternative to hardware raid and Solaris based systems, which has characteristics no other file system can provide and which can provide real data safety and reliability.

        Speaker: Marcus Ebert (University of Edinburgh (GB))
      • 56
        OSiRIS: One Year Update

        The OSiRIS (Open Storage Research Infrastructure) project started in September 2015, funded under the NSF CC*DNI DIBBs program (NSF grant #1541335). This program seeks solutions to the challenges many scientific disciplines are facing with the rapidly increasing size,
        variety and complexity of data they must work with. As the data grows, scientists are challenged to manage, share and analyze that data and become diverted from a focus on their scientific research to data-access and data-management concerns. Even more problematic is determining how to support many scientists sharing and accessing this ever increasing amount of data across multiple institutions.

        We will describe the progress made during the OSiRIS project's first year. OSiRIS has fully deployed and benchmarked its initial multi-institutional Ceph deployment. To do this involved developing,deploying and configuring a number of tools to support consistent provisioning, monitoring and management of the distributed OSiRIS infrastructure. We will cover those details and discuss our initial science engagements and near-term plans for our hardware, Ceph, Authentication/Authorization and Software Defined Networking as well as the longer term plans for this 5-year project.

        Speaker: Shawn Mc Kee (University of Michigan (US))
    • Conference Dinner UC Berkeley Faculty Club

      UC Berkeley Faculty Club

      LBNL

      Berkeley, CA 94720
    • Miscellaneous: Safety Announcement Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Storage and Filesystems Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 57
        Update from Database Services

        With the terabytes of data stored in databases and Hadoop at CERN and great number of critical applications relying on them, the database service is evolving and the Hadoop service is expanding to adapt to changing needs and requirements of its users. The demand is high and the scope is broad. This presentation gives an overview of current state of databases services and new technologies approaching in Hadoop Service to make better use of latest hardware developments. Update to Database-On-Demand management model and technologies (MySQL, PostgreSQL) will also be provided.

        Speaker: Katarzyna Maria Dziedziniewicz-Wojcik (CERN)
      • 58
        AFS phaseout at CERN

        (Open)AFS has been used at CERN as general purpose filesystem for Linux homedirectories and project space for over 20 years. It has an excellent track record, but is showing its age. It is now slowly being phased out due to concerns on the project's long-term viability. The talk will briefly explain CERN's reasons for phasing out, give an overview of the process, introduce the migration targets for the various use cases (primarily EOS-FUSE), and highlight the challenges (and opportunities) of this migration.

        Speaker: Jan Iven (CERN)
      • 59
        The future of AFS family file systems in research computing

        Since the introduction of Transarc AFS in 1991, the AFS family of file systems have played a role in research computing around the globe.

        This talk will discuss the resurgence in development of the AFS family of file systems. A summary of recent development for several family members will be presented including:

        • AuriStor File System suite of clients and servers
        • kAFS, the Linux in-tree client and the associated AF_RXRPC socket interface
        • OpenAFS clients and servers

        The talk will describe the potential uses of the /afs file namespace as a persistent storage solution for Containers.

        Finally, the talk will discuss the Tennessee Open Research storage Cloud (TORC) proposal that was submitted to the U.S. National Science Foundation for funding as part of the Cyber Infrastructure initiative. If funded, TORC will provide a wide-area, high-performance and interoperable storage infrastructure designed for scalable, multi-level federation under cooperative management. TORC will combine the global, federated /afs file namespace and the multi-level security and privacy provided by the AuriStor File System with the high performance, scalability and reliability of L-Store and the Internet Backplane Protocol.

        Speaker: Mr Jeffrey Altman
    • 10:20
      Coffee Break & California earthquake drill Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • IT Facilities and Business Continuity Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 60
        Deploying Open Compute hardware at CERN

        The Open Compute Project, OCP, was launched by Facebook in 2011 with the objective of building efficient computing infrastructures at lowest possible cost. Specifications and design documents for Open Compute systems are released under open licenses following the model traditionally associated with open source software projects. In 2014 we presented our plans for a public procurement activity for a small-size Open Compute hardware installation aimed at assessing the maturity of OCP market and whether it could be identified as a possible competitor of "traditional" hardware (Open Compute at CERN, HEPiX Spring 2014). We have finally deployed in September 2015 six Open Compute racks populated with CPU servers and storage enclosures in CERN's Meyrin datacentre. We were presented with interesting challenges during all phases of the project and at all levels of the stack, from the power distribution to hardware monitoring. I will outline some of the hurdles we had to overcome and the lessons we have learnt along the way, together with the results obtained during the evaluation of the systems.

        Speaker: Marco Guerri (CERN)
      • 61
        CERN Computing Facilities Evolution

        This talk will give an overview of current activities to expand CERN's computing facilities infrastructure. This will include a description of the 2nd Network Hub currently being constructed as we ll as its purpose. It will also cover the initial plans for a possible second Data Centre on the CERN site.

        Speaker: Wayne Salter (CERN)
      • 62
        The role of dedicated computing centers in the age of cloud computing

        BNL anticipates significant growth in scientific programs with large
        computing and data storage needs in the near future and has recently
        re-organized support for scientific computing to meet these needs.
        A key component is the enhanced role of the RHIC-ATLAS Computing
        Facility (RACF) in support of HTC and HPC at BNL.

        This presentation discusses the evolving role of the RACF at BNL, in
        light of its growing portfolio of responsibilities and its increasing
        integration with cloud (academic and for-profit) computing activities.
        We also discuss BNL's plan to build a new computing center to support
        the new responsibilities of the RACF and present a summary of the cost
        benefit analysis done, including the types of computing activities
        that benefit most from a local data center vs. cloud computing. This
        analysis is partly based on an updated cost comparison of Amazon EC2
        computing services and the RACF, which was originally conducted in 2012.

        Speaker: Tony Wong (Brookhaven National Laboratory)
      • 63
        GreenITCube - Status & Monitoring

        The GreenITCube is in production for half a year now. We want to present our experience so far, what we have learned about the system and give an outlook for the next couple of months.

        As a second part of the talk, we want to give a detailed overview of the infrastructure monitoring. The focus will be on the different systems, we have in work and how we put all monitoring data together.

        Speaker: Mr Jan Trautmann (GSI Darmstadt)
    • 12:25
      Lunch Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Basic IT Services Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 64
        Monitoring HTCondor with Clustered Graphite and Grafana

        Grafana is a popular tool for data analytics, and HTCondor generates
        large amounts of time-series data appropriate for the kinds of analysis
        Grafana provides. We use a Graphite cluster, which will be described in
        some detail, as a back-end for metric storage, and adapted some scripts
        from Fermilab for metric gathering. This work is in the context of the
        batch-monitoring working group.

        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 65
        Introduction of load balancers at a Tier-1 site

        Historically at the RAL Tier-1 we have always directly exposed public-facing services to the internet via static DNS entries. This is far from ideal as it means that users will experience connection failures during server maintenance (both planned and unplanned) and any changes to the servers behind a particular service require DNS changes. Since April we have been using in production HAProxy and Keepalived to facilitate a highly-available load balancer in front of FTS3 in order to avoid the issues resulting from the use of DNS aliases. We are also making extensive use of HAProxy and Keepalived for our OpenStack cloud which is under development. Here we will describe our setup, experience with load balancers for FTS3 and OpenStack as well as our progress and plans for other services.

        Speaker: Ian Collier (STFC - Rutherford Appleton Lab. (GB))
      • 66
        Renewal of Puppet for Australia-ATLAS

        Australia-ATLAS has been running Puppet for all infrastructure and Grid nodes since 2012. With the release of Puppet 4, and the move to Centos 7, we decided to rejig our Puppet configuration using what we've learnt in 4 years, and best practice methodologies. This talk will describe the problems we had with the old Puppet config, the decisions we made constructing the new system, and how the new system makes configuration management much easier.

        Speaker: Mr Sean Crosby (University of Melbourne (AU))
    • 15:15
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Basic IT Services Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 67
        User/group based access control for ElasticSearch + Kibana

        Kibana and ElasticSearch are used for monitoring in many places. However, by default they do not support authentication and authorization features. In the case of single Kibana and ElasticSearch services shared among many users, any user that can access Kibana can retrieve any information from ElasticSearch.

        In this talk, we will report on our latest R&D experience in securing the Kibana and ElasticSearch services. We will describe a Kibana plugin that allows Kibana dashboards to be separated based on user/group. We will also describe the effect on performance from using SearchGuard, which is an ElasticSearch plugin enables user/group based access control.

        Speaker: Wataru Takase (High Energy Accelerator Research Organization (JP))
      • 68
        Adopting Red Hat Satellite 6 for Lifecycle Management

        An overview of results and lessons learned from the Fermilab Scientific Linux and Architecture Management(SLAM) group's Satellite 6 Lifecycle Management Project. The SLAM team offers a portfolio of diverse system management service offerings with a small staff. Managing the risk of resource scarcity involves implementing tools and processes that will facilitate standardization, reduce complexity, and increase efficiency whenever possible. This short talk will give a brief overview of our experience and the results and the future of migrating to Satellite 6.1 as our new base for System Management.

        Speaker: Rennie Scott (Fermilab)
    • Grid, Cloud and Virtualisation Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 69
        Chameleon: A Computer Science Testbed as Application of Cloud Computing

        Did you ever need hundreds of state-of-the-art nodes that you could use to scalably test new ideas on? Run experiments that are not disrupted by what other users are doing? A platform that allows you to reinstall the operating system, recompile the kernel, and gives you access to the console so that you can debug the system? A place where your research team can easily reproduce experiments carried out weeks ago? A lab where your students can work with different hardware configurations, from Infiniband to GPUs, either as part of a class or homework?

        This talk will introduce Chameleon, a large-scale, deeply reconfigurable NSF-funded testbed for Computer Science research and education (www.chameleoncloud.org). The testbed consists of ~600 nodes (~14,000 cores) and a total of 5PB disk space hosted at the University of Chicago and TACC, and leverages 100 Gbps connection between the sites. The hardware consists primarily of homogenous nodes to support large-scale experiments – but subgroups of those nodes are equipped with additional capabilities including Infiniband networking, high-bandwidth I/O storages nodes, GPUs, and storage hierarchies with a mix of HDDs, SDDs, NVRAM, and high memory. To support Computer Science experiments, ranging from operating system and virtualization to security research, Chameleon provides a configuration system giving users exclusive access to bare metal nodes on an “as if it were in your lab basis”, i.e., full control of the software stack including root privileges, kernel customization, and console access. In addition, to facilitate educational and application exploratory projects Chameleon also provides a KVM cloud.

        I will describe user facing Chameleon capabilities, describe some of the project that the testbed supported in the past, and explain how the testbed was built and will continue to develop.

        Speaker: Kate Keahey (Argonne National Laboratory)
      • 70
        Extending the farm to external sites: the INFN Tier-1 experience

        The Tier-1 at CNAF is the main INFN computing facility offering computing and storage resources to more than 30 different scientific collaborations including the 4 experiments at the LHC. A huge increase in computing needs is foreseen in the next years mainly driven by the experiments at the LHC (especially starting with the run 3 from 2021) but also by other upcoming experiments such as CTA.
        While we are considering the upgrade of the infrastructure of our data center, we are also evaluating the possibility of using CPU resources available in other data centers or even leased from commercial cloud providers.
        Hence, at INFN Tier-1 we have pledged a small amount of computing resources (~2000 cores located at the Bari ReCaS) for the WLCG experiments for 2016 and we are testing the use of resources provided by a commercial cloud provider. While the Bari ReCaS data center is directly connected to the GARR network with the obvious advantage of a low latency and high bandwidth connection, in the case of the commercial provider we rely only on the General Purpose Network.
        In this presentation we describe the setup phase and the first results of these installations, started in the last quarter of 2015, focusing on the issues that we had to deal with and discussing the measured results in terms of efficiency.

        Speaker: Andrea Chierici (INFN-CNAF)
    • Security & Networking Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 71
        Effective and non-intrusive security within NERSC’s Open Science HPC environment

        Providing effective and non-intrusive security within NERSC’s Open
        Science HPC environment introduces a number of challenges for both
        researchers and operational personnel. As what constitutes HPC expands
        in scope and complexity, the need for timely and accurate decision
        making about user activity remains unchanged. This growing complexity
        is balanced against a backdrop of routine user and application
        attacks, which remain surprisingly effective over time.

        This presentation describes current efforts at NERSC to maintain
        system integrity without getting in the way of the science being done
        here. These efforts include network monitoring, 2 factor
        authentication as well as ssh and host based data analysis"

        Speaker: Abe Singer (Lawrence Berkeley Lab)
    • Grid, Cloud and Virtualisation Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 72
        On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

        This contribution reports on solutions, experiences and recent developments with the dynamic, on-demand provisioning of remote computing resources for analysis and simulation workflows. Local resources of a physics institute are extended by private and commercial cloud sites, ranging from the inclusion of desktop clusters over institute clusters to HPC centers.

        We report on recent experience from incorporating a remote HPC center (NEMO Cluster, Freiburg University) and resources dynamically requested from a commercial provider (1&1 Internet SE), which have been seamlessly tied together with the ROCED scheduler [1] such that, from the user perspective, local and remote resources form a uniform, virtual computing cluster with a single point-of-entry. On a local test system, the usage of Docker containers has been explored and shown to be a viable and light-weight alternative to full virtualization solutions in trusted environments.

        [1] O. Oberst et al. Dynamic Extension of a Virtualized Cluster by using Cloud
        Resources, J. Phys.: Conference Ser. 396(3)032081, 2012

        Speaker: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))
      • 73
        Update on HNSciCloud project

        Overview of what has happened in HNSciCloud over the last five months

        Speaker: Helge Meinhard (CERN)
      • 74
        The advances in IHEP Cloud facility

        In IHEP, more large scientific facilities requests more computing resources. Management of large scale resources requests efficient and flexible system architecture. Virtual computing through cloud technical is an approach. IHEPCloud is a private LaaS cloud which supports multi-users and multi-projects to achieve virtual computing. In this paper, we describe the infrastructure of virtual computing cluster in IHEP and discuss the work we done. We also show the performance testing for BES job. IHEPCloud has been online since Nov 2014 and works well. The performance penalty is also acceptable.

        Speaker: Tao Cui (IHEP(Institute of High Energy Physics, CAS,China))
    • 10:15
      Coffee Break Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
    • Grid, Cloud and Virtualisation Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      • 75
        Running HEP Workloads on the NERSC HPC Systems

        Running HEP workloads on a Cray system can be challenging since these systems typically don't look very much look a standard Linux system. This presentation will describe several tools NERSC has deployed to enhance HEP and other data intensive computing: Shifter, a docker-like container technology developed at NERSC, the Burst Buffer, a super fast IO layer, and a software defined network that allows high speed connections to the outside world. We will give an overview of the software and hardware architecture, deployment, and performance of these services.

        Speaker: Tony Quan (LBL)
      • 76
        Further Adventures in Container Orchestration at RAL

        We provide an update on our continued experiments with container orchestration at the RAL Tier 1.

        Speaker: Ian Collier (STFC - Rutherford Appleton Lab. (GB))
      • 77
        CSNS Computing Environment Based on OpenStack

        OpenStack is an open source software for creating private and public clouds.It controls large pools of compute, storage, and networking resources throughout a datacenter, managed through a dashboard or via the OpenStack API. Hundreds of the world’s largest brands rely on OpenStack to run their businesses every day, reducing costs and helping them move faster.
        We are applying this computing mode to the China Spallation Neutron Source(CSNS) computing environment.So from the research and practice aspects,firstly,the application status of cloud computing science in High Energy Physics Experiments and the special requirements of CSNS are introduced in this paper.Secondly, our design and practice of cloud computing platform based on OpenStack are mainly demonstrated from the aspects of cloud computing system framework, some improvments to openstack network, Storage architecture and so on. Finally, some future prospects of CSNS cloud computing environment are discussed in the ending of this paper.

        Speaker: Yakang li (ihep)
    • Closing and HEPIX Business Building 50 Auditorium

      Building 50 Auditorium

      LBNL

      Berkeley, CA 94720
      Convener: Tony Wong (Brookhaven National Laboratory)