HEPiX Spring 2017 Workshop

Europe/Budapest
Hungarian Academy of Sciences

Hungarian Academy of Sciences

Széchenyi István tér 9 1051 Budapest Hungary
Helge Meinhard (CERN), Tony Wong (Brookhaven National Laboratory)
Description

HEPiX Spring 2017 in Budapest (Hungary)

The HEPiX forum brings together worldwide Information Technology staff, including system administrators, system engineers, and managers from High Energy Physics and Nuclear Physics laboratories and institutes, to foster a learning and sharing experience between sites facing scientific computing and data challenges.

Participating sites include BNL, CERN, DESY, FNAL, IHEP, IN2P3, INFN, IRFU, JLAB, KEK, LBNL, NDGF, NIKHEF, PIC, RAL, SLAC, TRIUMF, and many others.

The workshop is hosted by the Wigner Research Centre for Physics, part of the Hungarian Academy of Sciences.

 

Gold sponsor:

                                     

 

 

Silver sponsors:

 

                                       

 

                                         

 

 

    • 08:00 09:00
      Registration 1h
    • 09:00 09:30
      Welcome
      • 09:00
        Logistics and general announcements 10m
      • 09:10
        Welcome 20m
        Speakers: Balazs Bago (Hungarian Academy of Sciences (HU)), Szilvia Racz (Wigner Datacenter), Mr Gabor Peto (Wigner Datacenter)
    • 09:30 10:45
      Site reports
    • 10:45 11:15
      Coffee break 30m
    • 11:15 13:00
      Site reports
      • 11:15
        Swiss National Supercomputing Centre T2 Site report 15m

        Site report, news and ongoing activities at the Swiss National Supercomputing Centre T2 site (CSCS-LCG2) running ATLAS, CMS and LHCb.

        Speaker: Dino Conciatore (Eidgenoessische Technische Hochschule Zuerich (CH))
      • 11:30
        T2_FI_HIP Site Report 15m
        • More hardware issues with HPE SL4510 Gen9
        • Parsing HP ADU Reports
        • dCache upgrade
        • IPv6
        Speaker: Johan Henrik Guldmyr (Helsinki Institute of Physics (FI))
      • 11:45
        Site Report for Max Delbrück Center for Molecular Medicine 15m

        I will give a short overview of our institute and its IT capabilities.

        Speaker: Alf Wachsmann (Max Delbrück Center for Molecular Medicine (MDC))
      • 12:00
        NDGF Site Report 15m

        Report on new developments and insights from NDGF. The report will focus on half a year of experience with HA dCache and how this works for us in practice.

        Speaker: Erik Mattias Wadenstein (University of Umeå (SE))
      • 12:15
        BNL RACF Site Report 15m

        An overview of BNL's RHIC/ATLAS Computing Facility, highlighting significant developments since the last HEPiX meeting at LBNL.

        Speaker: Ofer Rind
      • 12:30
        AGLT2 Site Report Spring 2017 15m

        We will present an update on our site since the Fall 2016 report, covering our changes in software, tools and operations.

        Some of the details to cover include changes and updates to our networking, storage and deployed middleware.

        We conclude with a summary of what has worked and what problems we encountered and indicate directions for future work.

        Speaker: Shawn Mc Kee (University of Michigan (US))
      • 12:45
        Nebraska Site Report 15m

        In the last year, the Nebraska site has worked hard to reinvent the services offered to its user communities. The high-throughput-computing resources have successfully transitioned to Docker, offering more flexibility in terms of OS environments. We have upgraded and improved our CVMFS infrastructure, allowing local users to heavily utilize it for data distribution. Finally, we have adopted Singularity as a site-wide mechanism for user usage of containers.

        In this report, we update the community on these and other activities at Nebraska.

        Speaker: Brian Paul Bockelman (University of Nebraska-Lincoln (US))
    • 13:00 14:30
      Lunch break 1h 30m
    • 14:30 15:20
      End-user services and operating systems
      • 14:30
        CERN Linux services status update 25m

        An update on CERN Linux support distributions and services.

        An update on the CentOS community and CERN involvement will be given. We will discuss software collections, virtualization and openstack SIGs and how we use them.

        We will present our new puppet based configuration tool and its future.

        A brief status on alternative arches (aarch64, ppc64le, etc...) work done by the community will be given.

        Speaker: Ulrich Schwickerath (CERN)
      • 14:55
        SW & Computing for Big Science Journal 25m

        The initiative to create a journal about Software and Computing for Big Science was presented one year ago, at HEPiX Berlin. The journal has now been launched. This talk will remind what are the goals of the journal and how to contribute.

        Speaker: Michel Jouvin (Universite de Paris-Sud 11 (FR))
    • 15:20 15:45
      Security and networking

      Cyber attacks we are faced with are not only sophisticated, but also in many instances highly profitable for the actors behind. It takes organizations on average more than six months to detect a cyber attack. The more sophisticated the attack, the likelier it is that it will pass undetected for longer. And by the time the breach is detected, many times it’s already too late.

      The only mean to adopt an appropriate response is to build a tight international collaboration and to implement trusted information sharing mechanisms within the community. The groundwork for that has already been laid in the WLCG Security Operations Centers Working Group by setting up a
      central HEP MISP instance and by investigating the sharing of threat intel data inside the community.

      Workshop attendees will be given an introduction to the MISP platform, will get to deploy their own MISP instance and set up sharing with other MISP instances. Access will be provided to the central HEP MISP instance currently containing more than 160 000 Indicators of Compromise, with new ones being constantly added as new threats emerge. This is an excellent opportunity to get to learn how to best make use of the platform and of the threat intelligence already available inside the community to protect both
      your organization and at the community as a whole.

      • 15:20
        IPv6 at the RAL Tier 1 25m

        After many months of work the wLCG Tier 1 centre at RAL has begun to deploy IPv6 addresses to production hosts. This talk will detail the work that has been done and explain the strategy that has been adopted for managing addresses in a dual-stack environment.

        Speaker: James Adams (STFC RAL)
    • 15:45 16:15
      Coffee break 30m
    • 16:15 17:05
      Basic IT services
      • 16:15
        A Hard Puppexit from 3 to 4 25m

        During the first quarter of 2017 CERN IT migrated from a Puppet 3-based service to a Puppet 4 one. We highlight the steps we took, the methods we used and the problems we discovered along the way.

        Speaker: Jerome Belleman (CERN)
      • 16:40
        salt stack - Using, Extending, and programming 25m

        saltstack is a newer configuration management tool, first developed for remote execution. This talk will cover my experiences with salt in two organizations, with two different roles.

        • Cleaning up an organizations use of salt.
        • Making ceph execution modules in python.
        Speaker: Owen Synge
    • 17:05 17:20
      Site reports
      • 17:05
        CEA Site Report 15m

        We would present an update of the changes at our site since 2016 reporting. Through the presentation we share the advancements, roadblocks and achievements made concerning different aspects (like unix, grid, projects etc.) at our facility.
        We conclude by summing up and mentioning our goals.

        Speaker: Arnab Sinha (CEA/IRFU)
    • 17:45 19:45
      Welcome reception 2h Krúdy Hall

      Krúdy Hall

      Hungarian Academy of Sciences

    • 08:30 09:00
      Registration 30m
    • 09:00 10:00
      Site reports
      • 09:00
        Experience on the operations at new KEKCC 15m

        The KEK central computer system had been upgraded in September 2016. In this talk, we would like to report some experiences on the operation of hierarchical storage system and Grid system with status and usage after the system upgrade.

        Speaker: Tomoaki Nakamura (High Energy Accelerator Research Organization (JP))
      • 09:15
        Tokyo Tier-2 Site Report 15m

        The Tokyo Tier-2 site, which is located in International Center for Elementary Particle Physics (ICEPP) at the University of Tokyo, is providing computing resources for the ATLAS experiment in the WLCG.
        Updates on the site since the Fall 2016 meeting, including the status of batch system migration and an implementation of redundancy in the database of the storage element, will be reported.

        Speaker: Tomoe Kishimoto (University of Tokyo (JP))
      • 09:30
        KR-KISTI-GSDC-01 Tier-1 Site Reports 15m

        We will present the latest status of the GSDC. And migration plan of administrative system will be presented.

        Speaker: Jeongheon Kim (Korea Institute of Science and Technology Information)
      • 09:45
        Status of IHEP site 15m

        The report talks about the current status of IHEP site including new physics experiment it supports, the migration to HTCondor cluster, both EOS and lustre file system deployed at IHEP and network upgraded since Oct. 2016.

        Speaker: Jingyan Shi (IHEP)
    • 10:00 10:25
      Security and networking

      Cyber attacks we are faced with are not only sophisticated, but also in many instances highly profitable for the actors behind. It takes organizations on average more than six months to detect a cyber attack. The more sophisticated the attack, the likelier it is that it will pass undetected for longer. And by the time the breach is detected, many times it’s already too late.

      The only mean to adopt an appropriate response is to build a tight international collaboration and to implement trusted information sharing mechanisms within the community. The groundwork for that has already been laid in the WLCG Security Operations Centers Working Group by setting up a
      central HEP MISP instance and by investigating the sharing of threat intel data inside the community.

      Workshop attendees will be given an introduction to the MISP platform, will get to deploy their own MISP instance and set up sharing with other MISP instances. Access will be provided to the central HEP MISP instance currently containing more than 160 000 Indicators of Compromise, with new ones being constantly added as new threats emerge. This is an excellent opportunity to get to learn how to best make use of the platform and of the threat intelligence already available inside the community to protect both
      your organization and at the community as a whole.

      • 10:00
        Computer Security Update 25m

        This presentation provides an update on the global security landscape since the last HEPiX meeting. It describes the main vectors of compromises in the academic community including lessons learnt, presents interesting recent attacks while providing recommendations on how to best protect ourselves. It also covers security risks management in general, as well as the security aspects of the current hot topics in computing. By showing how the attacks we are facing are both sophisticated and profitable, the presentation concludes that the only mean to adopt an appropriate response is to build a tight international collaboration and trusted information sharing mechanisms within the community.

        This talk is based on contributions and input from the CERN Computer Security Team.

        Speaker: Liviu Valsan (CERN)
    • 10:25 11:00
      Coffee break 35m
    • 11:00 13:00
      Security and networking: How to make use of MISP to harness the power of threat intelligence inside the HEP community (Liviu Valsan and Romain Wartel)

      Cyber attacks we are faced with are not only sophisticated, but also in many instances highly profitable for the actors behind. It takes organizations on average more than six months to detect a cyber attack. The more sophisticated the attack, the likelier it is that it will pass undetected for longer. And by the time the breach is detected, many times it’s already too late.

      The only mean to adopt an appropriate response is to build a tight international collaboration and to implement trusted information sharing mechanisms within the community. The groundwork for that has already been laid in the WLCG Security Operations Centers Working Group by setting up a
      central HEP MISP instance and by investigating the sharing of threat intel data inside the community.

      Workshop attendees will be given an introduction to the MISP platform, will get to deploy their own MISP instance and set up sharing with other MISP instances. Access will be provided to the central HEP MISP instance currently containing more than 160 000 Indicators of Compromise, with new ones being constantly added as new threats emerge. This is an excellent opportunity to get to learn how to best make use of the platform and of the threat intelligence already available inside the community to protect both
      your organization and at the community as a whole.

    • 13:00 14:30
      Lunch break 1h 30m
    • 14:30 15:45
      Security and networking

      Cyber attacks we are faced with are not only sophisticated, but also in many instances highly profitable for the actors behind. It takes organizations on average more than six months to detect a cyber attack. The more sophisticated the attack, the likelier it is that it will pass undetected for longer. And by the time the breach is detected, many times it’s already too late.

      The only mean to adopt an appropriate response is to build a tight international collaboration and to implement trusted information sharing mechanisms within the community. The groundwork for that has already been laid in the WLCG Security Operations Centers Working Group by setting up a
      central HEP MISP instance and by investigating the sharing of threat intel data inside the community.

      Workshop attendees will be given an introduction to the MISP platform, will get to deploy their own MISP instance and set up sharing with other MISP instances. Access will be provided to the central HEP MISP instance currently containing more than 160 000 Indicators of Compromise, with new ones being constantly added as new threats emerge. This is an excellent opportunity to get to learn how to best make use of the platform and of the threat intelligence already available inside the community to protect both
      your organization and at the community as a whole.

      • 14:30
        WLCG/OSG Networking Update 25m

        WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. The OSG Networking Area is a partner of the WLCG effort and is focused on being the primary source of networking information for its partners and constituents.

        The WLCG Throughput working group was established to ensure sites and experiments can better understand and fix networking issues. In addition, it aims to integrate and combine all network-related monitoring data collected by the OSG/WLCG infrastructure from both network and transfer systems. This has been facilitated by the already existing network of the perfSONAR instances that have been commissioned to operate in full production.

        We will cover the updates to a few higher level services that were developed to help bring perfSONAR network to its full potential. This includes a new web-based mesh configuration system, which allows to centrally schedule and manage all the network tests performed by the instances; updates and changes to our network datastore (esmond), which collects, stores and provides interfaces to access all the network monitoring information from a single place as well as perfSONAR infrastructure monitoring based upon Experiments Testing Framework (ETF), which ensures that the current perfSONAR instances are configured and operated correctly.

        In this presentation we will provide an update on how to use and benefit from perfSONAR, including information on changes that are included in the recent
        release of version 4.0 of the Toolkit. We will also cover the status of our WLCG/OSG deployment and provide some information on our future plans

        Speaker: Shawn Mc Kee (University of Michigan (US))
      • 14:55
        ESnet Update 25m

        ESnet staff are in the early stages of planning the next generation of their network, ESnet6. ESnet is providing network services to all of the large US LHC computing centers and this community is the biggest user of the current ESnet5 network. ESnet6 is expected to be online during the LHC Run 3 and Run 4. How the LHC community uses the network has a big impact on the ESnet6 project, and the ESnet6 network.

        This talk will provide a brief overview of the ESnet6 project process, the ESnet6 R&D efforts, WAN capacity planning, and how network usage patterns impact network operations and the capacity planning process.

        Speaker: Mr Joe Metzger (ESnet)
    • 15:45 16:15
      Coffee break 30m
    • 16:15 17:30
      Security and networking

      Cyber attacks we are faced with are not only sophisticated, but also in many instances highly profitable for the actors behind. It takes organizations on average more than six months to detect a cyber attack. The more sophisticated the attack, the likelier it is that it will pass undetected for longer. And by the time the breach is detected, many times it’s already too late.

      The only mean to adopt an appropriate response is to build a tight international collaboration and to implement trusted information sharing mechanisms within the community. The groundwork for that has already been laid in the WLCG Security Operations Centers Working Group by setting up a
      central HEP MISP instance and by investigating the sharing of threat intel data inside the community.

      Workshop attendees will be given an introduction to the MISP platform, will get to deploy their own MISP instance and set up sharing with other MISP instances. Access will be provided to the central HEP MISP instance currently containing more than 160 000 Indicators of Compromise, with new ones being constantly added as new threats emerge. This is an excellent opportunity to get to learn how to best make use of the platform and of the threat intelligence already available inside the community to protect both
      your organization and at the community as a whole.

      • 16:15
        Deployment of IPv6-only CPU on WLCG - an update from the HEPiX IPv6 Working Group 25m

        This update from the HEPiX IPv6 Working Group will present activities during the last 6-12 months. In September 2016, the WLCG Management Board approved the group's plan for the support of IPv6-only CPU, together with the linked requirement for the deployment of production Tier 1 dual-stack storage and other services. This talk will remind HEPiX of the requirements for support of IPv6 and the deployment timelines of the plan. The current status will be presented including dual-stack storage deployment at the WLCG Tier 1s as will the strategy for deployment at Tier 2s.

        Speaker: Andrea Sciaba (CERN)
      • 16:40
        KEK Computer security update 25m

        We present an update of KEK computer security since HEPiX spring 2016. Over this past year, several security incidents occurred in KEK and Japanese academic sites. Consequently, we are forced to change our computer security strategy.
        In this presentation, we also report our experiences, practices, and future plans on KEK computer security.

        Speaker: Dr Tadashi Murakami (KEK)
      • 17:05
        Building and operating a large scale Security Operations Center 25m

        The HEP community is facing an ever increasing wave of computer security threats, with more and more recent attacks showing a very high level of complexity. Having a Security Operations Center (SOC) in place is paramount for the early detection and remediation of such threats. Key components and recommendations to build an appropriate monitoring and detection Security Operation Center will be presented, as well as means to obtain and share relevant and accurate threat intelligence information. Various lessons learnt from building and operating the CERN SOC will be presented. This presentation also gives an update on the work performed in the WLCG Security Operations Center Working Group that aims to provide a scalable reference design applicable for a range of HEP sites.

        Speaker: Liviu Valsan (CERN)
    • 08:30 09:00
      Registration 30m
    • 09:00 10:15
      Storage, file systems and data management
      • 09:00
        CERN IT-Storage Strategy Outlook 25m

        The IT-Storage group at CERN is responsible for the operations and the development of the infrastructure to accommodate all the storage requirements from the physics data generated by LHC and non-LHC experiments to the personnel users’ files.

        This presentation will give an overview of the solutions operated by the group, current and future developments, highlighting the group strategy to tackle future needs.

        Speakers: Luca Mascetti (CERN), Julien Leduc (CERN)
      • 09:25
        EOS and CERNBox Update 25m

        EOS, the high-performance CERN IT distributed storage for High-Energy Physics provides now more than 160PB of disks and supports several work-flows from data-taking and reconstruction to physics analysis. With the next storage delivery the system will grow above the 250PB mark. EOS provide as well “sync and share” capabilities to more than 9k users for administrative, scientific and engineering use-cases thanks to its integration with CERNBox.
        The talk will give an update on the current status of both services, highlight their future evolution and in particular will present the challenges ahead with the phase-out of AFS.

        Speaker: Luca Mascetti (CERN)
      • 09:50
        BNL Box 25m

        Network-attached online storage, aka cloud storage, is a very popular form
        of storage service provided by many commercial vendors. Providers include
        Dropbox, Box, Google Drive, MS One Drive and Amazon Cloud Drive. All
        have similar capabilities, providing users with quota space and custom
        applications to transfer data between local sites and cloud storage. In
        addition, all have well designed Web GUI's that allow users to manage
        their data online. Although HEP and nuclear physics communities with massive
        production data services have been slow to adapt to such services, some users
        have been using them as a convenient medium to store selected data and
        documents and share them with their collaborators. BNL Box is an effort to
        provide local users with such online capabilities with additional useful
        functions specific to large scale data science such as tape archiving,
        restoration from tape archive, high speed uploading/downloading and data
        streaming. This presentation will present the current status of the BNL Box service.

        Speaker: Dr Hironori Ito (Brookhaven National Laboratory)
    • 10:15 10:45
      Coffee break 30m
    • 10:15 10:45
      Group Photo
    • 10:45 12:25
      Storage, file systems and data management
      • 10:45
        Federated data storage system prototype for LHC experiments and data intensive science 25m

        Rapid increase of data volume from the experiments running at the Large Hadron Collider (LHC) prompted physics computing community to evaluate new data handling and processing solutions. Russian grid sites and universities’ clusters scattered over a large area aim at the task of uniting their resources for future productive work, at the same time giving an opportunity to support large physics collaborations. In our talk we will cover deployment and testing of federated data storage prototype for WLCG centers of different levels and university clusters within one Russian National Cloud. The prototype is based on computing resources located in Moscow, Dubna, Saint Petersburg, Gatchina and Geneva. This project intends to implement a federated distributed storage for scientific applications with access from Grid centers, university clusters, supercomputers, academic and commercial clouds. The efficiency and performance of the system are demonstrated using synthetic and experiment-specific tests including real data processing and analysis workflows from ATLAS and ALICE experiments. We will present topology and architecture of the designed system and show how it can be achieved using different software solutions such as EOS and dCache. We will also describe how sharing data on a widely distributed storage system can lead to a new computing model and reformations of classic computing style.

        Speaker: Andrey Kirianov (Petersburg Nuclear Physics Institut (RU))
      • 11:10
        RAL Tier-1 Evolution as a Global CernVM-FS Service Provider 25m

        The CernVM File System (CernVM-FS) was developed to assist WLCG High Energy Physics (HEP) collaborations to deploy software on the worldwide distributed computing infrastructure used to run data processing applications. The technology is now firmly established as the primary method for distributing WLCG experiment software, and its use by other HEP and non-HEP communities has increased significantly in recent years.

        This presentation will give an overview of the CernVM-FS infrastructure deployed at RAL Tier-1 as part of the WLCG Stratum-1 network, but also as the facility provided to setup a complete service - the Release Manager Machine, the Replica Server and a customized uploading mechanism - for the non-LHC communities within EGI.

        The latest developments to widen and consolidate the CernVM-FS infrastructure as a global facility (with main contributors in Europe, North America and Asia) are reviewed, such as the mechanism implemented to publish external repositories hosted by emerging regional infrastructures (eg. South Africa Grid). Also the presentation will describe the progress on implementing the novel protected CernVM-FS repositories, a requirement for academic communities willing to use CernVM-FS technology.

        Speaker: Catalin Condurache (STFC - Rutherford Appleton Lab. (GB))
      • 11:35
        An update to Ceph at RAL 25m

        The new Ceph based storage system, Echo, is now accepting production data from LHC VOs. This talk gives an update on the work done while reaching this milestone. It will also cover other non-Echo Ceph related work at RAL.

        Speaker: Tom Byrne (STFC)
      • 12:00
        Data-NG: A distributed Ceph infrastructure 25m

        Since 2012, 8 physics labs from Orsay/Saclay work together to provide a efficient and resilient scientific computing infrastructure. After building shared hosting facilities, this year, the 8 labs IT submitted a project to build a distributed data infrastructure based on Ceph technology that got funded at the end of 2016. The objective is to deploy on 3 sites, connected with a 100G network, 1 PB of usable storage for different use cases from cloud to shared file systems. We will present this project, which involves 10 peoples from the differents labs, and the design choices.

        Speaker: Guillaume PHILIPPON (CNRS - LAL)
    • 12:25 12:40
      Site reports
      • 12:25
        PIC Report 15m

        This is the PIC report to HEPIX Spring 2017.

        Speaker: Dr Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)
    • 12:40 13:05
      Computing and batch systems
      • 12:40
        CosmoHub on Hadoop: a web portal to analyze and distribute massive cosmological data 25m

        We present CosmoHub, a web platform to perform interactive analysis of massive cosmological data without any SQL knowledge. CosmoHub is built on top of Apache Hive, which is an Apache Hadoop ecosystem component, which facilitates reading, writing, and managing large datasets.

        CosmoHub is hosted at the Port de Informació Científica (PIC) and currently provides support to several international cosmology projects such as the Euclid space ESA mission, the Dark Energy Survey (DES), the Physics of the Accelerated Universe (PAU) and the Marenostrum Institut de Ciències de l'Espai Simulations (MICE). More than two billion objects, from public and private data, as well as observed and simulated data, are available among all projects. In the last three an a half years more than 400 users have produced about 1500 custom catalogs occupying 2TB in compressed format.

        CosmoHub allows users to access value-added data, to load and explore pre-built datasets and to create their own custom datasets through a guided process. All those datasets can be interactively explored using an integrated visualization tool which includes 1D histogram and 2D heatmap plots. In our current implementation, online analysis of datasets of a billion objects can be done in less than 25 seconds. Finally, all those datasets can be downloaded in three different formats: CSV.BZ2, FITS and ASDF.

        The components, integration and performance of the system will be reviewed in this contribution.

        Speaker: Jordi Casals Hernandez (University of Barcelona (ES))
    • 13:05 14:30
      Lunch break 1h 25m
    • 14:30 15:45
      Computing and batch systems
      • 14:30
        HammerCloud extension for Data Centre commissioning 25m

        HammerCloud is a testing service to run continuous tests or on-demand large-scale stress tests of the WLCG resources with real-life experiment jobs.

        HammerCloud is used by the ATLAS and CMS experiments in production. It has been a useful service to commission both compute resources and various components of the complex distributed systems of the LHC experiments, as well as integral part of the monitoring suite essential for the computing operations of the experiments and their automation.

        This contribution will review recent developments of the HammerCloud service that allow use of HammerCloud infrastructure to test Data Centre resources in the early phases of the infrastructure and services commissioning process. One of the benefits we believe HammerCloud can provide is to be able to tune the commissioning of the new infrastructure, functional and also stress testing, as well as benchmarking with a "standard candle" jobs, with experiment realistic workloads, that can be heavy for CPU, or I/O, or IOPS, or everything together.

        Speaker: Jaroslava Schovancova (CERN)
      • 14:55
        Experiences With Intel Knights Landing, OmniPath and Slurm 25m

        Brookhaven Lab recently acquired an Intel Knight's Landing (KNL) cluster consisting of 144 nodes connected with a dual-rail OmniPath (OPA) fabric. We will detail our experiences integrating this cluster into our environment, testing the performance and deugging issues relating to the fabric and hardware. Details about the integration with the batch system (Slurm) and performance issues found with different kernels will be discussed, as well as some results from scientific users of the system.

        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
      • 15:20
        JLab's SciPhi-XVI Knights Landing Cluster Update 25m

        Update to JLab' Fall 2016 SciPhi-XVI KNL talk, to include the addition of 64 nodes into our Knights Landing cluster, reaching #397 on the Top500 list at 429.5 TFlops and #10 on the Green500 list at 3836.6 MFLOPS/W. It will include an overview of our cluster configuration updates, Omni-Path fabric, benchmarking, integration with Lustre and NFS over Infiniband, as well as current open issues.

        Speaker: Sandy Philpott
    • 15:45 16:15
      Coffee break 30m
    • 16:15 17:55
      Computing and batch systems
      • 16:15
        Updates from HEPiX Benchmarking Working Group 25m

        An update of the activity of the HEPiX Benchmarking Working Group will be reported.

        Speaker: Domenico Giordano (CERN)
      • 16:40
        The scheduling strategy and experience of IHEP HTCondor Cluster 25m

        IHEP cluster with more 10,000 jobs slots was migrated from PBS to HTCondor by the end of 2016. The report talks about the the sharing pool scheduling policy deployed at IHEP cluster to promote the resource utility. And the experience of HTCondor management would be talked in the report.

        Speaker: Jingyan Shi (IHEP)
      • 17:05
        The search for new traceability and isolation approaches 25m

        The multi-user pilot job (MUPJ) model has become deeply embedded in the LHC computing ecosystem. In this model, the pilot job sent to a site batch will dynamically pull down one or more user payload jobs as it is running at the site.

        While providing the experiments with utmost flexibility, the MUPJ presents challenges in isolation (preventing payloads from interacting with the pilot) and traceability (helping the site understand who is utilizing their resources). The traditional approach to provide traceability and isolation has been to utilize glexec.

        In this talk, we will provide an overview of the search for new technologies to fulfill the traceability and isolation requirements. We fill focus on work within the OSG to utilize singularity for isolation and the HTCondor-CE for traceability.

        Speaker: Brian Paul Bockelman (University of Nebraska-Lincoln (US))
      • 17:30
        Understanding the performance of benchmark applications 25m

        The HEPiX Benchmarking Working group has been investigating fast benchmark applications with the objective of identifying candidates that can run quickly enough to avoid wasting compute resources, but still capable of accurately representing HEP workloads. Understanding how the underlying processor microarchitecture affects the results of these benchmarks is important to foresee scenarios where the scaling of HEP workloads and benchmark applications might diverge. In order to characterise the behaviour of these benchmarks, in-depth profiling activities have been carried out. In this talk we present an example of performance analysis, covering several layers of the stack, starting from Python source code down to the number of cycles required to execute assembly instructions.

        Speaker: Luca Atzori (CERN)
    • 18:30 22:00
      Workshop banquet 3h 30m Gróf Széchenyi River Ship

      Gróf Széchenyi River Ship

    • 08:30 09:00
      Registration 30m
    • 09:00 10:40
      Grids, clouds, virtualisation
      • 09:00
        The Computing Resource Information Catalog 25m

        This contribution describes the CRIC (Computing Resource Information Catalog) framework which is designed to describe the topology of the Experiments computing models, providing unified description of resources and services used by Experiments applications

        The Worldwide LHC Computing Grid infrastructure links about 200 participating computing centers affiliated with several partner projects. It is built by integrating heterogeneous computer and storage resources in diverse data centers all over the world and provides CPU and storage capacity to the LHC experiments to perform data processing and physics analysis. In order to be used by the LHC experiments, these distributed resources should be properly described, which implies easy service discovery, detailed description of service configuration, and description of their topology on how the experiments will use these services. Currently this information is not complete, the one available is scattered over multiple generic information sources like GOCDB, OIM, BDII and experiment-specific information systems. Such a model does not allow to validate topology and configuration information easily, and does not allow to understand that big parts of the infrastructure is actually shared between the various experiments, leading to severe operational problem when the resources want to be fully exploited. Moreover, information in various sources is not always consistent. Finally, the evolution of computing technologies introduces new challenges. Experiments are more and more relying on opportunistic resources, which by their nature are more dynamic and not integrated in any WLCG existing framework: also these resources need to be described in CRIC to allow the experiments to effectively exploit them.

        CRIC integrates configuration and status information about resources, services and topology of the whole computing infrastructure used by LHC experiments.
        CRIC collects information from various information providers, performs validation and provides a consistent set of UIs and APIs to the LHC VOs for service discovery and usage configuration. The main requirements for CRIC are simplicity, agility and robustness. CRIC should be able to be quickly adapted to new types of computing resources, new information sources, and allow for new data structures to be implemented easily following the evolution of the computing models and operations of the experiments.

        The implementation of CRIC was inspired by the successful experience with the ATLAS Grid Information System (AGIS). The first prototype was put in place in a short time thanks to the fact that the substantial part of AGIS code was re-used though some re-factoring required in order to perform clean decoupling in two parts:
        • A core which describes all physical service endpoints and provides a single entry point for experiments service discovery.
        • Optional Experiment-specific extensions, implemented as plugins. They describe how the physical resources are used by the experiments and contain additional attributes and configuration which are required by the experiments for operations and organization of their data and work flows.

        CRIC not only provides a current view of the WLCG infrastructure, but also keeps track of performed changes and audit information. Its administration interface allows authorized users to make changes. Authentication and authorization are subject to experiment policies in terms of data access and update privileges.

        Speaker: Alessandro Di Girolamo (CERN)
      • 09:25
        ElastiCluster - automated deployment and scaling of computing and storage clusters on IaaS cloud infrastructures 25m

        ElastiCluster is a command-line application (and a Python API) to deploy, configure, and resize various types of computational clusters on Infrastructure-as-a-Service clouds. Currently supported is the deployment of SLURM/GridEngine/TORQUE batch clusters, Spark/Hadoop systems (with Hive and HDFS), and various types of distributed filesystems (GlusterFS, OrangeFS, Ceph) on OpenStack, Google Compute Engine, and Amazon EC2. ElastiCluster leverages Ansible for maximum extensibility and customizability.

        This talk will demo ElastiCluster features and give a short overview at how it's used in production at the University of Zurich and elsewhere.

        Speaker: Riccardo Murri
      • 09:50
        CERN Cloud service update: Containers, migrations, upgrades, etc. 25m

        We'll give an update on the status of our cloud, focusing in the features recently added with special attention over containers.

        Speaker: Luis Pigueiras (CERN)
      • 10:15
        Container Orchestration - Simplifing Use of Public Clouds 25m

        One obstacle to effective and efficient exploitation of public cloud resources is the work required to accomodate their different APIs. Observing that many public clouds offer varying degrees of support for container orchestration using Kubernetes, we present the results of practical experiments involving several large public cloud providers.
        We also present a brief update on container orchestration within STFC's Scientific Computing Department.

        Speaker: Ian Collier (STFC - Rutherford Appleton Lab. (GB))
    • 10:40 11:15
      Coffee break 35m
    • 11:15 12:30
      Grids, clouds, virtualisation
      • 11:15
        System testing service developments using Docker and Kubernetes: EOS + CTA use case 25m

        The IT Storage group at CERN develops the software responsible for archiving to tape the custodial copy of the physics data generated by the LHC experiments. This software is code named CTA (the CERN Tape Archive).
        It needs to be seamlessly integrated with EOS, which has become the de facto disk storage system provided by the IT Storage group for physics data.

        CTA and EOS integration requires parallel development of features in both software that needs to be synchronized and systematically tested on a specific distributed development infrastructure for each commit in the code base.

        This presentation describes the full continuous integration workflow that deploys and orchestrates all the needed services in docker containers on our specific kubernetes infrastructure.

        Speaker: Julien Leduc (CERN)
      • 11:40
        Distributed computing in IHEP 25m

        IHEP distributed computing was built based on DIRAC in 2012 and started operations in 2014 to meet peek needs of the BESIII experiment. As more new experiments (JUNO, LHAASO, CEPC, etc) with challenging data volume are coming into operations or are planned in IHEP, the system has been progressively developed into a common platform to support multi experiments in one instance. In this platform, the DIRAC-based WMS was extended to support multi-VO scheduling based on VOMS, with the general task submission and management and site monitoring tools developed. The CVMFS and central SE also were reconfigured to support more than one experiment.
        To fully use opportunistic resources, also the platform has integrated private cloud and explored commercial cloud in an elastic way. With the trend of parallel processing used in High Energy Physics experiment software, IHEP distributed computing has started to support of multi-core jobs. Also with more and more HPC resources available to fasten the speed of physics analysis, in the near future it is also interesting to share heterogeneous and distributed HPC resources through this platform.

        Speaker: Xiaomei Zhang (Chinese Academy of Sciences (CN))
      • 12:05
        Understanding performance: optimisation activities in WLCG 25m

        The LHC Run4 phase, also known as HL-LHC, is scheduled to start in mid 2026 and it will impose formidable challenges to the capability of processing and storing data according to the planned data acquisition rates. A tenfold increase in recorded event rates for ATLAS and CMS and a threefold increase in event pile-up will require an amount of computational power and storage far in excess of what can be estimated to be available taking into account flat budgets and current technological trends. WLCG has started to work on closing this gap by initiating a series of activities which aim at understanding and reducing inefficiencies at both the software level and the computing workflows level. In addition work started to exploit computing resources previously unavailable. This contribution describes the achievements and the ongoing work of the team dedicated to these activities, which are conducted in close collaboration with the LHC experiments.

        Speakers: Andrey Kiryanov (Petersburg Nuclear Physics Institute, National Research Center "Kurchatov Institute"), Andrea Sciaba (CERN)
    • 12:30 12:55
      Storage, file systems and data management
      • 12:30
        Advances in storage technologies 25m

        This is a whistle-stop tour of some of the new approaches and technologies that enables companies to derive insight from their data, both today and into the future. It compares the progress of SSD and HDD, and maps out how HDD can stay on the
        aerial density curve for the foreseeable future, up to 100TB per device. 

        Technology touched on will be Dual-Actuator drives, Helium, Two-Dimensional and multi-dimensional magnetic recording, HAMR (Heat Assisted Magnetic Recording) and Bit Patterned Media.

        Speaker: Mr Joe Fagan (Seagate)
    • 12:55 14:30
      Lunch break 1h 35m
    • 14:30 15:45
      Basic IT services
      • 14:30
        Centralising Elasticsearch 25m

        In January 2016 CERN launched a new project with the aim to provide a centralised Elasticsearch service. This presentation will summarise the status of the project,
        challenges, experiences from the pre-production phase, and methods applied to configure access control.

        Speaker: Ulrich Schwickerath (CERN)
      • 14:55
        The evolution of monitoring system: the INFN-CNAF case study 25m

        Over the past two years, the operations at CNAF, the ICT center of the Italian Institute for Nuclear Physics, have undergone significant changes. The adoption of configuration management tools, such as Puppet, and the constant increase of dynamic and cloud infrastructures have led us to investigate a new monitoring approach. The present work deals with the centralization of the monitoring service at CNAF through a scalable and highly configurable monitoring infrastructure.
        The selection of tools has been made taking into account the following requirements given by users: (I) adaptability to dynamic infrastructures, (II) ease of configuration and maintenance, capability to provide more flexibility, (III) compatibility with existing monitoring system, (IV) re-usability and ease of access to information and data. In the paper, the CNAF monitoring infrastructure and its related components are hereafter described: Sensu as monitoring router, InfluxDB as time series database to store data gathered from sensors, Uchiwa as monitoring dashboard and Grafana as a tool to create dashboards and to visualize time series metrics.

        Speaker: Mr Stefano Bovina (INFN)
      • 15:20
        Unified Monitoring Architecture for CERN IT and Grid Services 25m

        For over a decade, the CERN IT Data Centres have been using a centralized monitoring infrastructure collecting data from hardware, services and applications via in-house sensors, metrics and notifications. Meanwhile also the LHC experiments were relying on dedicated WLCG Dashboards visualizing and reporting the status and progress of the job execution, data transfers and sites availability across the WLCG grid resources.

        At the beginning of 2016 it was decided to merge services, resources and technologies of the two monitoring activities and move from in-house dedicated development toward open sources systems. This merge resulted in the definition and the development of a Unified Monitoring Architecture to collect, transport, store, search and visualize both IT Data Centres and WLCG Dashboard monitoring data. The newly developed architecture relies on state-of-the-art open source technologies and on open data formats, and provides solutions for easily collecting, processing and visualizing new monitoring data.

        This contribution provides an overview of the Unified Monitoring Architecture, currently based on technologies such as collectd, ElasticSearch, Spark and Hadoop, with details on the lessons learned and on the ongoing work to monitor both the CERN IT Data Centres and the WLCG job, data transfers and sites and services. And, given the move to established open source technologies, it could also be easier to share experience and common solutions within the HEPiX community.

        Speaker: Jaroslava Schovancova (CERN)
    • 15:45 16:15
      Coffee break 30m
    • 16:15 17:30
      Basic IT services
      • 16:15
        Data Collection and Monitoring update 25m

        Touching on the fact that we have an ongoing data collection project and its progression to the next phase, monitoring. Talk about a couple of monitoring paths taken, ones that look promising also talk about the ones that failed.

        Building upon last year, I'll discuss a bit about how to create a small data collection and monitoring setup. Instruction will be place on the HEPiX twiki.

        Speaker: Cary Whitney (LBNL)
      • 16:40
        Flexible, scalable and secure logging using syslog-ng 25m

        Event logging is a central source of information for IT.
        The syslog-ng application collects logs from many different
        sources, performs real-time log analysis by processing and filtering them,
        and finally it stores the logs or routes them for further analysis.

        In an ideal world, all log messages come in a structured format, ready to
        be used for log analysis, alerting or dashboards. But in a real world only
        part of the logs belong to this category. Traditionally, most of the log
        messages come as free format text messages. These are easy to be read by
        humans, which was the original use of log messages. However, today logs are
        rarely processed by the human eye. Fortunately syslog-ng has several tools
        to turn unstructured and many of the structured message formats into
        name-value pairs, and thus delivers the benefits of structured log messages.

        Once you have name-value pairs, log messages can be further enriched with
        additional information in real-time, which helps responding to security incidents
        in due time. One way is adding geo-location based on IP addresses.
        Another way is adding contextual data from external files, like the role of
        a server based on the IP address or the role of the user based on the name.
        Data from external files can also be used to filter messages, for example
        to check firewall logs to determine whether certain IP addresses are
        contained in various black lists for malware command centers, spammers, and
        so on.

        Logging is subject to an increasing number of compliance regulations.
        PCI-DSS or many European privacy laws require removing sensitive data from
        log messages. I will demonstrate how logs can be anonymized in a way that
        they are still useful for security analitics.

        At the end I would like to introduce you to the basics of syslog-ng
        configuration, and demonstrate how the collected logs can be used for
        alerting or visualized on a dashboard.

        Speaker: Mr Péter Czanik (Balabit)
      • 17:05
        Typical syslog-ng use-cases at our Tier-1 25m

        We present the log infrastructure at CCIN2P3 and illustrate how syslog-ng plays a central part in it.
        Following up on Balabit's talk on syslog-ng's features, we present several use-cases which are likely to be of interest to the HEPiX community.
        For instance, we present real-life examples on how to parse and correlate operating system and batch scheduler events.
        We present its integration with common alerting backends like Nagios, as well as modern indexing solutions like Elasticsearch, Kibana and Riemann.
        Moreover, in order to emphasize the software's high order of flexibility and upgradability, we provide some feedback from our interaction with the core developers.
        We finally present our past and present code contributions to the syslog-ng codebase, and our plans for the logging infrastructure's future.

        Speaker: Fabien Wernli (CCIN2P3)
    • 17:45 19:15
      HEPiX Board Hall of Paintings

      Hall of Paintings

      Hungarian Academy of Sciences

    • 08:30 09:00
      Registration 30m
    • 09:00 10:40
      Facilities and business continuity
      • 09:00
        Wigner Datacenter cooling system upgrade 25m

        Report about the development on the regulation system of the chillers: the current status, target in this phase and possible future plans.

        Speaker: Mr Gábor Szentiványi (Wigner Datacenter)
      • 09:25
        CERN Computing Facilities' Update 25m

        This talk will give the current status of two on-going Data Centre projects as well as two recent incidents.

        Speaker: Mr Wayne Salter (CERN)
      • 09:50
        P2IO/LAL Datacenter Extension 25m

        P2IO, a group of laboratories that LAL is member of, build the first phase of a shared datacenter a few years ago, in production since October 2013. This datacenter has been designed for achieving a good energy efficiency in the context of scientific computing. The extension of this datacenter is in progress to increase its capacity from 30 to 50 racks. This talk will presen the lessons learned from the first years of operations and the main choices for the extension.

        Speaker: Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 10:15
        Manage your hardware failures in an (almost) automated workflow 25m

        Hardware maintenance can be time consuming depending on your processes and your retailers ones. The goal of the talk is to depict how the end-to-end chain of hardware failures, from the event to the case closing, has been mostly automated in our machines rooms. It covers diagnostics, incident tracking, parts dispatching, statistics, processes, tools, bits of SOAP code and people management.

        In a word: save your sysadmins time.

        Speaker: Mattieu Puel
    • 10:40 11:10
      Coffee break 30m
    • 11:10 11:30
      Workshop wrap-up