HEPiX Spring 2013 Workshop

Europe/Rome
CNAF Bologna (Italy)

CNAF Bologna (Italy)

Andrea Chierici (INFN-CNAF), Helge Meinhard (CERN), Sandy Philpott (JLAB)
Description

HEPiX meetings bring together IT system support engineers from the High Energy Physics (HEP) laboratories, institutes, and universities, such as BNL, CERN, DESY, FNAL, IN2P3, INFN, JLAB, NIKHEF, RAL, SLAC, TRIUMF and many others.

Meetings have been held regularly since 1991, and are an excellent source of information for IT specialists in scientific high-performance and data-intensive computing disciplines. We welcome participation from related scientific domains for the cross-fertilization of ideas.

The hepix.org website provides links to information from previous meetings.

    • 09:00 09:30
      Miscellaneous
      Convener: Dr Helge Meinhard (CERN)
      • 09:00
        Welcome address 20m
        Speaker: Mauro Morandin (INFN)
        Slides
      • 09:20
        Workshop logistics 10m
        Speaker: Andrea Chierici (INFN-CNAF)
        Slides
    • 09:30 10:30
      Site reports
      Convener: Michele Michelotto (Universita e INFN (IT))
      • 09:30
        INFN-Tier1 Status report 15m
        We will give a status update on the Italian Tier1 center, hosted at CNAF.
        Speaker: Dr Luca Dell'Agnello (INFN)
        Slides
      • 09:45
        GridKa Site Report 15m
        Brief discussion of current status at GridKa, e.g. the completed LRMS migration.
        Speaker: Manfred Alef (Karlsruhe Institute of Technology (KIT))
        Slides
      • 10:00
        RAL Site Report 15m
        Update on status and plans for the RAL Tier1 and other activities at RAL.
        Speaker: Mr Martin Bly (STFC/RAL)
        Slides
      • 10:15
        CC-IN2P3 Site Report 15m
        News and status of the CC-IN2P3 since the last Site Report.
        Speaker: Sebastien Gadrat (CC-IN2P3 - Centre de Calcul (FR))
        Slides
    • 10:30 11:00
      Coffee break 30m
    • 11:00 12:30
      Site reports
      Convener: Michele Michelotto (Universita e INFN (IT))
      • 11:00
        NDGF site report 15m
        Update on services, storage, computing, organisation, and other things at NDGF.
        Speaker: Erik Mattias Wadenstein (University of Umeå (SE))
        Slides
      • 11:15
        CERN Site report 15m
        News from CERN since the last HEPiX workshop
        Speaker: Arne Wiebalck (CERN)
        Slides
      • 11:30
        Fermilab Site Report 15m
        Spring 2013 HEPiX Fermilab site report.
        Speaker: Dr Keith Chadwick (Fermilab)
        Slides
      • 11:45
        BNL RHIC/ATLAS Computing Facility Site Report 15m
        Presentation of recent developments at Brookhaven National Laboratory's (BNL) RHIC/ATLAS Computing Facility (RACF).
        Speaker: Christopher Hollowell (Brookhaven National Laboratory)
        Slides
      • 12:00
        Beijing-LCG2 Site Report 15m
        The presentation describes the current status and latest news at IHEP(BEIJING-LCG2 Site),e.g.:the hardware status,storage systems,IPv6 testbed status and SDN (software defined network) plan and current status.
        Speaker: Mr Qi Fazhi (IHEP)
        Slides
      • 12:15
        IRFU site report 15m
        What is new in IRFU Saclay since the last site report?
        Speaker: Pierrick Micout (Unknown)
        Slides
    • 12:30 14:00
      Lunch break 1h 30m
    • 14:00 15:30
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 14:00
        Wigner Data Centre 25m
        The CERN remote data centre in Budapest is now in operation, the first equipment deliveries have taken place and the network infrastructure, including the 1x100Gbps links is operational. By the time of this talk the first systems should also be in operation. This talk will therefore summarise the project since the last talk at the HEPIX spring meeting of 2012 and the experience gained so far.
        Speaker: Wayne Salter (CERN)
        Slides
        Video
      • 14:25
        An energy efficient datacenter in Orsay 25m
        Fundamental research labs in southern part of Paris, known as P2IO, have decided to build a common computing facility shared between labs and designed to be energy efficient. This involves the creation of a new datacenter as the first step. This project started one year ago has entered its construction phase and is expected to be available next Fall. This presentation will detail the requirements and technical design and introduce the current ideas for the future computing platform.
        Speaker: Michel Jouvin (Universite de Paris-Sud 11 (FR))
        Slides
      • 14:50
        Review of the recent power incidents at RAL Tier1 25m
        In November 2012 the RAL Tier1 suffered two serious power failure incidents in a period of two weeks. The first was a general whole-site failure which was exacerbated by the loss of the UPS power supply, and the second was a power surge caused during work on the UPS supply feed and which damaged significant amounts of equipment. In both cases the Tier1 facility and other Scientific Computing facilities were off-line for extended periods of time. This presentation runs through the incidents and the lessons learned while restoring the services.
        Speaker: Mr Martin Bly (STFC/RAL)
        Slides
      • 15:15
        Energy Optimisation 15m
        This talk will address the topic of energy saving aspects of data centre operations and raise the question of whether HEPIX wishes to have a common effort on this topic, e.g. to set up a working group or to have a track at a future workshop on this topic.
        Speaker: Wayne Salter (CERN)
        Slides
    • 15:30 16:00
      Coffee break 30m
    • 16:00 17:15
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 16:00
        Version control service integration with Issue tracking service at CERN 25m
        The current efforts on the version control and issue tracking services at CERN will be presented. Special attention to the new central git service, the integration between issue tracking and version control and future service deployments.
        Speaker: Alvaro Gonzalez Alvarez (CERN)
        Slides
      • 16:25
        Define the requirements for the demployment of a CMDB system at IN2P3-CC 25m
        it will be specified
        Speaker: Dr Emmanouil Vamvakopoulos (CC-IN2P3 - Centre de Calcul (FR))
        Slides
        summary
      • 16:50
        CMDBuild: configuring a custom database for asset information to support Service Management 25m
        CMDBuild is an application finalized to manage the configuration database (CMDB) of objects and services used by the IT department of an organization, compliant with ITIL "best practices". During this talk we want to give a brief presentation of the project and the most successful . We will explain how the application could be effectively used and we will illustrate some examples to present the main mechanisms CMDBuild: - modeling an object-relational database and navigate through its data - configuration workflow (process like Service Desk, Service Catalog and the correlation with technical services and CI, Change Management and CI handling, etc.) - usage of reports and dashboards - interoperability with other systems (automatic computer inventory, monitoring system, LDAP, virtualization management, ...) - integration with geographic information systems (GIS) Then we will describe the new features in the upcoming version: views, permissions on rows and columns, lock the data sheets Finally we will discuss the bidirectional integration with modeling tools TOGAF (enterprise architecture), which will be presented in a workshop in Florence on May 28.
        Speaker: Lisa Pedrazzi (T)
        Slides
    • 17:15 18:00
      CMDBuild BoF session 45m
    • 18:00 20:00
      Welcome reception 2h
    • 09:00 10:30
      Site reports
      Convener: Michele Michelotto (Universita e INFN (IT))
      • 09:00
        PDSF@NERSC Status Report 15m
        PDSF is a linux cluster supporting high energy and nuclear physics workloads at NERSC. The cluster has been in continues operation since 1998 and this lifespan brings interesting challenges to the procurement/retirement cycle. At this time the cluster covers needs of STAR Tier1 center (mostly simulations and analysis needs), Alice Tier2 (striving for Tier 1 status) and Tier3 for Atlas. In this presentation we will cover changes since the last report, details of our latest procurements and approaches taken to minimize the maintenance burden and conform to center-wide standards in deployment and configuration management (xcat, cfengine3). The anticipated move of the center into a new building in the Spring of 2015 will also be discussed.
        Speaker: Iwona Sakrejda
        Slides
      • 09:15
        Improvements to the Manchester Tier2 infrastructure and the UK NGI VOMS service. 15m
        The year 2012 brought many changes and improvements to the Manchester Tier2. Hardware upgrades to new and more energy efficient machines reduced the number of machines while keeping the available resources consistent. The whole network was upgraded to 10G, from 4x1G bonded connections for the storage elements and from 1G for all other machines. The external connection was already on 10G. The Grid services were upgraded from gLite 3.2 to EMI-2. Prior to the upgrade, all services were running on dedicated hardware. Most of the new servers were turned into KVM hypervisors and the upgraded services were installed as virtual machines. The introduction of virtualization allows to use the new servers more efficiently, as well as reducing the time and effort needed to set up, test and upgrade services. Like many other sites, Manchester is moving away from CFengine towards Puppet. Some of the services were moved to Puppet during the upgrade, while others were kept on CFengine, due to the close deadline for the upgrade. The Puppet work is still a work in progress and more machines will be moved to Puppet gradually. The UK NGI VOMS service was hosted in Manchester and was administered by the NGS. There were two distinct sets of servers, the GridPP and the NGS servers. To move to a more unified service provision, the VOs hosted on the NGS service were migrated to the GridPP service and the NGS service was decommissioned. Administration of the GridPP service became the responsibility of the Manchester Tier2 team. We describe the changes to the Manchester Tier2 infrastructure. How we approached them, what challenges we encountered and how we solved them. We describe the recent changes to the UK NGI VOMS service. We detail the transition process and changes to the resilience mechanisms.
        Speaker: Mr Robert Frank (The University of Manchester)
        Slides
      • 09:30
        QMUL Site report 15m
        Status and updates from QMUL, a WLCG Tier-2 site.
        Speaker: Christopher John Walker (University of London (GB))
        Slides
      • 09:45
        ASGC site report 15m
        The ASGC IT/computing status report
        Speaker: Mr Felix Lee (ASGC)
        Slides
      • 10:00
        AGLT2 Site Update 15m
        We will present an update on our site since the last report and cover our work with dCache, perfSONAR-PS, VMWare and changes underway to our node provisioning system. In addition we will discuss our new denser storage system from Dell, recent networking changes (including plans for 100G) and describe how we are integrating these into our site. We will conclude with a summary of what has worked and what problems we encountered and indicate directions for future work.
        Speaker: Shawn Mc Kee (University of Michigan (US))
        Slides
      • 10:15
        KISTI-GSDC site report` 15m
        Brief description of the status of GSDC computing and storage resources at KISTI.
        Speaker: Gianni Mario Ricciardi (KiSTi Korea Institute of Science & Technology Information (KR))
        Slides
    • 10:30 11:00
      Coffee break 30m
    • 11:00 11:15
      Site reports
      Convener: Michele Michelotto (Universita e INFN (IT))
      • 11:00
        GSI Site report 15m
        News from GSI; new people, status Green Cube, new file server for lustre, status GSI TeraLink
        Speaker: Dr Walter Schoen (GSI)
        Slides
    • 11:15 12:30
      Storage and filesystems
      Conveners: Andrei Maslennikov (Universita e INFN, Roma I (IT)), Mr Peter van der Reest (DESY)
      • 11:15
        Long-Term Data Preservation in High Energy Physics (DPHEP) 25m
        2012 saw the publication of the Blueprint Document from the DPHEP study group (DPHEP-2012-001). A summary of this document was used as input to the Krakow workshop on the future European Strategy for Particle Physics and it is to be expected that Data Preservation will be retained in the updated strategy to be formally adopted by the CERN Council in May 2013. The same year also saw a number of other important events directly related to data preservation and data management in the wider sense. In July, the European Commission published a Recommendation on “access to and preservation of scientific information” and a workshop was held in October to investigate the level of European and International coordination that was desired in this area (with an obvious conclusion). In conjunction, various funding agencies are adding requirements on data preservation and open access plans to future funding calls. In parallel, an activity now named “The Research Data Alliance” (RDA) was established with support from the US, the EU and Australia (other countries in Asia-Pacific are expected to join) “to accelerate and facilitate research data sharing and exchange” and a working group on data preservation is in the process of being established. There are very clear signs that the output of the RDA will have a direct influence on funding as part of the EU’s Horizon 2020 programme and presumably also elsewhere in the world. Activities related to these events also allowed us to strengthen our collaboration with numerous other disciplines: not only those with whom we have long had ties, such as Astronomy and Astrophysics, but also other scientific communities as well as arts and humanities, all of whom are deeply involved in data preservation activities. Basking in the scientific results of the LHC in 2012, there is a clear halo effect to be exploited. Following the proposal by the CERN Director for Research and Scientific Computing to support the requested position of a DPHEP project manager (for an initial period of 3 years), DPHEP is moving from being a study group to be an active international and interdisciplinary collaboration. An initial set of goals is listed in the DPHEP Blueprint and is currently being actively worked on. This presentations outlines the rapid progress that was made particularly in the second half of 2012 as well as the exciting opportunities for the future. It is framed in terms of a simple “vision” based on the Open Archival Information System model.
        Speaker: Dr Jamie Shiers (CERN)
        2020 vision report
        DPHEP Blueprint
        Draft Collaboration Agreement
        ICFA statement
        Slides
      • 11:40
        Backup system migration at LAPP : from a local infrastructure to IN2P3 centralized system 25m
        Lapp is one of the French IN2P3 labs located in Annecy-le-Vieux. In this presentation, we will explain how we moved our tape backup system from a local infrastructure to a centralized one. We were using Networker (Legato) software for more than ten years with a small LTO2 library and we switched to a mutualized infrastructure powered by CC IN2P3 using Tivoli (IBM) software. This talk will describe the new infrastructure (based on information provided by our CC-IN2P3 colleagues) and then we will detail the different steps for this migration including the integration with GPFS. We will point out the difficulties we met, the impacts on our organization and infrastructure and our future plans. On the other hand, we will also report the CC-IN2P3 point of view as a service provider.
        Speaker: Mrs Murielle Gougerot (LAPP / IN2P3 / CNRS (FR))
        Slides
      • 12:05
        AFS at CERN: Growing with the users' needs 25m
        During the past year, CERN AFS users were offered up to a 100 times more space and profited from major advances in reducing the access latencies in server overload situations. This presentation will summarise the efforts that made this evolution possible. In addition, we will also discuss the situation around IPv6 and OpenAFS and present potential options in this area which may require a common community effort.
        Speaker: Arne Wiebalck (CERN)
        Slides
    • 12:30 14:00
      Lunch break 1h 30m
    • 14:00 14:50
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 14:00
        SCCM 2012 SP1 : New features and implementation at CEA/Saclay IRFU 25m
        The IRFU use the Microsoft product SCCM (System Center Configuration Manager) to manage all his Windows computers (1500 computers). SCCM manage the network installation, inventory, product installation and update, security patchs … since the 2003 version. The 2 last releases of this product (SCCM 2012 in april 2012 and SCCM 2012 SP1 in January 2013) give us new possibilities. In particular the SP1 support now Windows 8 and also include a client to manage for Mac clients and Linux Clients.
        Speaker: Mr Joel Surget (CEA/Saclay DSM/IRFU)
        Slides
      • 14:25
        Exchange Exchange 25m
        DESY has been using Exchange for a long time and is actually still using version 2003 of Microsoft's groupware product. Currently we are investigating the use of Zimbra Collaboration Suite to abandon Exchange. The presentation will point to some interesting features and also arguments, why DESY has not planned to upgrade Exchange and also aspects of the sourrounding mail environment will be shown, to complete the picture of DESY's central mail infrastructure.
        Speaker: Mr Dirk Jahnke-Zumbusch (DESY)
        Slides
    • 14:50 15:40
      Computing
      Conveners: Michele Michelotto (Universita e INFN (IT)), Wolfgang Friebel (Deutsches Elektronen-Synchrotron (DE))
      • 14:50
        HS06 on recent processors, virtual machines and commercial cloud 25m
        We measured HS06 on recent AMD and Intel processors. The results will be commented. As a side activity with colleagues of E-Fiscal project we compared the performances on Amazon EC2 cloud with corresponding configuration on virtual machines based on KVM
        Speaker: Michele Michelotto (Universita e INFN (IT))
        Slides
      • 15:15
        Who can beat X86? 25m
        The talk will provide detailed information on one of the most revolutionary platform for low power computing of recent years. The interest on power saving and maintaining a good time to solution performance is the open task of the current research on computing. Our solution based on ARM Cortex-A9 and GPU Nvidia Quadro can achieve 270 GFlops of peak performance single precision with a power consumption of 50 W. The real performances are demonstrated through a series of synthetics benchmarks and real applications. In detail, the first part will be mostly focused on pure hardware measurements for example, i). measures bandwidth of transferring data across the PCIe bus, ii). measures bandwidth of reading data back from a device, iii). measures bandwidth of memory accesses to various types of device, iv). measures maximum achievable floating point performance using a combination of auto-generated and hand coded kernels. A second section will shows results from algorithms test, such as: • FFT: forward and reverse 1D FFT. • MD: computation of the Lennard-Jones potential from molecular dynamics • Reduction: reduction operation on an array of single or double precision floating point values. • SGEMM: matrix-matrix multiply. • Scan: scan (also known as parallel prefix sum) on an array of single or double precision floating point values. • Sort: sorts an array of key-value pairs using a radix sort algorithm • Spmv: sparse matrix-vector multiplication In the last part, some results obtained with standard applications will demonstrate both the performance and the compatibility of the codes on this new hardware platform.
        Speaker: Mr Piero Altoe' (E4)
        Slides
    • 15:40 16:10
      Coffee break 30m
    • 16:10 17:30
      Computing
      Conveners: Michele Michelotto (Universita e INFN (IT)), Wolfgang Friebel (Deutsches Elektronen-Synchrotron (DE))
      • 16:10
        Monitoring and Reporting for Gridengine 25m
        Gridengine does not come with a graphical monitoring tool to watch the activity on the batch farm. Commercial add ons like Arco and Unisight do provide limited monitoring based on the contents of a log file which however does contain only a minimum of information on running jobs. Therefore at DESY a batch monitoring software was developed to fill the gap. The system is in place at DESY for several batch farms (including the National Analysis Farm running Gridengine). It allows users to see the history and status of their active and finished jobs and to look at plots characterizing the batch farm usage over time. Privileged users get access to more details and are able to see job details for users within their groups. This talk will present the features of the monitoring software and discuss possible bottlenecks and plans to enhance the system.
        Speaker: Wolfgang Friebel (Deutsches Elektronen-Synchrotron (DE))
        Slides
      • 16:35
        Evaluation of a new Grid Engine Monitoring and Reporting Setup 25m
        Splunk is a commercial software platform for collecting, searching, monitoring and analyzing machine data providing interactive real-time dashboards integrating multiple charts, reports and tables. We have been working on a Son of Grid Engine setup based on the free Splunk branch supporting standard reporting and simple job and fairshare debugging with easy chart generation and smart event correlation features. On top of this we try understand the added value of the enterprise branch of Splunk supporting integrated user authentication and role-based access controls. There is a plan to share our work in a public available Splunk Grid Engine app.
        Speaker: Thomas Finnern (DESY)
        Slides
      • 17:00
        Job packing: optimized configuration for job scheduling 25m
        Default behaviour of batch systems is to dispatch jobs to the lesser loaded host, Thus spreading jobs almost uniformly across all the nodes in the farm. On some circumstances it is however desirable to have some kind of jobs running in the smallest possible set of nodes. Usecases and packing variants are discussed, and test results are presented.
        Speaker: Dr Stefano Dal Pra (INFN)
        Slides
    • 17:30 19:00
      Storage and filesystems: BoF session on OpenAFS
      Conveners: Andrei Maslennikov (Universita e INFN, Roma I (IT)), Dr Arne Wiebalck (CERN), Mr Peter van der Reest (DESY)
      • 17:30
        OpenAFS BoF 30m
        Speakers: Dr Arne Wiebalck (CERN), Mr Peter van der Reest (DESY)
    • 09:00 10:40
      Security and networking
      Conveners: Dr David Kelsey (STFC - Science & Technology Facilities Council (GB)), Dr Shawn Mc Kee (University of Michigan (US))
      • 09:00
        WLCG Network Monitoring using perfSONAR-PS 25m
        The LHC experiments have significantly evolved their distributed data management architecture over time. This resulted in the underlying WLCG infrastructure moving from a very rigid network topology, based on the MONARC model, to a more relaxed system, where data movement between regions or countries does not necessarily need to involve T1 centers. While this evolution brought obvious advantages in terms of flexibility for the LHC experiment’s data management systems, it also opened the question of how to monitor the increasing number of possible network paths, in order to manage and maintain a global, reliable network service. The perfSONAR network monitoring framework has been evaluated and agreed as a proper solution to cover the WLCG network monitoring use cases: it allows WLCG to plan and execute latency and bandwidth tests between any instrumented endpoint through a central scheduling configuration, it allows archiving of the metrics in a local database, it provides a programmatic and a web based interface exposing the tests results; it also provides a graphical interface for remote management operations. In this presentation we will report on our activities and plans for deploying a perfSONAR-PS based network monitoring infrastructure in the scope of the WLCG Operations Coordination initiative. we will motivate the main choices we made in terms of configuration and management, describe the additional tools we developed to complement the standard packages and present the status of the deployment, together with the possible future evolution.
        Speaker: Shawn Mc Kee (University of Michigan (US))
        Slides
      • 09:25
        Identity Federation in WLCG and HEP 25m
        Federated identity management (FIM) in general and federated identity management for research communities (FIM4R) is an arrangement that can be made among multiple organisations that lets subscribers use the same identification data to obtain access to the secured resources of all organisations in the group. Specifically in the various research communities there is an increased interest in a common approach to FIM as there is obviously a large potential for synergies. Several research communities have converged and this presentation or poster will present the FIM4R approach, including a common vision for FIM, a set of requirements and a number of recommendations for ensuring a roadmap for the uptake of FIM is achieved. In WLCG, a dedicated working group has been formed to investigate identity federation. The status of the discussions in the WLCG/HEP community, including issues, future and progress in the recent year, as well as work on a pilot service for WLCG, will also be presented.
        Speaker: Mr Romain Wartel (CERN)
        Slides
      • 09:50
        eXtreme Scale Identity Management (XSIM) for Scientific Collaborations 25m
        The eXtreme Scale Identity Management (XSIM) effort is seeking to capture the trust relationships between both large and small scientific collaborations and their resource providers so as to understand the evolution leading to the existing environment. Based on analysis of interviews with representatives from Virtual Organizations and Resource Providers, XSIM will later be proposing a model for the next evolutionary step in identity management that preserves the core trust relationships between HEP collaborations and resource providers. The presentation will describe the results of interviews to this point and ask for additional volunteers to add information and perspectives to our analysis base.
        Speaker: Bob Cowles (BrightLite Information Security)
        Slides
      • 10:15
        Security update 25m
        This presentation provides an update of the security landscape since the last meeting. It describes the main vectors of compromises in the academic community and presents interesting recent attacks. It also covers security risks management in general, as well as the security aspects of the current hot topics in computing, for example identity federation and virtualisation.
        Speaker: Mr Romain Wartel (CERN)
        Slides
    • 10:40 11:10
      Coffee break 30m
    • 11:10 12:30
      Security and networking
      Convener: Dr David Kelsey (STFC - Science & Technology Facilities Council (GB))
      • 11:10
        The HEPiX IPv6 Working Group 25m
        This talk will be an update on the activities of the IPv6 working group since the meeting in Beijing. CERN has now announced that they could run out of IPv4 addresses during 2014 so this has resulted in increased pressure on our activities. The working group has more active participation from the LHC experiments; we are also seeking out more WLCG Tier 1s and TIer 2s. Data transfer tests continue on our own testbed but we are now planning a phased set of tests on subsets of the WLCG production infrastructure.
        Speaker: Dave Kelsey (STFC - Science & Technology Facilities Council (GB))
        Slides
      • 11:35
        The last year of IPv6 HEPix testbed operation 25m
        A brief summary of the operational and technical issues found on the IPv6 HEPix testbed over the last year is presented. The status of the issues that were reported to various software providers is also shown.
        Speaker: Francesco Prelz (Università degli Studi e INFN Milano (IT))
        Slides
      • 12:00
        DNS multi master architecture for High Availability services 25m
        In order to build a national infrastructure for High Availability and Disaster Recovery, we are deploying a dedicated DNS multimaster architecture. The architecture is based on bind DLZ (Dynamically Loadable Zones). The new zone (ha.infn.it) is defined on top of a MySQL backend. Two or more MySQL servers are configured as multi-master servers in order to allow the ha.infn.it zone to be always updated even if one of the DNS servers (in charge of ha.infn.it) is unreachable. Critical INFN national services (web services, authentication services, etc) are then replicated at lest on two different sites. If the main server is down, a nagios process will modify the ha.infn.it zone accordingly to point to the online service.
        Speaker: Mr Riccardo Veraldi (INFN)
        Slides
    • 12:30 14:00
      Lunch break 1h 30m
    • 14:00 15:40
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 14:00
        Agile Infrastructure: an updated overview of IaaS at CERN 25m
        The Agile Infrastructure project being developed at CERN is growing, scaling and adopted by more users every day. After several pre-production releases, the OpenStack private cloud has been refined and adapted to fit our requirements: scalability, resource flexibility and availability. The current version is close to a production state that will provide Infrastructure as a Service to the CERN and LHC communities, with the needed services for tracking, auditing, authorization, high availability,... ensuring the appropriate level of security, accounting and isolation. This presentation will give an overview of the current status of the project, what we have learned, and what the future plans are.
        Speaker: Luis Fernandez Alvarez
        Slides
      • 14:25
        Agile Infrastructure Monitoring 25m
        The Agile Infrastructure (AI) project is establishing new tools and procedures to deliver a more flexible and dynamic management of CERN computer centres. From the infrastructure monitoring perspective the project is establishing a common monitoring architecture to simplify and improve the access to information about all computer centre resources. The AI monitoring area is working to deliver new solutions for the operation of the computer centre (notifications, alarms, tickets, etc.) and for the analysis and vizualisation of monitoring data (dashboards, correlation engine, etc.). This talk will explain the AI monitoring architecture building blocks, present the main areas of work, and provide more details about the status of the new operational tools developed for the General Notification Infrastructure (GNI) which is now being tested in a preproduction service.
        Speaker: Mr Pedro Manuel Rodrigues De Sousa Andrade (CERN)
        Slides
      • 14:50
        High Availability Load Balancing in the Agile Infrastructure (CERN) 25m
        This presentation describes several Load Balancing paradigms of services within our puppet managed, CERN private, Openstack cloud. Our primary Load Balancer is HAProxy, which is combined with Corosync/Pacemaker cluster engine or DNS Load balancer. Specifically, we create small clusters of Load Balancers behind a service IP to guarantee front end failover. The aforementioned Load Balancers redirect traffic to a cluster of backed servers, ensuring High Availability of CERN services.
        Speaker: Mr Evangelos (Vaggelis) Atlidakis (CERN)
        Slides
      • 15:15
        Messaging Services @ CERN 25m
        CERN started to use messaging (as defined in MOM[1]) in 2008. Today, it is used in very different environments, from Grid monitoring to computer center management and even LHC controls. This presentation will briefly describe messaging and will detail how messaging services are provided at CERN, focusing on lessons learned inside CERN but also through European projects (EGGE, EMI and EGI). [1] http://en.wikipedia.org/wiki/Message-oriented_middleware
        Speaker: Lionel Cons (CERN)
        Slides
    • 15:40 16:10
      Coffee break 30m
    • 16:10 17:25
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 16:10
        Experiences running a production Puppet infrastructure at CERN 25m
        As part of the CERN Agile Infrastructure project, we have been running the puppet component as a production service. With the benefit of the experience of running at larger scale with more users, we will describe the lessons we have learnt, and how we've evolved our original design. Looking to the future, we will present how we plan to support a multi-admin environment with a distributed change management workflow.
        Speaker: Ben Jones (CERN)
      • 16:35
        Configuration management tools used in the EGI sites: Survey result 25m
        EGI.eu has recently conducted a survey among EGI sites on configuration management tools used for their Grid resources. We will present the results of this survey.
        Speaker: Yves Kemp (Deutsches Elektronen-Synchrotron (DE))
        Slides
      • 17:00
        Quattor/Aquilon update - Latest developments into integrating Aquilon with a Quattor managed site 25m
        Aquilon is a Quattor configuration database and management broker developed by by an investment to meet the needs of their large worldwide grid. Providing much better relational integrity to the Quattor configuraton database and a workflow that is both more agile and more disciplined, Aquilon can transform the use of Quattor to manage sites. This talk will discuss the latest updates on the Quattor+Aquilon usage in Tier1 and also the experiences from documenting and applying Tier1 Aquilon experience to other Quattor sites. Moreover developments in the Quattor release process, the addition of yum based package management for Quattor, and current plans for Quattor evolution will be presented.
        Speaker: DIMITRIOS ZILASKOS (STFC)
        Slides
    • 17:30 19:00
      HEPiX board meeting (by invitation only) 1h 30m
    • 19:00 20:00
      Concert 1h
    • 20:00 23:00
      Social dinner 3h
    • 09:00 10:40
      Grids, clouds, virtualisation
      Conveners: Ian Collier (UK Tier1 Centre), Dr Keith Chadwick (Fermilab)
      • 09:00
        The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud) 25m
        The CMS online cluster consists of more than 3000 computers. During the normal work it is used for the Data Acquisition of the CMS experiment at CERN. The sofisticated design and the dedicated software runned on the farm allows to collect up to 20TBytes of data per day. Considering the huge amount of computing resources under control of the HLT farm, an Openstack cloud layer has been deployed on part of the cluster (13000 cores) allowing the opportunistic usage of the cluster. We will present the benefits of virtualization technology taking as an example the CMSooooCloud. The architecture choices, the usage of openstack cloud controller and openvswitch will be shown. The presentation will cover the design and performance aspects of the current installation.
        Speaker: Wojciech Andrzej Ozga (AGH University of Science and Technology (PL))
        Slides
      • 09:25
        Virtualisation & Cloud Developments at RAL Tier 1 25m
        The RAL Tier 1 has been making increasing use of virtualisation to provide a platform for development and increasingly production services. More recently we have a stablished a private cloud platform using StratusLab which is now being used extensively across the Tier 1 and the STFC Scientific Computing Department. This talk will discuss the current status and roadmap for the immediate future.
        Speaker: Ian Collier (UK Tier1 Centre)
        Slides
      • 09:50
        Fermilab Grid and Cloud Computing Updates 25m
        Beginning in 2007, Fermilab has made extensive use of virtualization to deliver services in support of science. Many production services are deployed using virtualization and have realized significant cost savings while delivering improvements in performance and availability. In 2010, Fermilab initiated the FermiCloud project to deliver a dynamic and scalable Infrastructure-as-a-Service (IaaS) capability using open source cloud computing frameworks to support the needs of the Fermilab scientific communities. A collaboration of personnel from Fermilab and the Korea Institute of Science and Technology Information (KISTI) has focused significant work over the past 18 months to deliver improvements to the applicability and robustness of FermiCloud, together with specific accomplishments with respect to direct and indirect support of science. The General Physics Computing Facility (GPCF), also initiated in 2010, uses open source virtualization tools to provide computing server resources for scientific experiments at Fermilab. This allows us to provision new needs quickly and to make efficient use of available hardware resources. GPCF provides virtual servers that are expected to have medium to long lives and that are centrally managed by a team of system administrators. The Virtual Server Hosting Service provides hosting for virtual machines running Windows, Linux, or Solaris x86. The service is intended for customers wishing to provision new virtual machines, import virtual machines from other environments, and convert physical systems into virtual machines. While currently used mainly for Enterprise Business application hosting, we are exploring it as a hosting environment for enterprise level scientific services and those experiment processes that are not data intensive. Our plans in the future include developing a cost model against which to measure our future choices for procurements and activities. The opportunities, challenges and successes of virtualization and cloud computing at Fermilab will be presented.
        Speaker: Dr Keith Chadwick (Fermilab)
        Slides
      • 10:15
        vmcaster and vmcatcher 25m
        vmcaster and vmcatcher try to match the requirements set by the now completed HEPiX virtualisation working group. The EGI federated cloud task force has recommended vmcatcher for installation at all cloud resource providers in their collaboration and the status of the integration with open stack and open nebular. This talk will explain the developments that have occurred in the past year, including the new product vmcaster for making publishing virtual machine image lists easy to automate, the developments in vmcatcher, and ask for suggestions on the future road map of these tools.
        Speaker: Owen Synge (N)
        Slides
    • 10:40 11:10
      Coffee break 30m
    • 11:10 11:35
      Grids, clouds, virtualisation
      Conveners: Ian Collier (UK Tier1 Centre), Dr Keith Chadwick (Fermilab)
      • 11:10
        EGI Federated Cloud Infrastructure status 25m
        This presentation provides an update on the status of the EGI Federated Cloud Task to the world wide Grid and Cloud communities. Building on the technology and expertise aggregated in over 10 years of successful provisioning and operation of a pan-European Grid Infrastructure, the Task Force further pushes the frontiers of Cloud interoperability enabling user communities to scale their computing needs across multiple Cloud providers, both academic/publicly funded and commercial providers.
        Speaker: Ian Peter Collier (STFC - Science & Technology Facilities Council (GB))
        Slides
    • 11:35 12:25
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 11:35
        Business Continuity Experiences at BNL 25m
        This presentation discusses business continuity experiences at BNL over the past few years, including cyber-security incidents, operational accidents and natural disasters. Based on these experiences, BNL has continuously adjusted its protocols and responses in an effort to minimize degradation to the user experience at our data center.
        Speaker: Christopher Hollowell (Brookhaven National Laboratory)
        Slides
      • 12:00
        Experience at CERN T0 on continuous service improvement 25m
        This talk will present the experience at CERN T0 on continuous service improvement. During the last years, we have made an explicit effort to understand and improve all service management aspects in order to increase efficiency and effectiveness. We will present the requirements, how they were addressed and share our experiences, the positive ones and the mistakes, describing how we measure, report and use the data to improve not only the processes but to continually improve the services being provided. The focus is not the tool or the process but the results of the continuous improvement effort from a large team of service managers, sitting between the service providers and the users.
        Speaker: Maite Barroso Lopez (CERN)
        Slides
    • 12:25 13:45
      Lunch break 1h 20m
    • 13:45 15:25
      Storage and filesystems
      Conveners: Andrei Maslennikov (Universita e INFN, Roma I (IT)), Mr Peter van der Reest (DESY)
      • 13:45
        Experience operating multi-PB disk storage systems (CASTOR and EOS) 25m
        After the strategic decision in 2011 to separate Tier0 activity from analysis, CERN-IT is now operating two large-volume physics file store: CASTOR and EOS. CASTOR is our historical storage system in production since many years, which now mainly handle the Tier0 activities as well as all tape-backed data. EOS is in production since 2011 for ATLAS and CMS, it supports the fast-growing need for high performance and low latency data access mainly for user analysis. In 2012 two new EOS instances were created (ALICE and LHCb) and a big migration of diskonly pools took place. In Summer 2013 we plan to set up a shared "non-LHC" instance for the others experiments. The presentation will be focused on CASTOR and EOS diskpools operations with a particular emphasis on efficiency, inefficiency, costs and future improvements.
        Speaker: Luca Mascetti (CERN)
        Slides
      • 14:10
        Monitoring multi-PB disk farms at CERN for the HEP experiments: a service manager view 25m
        In the last two years our team (operating CASTOR and EOS at CERN) invested a lot in monitoring our large disk-servers clusters. CASTOR and EOS disk-servers farms contain about 800 machines each and about 15 PB of usable disk space. Each of them produces logging information at a rate between 30 and 60 GB/day which are vital for monitoring, accounting, troubleshooting and disaster recovery. The new system, named Cockpit, has been designed around the experience and the requirements of the operation team in close contact with the Agile monitoring working group. In this presentation we will illustrate the present status of the project, its main components and usage and the lessons learnt. The system is in production for CASTOR for several months and under deployment for EOS.
        Speaker: Massimo Lamanna (CERN)
        Slides
      • 14:35
        SAS expanders and software RAID optimizations applied to CERN CC backup service 25m
        In 2012, the CERN backup server infrastructure switched to HBA attached SAS expanders and software RAID. This new hardware gave us the opportunity to perform a deeper analysis of TSM disk usage and introduce newer optimizations to reach higher overall performance. This presentation gives an overview of the backup service storage infrastructure and usage. Then it explains the different RAID and filesystem optimizations that allowed us to improve the backup infrastructure performance by 40%.
        Speaker: Julien Leduc (CERN)
        Slides
        Videos
      • 15:00
        Optimizing Tier1 Storage infrastructure 25m
        efficiency of storage solution in a Multi-petabyte data center consist of multiple factors but the most important is the choice of a Storage model. From analysis of our Storage model which is in production for several years it comes evident that building storage system from PB-sized blocks permits to obtain higher performance at the cost of lower flexibility. It's also important to find right balance between capacity and performance, price and availability, footprint and power consumption.
        Speaker: Dr Vladimir Sapunenko (INFN-CNAF)
        Slides
    • 15:25 21:00
      CNAF's 50th anniversary 5h 35m
    • 09:00 10:05
      Storage and filesystems
      Conveners: Andrei Maslennikov (Universita e INFN, Roma I (IT)), Mr Peter van der Reest (DESY)
      • 09:00
        Web object scaler 25m
        DDN's web object scaler (WOS) offers an alternative approach to managing large scale distributed data. WOS is designed without the overheads associated with traditional filesystems that normally underpin higher-level distributed middleware system. File system checks, fragmentation, RAID sets and inode structures have been removed in WOS and a simple, efficient placement of objects directly to disk is instead implemented. The approach moves metadata management to external software/middleware, which fits well with most current large scale distributed data systems. We present some alternative implementations of WOS in real-world environments and discuss performance of WOS for several use-cases.
        Speaker: Toine Beckers (DataDirect Networks)
        Slides
      • 09:25
        Ceph: A scalable, organic option for Storage-as-a-Service at CERN 25m
        Recent storage requirements, such as the demand for VM block storage in the Agile Infrastructure project or the need for a scalable, reliable backend for file services such as AFS and NFS, motivate a generic consolidated storage system for CERN IT. With its native file, block, and object interfaces the open-source system Ceph is an attractive vendor-independent candidate to fill this gap. In this presentation, we will introduce Ceph, discuss how it could fit the storage use cases at CERN, and report on the first results of our prototype investigation.
        Speaker: Dr Arne Wiebalck (CERN)
        Slides
      • 09:50
        Summary of the BoF session on IPv6 and OpenAFS 15m
        Summary of the BoF session on IPv6 and OpenAFS
        Speakers: Dr Arne Wiebalck (CERN), Mr Peter van der Reest (DESY)
        Slides
    • 10:05 10:45
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 10:05
        Report from CMDBuild BoF session 15m
        In the framework of the “business continuity track” at Hepix 2013 spring workshop, we had the opportunity to discuss with the representatives of the Tecnoteca company which supports the cmdbuild software. We had a free discussion for this first face-to-face meeting rather than a organized session. The context of the discussion kept in the technical level. We are going present an indicative list of our comments and points of this discussion.
        Speaker: Dr Emmanouil Vamvakopoulos (CC-IN2P3 - Centre de Calcul (FR))
        Slides
        summary
      • 10:20
        Scientific Linux Status 25m
        The status of Scientific Linux .
        Speaker: connie sieh (Fermilab)
        Slides
    • 10:45 11:05
      Coffee break 20m
    • 11:05 11:55
      IT Infrastructure
      Conveners: Dr Helge Meinhard (CERN), Iwona Sakrejda, Michel Jouvin (Universite de Paris-Sud 11 (FR))
      • 11:05
        GSI monitoring amalgamation 25m
        GSI has combined several well-established - maybe old-fashioned - technologies into a framework that's capable of fulfilling diverse requirements and is able to keep up with the evoltionary changes in GSI's heterogeneous IT environment with relatively small effort.
        Speaker: Christopher Huhn (GSI Helmholtzzentrum fuer Schwerionenforschung)
        Slides
      • 11:30
        Log management with Logstash and Elasticsearch 25m
        As the number of servers, both bare metal as well as virtual machines, keeps increasing we need some tools that are not only able to store efficiently the logs but can also give us some insightful information from them. Logstash, in combination with the indexing capabilities of Elasticsearch, can be used to collect the logs from different kind of sources and aggregate them in such a way that the resulting information could be useful to spot a problem or find a trend in a huge amount of data.
        Speaker: Matteo Dessalvi (G)
        Slides
    • 11:55 12:15
      Miscellaneous
      Convener: Dr Helge Meinhard (CERN)
      • 11:55
        Workshop wrap-up 20m
        Closing remarks
        Speaker: Dr Helge Meinhard (CERN)
        Slides