Group Meeting

Europe/Zurich
513/1-024 (CERN)

513/1-024

CERN

50
Show room on map
Alberto Pace (CERN), Oliver Keeble (CERN)
    • 14:00 14:15
      A milestone for DPM (Disk Pool Manager) 15m

      The DPM (Disk Pool Manager) system is a multiprotocol scalable technology for Grid storage that supports about 130 sites
      for a total of about 90 Petabytes online.
      The system has recently completed the development phase that had been announced in the past years, which consolidates
      its core component (DOME: Disk Operations Management Engine) as a full-featured high performance engine that can also
      be operated with standard Web clients and uses a fully documented REST-based protocol.
      Together with a general improvement on performance and with a comprehensive administration command-line interface,
      this milestone also brings back features like the automatic disk server status detection and the volatile pools for deploying
      experimental disk caches.
      In this contribution we also discuss the end of support for the historical DPM components (that also include a dependency
      on the Globus toolkit), whose deployment is now only linked to the usage of the SRM protocols, hence can be uninstalled
      when these are not needed anymore by the site.

      Speaker: Fabrizio Furano (CERN)
    • 14:15 14:30
      CERN Tape Archive (CTA) : From Development to Production Deployment 15m

      The first production version of the CERN Tape Archive (CTA) software is planned to be released for the end of 2018. CTA is designed to replace CASTOR as the CERN tape archive solution, in order to face scalability and performance challenges arriving with LHC Run-3.

      This contribution will describe the main commonalities and differences of CTA with CASTOR. We outline the functional enhancements and integration steps required to add the CTA tape back-end to an EOS disk storage system. We present and discuss the different deployment and migration scenarios for replacing the five CASTOR instances at CERN, including a description of how FTS will interface with EOS and CTA.

      Speaker: Michael Davis (CERN)
    • 14:30 14:45
      Providing large-scale disk storage at CERN 15m

      The CERN IT Storage group operates multiple distributed storage systems and is responsible
      for the support of the infrastructure to accommodate all CERN storage requirements, from the
      physics data generated by LHC and non-LHC experiments to the personnel users’ files.

      EOS is now the key component of the CERN Storage strategy. It allows to operate at high incoming
      throughput for experiment data-taking while running concurrent complex production work-loads.
      This high-performance distributed storage provides now more than 250PB of raw disks and it is the
      key component behind the success of CERNBox, the CERN cloud synchronisation service which allows
      syncing and sharing files on all major mobile and desktop platforms to provide offline
      availability to any data stored in the EOS infrastructure.

      CERNBox recorded an exponential growth in the last couple of year in terms of files and data stored
      thanks to its increasing popularity inside CERN users community and thanks to its integration
      with a multitude of other CERN services (Batch, SWAN, Microsoft Office).

      In parallel CASTOR is being simplified and transitioning from an HSM into an archival system, focusing mainly
      in the long-term data recording of the primary data from the detectors, preparing the road to the next-generation
      tape archival system, CTA.

      The storage services at CERN cover as well the needs of the rest of our community: Ceph as data back-end for
      the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for legacy home directory filesystem
      services and its ongoing phase-out and CVMFS for software distribution.

      In this paper we will summarise our experience in supporting all our distributed storage system and the ongoing work
      in evolving our infrastructure, testing very-dense storage building block (nodes with more than 1PB of raw space)
      for the challenges waiting ahead.

      Speaker: Herve Rousseau (CERN)
    • 14:45 15:00
      Scaling the EOS namespace 15m

      The EOS namespace has outgrown its legacy in-memory implementation, presenting the need for an alternative solution. In response to this need we developed QuarkDB, a highly-available datastore capable of serving as the metadata backend for EOS. Even though the datastore was tailored to the needs of the namespace, its capabilities are generic.

      We will present the overall system design, and our efforts in providing comparable performance with the in-memory approach, both when reading, through the use of extensive caching on the MGM, and when writing through the use of latency-hiding techniques involving a persistent, back-pressured local queue for batching updates to the QuarkDB backend.

      We will also discuss the architectural decisions taken when designing our datastore, including the choice of consensus algorithm to maintain strong consistency between identical replicas (raft), the choice of underlying storage backend (rocksdb) and communication protocol (redis serialization protocol - RESP), as well as the overall testing strategy to ensure correctness and stability of this important infrastructure component.

      Speaker: Andrea Manzi (CERN)
    • 15:00 15:15
      Testing of complex, large-scale distributed storage systems: a CERN disk storage case study 15m

      Complex, large-scale distributed systems are more frequently used to solve
      extraordinary computing, storage and other problems. However, the development
      of these systems usually requires working with several software components,
      maintaining and improving large codebases, and also a relatively large number
      of developers working together. Therefore, it is inevitable to introduce faults
      to the system. On the other hand, these systems often perform important if not
      crucial tasks so critical bugs, performance-hindering algorithms are not
      acceptable to reach the production state of the software and the system. Also,
      the larger number of developers can work more liberated and productively when
      they receive constant feedback that their changes are still in harmony with the
      system requirements and other people’s work which also greatly helps scaling
      out manpower, meaning that adding more developers to a project can actually
      result in more work done.

      In this paper we will go through the case study of EOS, the CERN disk storage
      system and introduce the methods and possibilities of how to achieve
      all-automatic regression, performance, robustness testing and continuous
      integration for such a large-scale, complex and critical system using
      container-based environments. We will also pay special attention to the details
      and challenges of testing distributed storage and file systems.

      Speaker: Andrea Manzi (CERN)
    • 15:15 15:30
      CERNBox: the CERN Cloud Storage HUB 15m

      CERNBox is the CERN cloud storage hub. It allows synchronising and sharing files on all major desktop and mobile platforms (Linux, Windows, MacOSX, Android, iOS) aiming to provide universal access and offline availability to any data stored in the CERN EOS infrastructure.

      With more than 12000 users registered in the system, CERNBox has responded to the high demand in our diverse community to an easily and accessible cloud storage solution that also provides integration with other CERN services for big science: visualisation tools, interactive data analysis and real-time collaborative editing.

      Collaborative authoring of documents is now becoming standard practice with public cloud services, and within CERNBox we are looking into several options: from the collaborative editing of shared office documents with different solutions (Microsoft, OnlyOffice, Collabora) to integrating mark-down as well as LaTeX editors, to exploring the evolution of Jupyter Notebooks towards collaborative editing, where the latter leverages on the existing SWAN Physics analysis service.

      We report on our experience managing this technology and applicable use-cases, also in a broader scientific and research context and its future evolution with highlights on the current development status and future roadmap. In particular we will highlight the future move to an architecture based on microservices to easily adapt and evolve the service to the technology and usage evolution, notably to unify CERN home directory services.

    • 15:30 15:45
      [WIP] Cloud Storage for data-intensive sciences in science and industry 15m

      In the last few years we have been seeing constant interest for technologies providing effective cloud storage for scientific use, matching the requirements of price, privacy and scientific usability. This interest is not limited to HEP and extends out to other scientific fields due to the fast data increase: for example, "big data" is a characteristic of modern genomics, energy and financial services to mention a few.

      The provision of cloud storage accessible via synchronisation and sharing interfaces became an essential element of services' portfolios offered by research laboratories and universities. "Dropbox-like" services were created and now support HEP and other communities in their day to day tasks. The scope for these systems is therefore much broader of HEP: we will describe the usage and the plans to adopt part of the tools originally conceived for our community in other areas. The adoption of cloud storage services in the main workflow for data analysis is the challenge we are now facing, extending out the functionality of "traditional" cloud storage.

      Which are the ingredients for these new classes of services? Is nowadays HEP proposing interesting solutions for other future projects on the timescale of high-luminosity LHC?

      The authors believe that HEP-developed technologies will constitute the backend for a new generation of services. Namely, our solution for exascale geographically distributed storage (EOS), the access and the federation of cloud storage across different domains (CERNBox) and the possibility to offer effective heavy-duty interactive data analysis services (SWAN) growing from this novel data infrastructure are the three key enablers for future evolution.

      In this presentation we will describe the usage of these technologies to build large content-deliver-networks (e.g. AARNET Australia), the collaboration with other activities (e.g. handling of satellite images from the Copernicus programme at JRC) and different partnerships with companies active in this field.

    • 15:45 16:00
      CHEP Presentation - Disk failures in the EOS setup at CERN: A first systematic look at 1 year of collected data 15m

      The EOS deployment at CERN is a core service used for both scientific data
      processing, analysis and as back-end for general end-user storage (eg home directories/CERNBOX).
      The collected disk failure metrics over a period of 1 year from a deployment
      size of some 70k disks allows a first systematic analysis of the behaviour
      of different hard disk types for the large CERN use-cases.

      In this presentation we will describe the data collection and analysis,
      summarise the measured rates and compare them with other large disk
      deployments. In a second part of the presentation we will present a first
      attempt to use the collected failure and SMART metrics to develop a machine
      learning model predicting imminent failures and hence avoid service degradation
      and repair costs.

      Speaker: Alfonso Juan Portabales Gonzalez (Universidad Politecnica de Madrid (ES))