WLCG Service Reliability Workshop

Europe/Zurich
IT Auditorium (CERN)

IT Auditorium

CERN

Description
NEW! - VRVS has been booked as follows:
  • Monday - Island
  • Tuesday - Island
  • Wednesday - Saturn
  • Thursday - Plane
  • Friday - Car

Tentative list of topics:

  • Critical Services - Experiments' Viewpoint
  • Reliability by design - follow-up on issues from WLCG Collaboration workshop in Victoria / CHEP
  • Monitoring & end-to-end Service Reliability
  • Middleware development - tips & techniques related to reliable by design.
    (Hopefully including a session on developing DB apps).
  • WLCG Medium Term Requirements for Operations & Support

Target attendance: 30-50(?) people

Make your suggestions here

mailing list

Participants
  • Alberto Pace
  • Alexander Verkooijen
  • Andrea Valassi
  • Andreas Motzke
  • Andrew Wong
  • Angela Poschlad
  • Barbara Martelli
  • Carlos Gamboa
  • Cecile Barbier
  • Clemens Koerdt
  • Cyril Lorphelin
  • Dario Barberis
  • David Bouvet
  • Dirk Duellmann
  • Eric Fede
  • Eric Grancher
  • Etienne URBAH
  • Federico Bonifazi
  • Felice Rosso
  • Gang Qin
  • Gavin McCance
  • Georgios Papadimitriou
  • Gonzalo Merino
  • Gordon Brown
  • Harry Renshall
  • Hsu-Chang Wei
  • Ignacio Reguero
  • James Casey
  • Jamie Shiers
  • Jan van Eldik
  • Jean-Rene Rouet
  • John Gordon
  • Jonatan Piedra
  • laura perini
  • Luca Canali
  • Maarten Litmaath
  • Maria Dimou
  • Maria Girone
  • Mark van de Sanden
  • Max Böhm
  • Miguel Anjo
  • Miguel Coelho dos Santos
  • Mingchao Ma
  • Muriel Gougerot
  • Nicholas Thackray
  • Nicole Iribarnes
  • Nilo Segura Chinchilla
  • Olli Tourunen
  • Olof Barring
  • Paolo Veronesi
  • Patricia Mendez Lorenzo
  • Peter Kreuzer
  • Pierre-Emmanuel Brinette
  • Pierre-Etienne Macchi
  • Piotr Nyczyk
  • Rajesh Kalmady
  • Ricardo Silva
  • Rob Quick
  • Rolf Rumler
  • Ron Trompert
  • Sophie LEMAITRE
  • Stephen Gowdy
  • Steve Fisher
  • Steve Traylen
  • Tony Wildish
  • Ulrich Schwickerath
  • Veronique Lefebure
  • Vincenzo Spinoso
    • 09:00 18:00
      Critical services - Requirements IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Island

      • 09:00
        Introduction and Idea of the Workshop 20m
        N.B. workshop summary will be given at Overview Board and GDB 1st week of December (next week)
      • 09:20
        Critical services - Requirements of the Experiments 1h
      • 10:20
        coffee break 20m
      • 10:40
        Techniques for implementing & running robust and reliable services 1h 20m
        • DNS Load Balancing 30m
          Speaker: Vlado Bahyl (CERN)
          Slides
        • Oracle Real Application Clusters (RAC) 30m
          Speaker: Miguel Anjo (CERN)
          Slides
        • LSF & VOMS (as examples of other less recommended techniques) 20m
          Other (less recommended) techniques - and their drawbacks
          • H/A Linux
          • Hot standbys
          • ...
          Speaker: Ulrich Schwickerath (CERN)
          Slides
      • 12:00
        lunch break 1h
      • 13:30
        Case Studies - WLCG Services (part 1) 1h 30m
        • The LCG File Catalogue - LFC 30m
          Speaker: Sophie Lemaitre (CERN)
          Slides
        • The File Transfer Service - FTS 30m
          Speaker: Gavin McCance (CERN)
          Slides
        • CASTOR and CASTOR SRM 30m
          Speaker: Jan van Eldik (CERN)
          Slides
        • FNAL techniques (from HEPiX)
      • 15:00
        coffee break 30m
      • 15:30
        Case Studies - WLCG Services (part 2) 30m
        • dCache 30m
          Speaker: Patrick Fuhrmann (DESY)
          Slides
      • 16:00
        Case Studies - Experiment Services 1h 30m
        Speaker: Veronique Lefebure
        Slides
    • 09:00 19:00
      WLCG Operations - What is Required to support LHC experiments? IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Island

      • 09:00
        Inter-site Problem Determination 40m
        Slides
      • 09:40
        FTS transfers - debugging tools 1h
        • Prototype tools and procedures on T0-export - Alexander Uzhinskiy
        • Prototype tools and procedures at SARA - Ron Trompert
        • Prototype tools and procedures at IN2P3 - David Bouvet
        • Plans and direction (discussion) - Gavin McCance
      • 10:40
        coffee break 30m
      • 11:10
        Mind the Gap 30m
        What can we do to prevent cracks opening (or widening) in the services?

        Specific examples from recent times (i.e. during EGEE '07) include:

        • SAM unavailability
        • GridView - change of availability algorithm
        • LFC - affected by Oracle client bug in 'old' versions

        More communication and better planning would likely help. How (concretely) do we fix these problems before the deluge of data arrives?

      • 11:40
        Requirements for WLCG management dashboards 40m
        Speaker: John Shade (CERN)
        Slides
      • 12:20
        lunch break 1h 10m
      • 13:30
        WLCG / EGEE / OSG operations and evolution in the coming years 2h
        WLCG / EGEE / OSG operations are now well established, through:
        • Weekly joint operations meetings
        • Bi-annual (roughly) workshops
        • Sessions at WLCG collaboration workshops
        • A set of tools, procedures and documentation.
        This session summarises the current state of global operations as well as the possible evolution beyond (or even during) the next phase of EGEE (III) - in the light of the EGI Design Study.

        In particular, we need to establish a clear view of our current needs in terms of efficient operations and how this would map to a model where National Grid Initiatives (NGIs) play a significant role.

        The issues of 24x7 operations also needs to be discussed with priority.

        • Operations - the current model 15m
          Speaker: Nick Thackray (CERN)
        • EGI - possible directions and scenarios - the EGI Design Study 15m
          Speaker: Laura Perini (INFN & EGI_DS/WP3)
          Slides
        • Operations Review - hot topics from recent months 15m
          Speaker: Steve Traylen (CERN)
          Slides
        • WLCG Requirements - what do we need for 2008 and beyond? 30m
        • Robust Services & 24x7 25m
          Speaker: Jamie Shiers
          Slides
        • Effective PPS - how to make the PPS more effective for the experiments 20m
          This includes providing an environment where the experiment-specific "Critical Services" can be tested
          Speaker: Nick Thackray
          Slides
      • 15:30
        coffee break 30m
      • 16:00
        Experiment Operations 1h 30m
        What is it that the experiments hate most about the current operations setup?

        What explicitly is missing from the point of view of the experiments?

        What can be done better? What (perhaps?) should not be done at all?

        Should we somehow integrate global / experiment operations? e.g. via repeat consoles in the various operations rooms?

        • CMS Centers for Control, Monitoring, Offline Operations and Analysis 25m
          The CMS experiment is about to embark on its first physics run at the LHC. To maximize the effectiveness of physicists and technical experts at CERN and worldwide and to facilitate their communications, CMS has established several dedicated and inter-connected operations and monitoring centers. These include a traditional “Control Room” at the CMS site in France, a “CMS Centre” for up to fifty people on the CERN main site in Switzerland, and remote operations centers, such as the “LHC@FNAL” center at Fermilab. We describe how this system of centers coherently supports the following activities: (1) CMS data quality monitoring, prompt sub-detector calibrations, and time-critical data analysis of express-line and calibration streams; and (2) operation of the CMS computing systems for processing, storage and distribution of real CMS data and simulated data, both at CERN and at offsite centers. We describe the physical infrastructure that has been established, the computing and software systems, the operations model, and the communications systems that are necessary to make such a distributed system coherent and effective.
          Speaker: Lucas Taylor (CMS)
          Slides
    • 09:00 18:00
      Monitoring - What is Required to run Reliable Services? IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Saturn

      Morning: Outstanding requirements for current projects and discussion of where this might go - e.g. SAM/ gridview, nagios-based prototype, GOCDB, CIC Portal, Experiment Dashboards

      Afternoon: discussion of the requirements identified during Tuesday's sessions, building a medium-long term plan.

      • 09:00
        Introduction 15m
        Speaker: Ian Neilson (CERN)
        Slides
      • 09:15
        Management reporting requirements 20m
        Speaker: Alberto Aimar (CERN)
        Slides
      • 09:35
        OSG Requirements 20m
        Speaker: Rob Quick (OSG - Indiana University)
        Slides
      • 09:55
        Requirements from CIC Portal users 20m
        Speaker: Gilles Mathieu (IN2P3/CNRS Computing Centre, Lyon, France)
        Slides
      • 10:15
        Coffee 15m
      • 10:30
        GOCDB Requirements 20m
        Slides
      • 10:50
        EGEE Metrics Implementation Group 20m
        Speaker: Etienne Urbah (LAL)
        Slides
      • 11:10
        Site Monitoring Prototype Feedback 20m
        Speaker: Emir Imamagic (Unknown)
        Slides
      • 11:30
        Discussion 30m
      • 12:00
        lunch break 1h
      • 13:00
        Service availability calculation methods - now and the future 20m
        Speaker: Piotr Nyczyk (CERN)
        Slides
      • 13:20
        Security for Grid Sites 20m
        Speaker: Louis PONCET (CERN)
      • 13:40
        Use cases for GridMap visualization 20m
        Speaker: Max Boehm (Unknown)
        Slides
      • 14:00
        Models for monitoring 20m
        Speaker: James Casey (CERN)
        Slides
      • 14:20
        coffee break 20m
      • 14:40
        Experiment Critical Services and Monitoring - What's Missing for CCRC'08 (and beyond)? 1h 30m
        Speaker: Julia Andreeva (CERN)
        • Critical Services and Monitoring - ALICE 15m
          Speaker: Catalin Cirstoiu (CERN)
          Slides
        • Critical Services and Monitoring - ATLAS 15m
          Slides
        • Critical Services and Monitoring - CMS 15m
          Speaker: Andrea Sciaba' (CERN)
          Slides
        • Critical Services and Monitoring - LHCb 15m
          Speaker: Philippe Charpentier (CERN)
          Slides
      • 16:10
        Prioritization of requirements raised during the day 30m
      • 16:40
        LHC VO User Support evolution update 20m
        Speaker: Maria Dimou (CERN)
        Slides
      • 17:00
        Mid-workshop checkpoint 10m
        Speaker: Jamie Shiers (CERN)
        Slides
    • 09:00 14:00
      Robust Services - Middleware Developers' Techniques & Tips IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Plane

      Key techniques from middleware / storage-ware developers for making services robust by design

      • 09:00
        CASTOR 30m
        Speaker: Dennis Waldron (CERN)
        Slides
      • 09:30
        LFC / DPM 30m
        Speaker: Jean-Philippe Baud (CERN)
        Slides
      • 10:00
        FTS 30m
        Speaker: Paolo Tedesco (Unknown)
        Slides
      • 10:30
        coffee break 30m
      • 11:00
        Other m/w sessions: BDII, WMS/LB, VOMS, R-GMA, Logging format 1h 30m
        • BDII 20m
          Speaker: Laurence Field (CERN)
          Slides
        • WMS / LB 30m
          Speakers: Ales Krenek, Francesco Giacomini
          Slides
        • VOMS 15m
          Speaker: Vincenzo Ciaschini (CNAF & INFN)
          Slides
        • R-GMA 20m
          Speaker: Dr Steve Fisher (RAL)
          Slides
        • Logging format 5m
          Speaker: Dr Steve Fisher (RAL)
          Slides
      • 12:30
        lunch break 1h 30m
    • 14:00 18:00
      DB application design issues IT Auditorium

      IT Auditorium

      CERN

      • 14:00
        DB design (Tips & tricks) 1h 30m
        • schema design,
        • columns and constraints,
        • indexes and partitioning,
        • proper use of PL/SQL (bind variables, static and dynamic SQL),
        • writing robust applications
        Speaker: Dawid Wojcik (CERN)
        Slides
      • 15:30
        coffee break 20m
      • 15:50
        DB development tools (with demo) 1h
        • bentic
        • application express
        • sql developer </UL
        Speaker: Balys Sulmanas (CERN)
        Slides
      • 16:50
        DB Monitoring tools (demo) 40m
        • session manager
        • OEM
        • 3D
        Speaker: Dawid Wojcik (CERN)
        Slides
      • 17:30
        CORAL 30m
        Speaker: Andrea Valassi (CERN)
        Slides
    • 09:00 13:30
      DB - performance and tuning issues IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Car

      • 09:00
        DB performance/tuning in a RAC environment for developers 1h 15m
        - Advanced SQL - Optimization - Examples from production
        Speaker: Miguel Anjo (CERN)
        Slides
      • 10:15
        coffee break 25m
      • 10:40
        DB performance tuning in a RAC environment for administrators 1h 20m
        • Service architecture for performance
        • Monitoring and reporting performance issues
        • Reactive performance tuning, case studies
        • New HW sizing and capacity planning
        Speaker: Luca Canali (CERN)
        Slides
      • 12:00
        Dataguard for RAC Migration 30m
        Speaker: Jacek Wojcieszuk (CERN)
        Paper
        Slides
      • 12:30
        lunch break 1h
    • 13:30 17:30
      DB - service issues IT Auditorium

      IT Auditorium

      CERN

      • 13:30
        Service plans for 2008 15m
        Speaker: Maria Girone (CERN)
        Slides
      • 13:45
        Service Recommendations 1h
        • Security of machines and authentication techniques
        • How to manage your logs (listener.log, crs logs, alert logs etc)
        • How to manage your Oracle environment (host environment)
        • A quick recap on the backup emails (from the talk at CNAF)
        • Managing your targets in Grid Control Security of machines and authentication techniques
        Speaker: Gordon Brown (CCLRC)
        Slides
      • 14:45
        streams service review 1h
        Speaker: Dafonte Perez Eva (CERN)
        Slides
      • 15:45
        coffee break 20m
      • 16:05
        adding/removing nodes plus demo 1h
        Speaker: Jacek Wojcieszuk (CERN)
        Slides