WLCG Service Reliability Workshop

Europe/Zurich
IT Auditorium (CERN)

IT Auditorium

CERN

Description
NEW! - VRVS has been booked as follows:
  • Monday - Island
  • Tuesday - Island
  • Wednesday - Saturn
  • Thursday - Plane
  • Friday - Car

Tentative list of topics:

  • Critical Services - Experiments' Viewpoint
  • Reliability by design - follow-up on issues from WLCG Collaboration workshop in Victoria / CHEP
  • Monitoring & end-to-end Service Reliability
  • Middleware development - tips & techniques related to reliable by design.
    (Hopefully including a session on developing DB apps).
  • WLCG Medium Term Requirements for Operations & Support

Target attendance: 30-50(?) people

Make your suggestions here

mailing list

Participants
  • Alberto Pace
  • Alexander Verkooijen
  • Andrea Valassi
  • Andreas Motzke
  • Andrew Wong
  • Angela Poschlad
  • Barbara Martelli
  • Carlos Gamboa
  • Cecile Barbier
  • Clemens Koerdt
  • Cyril Lorphelin
  • Dario Barberis
  • David Bouvet
  • Dirk Duellmann
  • Eric Fede
  • Eric Grancher
  • Etienne URBAH
  • Federico Bonifazi
  • Felice Rosso
  • Gang Qin
  • Gavin McCance
  • Georgios Papadimitriou
  • Gonzalo Merino
  • Gordon Brown
  • Harry Renshall
  • Hsu-Chang Wei
  • Ignacio Reguero
  • James Casey
  • Jamie Shiers
  • Jan van Eldik
  • Jean-Rene Rouet
  • John Gordon
  • Jonatan Piedra
  • laura perini
  • Luca Canali
  • Maarten Litmaath
  • Maria Dimou
  • Maria Girone
  • Mark van de Sanden
  • Max Bรถhm
  • Miguel Anjo
  • Miguel Coelho dos Santos
  • Mingchao Ma
  • Muriel Gougerot
  • Nicholas Thackray
  • Nicole Iribarnes
  • Nilo Segura Chinchilla
  • Olli Tourunen
  • Olof Barring
  • Paolo Veronesi
  • Patricia Mendez Lorenzo
  • Peter Kreuzer
  • Pierre-Emmanuel Brinette
  • Pierre-Etienne Macchi
  • Piotr Nyczyk
  • Rajesh Kalmady
  • Ricardo Silva
  • Rob Quick
  • Rolf Rumler
  • Ron Trompert
  • Sophie LEMAITRE
  • Stephen Gowdy
  • Steve Fisher
  • Steve Traylen
  • Tony Wildish
  • Ulrich Schwickerath
  • Veronique Lefebure
  • Vincenzo Spinoso
    • Critical services - Requirements IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Island

    • WLCG Operations - What is Required to support LHC experiments? IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Island

      • 7
        Inter-site Problem Determination
        Slides
      • 8
        FTS transfers - debugging tools
        • Prototype tools and procedures on T0-export - Alexander Uzhinskiy
        • Prototype tools and procedures at SARA - Ron Trompert
        • Prototype tools and procedures at IN2P3 - David Bouvet
        • Plans and direction (discussion) - Gavin McCance
      • 10:40
        coffee break
      • 9
        Mind the Gap
        What can we do to prevent cracks opening (or widening) in the services?

        Specific examples from recent times (i.e. during EGEE '07) include:

        • SAM unavailability
        • GridView - change of availability algorithm
        • LFC - affected by Oracle client bug in 'old' versions

        More communication and better planning would likely help. How (concretely) do we fix these problems before the deluge of data arrives?

      • 10
        Requirements for WLCG management dashboards
        Speaker: John Shade (CERN)
        Slides
      • 12:20
        lunch break
      • 11
        WLCG / EGEE / OSG operations and evolution in the coming years
        WLCG / EGEE / OSG operations are now well established, through:
        • Weekly joint operations meetings
        • Bi-annual (roughly) workshops
        • Sessions at WLCG collaboration workshops
        • A set of tools, procedures and documentation.
        This session summarises the current state of global operations as well as the possible evolution beyond (or even during) the next phase of EGEE (III) - in the light of the EGI Design Study.

        In particular, we need to establish a clear view of our current needs in terms of efficient operations and how this would map to a model where National Grid Initiatives (NGIs) play a significant role.

        The issues of 24x7 operations also needs to be discussed with priority.

        • a) Operations - the current model
          Speaker: Nick Thackray (CERN)
        • b) EGI - possible directions and scenarios - the EGI Design Study
          Speaker: Laura Perini (INFN & EGI_DS/WP3)
          Slides
        • c) Operations Review - hot topics from recent months
          Speaker: Steve Traylen (CERN)
          Slides
        • d) WLCG Requirements - what do we need for 2008 and beyond?
        • e) Robust Services & 24x7
          Speaker: Jamie Shiers
          Slides
        • f) Effective PPS - how to make the PPS more effective for the experiments
          This includes providing an environment where the experiment-specific "Critical Services" can be tested
          Speaker: Nick Thackray
          Slides
      • 15:30
        coffee break
      • 12
        Experiment Operations
        What is it that the experiments hate most about the current operations setup?

        What explicitly is missing from the point of view of the experiments?

        What can be done better? What (perhaps?) should not be done at all?

        Should we somehow integrate global / experiment operations? e.g. via repeat consoles in the various operations rooms?

        • a) CMS Centers for Control, Monitoring, Offline Operations and Analysis
          The CMS experiment is about to embark on its first physics run at the LHC. To maximize the effectiveness of physicists and technical experts at CERN and worldwide and to facilitate their communications, CMS has established several dedicated and inter-connected operations and monitoring centers. These include a traditional โ€œControl Roomโ€ at the CMS site in France, a โ€œCMS Centreโ€ for up to fifty people on the CERN main site in Switzerland, and remote operations centers, such as the โ€œLHC@FNALโ€ center at Fermilab. We describe how this system of centers coherently supports the following activities: (1) CMS data quality monitoring, prompt sub-detector calibrations, and time-critical data analysis of express-line and calibration streams; and (2) operation of the CMS computing systems for processing, storage and distribution of real CMS data and simulated data, both at CERN and at offsite centers. We describe the physical infrastructure that has been established, the computing and software systems, the operations model, and the communications systems that are necessary to make such a distributed system coherent and effective.
          Speaker: Lucas Taylor (CMS)
          Slides
    • Monitoring - What is Required to run Reliable Services? IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Saturn

      Morning: Outstanding requirements for current projects and discussion of where this might go - e.g. SAM/ gridview, nagios-based prototype, GOCDB, CIC Portal, Experiment Dashboards

      Afternoon: discussion of the requirements identified during Tuesday's sessions, building a medium-long term plan.

      • 13
        Introduction
        Speaker: Ian Neilson (CERN)
        Slides
      • 14
        Management reporting requirements
        Speaker: Alberto Aimar (CERN)
        Slides
      • 15
        OSG Requirements
        Speaker: Rob Quick (OSG - Indiana University)
        Slides
      • 16
        Requirements from CIC Portal users
        Speaker: Gilles Mathieu (IN2P3/CNRS Computing Centre, Lyon, France)
        Slides
      • 10:15
        Coffee
      • 17
        GOCDB Requirements
        Slides
      • 18
        EGEE Metrics Implementation Group
        Speaker: Etienne Urbah (LAL)
        Slides
      • 19
        Site Monitoring Prototype Feedback
        Speaker: Emir Imamagic (Unknown)
        Slides
      • 20
        Discussion
      • 12:00
        lunch break
      • 21
        Service availability calculation methods - now and the future
        Speaker: Piotr Nyczyk (CERN)
        Slides
      • 22
        Security for Grid Sites
        Speaker: Louis PONCET (CERN)
      • 23
        Use cases for GridMap visualization
        Speaker: Max Boehm (Unknown)
        Slides
      • 24
        Models for monitoring
        Speaker: James Casey (CERN)
        Slides
      • 14:20
        coffee break
      • 25
        Experiment Critical Services and Monitoring - What's Missing for CCRC'08 (and beyond)?
        Speaker: Julia Andreeva (CERN)
        • a) Critical Services and Monitoring - ALICE
          Speaker: Catalin Cirstoiu (CERN)
          Slides
        • b) Critical Services and Monitoring - ATLAS
          Slides
        • c) Critical Services and Monitoring - CMS
          Speaker: Andrea Sciaba' (CERN)
          Slides
        • d) Critical Services and Monitoring - LHCb
          Speaker: Philippe Charpentier (CERN)
          Slides
      • 26
        Prioritization of requirements raised during the day
      • 27
        LHC VO User Support evolution update
        Speaker: Maria Dimou (CERN)
        Slides
      • 28
        Mid-workshop checkpoint
        Speaker: Jamie Shiers (CERN)
        Slides
    • Robust Services - Middleware Developers' Techniques & Tips IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Plane

      Key techniques from middleware / storage-ware developers for making services robust by design

      • 29
        CASTOR
        Speaker: Dennis Waldron (CERN)
        Slides
      • 30
        LFC / DPM
        Speaker: Jean-Philippe Baud (CERN)
        Slides
      • 31
        FTS
        Speaker: Paolo Tedesco (Unknown)
        Slides
      • 10:30
        coffee break
      • 32
        Other m/w sessions: BDII, WMS/LB, VOMS, R-GMA, Logging format
        • a) BDII
          Speaker: Laurence Field (CERN)
          Slides
        • b) WMS / LB
          Speakers: Ales Krenek, Francesco Giacomini
          Slides
        • c) VOMS
          Speaker: Vincenzo Ciaschini (CNAF & INFN)
          Slides
        • d) R-GMA
          Speaker: Dr Steve Fisher (RAL)
          Slides
        • e) Logging format
          Speaker: Dr Steve Fisher (RAL)
          Slides
      • 12:30
        lunch break
    • DB application design issues IT Auditorium

      IT Auditorium

      CERN

      • 33
        DB design (Tips & tricks)
        • schema design,
        • columns and constraints,
        • indexes and partitioning,
        • proper use of PL/SQL (bind variables, static and dynamic SQL),
        • writing robust applications
        Speaker: Dawid Wojcik (CERN)
        Slides
      • 15:30
        coffee break
      • 34
        DB development tools (with demo)
        • bentic
        • application express
        • sql developer </UL
        Speaker: Balys Sulmanas (CERN)
        Slides
      • 35
        DB Monitoring tools (demo)
        • session manager
        • OEM
        • 3D
        Speaker: Dawid Wojcik (CERN)
        Slides
      • 36
        CORAL
        Speaker: Andrea Valassi (CERN)
        Slides
    • DB - performance and tuning issues IT Auditorium

      IT Auditorium

      CERN

      VRVS details: Car

      • 37
        DB performance/tuning in a RAC environment for developers
        - Advanced SQL - Optimization - Examples from production
        Speaker: Miguel Anjo (CERN)
        Slides
      • 10:15
        coffee break
      • 38
        DB performance tuning in a RAC environment for administrators
        • Service architecture for performance
        • Monitoring and reporting performance issues
        • Reactive performance tuning, case studies
        • New HW sizing and capacity planning
        Speaker: Luca Canali (CERN)
        Slides
      • 39
        Dataguard for RAC Migration
        Speaker: Jacek Wojcieszuk (CERN)
        Paper
        Slides
      • 12:30
        lunch break
    • DB - service issues IT Auditorium

      IT Auditorium

      CERN

      • 40
        Service plans for 2008
        Speaker: Maria Girone (CERN)
        Slides
      • 41
        Service Recommendations
        • Security of machines and authentication techniques
        • How to manage your logs (listener.log, crs logs, alert logs etc)
        • How to manage your Oracle environment (host environment)
        • A quick recap on the backup emails (from the talk at CNAF)
        • Managing your targets in Grid Control Security of machines and authentication techniques
        Speaker: Gordon Brown (CCLRC)
        Slides
      • 42
        streams service review
        Speaker: Dafonte Perez Eva (CERN)
        Slides
      • 15:45
        coffee break
      • 43
        adding/removing nodes plus demo
        Speaker: Jacek Wojcieszuk (CERN)
        Slides