WLCG Service Reliability Workshop
IT Auditorium
CERN
- Monday - Island
- Tuesday - Island
- Wednesday - Saturn
- Thursday - Plane
- Friday - Car
Tentative list of topics:
- Critical Services - Experiments' Viewpoint
- Reliability by design - follow-up on issues from WLCG Collaboration workshop in Victoria / CHEP
- Monitoring & end-to-end Service Reliability
- Middleware development - tips & techniques related to reliable by design.
(Hopefully including a session on developing DB apps). - WLCG Medium Term Requirements for Operations & Support
Target attendance: 30-50(?) people
Make your suggestions here
-
-
09:00
→
18:00
Critical services - Requirements IT Auditorium
IT Auditorium
CERN
VRVS details: Island
-
09:00
Introduction and Idea of the Workshop 20mN.B. workshop summary will be given at Overview Board and GDB 1st week of December (next week)
-
Draft workshop summaries
-
-
09:20
Critical services - Requirements of the Experiments 1h
-
Critical Services - ALICE
-
10:20
coffee break 20m
-
10:40
Techniques for implementing & running robust and reliable services 1h 20m
-
12:00
lunch break 1h
-
13:30
Case Studies - WLCG Services (part 1) 1h 30m
-
15:00
coffee break 30m
- 15:30
- 16:00
-
09:00
-
09:00
→
18:00
-
-
09:00
→
19:00
WLCG Operations - What is Required to support LHC experiments? IT Auditorium
IT Auditorium
CERN
VRVS details: Island
- 09:00
-
09:40
FTS transfers - debugging tools 1h
- Prototype tools and procedures on T0-export - Alexander Uzhinskiy
- Prototype tools and procedures at SARA - Ron Trompert
- Prototype tools and procedures at IN2P3 - David Bouvet
- Plans and direction (discussion) - Gavin McCance
-
10:40
coffee break 30m
-
11:10
Mind the Gap 30mWhat can we do to prevent cracks opening (or widening) in the services?
Specific examples from recent times (i.e. during EGEE '07) include:
- SAM unavailability
- GridView - change of availability algorithm
- LFC - affected by Oracle client bug in 'old' versions
More communication and better planning would likely help. How (concretely) do we fix these problems before the deluge of data arrives?
- 11:40
-
12:20
lunch break 1h 10m
-
13:30
WLCG / EGEE / OSG operations and evolution in the coming years 2hWLCG / EGEE / OSG operations are now well established, through:
- Weekly joint operations meetings
- Bi-annual (roughly) workshops
- Sessions at WLCG collaboration workshops
- A set of tools, procedures and documentation.
In particular, we need to establish a clear view of our current needs in terms of efficient operations and how this would map to a model where National Grid Initiatives (NGIs) play a significant role.
The issues of 24x7 operations also needs to be discussed with priority.
-
Operations - the current model 15mSpeaker: Nick Thackray (CERN)
-
WLCG Requirements - what do we need for 2008 and beyond? 30m
-
15:30
coffee break 30m
-
16:00
Experiment Operations 1h 30mWhat is it that the experiments hate most about the current operations setup?
What explicitly is missing from the point of view of the experiments?
What can be done better? What (perhaps?) should not be done at all?
Should we somehow integrate global / experiment operations? e.g. via repeat consoles in the various operations rooms?
-
CMS Centers for Control, Monitoring, Offline Operations and Analysis 25mThe CMS experiment is about to embark on its first physics run at the LHC. To maximize the effectiveness of physicists and technical experts at CERN and worldwide and to facilitate their communications, CMS has established several dedicated and inter-connected operations and monitoring centers. These include a traditional โControl Roomโ at the CMS site in France, a โCMS Centreโ for up to fifty people on the CERN main site in Switzerland, and remote operations centers, such as the โLHC@FNALโ center at Fermilab. We describe how this system of centers coherently supports the following activities: (1) CMS data quality monitoring, prompt sub-detector calibrations, and time-critical data analysis of express-line and calibration streams; and (2) operation of the CMS computing systems for processing, storage and distribution of real CMS data and simulated data, both at CERN and at offsite centers. We describe the physical infrastructure that has been established, the computing and software systems, the operations model, and the communications systems that are necessary to make such a distributed system coherent and effective.Speaker: Lucas Taylor (CMS)
-
-
09:00
→
19:00
-
-
09:00
→
18:00
Monitoring - What is Required to run Reliable Services? IT Auditorium
IT Auditorium
CERN
VRVS details: Saturn
Morning: Outstanding requirements for current projects and discussion of where this might go - e.g. SAM/ gridview, nagios-based prototype, GOCDB, CIC Portal, Experiment Dashboards
Afternoon: discussion of the requirements identified during Tuesday's sessions, building a medium-long term plan.
- 09:00
- 09:15
- 09:35
- 09:55
-
10:15
Coffee 15m
- 10:30
- 10:50
- 11:10
-
11:30
Discussion 30m
-
12:00
lunch break 1h
- 13:00
-
13:20
Security for Grid Sites 20mSpeaker: Louis PONCET (CERN)
- 13:40
- 14:00
-
14:20
coffee break 20m
-
14:40
Experiment Critical Services and Monitoring - What's Missing for CCRC'08 (and beyond)? 1h 30mSpeaker: Julia Andreeva (CERN)
-
16:10
Prioritization of requirements raised during the day 30m
- 16:40
- 17:00
-
09:00
→
18:00
-
-
09:00
→
14:00
Robust Services - Middleware Developers' Techniques & Tips IT Auditorium
IT Auditorium
CERN
VRVS details: Plane
Key techniques from middleware / storage-ware developers for making services robust by design
- 09:00
- 09:30
- 10:00
-
10:30
coffee break 30m
-
11:00
Other m/w sessions: BDII, WMS/LB, VOMS, R-GMA, Logging format 1h 30m
-
12:30
lunch break 1h 30m
-
14:00
→
18:00
DB application design issues IT Auditorium
IT Auditorium
CERN
- 14:00
-
15:30
coffee break 20m
- 15:50
- 16:50
- 17:30
-
09:00
→
14:00
-
-
09:00
→
13:30
DB - performance and tuning issues IT Auditorium
IT Auditorium
CERN
VRVS details: Car
- 09:00
-
10:15
coffee break 25m
- 10:40
- 12:00
-
12:30
lunch break 1h
-
13:30
→
17:30
DB - service issues IT Auditorium
IT Auditorium
CERN
- 13:30
-
13:45
Service Recommendations 1h
- Security of machines and authentication techniques
- How to manage your logs (listener.log, crs logs, alert logs etc)
- How to manage your Oracle environment (host environment)
- A quick recap on the backup emails (from the talk at CNAF)
- Managing your targets in Grid Control Security of machines and authentication techniques
Speaker: Gordon Brown (CCLRC) - 14:45
-
15:45
coffee break 20m
- 16:05
-
09:00
→
13:30