WLCG Service Reliability Workshop
IT Auditorium
CERN
- Monday - Island
- Tuesday - Island
- Wednesday - Saturn
- Thursday - Plane
- Friday - Car
Tentative list of topics:
- Critical Services - Experiments' Viewpoint
- Reliability by design - follow-up on issues from WLCG Collaboration workshop in Victoria / CHEP
- Monitoring & end-to-end Service Reliability
- Middleware development - tips & techniques related to reliable by design.
(Hopefully including a session on developing DB apps). - WLCG Medium Term Requirements for Operations & Support
Target attendance: 30-50(?) people
Make your suggestions here
-
-
Critical services - Requirements IT Auditorium
IT Auditorium
CERN
VRVS details: Island
-
1
Introduction and Idea of the WorkshopN.B. workshop summary will be given at Overview Board and GDB 1st week of December (next week)
-
a) Draft workshop summaries
-
-
2
Critical services - Requirements of the Experiments
-
c) Critical Services - ALICE
-
10:20
coffee break
-
3
Techniques for implementing & running robust and reliable services
-
12:00
lunch break
-
4
Case Studies - WLCG Services (part 1)
-
15:00
coffee break
- 5
- 6
-
1
-
-
-
WLCG Operations - What is Required to support LHC experiments? IT Auditorium
IT Auditorium
CERN
VRVS details: Island
- 7
-
8
FTS transfers - debugging tools
- Prototype tools and procedures on T0-export - Alexander Uzhinskiy
- Prototype tools and procedures at SARA - Ron Trompert
- Prototype tools and procedures at IN2P3 - David Bouvet
- Plans and direction (discussion) - Gavin McCance
-
10:40
coffee break
-
9
Mind the GapWhat can we do to prevent cracks opening (or widening) in the services?
Specific examples from recent times (i.e. during EGEE '07) include:
- SAM unavailability
- GridView - change of availability algorithm
- LFC - affected by Oracle client bug in 'old' versions
More communication and better planning would likely help. How (concretely) do we fix these problems before the deluge of data arrives?
- 10
-
12:20
lunch break
-
11
WLCG / EGEE / OSG operations and evolution in the coming yearsWLCG / EGEE / OSG operations are now well established, through:
- Weekly joint operations meetings
- Bi-annual (roughly) workshops
- Sessions at WLCG collaboration workshops
- A set of tools, procedures and documentation.
In particular, we need to establish a clear view of our current needs in terms of efficient operations and how this would map to a model where National Grid Initiatives (NGIs) play a significant role.
The issues of 24x7 operations also needs to be discussed with priority.
-
a) Operations - the current modelSpeaker: Nick Thackray (CERN)
-
d) WLCG Requirements - what do we need for 2008 and beyond?
-
15:30
coffee break
-
12
Experiment OperationsWhat is it that the experiments hate most about the current operations setup?
What explicitly is missing from the point of view of the experiments?
What can be done better? What (perhaps?) should not be done at all?
Should we somehow integrate global / experiment operations? e.g. via repeat consoles in the various operations rooms?
-
a) CMS Centers for Control, Monitoring, Offline Operations and AnalysisThe CMS experiment is about to embark on its first physics run at the LHC. To maximize the effectiveness of physicists and technical experts at CERN and worldwide and to facilitate their communications, CMS has established several dedicated and inter-connected operations and monitoring centers. These include a traditional โControl Roomโ at the CMS site in France, a โCMS Centreโ for up to fifty people on the CERN main site in Switzerland, and remote operations centers, such as the โLHC@FNALโ center at Fermilab. We describe how this system of centers coherently supports the following activities: (1) CMS data quality monitoring, prompt sub-detector calibrations, and time-critical data analysis of express-line and calibration streams; and (2) operation of the CMS computing systems for processing, storage and distribution of real CMS data and simulated data, both at CERN and at offsite centers. We describe the physical infrastructure that has been established, the computing and software systems, the operations model, and the communications systems that are necessary to make such a distributed system coherent and effective.Speaker: Lucas Taylor (CMS)
-
-
-
-
Monitoring - What is Required to run Reliable Services? IT Auditorium
IT Auditorium
CERN
VRVS details: Saturn
Morning: Outstanding requirements for current projects and discussion of where this might go - e.g. SAM/ gridview, nagios-based prototype, GOCDB, CIC Portal, Experiment Dashboards
Afternoon: discussion of the requirements identified during Tuesday's sessions, building a medium-long term plan.
- 13
- 14
- 15
- 16
-
10:15
Coffee
- 17
- 18
- 19
-
20
Discussion
-
12:00
lunch break
- 21
-
22
Security for Grid SitesSpeaker: Louis PONCET (CERN)
- 23
- 24
-
14:20
coffee break
-
25
Experiment Critical Services and Monitoring - What's Missing for CCRC'08 (and beyond)?Speaker: Julia Andreeva (CERN)
-
26
Prioritization of requirements raised during the day
- 27
- 28
-
-
-
Robust Services - Middleware Developers' Techniques & Tips IT Auditorium
IT Auditorium
CERN
VRVS details: Plane
Key techniques from middleware / storage-ware developers for making services robust by design
- 29
- 30
- 31
-
10:30
coffee break
-
32
Other m/w sessions: BDII, WMS/LB, VOMS, R-GMA, Logging format
-
12:30
lunch break
-
DB application design issues IT Auditorium
IT Auditorium
CERN
- 33
-
15:30
coffee break
- 34
- 35
- 36
-
-
-
DB - performance and tuning issues IT Auditorium
IT Auditorium
CERN
VRVS details: Car
- 37
-
10:15
coffee break
- 38
- 39
-
12:30
lunch break
-
DB - service issues IT Auditorium
IT Auditorium
CERN
- 40
-
41
Service Recommendations
- Security of machines and authentication techniques
- How to manage your logs (listener.log, crs logs, alert logs etc)
- How to manage your Oracle environment (host environment)
- A quick recap on the backup emails (from the talk at CNAF)
- Managing your targets in Grid Control Security of machines and authentication techniques
Speaker: Gordon Brown (CCLRC) - 42
-
15:45
coffee break
- 43
-