EGEE-II Operations Workshop

Name: EGEE-II Operations Workshop
Start: 2007-06-11T08:00:00+02:00
End: 2007-06-15T18:00:00+02:00
Location: CERN

11 Jun 2007, 08:00 → 15 Jun 2007, 18:00 Europe/Zurich

1/1-025 (CERN)

1/1-025

CERN

Show room on map

Monday 11 June
- Mon 11 Jun
- 08:00 → 08:20
  Domenico Vicinanza's Minutes added (COD meeting + OPS workshop) 20m
  Slides
  
  SAMAlarmTriggering.pdf
  
  SAMAlarmTriggering.ppt
  
  SAMUpdatesDomenico.pdf
  
  SAMUpdatesDomenico.ppt
- 08:20 → 08:40
  
  Dusan Vudragovic's minutes from wLCG/ EGEE Grid Operations Workshop in Stockholm – 2007 20m
  
  wLCG/ EGEE Grid Operations Workshop – 2007 13-15 June 2007 KTH, Stockholm Wednesday, June 13, 2007 *Opening plenary Introduction speech is given by Ian Bird. *Operation procedures Introduction speech is given by Maite Barroso Lopez **Site reports All presentations in the session covered description of tools used in daily grid operations and features that are missing in these tools that can make work easier. Examples of the most frequent scheduled/unscheduled interventions are given as well as suggestions for improvement of the operation bodies (COD/ROC) and the operation meetings. Communication with users, other sites, ROC and the way how it can be improved are discussed here also. Presentations are given by following sites: LIP-Lisbon, CY-01-KIMON, FZK-LCG2, NIKHEF-ELPROD, CYFRONET-LCG2, GRIF, CERN-PROD, JP-KEK-CRC-01, JP-KEK-CRC-02 and CNAF-T1. **User reports: WLCG This presentation covered WLCG service operation and MoU targets, service coordination roles, S.W.O.T. analysis of WLCG service and LHC startup challenges. **Round table discussion The conclusions are that SAM and GStat are the most used tools in daily grid operations and after them there are GGUS, GOCDB, CIC, SAM admin and GridView. Beside these, Nagios, Ganglia and Quattor are used very often. Following features are missing to make admins’ work easier: middleware logs should be at one place, operation site admin manual should exist as well as replicates of SAM admin; firewall rules should be created and CIC portal should be improved with search engine; interaction CIC-CIC Operatior-SAM-GOCDB-GGUS as well as support for schedule upgrades must be somewhere; SAM alarms should be sent directly to the site, and the tool do to correlation statistics for difficult problems related to load and adding of a stamp from local batch system to a job’s ID to improve traceability of jobs from UI to WN, should be created. Regarding deployment it is emphasized that test and validation in PPS or SA3 are important before deploying and that updates should be less frequent. Thursday, June 14, 2007 *Grid operations Service Level Agreements (SLA) At this session EGEE-II SLA progress report and initial proposal are presented. One of SLA working group mandate is collecting relevant examples of SLAs and other documentation and making these available within the working group as well as reviewing the example documents and extracting a list of useful item from each one. So, here are presented SEE-GRID SLA, WLCG MoU, INFN MoU, UK Tier-2 MoU, Oxford NGS Service Level Description, Service Level Description for NGS Helpdesk, BalticGrid SLA and EGEE-II SA2 SLA. After that hardware and network connectivity, level of expertise and support and VO support requirements are also presented. *Software release updates cycles During the presentation some examples of problems in releases and update 24 are given. The conclusion is that RPMs are released into production with simple bugs that should have been noticed in PPS and that more clearly defined set of systematic tests in PPs or in SA3 should be created. Also, only one part of the update 24 was of the high priority. In the same update, there were many other things, in particular DPM 1.6.4 which is not something you can do on the fly, if you don’t want to disrupt your SE service, so you would need downtime even for a short while. If DPM update was considered high priority because of the planned changed in SAM, it should have been stated clearly. So, high priority update should contain only the components that are impacted by the high priority update. Service update with non trivial procedures involving a temporary shutdown of the service should be released as separate update. Non critical updates can be bundled together in one update. Regarding documentation, the full set of services which should be running for each node type is not documented. Major changes are not sufficiently highlighted in the release. When release notes are upgraded with workarounds broadcast should be sent. Even better, the release notes should contain a ChangeLog and all known issues of the current and previous releases. If a given issue is solved in the meantime, only ChangeLog will mention it. Releases should not contain updates that concern only CERN. Some new "features" don't appear in the release notes at all. Release notes often do not mention major changes that will affect services in production which should not be happening. Regarding PPS, there is not enough time for patch testing in PPS due to the weekly updates. *Rollout of new VOMS features: fair shares, VOViews, FQANs The presentation was more technical than others. It described VOMS FQANs, how user mapping works, format of gridmap file, how to configure users, torque configuration… The same presentation explained new structure and coordination of YAIM. *So what can NPM do for you? Network Performance Monitoring, formerly part of JRA4, is now part of SA1. Here was presented why NPM are important for site and grid operations as well as for grid services and middleware. NPM data allows end user to see the performance they should expect from their Grid application. All items from presentation are covered with real life examples. NPM diagnostic tool, deployment issue and plans are presented also. *User problems do not go to /dev/null Conclusions from this session are that GGUS tickets take too long to be solved and solution is not satisfactory. Tickets and procedures should contain clear information. Up-to-date and easy to find documentation should be supplied. Concrete proposals in the GGUS tickets are aimed to reduce the ticket turn-around time. Monitoring ticket assignment and solution times are reported at ROC anagers’ meeting are already in practice. Improvement of GGUS interface via monthly releases based on a shopping list is done. *Site/grid/application monitoring ** Sharing best practices and system management tools: www.sysadmin.hep.ac.uk One of the problems observed (by EGEE and LCG) in providing a reliable grid service is the reliability of the local fabric services of participating sites. The SMWG should bring together existing expertise in different area of fabric management to build a common repository of tools and knowledge for the benefit of HEP system managers’ community. The idea is not to present all possible tools nor to create new ones, but to recommend specific tools for specific problems according to the best practices already in use at sites. Although this group is proposed in order to help improve grid sites reliability, the results should be useful to any site running similar local services. Two areas should be improved by the group: tools and documentation. Group already created mandate, wiki, repositories, but still needs people to contribute. ** Grid monitoring WG: proposal to sites and timelines/roll out strategy Grid Service Monitoring Working Group should help improve reliability of grid infrastructure and provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. Stakeholders are site administrators, grid service managers and operations, VOs and Grid Project management. Aims of this working group are not providing yet another technical solution, but improving reliability of WLCG and consolidating existing solutions. There is a lot of good ideas, but no volunteers. ** Application monitoring Goal of this working group is to gain understanding of application failures in the grid environment and to provide an application view of the state of the infrastructure. Aim is to summarize experience gained by the LHC experiments in achieving this goal and to provide input to grid service monitoring working group and WLCG management too. Application monitoring working group should give information related to the availability of the Grid infrastructure/services as they are seen/measured by LHC VOs to ROC managers and local fabrics monitoring systems, so that eventual problems are fixed by people who can take action. Here were demonstrated experiments’ monitoring system and dashboard concept. *CMS operations The presentation covered CMS operation service and support, activity operations (data challenge planning), current system for user support and monitoring tools. *CMS usage of SAM in EGEE and OSG The presentation covered SAM usage in CMS. CMS is using SAM in different ways: to relying on OPS test results, to run some standard tests under CMS VO and for running custom CMS tests. The latter way is the most efficient in this case. For a long time, CMS have used some ops tests as critical tests, but also the custom tests (basic, swinst, Monte Carlo, squid, FroNtier). They are submitting tests to EGEE and OSG sites through LCG resource broker and gLite WMS. In future they will develop test for SRM v1 and v2. They are planning to use SAM for automatic software installation and visualization in the ARDA dashboard. *OSG operations Here the operations, interpolations and problems in OSG were described. *NDGF operations Nordic DataGrid Facility (networking and computing) was described here. Friday, June 15, 2007 *ROC activities ** SEE-GRID operational tools and Grid services improvements Here a part of WP3 activities in SEE-GRID development of next-generation SEE-GRID infrastructure (next generation of EGEE middleware and services) and support in deployment and operations of the resource centres (monitoring, helpdesk, overall upgrade of infrastructure) was described. Monitoring tools used in SEE-GRID are presented in details. ** InGRID: A Generic Autonomous Expert System for Grid Nodes Real Time The project studies the problem of managing a Grid resource centre and how to improve this task using an expert system. Architecture of the system is presented here. The main objectives of InGRID are: minimization of the large amount of incidents that the operator must attend and restoring automatically the major number of services without the intervention of the operator. It is tested at PIC (South-West ROC). ** The Baazar vision Bazaar will be a web-based grid resource market platform available in a frame of EGEE. Main ideas are presented here. *Planning for EGEE III The main vision of EGEE-III is to make a strong move towards a sustainable world-wide production quality Grid infrastructure by appropriate technical and organizational evolutions. The e-Infrastructure operated by EGEE-III must be capable of providing services to a rapidly increasing number of application areas, and make Grid technology easily accessible and usable for these communities. The main goal of EGEE-III is to enable the transition to EGI by evolving the existing technical and organizational structures. This will be complex since EGEE-III must in parallel ensure the continuous availability of the production infrastructure to an ever increasing number of diverse user communities. As for EGEE-II there is a short time to complete negotiation (exact dates of call not yet known) and may happen that it will be necessary to start project before contract is signed. *Future and evolution of Grid operations in EGEE III and EGI There won’t be major changes in Grid operations inside EGEE-III. Five major goals will remain: grid management, grid operations and support, user support, operational support and general and admins tasks. Operations: - SA 1.1: Grid Management - SA 1.2: Operations and support - SA1.3: User support - SA1.4: Grid security - SA1.5: Overhead tasks Some changes are: - all ROCs must do all key operational tasks (operator on duty, TPMs and GGUS support effort, security coordinator) - no regional certification - porting tasks and interoperation/other grid projects (effort for EGEE to work with other projects but not support other projects) Specific areas to address are: - monitoring and oversight evolve towards automation - Service Level Agreement - Integrations of operations with existing and embryonic National Grid Infrastructures - partner reviews will be formal part of the project - QA integration in each activity and integrating new VOs into the infrastructure

Choose timezone

EGEE-II Operations Workshop

1/1-025

CERN