Weekly Operations' Meeting 2005-03-07

Agenda: http://agenda.cern.ch/fullAgenda.php?ida=a045857
Operations' Manual: http://lcgdeploy.cvs.cern.ch/cgi-bin/lcgdeploy.cgi/lcg-docs/EGEE-CIC-Operational-Manual/opMan.pdf
CA rpm instructions: http://cern.ch/grid-deployment/lcg2CAlist.html and https://cic.in2p3.fr/index.php?id=rc&subid=rc_config
Template for weekly report submission by the ROC: http://cern.ch/egee-docs/list.php?dir=.\operational_tools\&

Participants: Alessandro Cavalli (INFN), Maria Dimou (CERN, Chair this time and Secretary), Pierre Girard (IN2P3,FR), Kostas Koumantaros (SEE), Gilles Mathieu (IN2P3,FR), Gonzalo Merino (PIC, SWE), Ian Neilson (CERN), Piotr Nyczyk (CERN), Per Ă–ster (NE), Ognjen Prnjat (SEE), Marcin Radecki (CE), Frederic Schaer (IN2P3,FR), Philippa Strange and Steve Traylen (RAL, UK), Ron Trompert (SARA, NE),Min Tsai (CERN & Taiwan). 
Apologies: Sven Hermann, Holger Marten (FZK) and Nick Thackray (CERN).
Absences: There was a report but no participant from Russia.

Fast access to the Action List

Comments on the notes of the last meeting:
1. Alessandro suggested by email to have the Action list published earlier than the rest of the notes. Maria promised to publish the whole lot by Thursday noon at the latest. The meeting participants agreed that this leaves enough time to read and act.
2. Jeremy suggested by email to re-structure the Action list in a way that one can see what is assigned to oneself. This takes place as of today. Jeremy also asked why Ian Bird appears still in the agenda as the meeting chairman. The answer to this is that Nick can't make the change in the CERN agenda system, this option is not offered.

CIC & ROC reports:

CIC-on-duty report (Philippa from RAL):
A full report is in the CIC-on-duty log, starting as [2005-03-07 15:21] - philippa strange
To navigate from the entry page http://cic.in2p3.fr, click on CIC views, On Duty Dashboard, FollowUp, FollowUp using LogFiles. This is the last entry on page 01/2005 : html

ROC reports are now submitted by the CIC-on-duty and the ROCs according to this template, linked from the ROC views of the CIC web site and from this meeting's agenda.

Reports should be sent to the project-egee-roc-managers@cern.ch list by 11am CET. They are linked from the meeting agenda. CERN submitted no report again because it is not a ROC but a single site. Kostas and other ROC managers asked for CERN reports in the future.

Issues discussed in addition:
 

1. The exiting CIC-on-duty (RAL) said there are 2 tickets in savannah one is #1631 about BNL and another one concerns a Canadian site (#1016 for Carleton) that reached the point of escalation to the GDB. Ognjen and Kostas reminded again that the GDB is not the body for escalation of problems in the EGEE community. The ROC managers' meeting will discuss on March 15th which is the right EGEE body for such escalations. The participants decided that Nick (who was absent) should ask Ian Bird to present these specific 2 american sites to the GDB, given that they are LCG and not EGEE.

2. The exiting CIC-on-duty (RAL) apologises to the new one (INFN) for leaving a big backlog of old tickets for clean-up.

3. Kostas reported repetitive BDII instabilities. Laurence Field should be invited next time for advice.

Security report (Ian Neilson):
- LCG2 2_3_1 is out as a security update. Sites complained that this release was out much to close to a major one coming as LCG2 2_4_0. Ian emphasised that sites must get used to quickly deploy security updates under short notice. The participants requested that security updates stay clearly separate from any other add-ons in releases. This requirement was considered perfectly acceptable.

Next CIC-on-duty:

AOB:

Action List:

NumberDescriptionAssigned To:Status
2004-11-29--1Check with the other application managers the "Migration to SLC3 plan" of non-HEP VOs. LHC experiments are, up to now, mid February 2005, still not clear about the status of their software. IN2P3 reported on the 2005-02-21 meeting that the BIOMED VO hasn't yet started.Vincent BretonOPEN
2004-12-06--1Sites should accelerate migration to SLC3, at least on the service nodes due to security considerations.ROC mgrsOPEN
2004-12-13--1Escalation procedures in the Operations' Manual should clarify that sites running outdated LCG2 versions or don't respond to CIC-on-duty action prompt will be disclosed in this meeting.PiotrOPEN
2005-01-03--1Grid Infrastructure Support Section to give generic (functional) DNS aliases to important service nodes, ensuring transparent service changes.
Conclusion:
This was discussed after the meeting with Markus Schulz. The network group can't propagate the new host alias as fast as needed when the physical machines change. The solution we adopted (without being a complete fail-over) is to put the RB files on a shared filesystem and bring another machine up, when necessary with the same identity.
MarkusDONE Close at next meeting?
2005-01-10--3The asian sites' support needs to be formalised. Requested by CERN at the relevant CIC-on-duty report. Assigned to Taiwan, the asian sites' ROC.MinOPEN
2005-02-07--3GOC database manager David Kant to add an operational days field, in addition to operational hours, e.g. Operational hours: 0900 - 1700 (GMT), for every site. Sites to correct their timezone in GOC db. ROC managers should do the same, each on their page under http://cern.ch/egee-sa1/ROC-support.htm
Progress:
This is done so far by DE&CH ROC, SEE ROC and SWE ROC.
 
D.Kant & ROC mgrsOPEN
2005-02-07--4John Gordon using some information circulated by Ian Bird to define metrics on site performance.J.GordonOPEN
2005-02-14--1Min Tsai (CERN) noticed that the CNAF LCG2 Release shown in the monitoring tools is 2_2_0. This is an editing mistake and it should be corrected by CNAF. 
Conclusion:
This was not an error. As we have to get away from LCG2 2.2 release, and as the site remains in this release due to LSF problems, CNAF is advised to consult lcg-deployment-support@cern.ch.
 
CNAF site adminsOPEN
2005-02-14--2ROCs to make sure that sites in their region which are in the GOC database, also appear in the BDII configuration file which is a prerequisite status for a 'certified' site according to the Site Registration Requirements document. 
Conclusion:
David Kant said that a GOC tool is available for the ROC manages with a list of sites in their region and an easy way to change the site status flag to 'certified'.
 
ROC mgrsOPEN
2005-02-14--3Clarify the future release strategy. Ian Bird presented different plans at the EGEE review to the ones that Maarten Litmaath published in email.NickOPEN
2005-02-14--4Change the colour code on the Regions' page according to criteria that will result from the action on metrics.MinOPEN
2005-02-14--5Provide a page with a flag indicating whether a site had been down over a given period and foresee use of this info in the reporting template. Details in the action list of the 2005-02-28 meeting notes.Piotr & NickOPEN
2005-02-21--1Define and document in the Operations' Manual the penalty and the body to escalate to, for sites not responding to escalation, e.g. Toronto case of week 2005-02-14. 
 
Nick,Ian Bird & theROC mgrs' meeting.OPEN
2005-02-21--5A re-incarnation of ACTION 2005-02-07--1(see notes of the relevant meeting). Write a policy and procedure of data migration possibility for a site.Who?OPEN
2005-02-28--1Sites to set the value of max_running_jobs to the number of available CPUs instead of 9999 for those cases when there is no limit. Progress will be monitored via the ROC managers' meeting. This is an ATLAS request.ROC mgrsOPEN

Next meeting:

Maria Dimou, IT/GD, Grid Infrastructure Services