Weekly Operations' Meeting 2005-02-21

Agenda: http://agenda.cern.ch/fullAgenda.php?ida=a045855
Operations' Manual: http://lcgdeploy.cvs.cern.ch/cgi-bin/lcgdeploy.cgi/lcg-docs/EGEE-CIC-Operational-Manual/opMan.pdf
Template for weekly report submission by the ROC: http://cern.ch/egee-docs/list.php?dir=.\operational_tools\&

Participants: Steven Burke (RAL, UK), Alessandro Cavalli (INFN), Maria Dimou (CERN, Secretary), Pierre Girard (IN2P3), Kostas Koumantaros (SEE), Holger Marten (FZK), Gonzalo Merino (PIC, SWE), Piotr Nyczyk (CERN), Marcin Radecki (CE), Davide Salomoni (NIKHEF, NE), Frederic Schaer (IN2P3), Philippa Strange (RAL, UK), Nick Thackray(CERN, Chair), Steve Trailen (RAL, UK), Min Tsai (CERN & Taiwan).

CIC & ROC reports:

CIC-on-duty report (Piotr Nyczyk from CERN):
A full report is in the CIC-on-duty log, starting as [2005-02-20 23:56] - Frederic Schaer. To navigate from the entry page http://cic.in2p3.fr, click on CIC views, On Duty Dashboard, FollowUp, FollowUp using LogFiles. This is the last entry on page 01/2005 : html

ROC reports are now submitted by the CIC-on-duty and the ROCs according to this template, linked from the ROC views of the CIC web site and from this meeting's agenda. There was no report or participant from Russia.

Reports should be sent to the project-egee-roc-managers@cern.ch list by 11am CET. They are linked from the meeting agenda. CERN submitted no report again because it is not a ROC but a single site. Kostas and other ROC managers asked for CERN reports in the future.

Issues discussed in addition:
1. The network problems experienced at CERN since last Thursday Feb. 17th (see announcement email below) caused 30% of the sites to vanish from the BDII, despite the use of other RBs. Kostas said that this information was not broadcasted enough. Alternative UIs were suggested by Steve. Frederic will give accounts on a public UI.
-----Original Message-----
From: LHC Computer Grid - Rollout [mailto:LCG-ROLLOUT@cclrclsv.RL.AC.UK] On Behalf Of Laurence
Sent: Friday, February 18, 2005 1:28 PM
To: LCG-ROLLOUT@cclrclsv.RL.AC.UK
Subject: [LCG-ROLLOUT] BDII problems at CERN

 

There is currently a strange problem with the BDIIs at CERN. About 30 sites are timing out when queried by the BDII. When an ldapsearch is tried on some sites they hang, however, it is possible to ping the machines and telnet to port 2170 on the remote machines. This problem seems to affect all production machines in the range 137.138.152.*

If looks like there are a few network problems here at CERN. Hopefully, the BDII will be back to normal once the network is okay again. http://tvscreen.cern.ch/

Laurence
 

2. There was a problem with the Toronto site that didn't respond to escalation as reported by the last CIC-on-duty (IN2P3 for the Feb. 14th week). This revived the 'non-functional site' criteria discussion:

Max. wall clock time needs to be put up to 60 minutes for functional site tests to run. The monitoring jobs can't complete at those sites which don't allow long jobs to run.

3. Steve reminded that the general EGEE migration strategy to SLC3 is missing (action since 2004-11-29 in the action list below). New release dates for LCG2 should not be announced before all sites complete upgrade to LCG2 2_3_0.

4. Marcin reported that CEE sites accept to support non-EGEE VOs.

5. Pierre asks for the possibility to run functional tests on demand. For now, this should be requested from Piotr or the CIC-on-duty but it will be made possible in the future.

 

Security report (Ian Neilson absent):
- No report.

Next CIC-on-duty:

Action List:

(*** ACTION 2004-11-29--1 ***)Vincent Breton to check with the other application managers the "Migration to SLC3 plan" of non-HEP VOs. LHC experiments are, up to now, mid February 2005, still not clear about the status of their software. IN2P3 reported on the 2005-02-21 meeting that the BIOMED VO hasn't yet started. OPEN

(*** ACTION 2004-12-06--1 ***) Sites should accelerate migration to SLC3, at least on the service nodes due to security considerations. OPEN

(*** ACTION 2004-12-13--1 ***) Escalation procedures in the Operations' Manual should clarify that sites running outdated LCG2 versions or don't respond to CIC-on-duty action prompt will be disclosed in this meeting. OPEN

(*** ACTION 2005-01-03--1 ***) Grid Infrastructure Support Section to give generic (functional) DNS aliases to important service nodes, ensuring transparent service changes. OPEN

(*** ACTION 2005-01-10--3 ***) The asian sites' support needs to be formalised. Requested by CERN at the relevant CIC-on-duty report. Assigned to Taiwan, the asian sites' ROC. OPEN

(*** ACTION 2005-02-07--1 ***) Nick & Flavia to send the procedure to be used by sites when they need to inform a VO that the SE fills-up to Piotr for publication in the Operations Manual. 
CONCLUSION:
The SE space is owned by the VO. The sites may inform the VO via the EIS team that the SE is filling up but the VO might decide to keep the data. It is not clear whether there is a quota per VO.
DONE. Close at the next meeting.

(*** ACTION 2005-02-07--2 ***) Gilles Mathieu and Markus to put the web page with CA rpms, currently under http://cern.ch/markusw, on the CIC web site. OPEN

(*** ACTION 2005-02-07--3 ***) GOC database manager David Kant to add an operational days field, in addition to operational hours, e.g. Operational hours: 0900 - 1700 (GMT), for every site. Sites to correct their timezone in GOC db. ROC managers should do the same, each on their page under http://cern.ch/egee-sa1/ROC-support.htm. OPEN

(*** ACTION 2005-02-07--4 ***) John Gordon using some information circulated by Ian Bird to define metrics on site performance. OPEN

(*** ACTION 2005-02-14--1 ***) Min Tsai (CERN) noticed that the CNAF LCG2 Release shown in the monitoring tools is 2_2_0. This is an editing mistake and it should be corrected by CNAF. 
CONCLUSION:
This was not an error. 
DONE. Close at the next meeting.

(*** ACTION 2005-02-14--2 ***) David Kant's comment is needed here: ROCs to make sure that sites in their region which are in the GOC database, also appear in the BDII configuration file which is a prerequisite status for a 'certified' site according to the Site Registration Requirements document. OPEN

(*** ACTION 2005-02-14--3 ***) Nick to clarify the future release strategy. Ian Bird presented different plans at the EGEE review to the ones that Maarten Litmaath published in email. OPEN

(*** ACTION 2005-02-14--4 ***) Min to add the colour code to the Regions' page as well. DONE but should remain OPEN because some additional enhancements suggested by Jeremy are being implemented now. Close at the next meeting?

(*** ACTION 2005-02-14--5 ***)Participants requested a historical site view, i.e. one that shows how long a given site has been out last week to appear on the weekly report template. OPEN

(*** ACTION 2005-02-21--1 ***) Nick and Ian Bird to define and document in the Operations' Manual the penalty for sites not responding to escalation, e.g. Toronto case of week 2005-02-14 OPEN

(*** ACTION 2005-02-21--2 ***) Steve to ask David Kant to adapt GOC db not to allow sites to turn their monitoring flag off while in production. OPEN

(*** ACTION 2005-02-21--3 ***) Piotr to ask Judit to exclude 'non functional' sites when building the BDII from GOC db. OPEN

(*** ACTION 2005-02-21--4 ***) Piotr to make a list of sites that don't allow (long) monitoring jobs to complete and will send it to the ROC managers. OPEN

(*** ACTION 2005-02-21--5 ***) A re-incarnation of (*** ACTION 2005-02-07--1 ***): Who will wite a policy and procedure of data migration possibility for a site? OPEN

AOB

Maria Dimou, IT/GD, Grid Infrastructure Services