Weekly Operations' Meeting 2005-01-10

Agenda: http://agenda.cern.ch/fullAgenda.php?ida=a045849
Contact:  project-lcg-gda@cern.ch
Operations' Manual: http://lcgdeploy.cvs.cern.ch/cgi-bin/lcgdeploy.cgi/lcg-docs/EGEE-CIC-Operational-Manual/opMan.pdf

NB! As of today Nick Thackray will be chairing this meeting.

CIC reports:

CIC-on-duty report (Piotr Nyczyk from CERN):
A full report is in the CIC-on-duty log, starting as [2005-01-10 11:54] - Piotr Nyczyk. To navigate from the entry page http://cic.in2p3.fr, click on CIC views, On Duty Dashboard, FollowUp, FollowUp using LogFiles.

- The week started with 65 problematic sites, out of which 45 were treated by the CIC-on-duty.
- Piotr tuned the test site script, no more hanging is expected.
- 18 savannah tasks were closed but more of them were old and just out-of-date.
- At the time of the meeting, 56 sites were still failing the tests.
- We still have no re-direction solution in case a SE is full. The Disk Pool Manager and dCache are supposed to handle this problem.
- The ROCs should be more involved in the operation. It is very hard for the CIC-on-duty to identify and handle problems centrally.
- NB! The ROC managers should remember that when a problem at a site is solved the site needs to go through a period of quarantine before joining production again. This is mentioned in the Operations' Manual
- The asian sites' support needs to be formalised.

CNAF CIC report (Nobody joined ):
- No report.

IN2P3.FR CIC report (Helene Cordier, Pierre Girard, Rolf Rumler on the phone and Frederic Schaer at CERN):
- Asynchronous updates of the CA list on various nodes at a site (RB, WNs etc) cause credential verification failure. Frederic will enter a bug in savannah (see action list).
- All french sites commit to complete the migration to LCG2 2_3_0 by the end of January 2005.

RAL CIC report (Steve Traylen):
- There are 10 problems in savannah due to the UK, 3-4 are handled, clean-up will take place this week.
- Migration to LCG2 2_3_0 is still going on at various UK sites.
- Problems with the portugese and israeli CRL were reported to RAL from some sites.
- We should contact Atlas and LHCb weekly and ask them to clean their SEs.

SW Europe ROC report (Gonzalo Merino):
- The last meeting of the SWE federation was before Christmas. The next meeting will be on Jan. 17th and more news to report will be available. 
- All sites understand they are due to complete the migration to LCG2 2_3_0.

SEE ROC report (Kostas Koumantaros, Ognjen Prnjat):
- 5 sites upgraded to LCG2 2_3_0, another 2 are still left.
- 1 site failed the test at the time of the meeting.
- It is desirable to use savannah as the location where ROCs can follow a problematic site's status, provided the relevant ticket is kept up-to-date. This is probably easier not to miss than additional information in email.

CE ROC report (Nobody joined):
Marcin Radecki emailed his apologies to Ian Bird.

NE ROC report (Nobody joined):
- No report.

Russia report (Nobody joined, holiday in Russia):
- No report.

FZK.DE report (Holger Marten, Sven Hermann):
- They need to write adaptation code for the 2_3_0 release to handle the fact that PBS at the site is on a server separate to the CE. They also have different IP addresses for each node's internal and external network and the LCG2 release assumes only one IP address per node.
- The GSI site had no problems with the upgrade.
- Overall they move to SLC3 first and then to LCG2 2_3_0. By the end of January 2005 all 150 machines they run will be ready.
- They will email Oliver Keeble to learn how yaim can be forced to re-install everything.

Security report (Ian Neilson absent):
No report.

Next CIC-on-duty:

Action List:

(*** ACTION 2004-11-29--1 ***)Vincent Breton to check with the other application managers the "Migration to SLC3 plan" of non-HEP VOs. OPEN

(*** ACTION 2004-11-29--3 ***) Steve Traylen to publish procedures for a site additing itself to R-GMA, including configuration file of the R-GMA registry. DONE - REMOVE FROM THE LIST NEXT TIME

(*** ACTION 2004-11-29--5 ***) Steve Traylen, with help from Laurence Field to document on the Wiki page for the sites on how to block users, when necessary. OPEN

(*** ACTION 2004-12-06--1 ***) Sites should accelerate migration to SLC3, at least on the service nodes due to security considerations. OPEN

(*** ACTION 2004-12-13--1 ***) Escalation procedures in the Operations' Manual should clarify that sites running outdated LCG2 versions or don't respond to CIC-on-duty action prompt will be disclosed in this meeting. OPEN

(*** ACTION 2004-12-13--2 ***) Savannah accounts were created for all the ROCs. New users had to activate these accounts but as some failed to do it in time, their accounts have now expired. Now A. Kryukov has to re-create the Russian ROC account(s). OPEN

(*** ACTION 2004-12-13--3 ***) CNAF.IT CIC to submit their hand-over report to the ROC managers' list on Dec. 20th 2004. DONE - REMOVE FROM THE LIST NEXT TIME

(*** ACTION 2004-12-13--4 ***) Nick Thackray to send email to the 3 relevant ROCs about Helene Cordier's reminder to UK, IT and RU to submit their test suites as agreed in The Hague. OPEN

(*** ACTION 2005-01-03--1 ***) Grid Infrastructure Support Section to give generic (functional) DNS aliases to important service nodes, ensuring transparent service changes. OPEN

(*** ACTION 2005-01-10--1 ***) Frederic Schaer will enter a bug in savannah, project=lcgoperation, requesting a proxy at site level protecting from asynchronous updates of the CA list on various nodes at a site (RB, WNs etc) that now cause credential verification failure. OPEN

(*** ACTION 2005-01-10--2 ***) Laurence Field or Markus Schulz to prototype a dump of the GOC database every few hours and provide a read-only copy for the community. OPEN

AOB

Maria Dimou, IT/GD, Grid Infrastructure Services