Minutes for Regional Operations Meeting DECH (November 3rd, 2006)

Attendance:
Sven Hermann <chair>, Clemens Koerdt, Torsten Antoni (FZK)
Christoph Wissing (Uni Dortmund, DESY-HH)
Andreas Gellrich, Uwe Ensslin (DESY-HH)
Horst Schwichtenberg, Kläre Cassirer (SCAI)
Andreas Haupt (DESY-ZN)
Thomas Kress (RWTH-Aachen)
Yves Kemp, Volker Büge, Christopher Jung (Uni Karlsruhe)
Christian Peter (ITWM)
Thomas Kress (RWTH)

Apologies:
GSI
Uni Wuppertal

Missing (due to technical problems with the con-system):
MPPMU
CSCS
Uni Freiburg


1. Introduction

Announcements:
- new report cycle: Operations meeting still Monday
- new Site Uni-Siegen-HEP (ATLAS), registration started (FZK), SCAI also in contact with them
- Top Level BDIIs (with common config) needed to spread bdii-load on CERN, each ROC to provide, details t.b.d.
- torque security update: problems reported by ROC SEE
S.H.: have there been any problems in our region?
A.H.: Timing on Friday afternoon was unfortunate
K.C.: problems with rpm dependancies resulted in having to do the installation twice  (took very long for MPI )
- feedback form for CIC portal: Please start using it!
- RM failures could have been caused last week by hardware changes for LFC infrastructure at CERN
- Update on SLC4 migration
> Nick: The effort is now focused on porting the WN and UI. As a fast solution a test is being done in certification to run RPMs built for SL3 on SL4 machines.  In parallel software is being built on SL4.  A couple of weeks from now are likely needed to be ready, assuming that the middleware release is not late. In that eventuality the possibility to run old “SL3” RPMs is considered as a fall-back solution.
- PPS: Is it ok, to enter GSI, ITWM and DESY in GOC DB now? Who is main administrator for PPS at your site?
  o DESY (send details !#)
  o GSI (asked offline !#)
  o ITWM (Christian Peter, same as production as first step)
(agreed, details offline, new action point !#)
- "No emails from WNs!": It has been decided at the last operations meeting that there will be no support for sending emails from WNs for a common user job


Action points
- RB garbage collection: A.G. has opened ticket (thanks for this), closed unfortunately, to be reopened in coordinated way (!# S.H.)
- network measurement survey (closed)
- communication channels: till next meeting
  o create new list with all admin lists on it (each partner will receive a mail to their admin list for *confirmation*)
  o !# S.H. create list
  o !# each partner: confirm
- Assessgrid: cooperation successful (closed)
- Dech VO:
  o Status?
  o !# SCAI: Tests -> Send results
  o !# COD DECH: Open tickets if site doesn't support VO DECH (!# C.K. to coordinate)
  o !# ROC On Duty: Do follow up of these ticvkets as usual
- SFT-Server: Status unchanged? Time scale?
(installed, Apache, Tomcat and MySQL running, one servlet missing that should listen on port 8088 related to RGMA Server functionality...)
  C.K.: Should run within next month, set to high priority from now on. SAM server will come later (RPMs are currently being built).



2. Round the Sites

CSCS
---
(not present)

DESY
----
- business as usual
- switch to VOMS: ldap vo server at DESY will now be shut down (broadcast was send - to remind concerned VOs to enter voms instead.)

DESY-ZN
-------
- business as usual
- there was a ticket (#14093) concerning a missing dcache file that is not being acted upon. Escalation seems to have stopped.
What to do in such a case?
S.H.: Tickets be further escalated to weekly operations meeting. (!# S.H. + C.K.)
T.A.: Mode of escalation has changed recently, but has not stopped. 


FZK
----
MW:
- update to 3.0.6 (only servers)
- some smaller problems with infosystem. fixed
- Job submission turned off awaiting pbs-pro security patch (bug track #20883) from 21/10/06 till 23/10/06. Fixed.
- Re-installation of Atlas VObox on 20/10/06 following OSCT security advisory regarding  weak password protection.
PPS:
- pbspro security patch caused problems with batch system config on PPS- WNs leading to JS-failures. To be fixed.
Storage:
- The upgrade 24/10/06 to dCache version 1.6.6-5 did not bring all the expected improvements.
- The stability of the gridftp doors has not improved. gridftp doors hang after machines run out of memory.
- In an effort to workaround the problem more gridftp doors have been installed.
- Since a week we also experience instabilities of the dCache head node. Activity is focused to solve this problem first.

ITWM
----
two downtimes registered recently: replacement of hardware. performance problems on CE. also replaced WNs with new hardware.
Seems to work fine now. Can now concentrate on PPS.

SCAI
----
Still heavily involved in biomed data challenge. 9000 jobs theoretically possible, but in reality rarely exceeds 2600.
Bottleneck is with the four available RBs, which cannot do more. WMS to be tested soon.
WMS no longer in PPS, will take LCG RB for the time being.
S.H.: Biomed could demand more WMS at the operations meeting in case they need them.
H.S.: They will probably see no need for it now.
K.C.: sam pps-tests quite old.
C.K.: suggest opening ticket in such a case.
PPS almost updated to latest release 9. (CE und WN).


RWTH-Aachen
-----
- They report problems with the time of this meeting. Have a new postdoc who might be soon able to attend. Could we hold the
meeting in English instead of German? No objections.
- Site now has over 100 CPUs. Next year much more new hardware will arrive.
- Does somebody have experience with redundant CEs so far?
- In connection with CMS data challenges, site reported problems with transfers to GridKa for over a week.


Uni Dortmund
--------
business as usual
LHCb Monte Carlo jobs transferring to GridKa still fail.


Uni Karlsruhe
------
Upgrading WNs to SL3.0.8 and latest glite version. Will attack storage element afterwards. Increasing to 30TB.
Undecided on what to use DPM or dCache to have voms functionality?
S.H.: DPM is the one officially supported by EGEE. However, dCache would be specifically interesting as a DECH regional solution, as it is developed in DECH.
Y.K.: maybe dCache too heavyweight for our scope.
T.K.: Aachen has dCache and is happy with it: advantage of connected Tivoli file system, can give support together with DESY and FZK.
..: CSCS uses DPM. SCAI currently setting up DPM server, but migration is more difficult.
S.H.: DPM support unit can be used (open tickets in ROC DECH Portal) help from experts at CERN.


3. COD


There are some more shift in November. 6-12.11. as lead team (Victor, Peter, Christoph)
and 13.-19.11 as backup team. Also COD meeting next week in Athens.

4. ROC-On-Duty

Handover from GSI to CSCS has to be done offline by the parties involved.

Discussion of problematic tickets:
Y.K. reported:
#1375 (opened without justification, SFTs failed only once or twice) -> closed
#1377 (RGMA opened tickets concerning mon-boxes, should not do this, since he is not a member of COD) -> S.H. trying to clarify soon.
H.S. reported:
#1295 open ports that where not documented. -> to assign ticket to CERN support unit. (!# ROC-On-Duty)
T.K. reported:
#1178 and #14512 (GGUS) complicated situation, site RWTH can not make progress here -> !# C.K. to take further care of the ticket, to be addressed by the appropriate people

AOB:
---

- Task 1.9.1.: gather input for regular reporting of issues to the TCG.
Suggestion: open regular 5 minutes slot in this meeting to gather and discuss what to include such a report: massive mw issues, serious regional problems, typical problems sa1-wiki (https://twiki.cern.ch/twiki/bin/view/EGEE/SA1) with list of ROC issues can give orientation to partners for some the kind of information expected