- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Minutes for Regional Operations Meeting DECH (June 1st, 2007)
Attendance:
Sven
Hermann, Clemens Koerdt, Günter Grein (FZK)
Renate Dohmen
(MPPMU)
Christian Peter (ITWM)
Stephan Nies (Uni
Dortmund)
Andreas Haupt (DESY-ZN)
Alessandro Usai
(CSCS)
Christoph Wissing, Uwe Ensslin, Yves Kemp (DESY-HH)
Horst
Schwichtenberg (SCAI)
Apologies:
(Uni
Wuppertal)
Missing:
(GSI)
(Uni
Karlsruhe)
(Uni Siegen)
(RWTH-Aachen)
(Uni Freiburg)
1. Introduction
Last meetings minutes: no comments
Announcements:
change
of DECH Wiki url: now available via http (instead of
https)
http://twiki.cscs.ch/twiki/bin/view/DECH/WebHome
gstat statistics: Some info received, follow up in more detail (wrong numbers for some sites with more then one CE, #C.K.)
EGEE
production sites email channel was disturbed, seems to be solved
now.
Apologies.
If you want to write to all EGEE production sites, you have to get yourself registered first with your personal Savannah account. (https://savannah.fzk.de/projects/egee-production/)
phasing
out classic SE :(see mail)
Are there any features you as site
admins would miss? (#all)
software release cycle: Sven is collecting feedback on this (mail forwarded by Sven on 31.5. to [egee-production_sites] “Poll: [Fwd: Software release updates cycles]”
Current requirements for Operation systems in our region -> added a link to the agenda page. Feedback welcome! Will be forwarded to the TCG soon! Send corrections if necessary.
using DECH VO for Gridka school (temporary certs / use with VOMS server)
there are some security considerations.
partners (DESY, SCAI, after the meeting also Wupppertal) strongly opted for use of Dech VO for training, instead of creating yet another VO for that purpose.
All Dech sites could agree to accept temporary certificates for a limited use case.
still issues to clarify. Discussions at FZK ongoing, esp. with security people
you can find links of the last three ops meetings on the agenda page
Reminder: voms cert has changed (still an issue..)
top level BDIIs stress test (see link to ops meeting 21.5.)
production release 24:
yaim, sl4comp.WN (Attention: difficult upgrade path to natively compiled package later and therefor not recommended for small sites)
production release 25
DGAS, lcg-CE, RGMA server bugfixes...
changes of pool account mapping introduced with new yaim (two tickets #22244,#22306 opened on that, as well as discussions in the rollout list)
OS flag (GlueVariable “OS” to specify OS for jobs)
PPS 29 – glite 3.1 WN on SL4
PPS 30 – canceled
PPS 31 – just been released. predeployment test report (see link on agenda page)
new release schedule proposed (every two weeks)
patches go out on Mondays, stay in for three-four weeks and then go to production if no issues were found
EGEE III preparations ongoing. All Partners sent their requests. Negotiations on European level to come.
Action Items:
PPS
certification (get info system and SAM tests working: #! GSI,
ITWM)
-> GSI partly passing the SAM tests. Next step would be
to sustain positive testing for a week. ITWM delayed due to higher
priorities at the moment:
replica of the GOC db web server up
and running : goc2.gridops.org
after stable version of this is
achieved, go back to PPS
Dech
VO: Now only recently certified sites LRZ and Uni Siegen not
supporting Dech VO. (#! Siegen, LRZ, C.K.)
-> Siegen started,
though still JS problems, LRZ no progress visible
Create
a criteria catalog to classify MW update in "major"
and "minor" ones to warn site admins regionally in
advance, to better advise smaller sites
-> no feedback
received. rough classification exists on DECH wiki. Feedback still
welcome. Agreed for now. CLOSED
2. Round the Sites
CSCS
updated dCache with recent patches
experiencing timeout problems in connection with replica management
rate of transfer not acceptable at the moment
might be that the source (Karlsruhe) has a problem...?
huge
improvement last week, but then deterioration....
SH: might be
useful to open a GGUS ticket on this problem.
recent TPM shift without notable problems
---
DESY-HH
PPS
running two WN in compatible mode
Quattor scripts for native WN package in progress
involved in SRM 2.2 testing
new PPS release not yet deployed. Hope it will not again destroy our gLite CE as usually
Production
struggling with changes in deployment for pool accounts
fifth RB soon online, too much load requires manual interventions
Each RB handling dedicated VOs.
still having problems with VOMS mapping
anybody using VOMS mapping extensively? Feedback welcome!
opened a ticket on VOMS: see #21641
---
DESY-ZN
upgraded to dCache 1.7 with help of developers
Since Wednesday running again
---
FZK
first half of May without troubles.
second half of May more problematic:
first difficulties with SRM and hanging gridftp doors (all effort is consumed in just maintenance of service, if effort unavailable service is quickly degrading)
(new SRM certificate test was testing on gridftp port and thus failed. Issue was known to SAM team since end of last year, but was ignored.)
The SE via the SRM became inoperative several times. Reason for failure: hanging processes with memory errors in combination with high load. Under investigation.
CEs started to lock up recently (I/O processes too slow for the high number of jobs). Procedure to detect and fix this particular failure class is being developed.
dCache 1.8beta (SRM 2.2) in PPS
Again problems with APEL RPMs after changing specInt value. Some manual hacks were necessary.... Still in contact with Dave Kant...
---
GSI
(No
information received)
---
ITWM
involved in gocdb3 web portal replica
production: currently in maintenance
installing latest updates
LDAP server on CE became unstable recently, reason unknown..?
---
LRZ
(No
information received)
---
MPPMU
glite-job-submit on SGE problem is now solved with help from LRZ
some changes to the job manager did the trick
VOMS server problem (see ticket DECH#1905, solved at time of writing)
---
RWTH
(No
information received)
---
SCAI
newly employed person: Daniel Rubin(?)
TPM shift was normal
WMS: still problems with this service (processes are hanging, well known to developers...)
---
Uni Dortmund
new site admin: Stefan Nies, will represent Dortmund in this meeting
---
Uni
Freiburg
(No information received)
---
Uni Karlsruhe
Yves: short status: still in SD
currently problems with not published information.. (ticket opened with that)
Gregory Schott in charge of the cluster now.
---
Uni
Siegen
(No information received)
---
Uni
Wuppertal
(No information received)
---
3. Feedback to the TCG
HS: no feedback on questionnaire received yet
Anybody interested in interfaces to databases (Oracle database , OGSA DAI, webservice access database (earth science community) ?
Interfaces to access databases like Oracle:
A first introduction to AMGA may be: http://amga.web.cern.ch/amga/downloads/amga-manual_1_2_3.pdf
See also links at TCG
like:
http://egee-intranet.web.cern.ch/egee-intranet/NA1/TCG/wgs/mdm.htm
Another well known interface OGSA-DAI
is
http://www.ogsadai.org.uk/
see also:
http://omii-europe.org
There was also a task in TCG about the
Evaluation of OGSA-DAI
https://savannah.cern.ch/task/?2937
please do not hesitate to come forward with feedback to the TCG!
ask your user groups!
4. COD
Next COD shift will be during the week in Stockholm of 11.6.-18.6. (three COD DECH people attending).
Will have an internal COD meeting the week after. Proposal is 21.6. (Thursday)
Currently huge problems with SAM portal: missing tests, LFC errors, etc., ...
Still lots of useless error messages. (expect discussions in forthcoming COD meeting)
New host-cert tests caused some controversial discussions about procedure to apply before making them critical.
Signal to Noise in official communication channels like EGEE broadcast seems too low now. Some improvements are currently discussed at the CIC team.
Working group of TIP
Encouragement to site admins to improve/extend GOC Wiki, which is a community effort...
5. ROC-On-Duty
Handover GSI to CSCS
GSI not present in meeting. CSCS agrees to contact them offline
Any problematic tickets?
YK: #21600 -> UI installation on file system where you are not root (voms-proxy-init). Got no feedback from developers. Better to use rollout list?!?
SH: contacting developers is always problematic. But posting on rollout list is not enough. GGUS tickets, where the support unit becomes not active should be reported in your weekly site reports. They can then be escalated to Monday's operations meetings.
6. AOB
Reminder: Operations workshop in Stockholm,
13-15th June, agenda available:
http://indico.cern.ch/conferenceTimeTable.py?confId=12807