Minutes for Regional Operations Meeting DECH (March 23rd, 2007)
Attendance:
Alessandro Usai (CSCS)
Uwe Ensslin, Andreas Gellrich, Christoph Wissing, Yves Kemp (DESY-HH)
Sven Hermann <chair>, Günter Grein (FZK)
Renate Dohmen (MPPMU)
Andreas Nowack (RWTH Aachen)
Horst Schwichtenberg, Kläre Cassirer (SCAI)
Christoph Wissing (Uni Dortmund)
Hans-Gunter Borrmann (Uni Freiburg)
Yves Kemp (Uni Karlsruhe)
Apologies:
(DESY-ZN, sent report)
(GSI)
(ITWM, sent report)
(Uni Wuppertal)
Missing:
(LRZ)
1. Introduction
Announcements:
- Last meetings minutes: no comments
- WLCG/EGEE Operations Meeting 12.3. agenda
- technical problem with ROC report due to CIC Portal
- problems with last Torque update
- related to major/minor update discussion, see below
- gLite WMS/CE – LCG RB/CE deployment strategy
- catalogue with criteria to fulfil for these MW components
presented in Monday's meeting
- gLite flavoured CE/WMS still recommended default services
- sites' decision, which to use, but lcg-flavoured services
still much more reliable and usable
- WLCG/EGEE Operations Meeting 19.3. agenda
- Survey: Migration plan to SLC4 (see mail
"[egee-production_sites] SL4 plans" from March, 21st)
- gLite update 16: major <-> minor updates (problems with
recent releases)
- CERN rolls out no more than one major release per year, this
year "gLite 3.1"
- PPS experience and test results go into the release notes
- ROCs get each release in advance in PPS for testing, use
experience in PPS! ROC should then give specific feedback from PPS to
their sites
- warning about torque2 was in release notes, there has been an
additional action point in Monday's Ops Meeting
- failure of ROC DECH management: didn't point out possible
difficulties with torque update
- it's still explicitly not recommended to auto-update. don't
directly use apt-update for other than OS and CAs
- DESY e.g. smoothly ran torque update, no problem
- agreement to create a criteria catalogue to classify MW
update in "major" and "minor" ones to warn site admins regionally in
advance, to better advise smaller sites (!# make draft available for
next meeting, discuss in specific slot)
- Production update 17 and 18 (see link!)
- 17:
- CA update - high priority
- dCache 1.7 update - admin intervention necessary
- 18:
- problem with MySQL version of LFC reported, ugly bug, difficult
to fix, almost lost DB (addition: meanwhile went into weekly report on
Monday)
- EGEE: interventions procedures (draft document)
- established procedures in LCG
- no objections/comments yet, please feel free to send feedback
till April, 2nd
- EGEE: preparation EGEE3 -- internal ROC DECH note
- received very little information about EGEE3 preparation so far
- no draft from SA1 management available yet, difficult to
contribute
- feedback sent to SA1 management beginning of this week
- internal document available to DECH SA1 team for discussion and
feedback (see link)
Action Items:
- PPS certification (get info system and SAM tests working:
#! GSI, ITWM) - unchanged
- Dech VO: Supported by all sites in region? -> MPPMU
still working on it.
- Regional SFT-Server: no update, admin currently not
available
2. Round the Sites
CSCS
- Migrating DPM to dCache, new machines for storage planned, expand by
end of this year
- data throughput from Karlsruhe: Maybe more tampers needed?
- new WNs: SL4 on board, 32bit - next week
- need clarification about WLCG storage/transfers
S.H.: coordination meeting T1-T2 concerning storage/transfers to be
organised (!#)
DESY
- Production: recent update
- CE without problems
- LFC did not work: database was touched, restarted db, but
access tables lost!, dirty hacking helped
A.U.: similar problems with DPM due to schema change, transfers with
old schema corrupted db -> testing with usage necessary!
C.W.: also MySQL seems to be little tested
S.H.: to check certification of MySQL components with OCC (!#)
- PPS
- ticket for CE, but seemed to be a central problem with WMS,
magically solved this morning without intervention
- should be considered to handle problems with gLite-flavoured
CE/WMS with lower priority for the time, as these services are quite
unstable
- VOs
- SL4/64bit issue: SA3 seems to discuss dropping Perl from MW. But
H1 + ZEUS use Perl API for LFC heavily. Concern!
S.H.: Please send details, and I'll forward this to the appropriate
meeting
C.W.: agree (done at the time of writing)
- SL4 survey: MW should not depend on flavour, standard OS *not*
CERN version used; depending on RPM version of MW
- APEL
- Dave Kant is getting APEL running
- R-GMA Mon-Box constant disaster
SCAI: same here, cron job restarts Tomcat server every hour to pass SAM
tests
DESY-ZN
(apologies)
- number of average running jobs decreased (now ~30)
- since Monday 13.5 TB more disc space in dCache (total 32 TB now)
- GGUS ticket 18520 not progressing, please escalate
FZK
- SL4 status
- new WNNs run with SL3-build under SL4
- preparation for SL4 ongoing
- waiting for glite 3.1 to migrate old WNs as well
- schedule and plan depending on gLite developers
- dCache update successful
- more stable
- recent problems:
- head node full -> moved
- gridftp doors restarts
- batch system problems
- solved now
- CE was erroneous taken out of monitoring for days
GSI
- (no status report received)
ITWM
(apologies)
- production up and running
- security challenge finished successfully, feedback in preparation
- work on PPS to be continued ASAP
MPPMU
- DECH VO configuration doesn't work yet (to be fixed with K.C.
offline)
- APEL accounting not in place yet
- no plans for SL4 yet, high workload
LRZ
- (no information received)
RWTH
- dCache upgrade last week
- some problems
- 1.6. was unstable, improvement with new java version 1.5.0.11
- dCache 1.7.31 not so stable, now running fine with dCache 1.7.29
- SL4
- waiting for MW
- need Quattor templates
- CMS
- "dcms" currently not accounted for "cms", should be changed
with VOMS in future
S.H.: to address such kind of accounting issues (map "dcms" to "cms/de"
e.g.), it's useful to open a GGUS ticket. Accounting experts could then
change this in the accounting db, if needed.
SCAI
- production and PPS running fine, business as usual
- issue with LFC: update 18 solved problem
- WMS problem: release notes said "no reconfig necessary", but not
true
- SL4 migration in PPS next week
- PPS gLite CE: "JS" errors due to central problem
S.H.: to be addressed next Monday's ops meeting (done at the time of
writing)
- trying to address WNs with two different registry hosts (R-GMA),
two different torque servers (for different Grids)
A.G.: should use one batch system with different CEs instead, GIIS is
simply added to different Top Level BDIIs to publish site to different
grids
Dortmund
- installation with SL4 tarball
- D-Grid participation of another faculty in DO, planning to bring
these resources into EGEE as well
- thanks to Kläre for help with DECH VO
Freiburg
- 20 WNs with 8 GB RAM, 4 cores -> need 64-bit to fully use
hardware
- dCache installation from scratch
Karlsruhe
- in scheduled downtime
- lack of manpower
- currently fixing problems at site to go back to production mode
Wuppertal
- (no information received)
3. Feedback to the TCG
preparation of questionnaire ongoing, hopefully next week
4. COD
- no shifts no ROC DECH team last two weeks
- business as usual
5. ROC-On-Duty
Handover GSI to CSCS
Summary: 62 created / 56 solved last two weeks.
currently no problematic tickets or other issues to discuss
DECH Portal update scheduled for Thursday, 29th
- new support units
- see grid news
- reminder procedure, 1 reminder per day from then to reduce mails
6. AOB
------
next meeting on April, 20th due to Easter (wrong announcement in
meeting due to erroneous schedule), see
http://indico.cern.ch/conferenceDisplay.py?confId=11585