28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
OR click HERE (Please specify your name & affiliation in the web-interface)
From: France and Italy
To: UKI and Russia Report from France:
2 cases transfered to political instances
GGUS ticket #40782 APEL failure on gridce01.tier2-kol.res.in
Ticket submitted on 11/09/08
GGUS ticket #41152 SRM failure on gridse001.tier2-kol.res.in
Ticket submitted on 22/09/08
=> Already discussed about suspension for IN-DAE-VECC-01 at Ops meeting, but still not suspended by ROC -> CODs have rights in GOCDB to suspend, but are they allowed to do it?
RU-Phys-SPbSU: APEL failure on phys5.gridzone.ru
GGUS Ticket #40521
Ticket submitted on 05/09/08
=> ask for suspension
UKI-LT2-QMUL: RGMA failure on mon01.esc.qmul.ac.uk
GGUS Ticket #40945
Ticket submitted on 16/09/08
Answered on 04/10/08: site did not receive the ticket
=> ROC_UKI seems not answering.
It seems ROC_UKI does not receive GGUS notifications. This should be fixed.
KR-KISTI-HEP: APEL failure on hep001.kisti.re.kr
GGUS ticket #40773
Answer on 03/10/08
srm.pps.cern.ch (CERN-PROD): in SD until 03/10/09
Is it a test node or a CERN-PPS node?
If yes, it would be better to change the SD description in "Test node"
=> Still nothing about the possibility to declare a test node in GOCDB (see https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus)
What is the status on that 'test node' problem?
GGUS Ticket-ID: 40945
Affected Site: UKI-LT2-QMUL
Responsible Unit: ROC_UK/Ireland
Apologies received on 2008-10-04:
"The delay in responding was related to the fact that the QMUL site admin email list was left off the orginal list of assignees.
Anyway, mon01 has a problem which should be fixed early next week."
<big> PPS Report & Issues </big>
Please find Issues from EGEE ROCs and general info in:
France: Which is the status of the SAM problem raised with GGUS ticket #40565 ?
Somehow some nodes might not be taken into account by SAM after a SD.
<big> Comparison of BDII and GOCDB entries for LFC in GSTAT</big>10m
Some sites have noticed that GSTAT is now comparing LFC entries in the
GlueService of the BDII and the nodenames in GOCDB.
prod-lfc-atlas-local.cern.ch as being present in the BDII as a
GlueServiceType: lcg-local-file-catalog but in the BDII this host
is entered as node type LFC. Assuming it is a local LFC it should be a
node type Local-LFC
In the case of the top_bdii there is an existing bug that can make this
harder to resolve than it should be when you wish to publish a host alias as the
service endpoint. BUG:41361. A fix for this trivial bug will pushed forward.
<big>New LFC SAM tests</big>5m
Later this week, two new services will be added to SAM production: LFC_L and LFC_C. The associated tests will be made critical so that history can be viewed in the SAM portal, but they will be ignored for availability calculations, and COD alarms will be supressed. At some stage in the future, and after suitable notifications, they will replace the existing LFC service. The new tests avoid trying to write to read-only LFCs, and include an lfc-ping test on which the others are dependent.
<big>gLite 3.0 services to be obsoleted</big>5m
An announcement for this retirement is already on the gLite 3.0 page :
This corresponds to the procedure (until we have new one) that was discussed in the ops meeting in Feb 08: https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMinutes2008x02x25#Support_for_gLite_3_0_services
PLEASE, LET US KNOW ANY OBJECTION BY NEXT WEEK!
<big>Changes in VO Cards, e.g change in required OS Software</big>10m
Following recent requests from a VO member directly to sites to install a particular extra piece of OS software then a recap of the policy is made.
VOs wishing to change their needs to be supported by a site should of course use the VO cards as the definitive reference.
Any change to the VO card by any VO which would trigger site action should be discussed first at the weekly EGEE/WLCG operations meeting.
The purpose is to allow other VOs to sites to raise concerns. Also a sensible time line can be
decided for the sites to implement the changes.
<big>Job Storm for Last Friday's GridFest.</big>5m
For last Friday's LHC GridFest several
100 thousand jobs were submitted.
It is clear that sites and resource centres should have been notified about this. Thanks
to all sites who propped up services during this time. To my knowledge only one 3.0 lcg-CE
Apologies for not informing the sites, all jobs should now have exited and be clear of the
<big> WLCG issues coming from ROC reports </big>
France: Is there a procedure to notify sites and GGUS about changes in LHC alarm DN list automatically? (cf. https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage)
Checking manually this list is not very user-friendly and could lead to alarm from a new authorized person being rejected if sites or GGUS are not up to date.
This kind of changes could be notify to sites and GGUS by a GGUS ticket. This will ensure that everyone is aware of the changes, and that it has been taken into account. This should also concerned the possible change of the alarm email addresses for site/VO.
<big>WLCG Service Interventions (with dates / times where known) </big>