WLCG-OSG-EGEE Operations meeting
28-R-15
CERN conferencing service (joining details below)
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
OR click HERE
(Please specify your name & affiliation in the web-interface)
-
-
16:00
→
16:01
Feedback on last meeting's minutes 1m
-
16:01
→
16:30
EGEE Items 29m
-
<big> Grid-Operator-on-Duty handover </big>From: France and Italy
To: UKI and Russia
Report from France:-
2 cases transfered to political instances
- IN-DAE-VECC-01: GGUS ticket #40782 APEL failure on gridce01.tier2-kol.res.in Ticket submitted on 11/09/08 No answer GGUS ticket #41152 SRM failure on gridse001.tier2-kol.res.in Ticket submitted on 22/09/08 No answer => Already discussed about suspension for IN-DAE-VECC-01 at Ops meeting, but still not suspended by ROC -> CODs have rights in GOCDB to suspend, but are they allowed to do it?
- RU-Phys-SPbSU: APEL failure on phys5.gridzone.ru GGUS Ticket #40521 Ticket submitted on 05/09/08 No answer => ask for suspension
- UKI-LT2-QMUL: RGMA failure on mon01.esc.qmul.ac.uk GGUS Ticket #40945 Ticket submitted on 16/09/08 Answered on 04/10/08: site did not receive the ticket => ROC_UKI seems not answering. It seems ROC_UKI does not receive GGUS notifications. This should be fixed.
- KR-KISTI-HEP: APEL failure on hep001.kisti.re.kr GGUS ticket #40773 Answer on 03/10/08
- srm.pps.cern.ch (CERN-PROD): in SD until 03/10/09 Is it a test node or a CERN-PPS node? If yes, it would be better to change the SD description in "Test node" => Still nothing about the possibility to declare a test node in GOCDB (see https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus) What is the status on that 'test node' problem?
Report from Italy:-
2 Cases transfered to political instances
- GGUS Ticket #40521 Affected site: RU-Phys-SPbSU Responsible Unit: ROC_Russia No replies
- GGUS Ticket-ID: 40945 Affected Site: UKI-LT2-QMUL Responsible Unit: ROC_UK/Ireland Apologies received on 2008-10-04: "The delay in responding was related to the fact that the QMUL site admin email list was left off the orginal list of assignees. Anyway, mon01 has a problem which should be fixed early next week."
-
<big> PPS Report & Issues </big>Please find Issues from EGEE ROCs and general info in:
https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps -
<big> gLite Release News</big>Please find gLite release news in:
https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
Now in Production:
Soon in Production:
-
<big> EGEE issues coming from ROC reports </big>
- France: Which is the status of the SAM problem raised with GGUS ticket #40565 ? Somehow some nodes might not be taken into account by SAM after a SD.
-
<big> Comparison of BDII and GOCDB entries for LFC in GSTAT</big> 10mSome sites have noticed that GSTAT is now comparing LFC entries in the GlueService of the BDII and the nodenames in GOCDB.
Example: http://gstat.gridops.org/gstat/CERN-PROD shows prod-lfc-atlas-local.cern.ch as being present in the BDII as a GlueServiceType: lcg-local-file-catalog but in the BDII this host is entered as node type LFC. Assuming it is a local LFC it should be a node type Local-LFC
History: GGUS:38053
While this test does produce in Error in gstat it is not critical in the sense of availability or for the CODs. i.e fix it in your own time.
To pass this test the following comparison is made.
Service BDII Service Type GOCDB Node Type Central LFC lcg-file-catalog LFC Local LFC lcg-local-file-catalog Local-LFC -
<big>Comparison of BDII and GOCDB Entries for bdii_site and bdii_top Services. 5mSimilar to the LFC test above another test is also done by gstat to compare the BDII entries for SiteBDII and TopBDII endpoints. The conditions that will pass are.
Service GlueServiceType GOCDB Node Type Top BDII bdii_top Top-BDII Site BDII bdii_site Site-BDII History: GGUS:40475
In the case of the top_bdii there is an existing bug that can make this harder to resolve than it should be when you wish to publish a host alias as the service endpoint. BUG:41361. A fix for this trivial bug will pushed forward.
-
<big>New LFC SAM tests</big> 5mLater this week, two new services will be added to SAM production: LFC_L and LFC_C. The associated tests will be made critical so that history can be viewed in the SAM portal, but they will be ignored for availability calculations, and COD alarms will be supressed. At some stage in the future, and after suitable notifications, they will replace the existing LFC service. The new tests avoid trying to write to read-only LFCs, and include an lfc-ping test on which the others are dependent.Speaker: John Shade (CERN)
-
<big>gLite 3.0 services to be obsoleted</big> 5m
- glite-SE_classic
- glite-VOBOX
- glite-WMS
- glite-PX
- glite-MON
An announcement for this retirement is already on the gLite 3.0 page :
http://glite.web.cern.ch/glite/packages/R3.0/
This corresponds to the procedure (until we have new one) that was discussed in the ops meeting in Feb 08:
https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMinutes2008x02x25#Support_for_gLite_3_0_services
PLEASE, LET US KNOW ANY OBJECTION BY NEXT WEEK!
-
-
16:30
→
17:00
WLCG Items 30m
-
<big>Changes in VO Cards, e.g change in required OS Software</big> 10m
Following recent requests from a VO member directly to sites to install a particular extra piece of OS software then a recap of the policy is made.
VOs wishing to change their needs to be supported by a site should of course use the VO cards as the definitive reference.
Any change to the VO card by any VO which would trigger site action should be discussed first at the weekly EGEE/WLCG operations meeting.
The purpose is to allow other VOs to sites to raise concerns. Also a sensible time line can be decided for the sites to implement the changes.
-
<big>Job Storm for Last Friday's GridFest.</big> 5mFor last Friday's LHC GridFest several 100 thousand jobs were submitted.
It is clear that sites and resource centres should have been notified about this. Thanks to all sites who propped up services during this time. To my knowledge only one 3.0 lcg-CE actually died.
Apologies for not informing the sites, all jobs should now have exited and be clear of the system.
-
<big> WLCG issues coming from ROC reports </big>
- France: Is there a procedure to notify sites and GGUS about changes in LHC alarm DN list automatically? (cf. https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage)
Checking manually this list is not very user-friendly and could lead to alarm from a new authorized person being rejected if sites or GGUS are not up to date.
This kind of changes could be notify to sites and GGUS by a GGUS ticket. This will ensure that everyone is aware of the changes, and that it has been taken into account. This should also concerned the possible change of the alarm email addresses for site/VO.
- France: Is there a procedure to notify sites and GGUS about changes in LHC alarm DN list automatically? (cf. https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage)
Checking manually this list is not very user-friendly and could lead to alarm from a new authorized person being rejected if sites or GGUS are not up to date.
This kind of changes could be notify to sites and GGUS by a GGUS ticket. This will ensure that everyone is aware of the changes, and that it has been taken into account. This should also concerned the possible change of the alarm email addresses for site/VO.
-
<big>WLCG Service Interventions (with dates / times where known) </big>Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
Many interventions scheduled this week. Please consult the URLs above for details.Time at WLCG T0 and T1 sites.
-
<big> WLCG Operational Review </big>Speaker: Harry Renshall / Jamie Shiers
-
<big> Alice report </big>
-
<big> Atlas report </big>
-
<big> CMS report </big>None.Speaker: Daniele Bonacorsi
-
<big> Storage services: Recommended base versions </big>The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
-
<big> Storage services: this week's updates </big>Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
-
-
17:00
→
17:30
OSG Items 30mSpeaker: Rob Quick (OSG - Indiana University)
-
Discussion of open tickets for OSG
-
-
17:30
→
17:35
Review of action items 5m
-
17:35
→
17:36
AOB 1m
-
16:00
→
16:01