28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
NB: Reports were not received in advance of the meeting from:
Recording of the meeting
Feedback on last meeting's minutes
<big> Grid-Operator-on-Duty handover </big>
From: CERN / AsiaPacific
To: SouthEast Europe / Italy
Issues from AsiaPAcific ROC:
Seems CIC dashboard was not stable this week, we encountered problems to access it for many times during Taiwan's work time(+8).
The error msg showed "Data source is unreachable" on dashboard.
23 Jan: An issue regard to "open ticket for the node not in production", it was sent to CIC mail list but no any responses for now. Detail:
We are the lead team (Taiwan) this week. There is one thing need to consult with you. We found that there is a attribute for node as below on GOCDB.
Production status: Not in production.
Site admin can decide if put their node in production or not. But Gstat does not check it for now and seems has the same situation on SAM. So we as COD would see a failed alarm for the node that site admin put it in "Not in production" but still switch on the monitor of it. How to deal with this knid of situation?
Should we not to open ticket for the node not in production? In this case CIC dashboard may provide us the information in the alarm.
For detail please see the ticket #7169
In this ticket site admin suggested us should not open ticket for the node
"Not in production".
Any comments about this?
In fact every nodes from a production site are supposed to be in production .
That is the case now in GOC DB . You could not disable monitoring for nodes of a production site . I've sent a mail to COD people to explain what to do for the moment with these nodes.
Since the last update of GOC DB some old nodes with monitoring off in
the previous version are now monitored. And this is now not possible to switch off the monitoring on nodes for a production site.
Consequently site admins must delete now in GOCDB all old nodes which are not used anymore. And these nodes must be removed from the information system in order not to be monitored by SAM. In case of differences between GOC DB and BDII , a node is registered in the BDII and not in GOC BD , or a node is not in the GOC DB but in the BDII , you must open a ticket to inform site about it.
Either site admin register these nodes into both BDII and GOC DB and
eventually put a downtime , either they must delete nodes from GOC DB
<big> PPS Report & Issues </big>
PPS reports were not received from these ROCs:
AP, IT, SEE, SWE Issues from EGEE ROCs:
Response to Issue 2 raised by CE ROC (known bugs released to Production, see below):
Almost all the patches released last week were quickly deployed in production as soon as they were certified. The PPS phase was reduced to the pre-deployment testing (installation and configuration tests). The pre-deployment test is a PPS internal facility, meant to protect the pre-production from the introduction of bugs and does not cover for the time being all the scenarios existing in production (namely the ClassicSE is not covered). The PPS stage of these patches (~2 weeks) was clearly not completed. This was done in order to provide the T1 sites with the software they urgently needed for CCRC08. Regarding the two bugs raised in preproduction, that was actually a communication flaw in the procedure, which was revealed because of the high rate of releases. The flaw has been fixed, so the risk for this kind of glitch is from now on reduced.
There were three releases to PPS last week, mainly dealing with software needed for CCRC08, details available in
<big> How to handle broadcasting of <i>transparent</i> interventions?</big>
We have the automated tool (through GOC DB and CIC Portal) for announcing downtimes of sites/services. However, we also need to announce "transparent" interventions - how should we do this?
<big> EGEE issues coming from ROC reports </big>
(ROC CERN): SE and SRM tests failed at FNAL with timeouts after 600s. FNAL admins are proposing to increase the timeout up to 1200s. A feature request was also submitted to Savannah @ CERN for that. SAM developers don't see major issues to increase the timeout. Other ROCs are requested to comment, otherwise the new timeout is approved. FNAL admins continue in parallel the work with the SRM team for better handling of overloads
(ROC Central Europe): Three gLite releases in a row contain bugs resulting in SAM errors at sites just after deployment. Could we do something to avoid that?
update 10: https://gus.fzk.de/ws/ticket_info.php?ticket=31596
update 11: https://savannah.cern.ch/patch/?1654
update 12: https://gus.fzk.de/ws/ticket_info.php?ticket=31802
Two of these bugs were found in PPS but despite of that got into Production release. Problems with a release procedure between PPS and the release team?
(ROC North Europe): It is not clear how sites should proceed if they want to test services before going into production, see GGUS ticket 31311, also discussed in last weeks operations meeting. My understanding always was that services won't be monitored if they are not in the GOCDB, but now it seems that the BDII information is the primary source. As far as I know the only reason to raise alarms if services were in the BDII and not in the GOCBD was just to make one aware of this mismatch, but not to raise tickets about the service itself.
So if services are monitored (and tickets raised) if they are published in BDII, then the workflow for a new service (under test) would be: 1) create entry in GOCDB, 2) put service in scheduled downtime (or monitoring off), 3) create BDII entry?
Reply to item 3 (OCC): The BDII is not the primary source of information for SAM; both GOCDB and the BDII are used. In the past, the difference between registering a site in the GOCDB and just publishing it in the BDII was exactly that - once the site was registered, a site admin could decide to turn the monitoring off for the service. The situation has recently changed: it was decided that the monitoring flag could not be used anymore by the site admins. Without going back to the reasons behind this decision, it is true that it creates a discontinuity with the usual way to operate when dealing with new services.
The suggested procedure
create entry in GOCDB
put service in scheduled downtime
create BDII entry
can work, but it is clearly a workaround to the problem. Apart from the semantics, there is a significant difference in terms of expected service level between a service 'emerging' from a scheduled downtime and a newly installed service. So it wouldn't be surprising if the site admins were unsatisfied with this solution. For the time being it seems to be the only one available though, so it is probably the one to apply.
A different, better, approach has to be studied and discussed.
<big> gLite Release News</big>
There were three releases on gLite3.1 + one release on gLite3.0 last week. The released software mainly dealt with software needed for CCRC08. Details available in:
<big> WLCG issues coming from ROC reports </big>
<big>WLCG Service Interventions (with dates / times where known) </big>
GGUS ticket #31800 assigned to the operations.
As requested at the daily CCRC08 meeting last Thursday, LHCb would like to ask to each of the T1 sites involved in the CCRC exercise to setup some informative twikis (and provide the link!) where a detailed description of the SRM-v2 implementation is given.
By "detailed" we mean: the SRM endpoint setup (whether it is dedicated to LHCb or not & what the hardware is), which storage classes have been defined and which resources (disk space) for each of them (the disk cache in case of pure tape T1D0), space tokens defined (and whether they are activated), overall status of the service (S2 tests results in case under tests, information correctly published in the BDII) , how to access pseudo-dynamic information about the disk occupancy (for each Service class if possible) and again how many tape servers, how many disk servers and so forth (are all of them dedicated to LHCb, are they shared?)
We agreed last week that this ticket should be handled by the WLCG Operations team that will follow it with highest priority.
Problem affects the whole VO: lhcb