SA1 coordination meeting

Name: SA1 coordination meeting
Start: 2008-12-16T10:00:00+01:00
End: 2008-12-16T12:50:00+01:00
Location: CERN

Tuesday 16 Dec 2008, 10:00 → 12:50 Europe/Zurich

28-R-06 (CERN)

28-R-06

CERN

Description

Minutes, https://edms.cern.ch/document/923879

Actions, https://edms.cern.ch/document/923881

This meeting is 10:00 to 11:30 UTC+1
Phone number is: +41 22 767 6000
Access code is: 0191881
Or click here: https://audioconf.cern.ch/call/0191881

The conference call opens 15 minutes before the meeting starts.

- Admin matters
  1. MSA1.7: https://edms.cern.ch/document/975596
  2. AA meeting 27-28th January: All ROC managers are invited (1 per ROC, as the max attendance is 40; if you need more places, please let me know and I'll check with the PO). Please, register in advance: http://indico.cern.ch/conferenceTimeTable.py?confId=45819
  3. Dates for the EGEE-III review to be held at CERN on 24 & 25 June 2009.
  4. SA1 coordination meetings 2009: 13th January, 3rd February, 17th February, F2F meeting in Catania
- CREAM CE: replacement criteria and deployment plan
  1. Proposal to deploy 1 CREAM CE per region to get experience with it
  2. Criteria to replace the LCG-CE in production by the CREAM CE
  List of criteria
  
  slides
- Definition of core services
  
  We need a clear definition of core services so they can be properly tagged in GOCDB, clean the associated downtime notifications, etc
  
  Main aim is to reduce the list of core services at "grid" and "VO" level. "Federation" level is not that critical, because it's the ROC that get notified, and if they want to stop this, they have proper rights in GOCDB to change the flag by themselves.
  
  A first idea would be to define simple rules to say what is NOT a core service (CEs, UIs, etc.) and systematically exclude any of these services from the "core" flag definition.
  breakdown of core services per type of service:
  
  Services on nodes declared as GRID core:
  - 13 CE + APEL
  - 10 Site-BDII
  - 3 UI
  - 10 Classic-SE
  - 9 MON
  - 5 RB
  - 19 WMS
  - 8 VOMS
  - 3 VO-box
  - 3 FTS
  - 3 SRM
  - 6 LB
  - 1 Central-LFC
  - 2 Local-LFC
  - 3 MyProxy
  - 7 Top-BDII
  - 1 CREAM-CE
  
  Services on nodes declared as VO core:
  - 9 CE + APEL
  - 5 Site-BDII
  - 1 UI
  - 8 Classic-SE
  - 4 MON
  - 1 RB
  - 8 WMS
  - 6 VOMS
  - 8 VO-box
  - 1 FTS
  - 6 SRM
  - 4 LB
  - 10 Central-LFC
  - 12 Local-LFC
  - 3 MyProxy
  - 5 Top-BDII
- Proposal for a process for retiring obsolete services and old versions of services
  
  document
- Last escalation step/Site suspension follow-up
  From: David Bouvet - COD-FR
  
  Context: Follow-up of last escalation step by OCC and ROC not correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks.
  
  In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the site status is set to “suspended” in GOCDB by COD unless OMC say differently."
  
  Proposed solution:As COD has rights to suspend a site, if ROC is not present at Weekly Operation meeting or has not send a mail about that problem, COD suspends the site. If ROC is present and asks for discussion with its site, OCC should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days.
  
  Some example of "long" last step:
```
* GGUS #40521: RU-Phys-SPbSU (1 month and a half)
      o 25/09/2008: last escalation step
      o 06/10/2008: raised at WLCG Ops meeting
      o 06/11/2008: still in last step and not suspended
      o 06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and Nick
      o 06/11/2008: Maite sent mail to Russian ROC
      o 06/11/2008: site suspended by Russian ROC 
* GGUS #42015: ITPA-LCG2 (4 weeks)
      o 24/10/2008: last escalation step
      o 27/10/2008: raised at WLCG Ops meeting
      o 03/11/2008: raised again at WLCG Ops meeting
      o 07/11/2008: still in last step and not suspended
      o 10/11/2008: raised again at WLCG Ops meeting
      o 17/11/2008: still in last step and not suspended. ROC North is present at WLCG Ops meeting and will check with site.
      o 18/11/2008: finally fixed by site
```
- status/progress/plans of the RAG
  1. All agreements with sites according to amount of resources provided are done. The summary is the following:
  GRIF: 39 cores, 4.5 TB
  STFC: 146 cores, 1 TB
  CYFRONET: 30 cores, 15 TB
  GR-01-AUTH: 42 cores, 6.5 TB
  
  which gives: 257 cores and 27 TB in total.
  1. It was clarified by Project Office that money related to seed resources will be allocated by a contract amendment. I suppose this will affect WBS while there is an option that money will be claimed by partners as effort (!). It is equally clear for sites that they are obliged to start support earlier (on demand).
  2. The first VO that will be running on seed-resource is na4.vo.eu-egee.org according to suggestion from Cal. Currently, I almost finished agreeing with VO the level of support (it took some time becouse VO was not fully prepared). If resources required will be confirmed I will request the site to prepare the configuration.
  3. For other VOs, the possibility of requesting such resources is mentioned on NA4 policies wiki. I contacted NA4 people responsible for this, whenever there were any requests for resources that would be valid still, but there were no.
  4. We are ready to handle new requests being contacted by GGUS ticket (according NA4 policy). Additional, I requested extension to VO cards in CIC portal to make possible apply for seed resource directly while registering the VO (https://savannah.cern.ch/support/?106454). Helene promised me to handle this with a high priority.
- AOB
  1. Issue from QR: For some production sites there were some issues in this reporting period caused by the deployment of certain gLite updates which were not properly verified, and despite that have been put into production. For clients, such problems are avoided testing all updates locally before installing them in production at the site (regional/site certification). With other services (e.g. CE, SE, VOMS etc.) some small sites take the approach of waiting several days to see if problems are reported on common lists or through EGEE broadcasts, and only then they proceed. This introduces some delays to deploy new released versions into production. SA1 will discuss in the next quarter how this could be solved, mainly by building on the informal process already started by some sites.
    Similar issue raised for yesterday's ops meeting by SEE: In the light of the recent "urgent" update for the BDII / WMS released on Friday the sites responded that they do not trust urgent m/w upgrades in general and they prefer to hold back for others to try them out first. This is a rather serious issue that tend to come back every now and then and I reckon we should try to solve it once and for all.
    Also There are quite a few GGUS tickets related to the BDII issues
    https://gus.fzk.de/ws/ticket_info.php?ticket=43230
    https://gus.fzk.de/ws/ticket_info.php?ticket=43578
    https://gus.fzk.de/ws/ticket_info.php?ticket=42684
  2. next COD F2F meeting April 2 to April 4th 2009; change of format, with training session on the regional dashboard before the meeting.

Choose timezone

SA1 coordination meeting

28-R-06

CERN