EGEE/WLCG Operations Meeting, November 13rd 2006

Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=7636

 
Attendees:

OSG ROCs: Absent
OSG GOC: Absent 
US-ATLAS: Absent
US-CMS: Joe Kaiser and Lisa
 

+ EGEE ROCs

Asia-Pacific:  Min Tsai    
Central Europe: Malgorzata Krakowian
CERN: Alexandre Duarte, Streve Traylen, Judit Novak, David Collados, Maite Barroso, Nicholas Trackray
DECH: Clemens Koerdt, Sven Hermann, Bruno Hoeft,
France: Absent
Italy: Absent
Northern Europe: Anders Selander
Russia: Lev Shamardin
South East Europe: Ioannis Liabotis
South West Europe: Gonzalo Merino
UK/I: Jeremy Coles


+ WLCG Tier 1 sites

ASGC: Min Tsai
BNL: Absent
CERN:
Fermilab: Lisa
GridKA: Sven Hermann 
IN2P3: Absent
INFN: Absent
NDGF: Absent
PIC: Gonzalo Merino
RAL: Derek Cross, Matt Hodges
SARA /NIKHEF: Ron
TRIUMF: Reda

+ GGUS: Torsten Antoni

+ VOs

Alice: Patricia
ATLAS: Gilbert
BioMed: Absent
CMS: Ian Fisk
LHCb: 
 

Feedback from last meeting
     

EGEE Items

  + Grid Operator on Duty (From ROC DECH (backup: ROC SWE) to ROC SWE (backup: ROC DECH)) 

    New tickets: 24
    Tickets modified: 131
    - 1st email sent: 21
    - 2nd email sent: 15
    - Quarantined: 28
    - Set to OK: 66
    - Set to unsolvable: 1
 

    #  There are three tickets left from last week (2857, 3087 and 3292). We sent a mail to the CIC-on-Duty list to 
           discuss the further proceeding.
    # We had some problems with the SAM tests for PPS sites (GGUS ticket 15297) and we observed that alarms where triggered
          for sites in downtime (e.g. AEGIS01-PHY-SCL) 
 
        No updates this week.
        Just one ticket remains open.

  Cases to discus:

  SAM Alarms raised to on in-maintenance sites:
  Judit: It was a bug and now should be fixed.


   1.  Can someone document and eradicate all the places where the host certificate is copied 
       and chowned for use by some non-root service. This has caused problems on almost every 
       host cert renewal. I know of lfc, CE rgma-gin, fts. This should be done in an init.d script
       NOT yaim. [TRIUMF]
       Response: See https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to _replace_host_certificates
       As for the 'eradicate' part, this is harder; there are two solutions, the init script as suggested, or proxies 
       generated regularly by a root crontab. This would be a service-specific solution in each case; we can update the 
       yaim config when this is supported.
  
---    
  
   2. We need to be able to connect the LCG SAM results to our FNAL monitoring in an automated way - ie, without looking at 
      the web pages. Is an API available so we can query the SAM test results?  
      If so is there documentation on how to access it? If not, do you plan on adding an API or is there some other 
      mechanism available for us to accomplish the same task? [FNAL]

Email sent about this issue. 
David: SAM team can provide a customized web service to provide the required information.
David asked FNAL to send an e-mail with the specific query to customize the webservice.
FNAL: Is there a generic web service to query the sam database?
David: There is a command line utility in the SAM Client called same-query that can be used to query database but it is not 
well documented. 

---
   3. Ticket #14093 (DESY-ZN) is escalating since mid October without being worked on. Could it be updated? 
      (central support unit) GGUS Ticket #14093 [DECH]

     We will contact the data Management support unit.

---

   4. PPS FZK is not tested with SFTs any more since few weeks, see GGUS Ticket #14511 -&gt; ROC SEE seems to be responsible 
      for submitting PPS tests. [DECH]

DECH: Correction, SAM instead of SFTs. 
They should be submitted. 
Check with Petras? and if no solution assign it to CERN ROC.
---

   5. The SAM/SFT jobs are not working properly. Job submission timeout errors are cryptic and non-helpfull and it's 
      complicating troubleshooting. Please test de SAM/SFT properly before changing it. It also seems the SAM has 
      switched to 100% glite-* commands for testing, which does not work for LCG 2.7.0 hosts (as some are still running). 
      [NE, SARA.nl]

Judit: It was not switched 100% to gLite. It is using edg-job-submit to submit jobs to non-gLite Computing Elements.
Nick asked for more comments on the SAM output from SARA to be more specific to allow improvements on SAM
Sven: Is there a GGUS ticket about this problem?
Ron will check.

---

   6. Some inconsistencies have been noticed between the CIC daily reports and the Scheduled downtimes declared in GOCDB.
      On the 2006-10-08, CIC reporting tool didn't notice our scheduled downtime, registered in GOCDB several days ago. 
      Due to this omission, several SFT failures are present in our CIC daily report. In addition, we note that on some of 
      the SFTs (sent to our site prior to start of the scheduled downtime) we pass all tests with OK, but overall we have 
      error. The latest one (2006-11-09) is:
      Although the scheduled downtime is now registered for today in CIC daily report, still we have SFT failures in the 
      report present. This should not be the case. GGUS ticket created on this issue: https://gus.fzk.de/ws/ticket_info.php?
      ticket=15431 [SEE ROCl]

It seems to be ok now. 

---

   7. Some sites are not having SAM tests in Production during 8,9 Nov (same in PPS for the days 7,8 Nov). Is this a 
      SAM central problem? It would be good to find a way to allow the SAM maintainers to "flag" the periods of SAM 
      unreliability somehow, so that the sites can see immediately in their reports that this is a SAM central problem. 
      This would save a big amount of time integrated throughout all the sites. [SWE ROC]

Judit: In production it should not be the case. There were some problems due to the ip renumbering but we don't know any jobs
 to be stucked for so long times. For PPS there were some problems with job submission but they were corrected by Steve last
 week.

The ticket will be assign to ROC CERN.
The SAM team will try to find a way to notify the sites during SAM instability periods.

---

   8. Several sites have seen high load on their CEs leading to them dropping out of the information system. Dublin report 
      "very high load on the CE is affecting the sites reliabilty, we may need to limit the maximum number of jobs from 
      all VOs. We have been doing some local stress testing of the lcgcondor job manager, and this seems to be A cause of 
      the problem." [UK/I]

Nick: Can you provide some suggestions to improve it?
Nick: Taking the BDIIs out of the CE would help?
Steve: Yes, it could help 

Jeremy: Would this problem disappear with gLite CE?
Nick: We don't think so.

Sven: There was a topic on a previous meeting advising to separate the services.
Jeremy will check.

---

   9. RAL-LCG2 reports that "sometimes we have to wait for a SAM tests proxy to expire before another test is run". [UK/I]

Judit: It can happen if the job is stucked, so you have to wait until the proxy expires to the job be aborted in order to 
submit another one.

Jeremy: How it will affect the availability?
Judit: It will appear as a job abortion, what means a critical test failure.

Nick:The question is why the job is stuck

Judit: It seems to be a problem in the RAL CE since it is not occurring in other sites.
Sven: Is there a ticket related to this problem?
Jeremy: No. Look to see if it is a RAL problem. If not, raise a ticket.

Min: Is it possible to check if the job failed due to a proxy expired and submit it again?
Judit: Will check with Piotr.

---

OSG Items

OSG Handover
Some jobs were submiting using CMS accounts/certificates but they are not CMS jobs.
Nick: How would you know that they are not CMS jobs?
...
Discussion postponed to next week.


WLCG Items
    

+ WLCG Service Report (15') 

Patricia: Poor efficiency for Alice with file transferences involving SARA. A ticket will be/was submitted.
Roberto: What is the status of the each site referring to the requirements to have different accounts in VOs for production 
and for experiments?

+ WLCG Service Commissioning report and upcoming activities (15') (files document doc  )
      See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans


+ WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (15') (files Tier-1 reports  )
    Reports were not received from:
        Tier-1 sites: BNL; FNAL; IN2P3; NDGF; PIC; TRIUMF
              
        Report on INFN-T1 site problems last week
        Report on FZK-LCG2 site problems last week 

From the 7 to 10 of November there were no transferences for Alice to Tier 1 sites.
The problem will be further investigated.
Sven: Is there a ticket about that?
Patricia: No, but it will be created.

All Tiers-1 should announce their downtimes and also the end of the downtimes


Review of Action Items


The updated list action items can be found attached to the agenda (along with these minutes).

Action 2006-10-05—2:
Decission from the TCG:
------------------------
The strategy is that: we basically we follow the plan that Ian outlined, we move to the (standard vdt) GT4-pre-webservice 
gram on the gLite CE, the lcg-ce will continue to be supported as is and we hopefully can cease the support by june. 
This depends on the quality of the gLite CE. People who want to configure all the standard vdt job managers on the gLite CE 
are free to do so, however for the moment we will not provide a certification of that. We invite sites which do that to 
become part of the certification/pre-production service but it is not part of the core SA3 responsibility. 
The practicalities of this will be discussed between the effected sites (in particular NIKHEF) and SA3 and the TCG will be 
kept informed.
At the same time we ask cream if it would be possible to expose a GT4 WS interface in addition to the cream one.

The general policy of EGEE is to support multiple interfaces on the CE to the extent that this is feasible and required by 
the EGEE applications and/or EGEE sites.


AOB


Ian Neilson: There are still some sites that didn't apply the security update.
Nick: The ROCs should contact their sites to strongly recommend the Secutiry update.
Sven suggests to raise tickets about that to the sites.


Maria Dimou: 

AOB1: Important: Bug fixes in new vomrs version  1.3.0 require ROC Managers' input

vomrs-1.3.0 approaches the end of its testing period.
The changes it contains are listed in:
https://twiki.cern.ch/twiki/bin/view/LCG/VomrsUpdateLog

Before we are able to upgrade we need your position on the following:

https://savannah.cern.ch/bugs/?func=detailitem&amp;item_id=14990
is fixed in this release. The Group/Role description is implemented as
*mandatory*. If you want it to be optional, please write in the savannah
ticket or reply to all on this header a.s.a.p. A code and db schema
change might be required, depending on your answer.

In my opinion as DTEAM VO Admin, the Group name already says everything about the purpose of the Group. So, if other VOs 
(like CMS) wish the field to be mandatory, I wouldn't object but I would  
simply repeat a standard string like "This is a Grid site, For full info select the site name from the menu of page 
https://goc.grid-support.ac.uk/gridsite/gocdb2/" 


AOB2: Request to insert a link to https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatusAas in the CIC portal report for the 
DTEAM VO.