EGEE/WLCG Operations Meeting, November 13rd 2006 Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=7636 Attendees: OSG ROCs: Absent OSG GOC: Absent US-ATLAS: Absent US-CMS: Joe Kaiser and Lisa + EGEE ROCs Asia-Pacific: Min Tsai Central Europe: Malgorzata Krakowian CERN: Alexandre Duarte, Streve Traylen, Judit Novak, David Collados, Maite Barroso, Nicholas Trackray DECH: Clemens Koerdt, Sven Hermann, Bruno Hoeft, France: Absent Italy: Absent Northern Europe: Anders Selander Russia: Lev Shamardin South East Europe: Ioannis Liabotis South West Europe: Gonzalo Merino UK/I: Jeremy Coles + WLCG Tier 1 sites ASGC: Min Tsai BNL: Absent CERN: Fermilab: Lisa GridKA: Sven Hermann IN2P3: Absent INFN: Absent NDGF: Absent PIC: Gonzalo Merino RAL: Derek Cross, Matt Hodges SARA /NIKHEF: Ron TRIUMF: Reda + GGUS: Torsten Antoni + VOs Alice: Patricia ATLAS: Gilbert BioMed: Absent CMS: Ian Fisk LHCb: Feedback from last meeting EGEE Items + Grid Operator on Duty (From ROC DECH (backup: ROC SWE) to ROC SWE (backup: ROC DECH)) New tickets: 24 Tickets modified: 131 - 1st email sent: 21 - 2nd email sent: 15 - Quarantined: 28 - Set to OK: 66 - Set to unsolvable: 1 # There are three tickets left from last week (2857, 3087 and 3292). We sent a mail to the CIC-on-Duty list to discuss the further proceeding. # We had some problems with the SAM tests for PPS sites (GGUS ticket 15297) and we observed that alarms where triggered for sites in downtime (e.g. AEGIS01-PHY-SCL) No updates this week. Just one ticket remains open. Cases to discus: SAM Alarms raised to on in-maintenance sites: Judit: It was a bug and now should be fixed. 1. Can someone document and eradicate all the places where the host certificate is copied and chowned for use by some non-root service. This has caused problems on almost every host cert renewal. I know of lfc, CE rgma-gin, fts. This should be done in an init.d script NOT yaim. [TRIUMF] Response: See https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#How_to _replace_host_certificates As for the 'eradicate' part, this is harder; there are two solutions, the init script as suggested, or proxies generated regularly by a root crontab. This would be a service-specific solution in each case; we can update the yaim config when this is supported. --- 2. We need to be able to connect the LCG SAM results to our FNAL monitoring in an automated way - ie, without looking at the web pages. Is an API available so we can query the SAM test results? If so is there documentation on how to access it? If not, do you plan on adding an API or is there some other mechanism available for us to accomplish the same task? [FNAL] Email sent about this issue. David: SAM team can provide a customized web service to provide the required information. David asked FNAL to send an e-mail with the specific query to customize the webservice. FNAL: Is there a generic web service to query the sam database? David: There is a command line utility in the SAM Client called same-query that can be used to query database but it is not well documented. --- 3. Ticket #14093 (DESY-ZN) is escalating since mid October without being worked on. Could it be updated? (central support unit) GGUS Ticket #14093 [DECH] We will contact the data Management support unit. --- 4. PPS FZK is not tested with SFTs any more since few weeks, see GGUS Ticket #14511 -> ROC SEE seems to be responsible for submitting PPS tests. [DECH] DECH: Correction, SAM instead of SFTs. They should be submitted. Check with Petras? and if no solution assign it to CERN ROC. --- 5. The SAM/SFT jobs are not working properly. Job submission timeout errors are cryptic and non-helpfull and it's complicating troubleshooting. Please test de SAM/SFT properly before changing it. It also seems the SAM has switched to 100% glite-* commands for testing, which does not work for LCG 2.7.0 hosts (as some are still running). [NE, SARA.nl] Judit: It was not switched 100% to gLite. It is using edg-job-submit to submit jobs to non-gLite Computing Elements. Nick asked for more comments on the SAM output from SARA to be more specific to allow improvements on SAM Sven: Is there a GGUS ticket about this problem? Ron will check. --- 6. Some inconsistencies have been noticed between the CIC daily reports and the Scheduled downtimes declared in GOCDB. On the 2006-10-08, CIC reporting tool didn't notice our scheduled downtime, registered in GOCDB several days ago. Due to this omission, several SFT failures are present in our CIC daily report. In addition, we note that on some of the SFTs (sent to our site prior to start of the scheduled downtime) we pass all tests with OK, but overall we have error. The latest one (2006-11-09) is: Although the scheduled downtime is now registered for today in CIC daily report, still we have SFT failures in the report present. This should not be the case. GGUS ticket created on this issue: https://gus.fzk.de/ws/ticket_info.php? ticket=15431 [SEE ROCl] It seems to be ok now. --- 7. Some sites are not having SAM tests in Production during 8,9 Nov (same in PPS for the days 7,8 Nov). Is this a SAM central problem? It would be good to find a way to allow the SAM maintainers to "flag" the periods of SAM unreliability somehow, so that the sites can see immediately in their reports that this is a SAM central problem. This would save a big amount of time integrated throughout all the sites. [SWE ROC] Judit: In production it should not be the case. There were some problems due to the ip renumbering but we don't know any jobs to be stucked for so long times. For PPS there were some problems with job submission but they were corrected by Steve last week. The ticket will be assign to ROC CERN. The SAM team will try to find a way to notify the sites during SAM instability periods. --- 8. Several sites have seen high load on their CEs leading to them dropping out of the information system. Dublin report "very high load on the CE is affecting the sites reliabilty, we may need to limit the maximum number of jobs from all VOs. We have been doing some local stress testing of the lcgcondor job manager, and this seems to be A cause of the problem." [UK/I] Nick: Can you provide some suggestions to improve it? Nick: Taking the BDIIs out of the CE would help? Steve: Yes, it could help Jeremy: Would this problem disappear with gLite CE? Nick: We don't think so. Sven: There was a topic on a previous meeting advising to separate the services. Jeremy will check. --- 9. RAL-LCG2 reports that "sometimes we have to wait for a SAM tests proxy to expire before another test is run". [UK/I] Judit: It can happen if the job is stucked, so you have to wait until the proxy expires to the job be aborted in order to submit another one. Jeremy: How it will affect the availability? Judit: It will appear as a job abortion, what means a critical test failure. Nick:The question is why the job is stuck Judit: It seems to be a problem in the RAL CE since it is not occurring in other sites. Sven: Is there a ticket related to this problem? Jeremy: No. Look to see if it is a RAL problem. If not, raise a ticket. Min: Is it possible to check if the job failed due to a proxy expired and submit it again? Judit: Will check with Piotr. --- OSG Items OSG Handover Some jobs were submiting using CMS accounts/certificates but they are not CMS jobs. Nick: How would you know that they are not CMS jobs? ... Discussion postponed to next week. WLCG Items + WLCG Service Report (15') Patricia: Poor efficiency for Alice with file transferences involving SARA. A ticket will be/was submitted. Roberto: What is the status of the each site referring to the requirements to have different accounts in VOs for production and for experiments? + WLCG Service Commissioning report and upcoming activities (15') (files document doc ) See new and updated information at https://twiki.cern.ch/twiki/bin/view/LCG/SC4ExperimentPlans + WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (15') (files Tier-1 reports ) Reports were not received from: Tier-1 sites: BNL; FNAL; IN2P3; NDGF; PIC; TRIUMF Report on INFN-T1 site problems last week Report on FZK-LCG2 site problems last week From the 7 to 10 of November there were no transferences for Alice to Tier 1 sites. The problem will be further investigated. Sven: Is there a ticket about that? Patricia: No, but it will be created. All Tiers-1 should announce their downtimes and also the end of the downtimes Review of Action Items The updated list action items can be found attached to the agenda (along with these minutes). Action 2006-10-05—2: Decission from the TCG: ------------------------ The strategy is that: we basically we follow the plan that Ian outlined, we move to the (standard vdt) GT4-pre-webservice gram on the gLite CE, the lcg-ce will continue to be supported as is and we hopefully can cease the support by june. This depends on the quality of the gLite CE. People who want to configure all the standard vdt job managers on the gLite CE are free to do so, however for the moment we will not provide a certification of that. We invite sites which do that to become part of the certification/pre-production service but it is not part of the core SA3 responsibility. The practicalities of this will be discussed between the effected sites (in particular NIKHEF) and SA3 and the TCG will be kept informed. At the same time we ask cream if it would be possible to expose a GT4 WS interface in addition to the cream one. The general policy of EGEE is to support multiple interfaces on the CE to the extent that this is feasible and required by the EGEE applications and/or EGEE sites. AOB Ian Neilson: There are still some sites that didn't apply the security update. Nick: The ROCs should contact their sites to strongly recommend the Secutiry update. Sven suggests to raise tickets about that to the sites. Maria Dimou: AOB1: Important: Bug fixes in new vomrs version 1.3.0 require ROC Managers' input vomrs-1.3.0 approaches the end of its testing period. The changes it contains are listed in: https://twiki.cern.ch/twiki/bin/view/LCG/VomrsUpdateLog Before we are able to upgrade we need your position on the following: https://savannah.cern.ch/bugs/?func=detailitem&item_id=14990 is fixed in this release. The Group/Role description is implemented as *mandatory*. If you want it to be optional, please write in the savannah ticket or reply to all on this header a.s.a.p. A code and db schema change might be required, depending on your answer. In my opinion as DTEAM VO Admin, the Group name already says everything about the purpose of the Group. So, if other VOs (like CMS) wish the field to be mandatory, I wouldn't object but I would simply repeat a standard string like "This is a Grid site, For full info select the site name from the menu of page https://goc.grid-support.ac.uk/gridsite/gocdb2/" AOB2: Request to insert a link to https://twiki.cern.ch/twiki/bin/view/LCG/LcgScmStatusAas in the CIC portal report for the DTEAM VO.