WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2007-11-05T16:00:00+01:00
End: 2007-11-05T18:00:00+01:00
Location: CERN conferencing service (joining details below)

Monday 5 Nov 2007, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Antonio Retico (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: AP (apologise), IT, RU, SWE, SEE

VOs: Alice, CMS, BioMed , LHCb

- 16:00 → 16:05
  
  Feedback on last meeting's minutes 5m
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big> 5m
    
    From: UK/Ireland / SouthWesternEurope
    To: CentralEurope / Italy
    
    Issues: No particular issues to report for this hand-over
  - <big> PPS Report & Issues </big>
    
    PPS reports were not received from these ROCs:
    AP (apologise), IT, RU, SWE, SEE
    Issues from EGEE ROCs:
    
    Nothing to report
    Release News:
    
    gLite 3.1.0 PPS Update08: pre-deployment tests passed and deployed to the remaining PPS sites; The release contains (patch numbers):
    
    1233 R3.1 FTS update (glite-data_R_3_1_35_1)
    1255 JobWrapper tests - new version with no R-GMA dependencies
    1381 New version of lcg-tags with better error reporting
    1382 New version of lcg-info with support for VOViews, sites and services
    1383 lcg-CE for glite 3.1
    1384 Updated Torque (2.1.9-4) and Maui (3.2.6p19-4)
    1393 gLite 3.1 TORQUE_utils (slc4/ia32)
    1394 gLite 3.1 TORQUE_server (slc4/ia32)
    1413 glite-yaim-core 4.0.1 for the 3.1 repository
    1415 glite-yaim-clients 4.0.1 for the 3.1 repository
    
    Item 2
  - <big> EGEE issues coming from ROC reports </big> 5h
    
    (NE - site RUG): a. I finally managed to solve the problem with missing accounting information. I found out that our site had not been added to the acl of the central R-GMA repository. Apparently this acl is not synchronised with the information in GOCdb. I think this is problematic. At least it took me a lot of work finding out that that was the problem.
    Question from OCC to GOC: is the registration to the R-GMA registry for new sites a step to be documented in the ROCs registration procedures?
    b. Most of the work related to the issue above was caused by poor handling of the ticket. It was not possible for me to keep the problem in GGUS, which should have been possible. See the ticket for further details: https://gus.fzk.de/ws/ticket_info.php?ticket=27104
    Reply from GGUS: the current configuration of the GGUS system would have indeed allowed this particular ticket to be solved in a different way by the GGUS "standard" support units. Specifically, with reference to the action on the 8th-Oct, the RUC UK could have contacted the local support mailing list instead of asking the site admin to do it. On the other hand, GGUS is not supposed to replace existing support infrastructures, but rather to coordinate the support effort. Calls to local helpdesks are not forbidden by the process (and sometimes they may even turn out to accelerate the resolution of the issues), part of the added value of GGUS is to make sure that once a problem ticket is opened within GGUS it is actually followed-up by supporters at all levels and brought to a resolution in a convenient time, independently by SLAs which may or may not be set up at the level of the local helpdesks.
    A nice overview of the GUSS process, with particular stress on the interaction with the local helpdesks (Slides 4, 8) can be found in
    http://indico.cern.ch/contributionDisplay.py?contribId=76&sessionId=24&confId=3580
    Any suggestions to improve the process are of course welcome. Suggestions and recommendations can be posted in
    https://savannah.cern.ch/support/?func=additem&group=esc
    
    (NE - site SARA) GGUS ticket 18826 is assigned to SAM/SFT support team and confirmed on April 16, but nothing seems to happen, only from time to time the site is asked if the problem still exists, see: https://gus.fzk.de/pages/ticket_details.php?ticket=18826
    
    (CERN - site CERN-PROD): the command
    glite-wms-job-status -all
    meant to retrieves all jobs of the user that are in a certain status, is not working at CERN-PROD. As a follow-up of a GGUS ticket (https://gus.fzk.de/ws/ticket_info.php?ticket=27455) the WMS service admins replied opening a bug in Savannah (https://savannah.cern.ch/bugs/?30989 ) saying:
    
    ...
    In Chapter 6.3.2 'Job Operations', the 'gLite 3 User Guide' explains that for the -all option to work, it is necessary that an index by owner is created in the LB server; otherwise, the command will fail, since it will not be possible for the LB server to identify the user's jobs. Such index can only be created by the LB server administrator, as explained in http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0118-1_2.pdf However, it seems that this feature is not enable by yaim . We don't want to do it by hand. If we allow users to do this action, it should be configured automatically by yaim (for each user ?!?).
    ...
    I really think that the --all option is nasty since it queries the whole database, and it could slow down the performance of the LB node if several users use it at the same time. It would be nice if the developers could make some comments and decide what to do with the --all option.
    Question from OCC to site admins:Does anyone ever experienced overload of the LB database due to concurrent use of the command glite-wms-job-status -all
    Question from OCC to HEP VOs:Does any of the HEP VOs plan to make extensive use of this feature?
    (CE - IRB) Atlas VO VOView problem. There are sites from CE listed as failing for which the reason is not clear e.g. egee.irb.hr. Could atlas VO provide a source code of the script which generates the list of failing sites? That would help us to understand what is the problem with these sites.
    Reply (Simone Campana from Atlas):
    The code is in
    /afs/cern.ch/user/c/campanas/public/VOVIEW
    There is a .txt file with a query you should run on the BDII to gather the relevant info. In addition there is a python script which fetches info from the output of the query and generates and output, where sites marked with ==> are the problematic ones.
    I did re-run the query and publish the results on
    http://voatlas01.cern.ch/atlas/data/VOViewProblem.log
    The site mentioned below is not in the list, so it looks fine. There are currently 65 queues with problems. It would be nice if some action could be taken, the situation is persisting since > than 1 month.
    (CE - CYFRONET) CIC SD notification. It seems that some downtime notification are wrongly addressed. e.g. admins of CYFRONET-IA64 sites received notification about T2_Estonia downtime. It would be helpful if the tool would tell why it considers the recipient as "relevant" for the message. That would help to find out if this was wrongly addressed by the SD submitter or this is a bug in the SD web form.
    Comment (OCC):Question moved to CIC Portal. Point addressed later in this agenda
    (CE) Comment from CE sites about expiration of SL3 service after SL4 version is released. SA3 proposed 1 month time which seems reasonable in case there are no issues with SL4 service which prevents migration. We suggest a procedure in which sites will be able to raise issues which prevents them to migrate to SL4 version of the service and extends the SL3 expiration date until the issues with SL4 are solved. The procedure could be like that:
    
    SA3 announces expiration of SL3 service version
    sites have some time e.g. one week to raise issues with SL4
    if no issues the expiration is accepted and in 1 month time SL3 version expires
    if there were issues they need to be addressed by SA3 and expiration of SL3 is extended
    
    Comment (OCC):Suggestion forwarded to SA3. SA3 accepts the suggestion and in general agree with that
    (UKI - RAL T1) There have been a few recent cases where the gstat page shows a site in maintenance while the GOCDB does not have any listed downtime (for one example see the RAL Tier-1 case in GGUS ticket 28520).
    Reply(gstat):There is a problem with the maintenance module for gstat. We have disabled it for now so we can fix it and put it back into production. This stems from a problem that we have with the way we formulate the sql query to gocdb.
    Several sites have seen recent SAM sft-job failures relating to downloading from RBs. Errors like "Cannot download X from gsiftp://rb115.cern.ch" where X is usually .BrokerInfo or the tarball of SAM tests are being seen. Is this evident in other EGEE regions and what is behind it?
    (IT - CNAF) error: FTA_GLOBAL_DB_PASSWORD in site-info.def (very) possible reason: bug in yaim that doesn't change the password in /etc/tomcat5/Catalina/localhost/glite-data-transfer-fts.xml according to the variable FTA_GLOBAL_DB_PASSWORD.
    Question (OCC)Version of FTS? Was a bug reported for that?
    (IT - CNAF) answer: only reported to FTS developers. Will open not later than tomorrow.
    
    (IT - CNAF) Sam tests: we had a configuration problem with CE (ce05-lcg.cr.cnaf.infn.it), some SAM tests failed, and FCR (Freedom of Choices for Resources used by CMS) took off the CE from BDII. Even if we fixed fastly the CE, it was considered down for many hours, it seems that the "JS" result needed too much time to get correctly published after all single tests were successful. Because of this, the old JS ERROR status was still true, causing FCR to exclude our CE even if everything was ok. Is there a possible workaround for this? Has this already been raised by some other SITE/ROC?
  - <big> gLite Release News</big> 15m
  - <big>additional points and precisions about the new SD procedure</big> 5m
    
    Speaker: Gilles Mathieu (IN2P3/CNRS Computing Centre, Lyon, France)
  - <big> SAM: new security test in Validation </big> 15m
    
    new security test in Validation, what the test does is:
    
    reads all the env vars in the WN account where the job runs.
    for each directory/file specified in any variable, and for each file inside any of those directories:
    
    the test checks if the file/dir has write privileges for the Other group (the --------X- bit).
    if the file/dir is in $PATH, it returns an ERROR
    if the file/dir is not in $PATH, it returns a WARNING
    if no 'w' privilege was found in any of the files/dirs, the test returns OK.
    
    Validation instance of SAM portal is available here:
    https://lcg-sam-val.cern.ch:8443/sam-val/sam.py
- 16:30 → 17:00
  WLCG Items 30m
  - <big> Tier 1 reports </big>
    
    Item 1
  - <big> WLCG issues coming from ROC reports </big>
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    [Announcement] FZK (Tier1 GridKa): Scheduled downtime for maintenance on November 6, 8:00-22:00 UTC (9:00-23:00 CET). Upgrade to dCache 1.8. - All VO's using the GridKa SE are affected. Data transfers are stopped during this period.
    
    Time at WLCG T0 and T1 sites.
  - <big>FTS service review</big> 5m
    
    Please read the report linked to the agenda.
    In particular ?
    
    Speakers: Gavin McCance (CERN), Steve Traylen
    
    Paper
  - <big> ATLAS service </big> 5h
    
    See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.
  - <big>CMS service</big>
    
    No Report Given
    
    Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
  - <big> LHCb service </big>
    
    one
    
    Speaker: Dr roberto santinelli (CERN/IT/GD)
  - <big> ALICE service </big>
    
    No Report Given.
    
    Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  - <big> WLCG Service Coordination </big>
    
    WLCG Service Reliability workshop, CERN, November 26 - 30 - agenda - wiki
    Common Computing Readiness Challenge - CCRC'08 - Meetings schedule
    CMS CSA07 has been extended till mid-November.
    ATLAS M5 detector cosmics run has started to run till 5 November. Data for reconstruction and export not expected till later this week.
    
    Speaker: Harry Renshall / Jamie Shiers
- 16:55 → 17:00
  OSG Items 5m
  Speaker: Rob Quick (OSG - Indiana University)
- 17:00 → 17:05
  
  Review of action items 5m
  
  list of actions
- 17:10 → 17:15
  AOB 5m
  - .

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)