WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Antonio Retico (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: AP (apologise), IT, RU, SWE, SEE
  • VOs: Alice, CMS, BioMed , LHCb
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big> 5m
          From: UK/Ireland / SouthWesternEurope
          To: CentralEurope / Italy


          Issues: No particular issues to report for this hand-over
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP (apologise), IT, RU, SWE, SEE

          Issues from EGEE ROCs:
          1. Nothing to report
          Release News:
          • gLite 3.1.0 PPS Update08: pre-deployment tests passed and deployed to the remaining PPS sites; The release contains (patch numbers):
            • 1233 R3.1 FTS update (glite-data_R_3_1_35_1)
            • 1255 JobWrapper tests - new version with no R-GMA dependencies
            • 1381 New version of lcg-tags with better error reporting
            • 1382 New version of lcg-info with support for VOViews, sites and services
            • 1383 lcg-CE for glite 3.1
            • 1384 Updated Torque (2.1.9-4) and Maui (3.2.6p19-4)
            • 1393 gLite 3.1 TORQUE_utils (slc4/ia32)
            • 1394 gLite 3.1 TORQUE_server (slc4/ia32)
            • 1413 glite-yaim-core 4.0.1 for the 3.1 repository
            • 1415 glite-yaim-clients 4.0.1 for the 3.1 repository
          • Item 2
        • <big> EGEE issues coming from ROC reports </big> 5h
          1. (NE - site RUG): a. I finally managed to solve the problem with missing accounting information. I found out that our site had not been added to the acl of the central R-GMA repository. Apparently this acl is not synchronised with the information in GOCdb. I think this is problematic. At least it took me a lot of work finding out that that was the problem.
            Question from OCC to GOC: is the registration to the R-GMA registry for new sites a step to be documented in the ROCs registration procedures?
            b. Most of the work related to the issue above was caused by poor handling of the ticket. It was not possible for me to keep the problem in GGUS, which should have been possible. See the ticket for further details: https://gus.fzk.de/ws/ticket_info.php?ticket=27104
            Reply from GGUS: the current configuration of the GGUS system would have indeed allowed this particular ticket to be solved in a different way by the GGUS "standard" support units. Specifically, with reference to the action on the 8th-Oct, the RUC UK could have contacted the local support mailing list instead of asking the site admin to do it. On the other hand, GGUS is not supposed to replace existing support infrastructures, but rather to coordinate the support effort. Calls to local helpdesks are not forbidden by the process (and sometimes they may even turn out to accelerate the resolution of the issues), part of the added value of GGUS is to make sure that once a problem ticket is opened within GGUS it is actually followed-up by supporters at all levels and brought to a resolution in a convenient time, independently by SLAs which may or may not be set up at the level of the local helpdesks.
            A nice overview of the GUSS process, with particular stress on the interaction with the local helpdesks (Slides 4, 8) can be found in
            http://indico.cern.ch/contributionDisplay.py?contribId=76&sessionId=24&confId=3580
            Any suggestions to improve the process are of course welcome. Suggestions and recommendations can be posted in
            https://savannah.cern.ch/support/?func=additem&group=esc
          2. (NE - site SARA) GGUS ticket 18826 is assigned to SAM/SFT support team and confirmed on April 16, but nothing seems to happen, only from time to time the site is asked if the problem still exists, see: https://gus.fzk.de/pages/ticket_details.php?ticket=18826
          3. (CERN - site CERN-PROD): the command
            glite-wms-job-status -all
            meant to retrieves all jobs of the user that are in a certain status, is not working at CERN-PROD. As a follow-up of a GGUS ticket (https://gus.fzk.de/ws/ticket_info.php?ticket=27455) the WMS service admins replied opening a bug in Savannah (https://savannah.cern.ch/bugs/?30989 ) saying:

            ...
            In Chapter 6.3.2 'Job Operations', the 'gLite 3 User Guide' explains that for the -all option to work, it is necessary that an index by owner is created in the LB server; otherwise, the command will fail, since it will not be possible for the LB server to identify the user's jobs. Such index can only be created by the LB server administrator, as explained in http://server11.infn.it/workload-grid/docs/DataGrid-01-TEN-0118-1_2.pdf However, it seems that this feature is not enable by yaim . We don't want to do it by hand. If we allow users to do this action, it should be configured automatically by yaim (for each user ?!?).
            ...
            I really think that the --all option is nasty since it queries the whole database, and it could slow down the performance of the LB node if several users use it at the same time. It would be nice if the developers could make some comments and decide what to do with the --all option.
            Question from OCC to site admins:Does anyone ever experienced overload of the LB database due to concurrent use of the command glite-wms-job-status -all
            Question from OCC to HEP VOs:Does any of the HEP VOs plan to make extensive use of this feature?
          4. (CE - IRB) Atlas VO VOView problem. There are sites from CE listed as failing for which the reason is not clear e.g. egee.irb.hr. Could atlas VO provide a source code of the script which generates the list of failing sites? That would help us to understand what is the problem with these sites.
            Reply (Simone Campana from Atlas):
            The code is in
            /afs/cern.ch/user/c/campanas/public/VOVIEW
            There is a .txt file with a query you should run on the BDII to gather the relevant info. In addition there is a python script which fetches info from the output of the query and generates and output, where sites marked with ==> are the problematic ones.
            I did re-run the query and publish the results on
            http://voatlas01.cern.ch/atlas/data/VOViewProblem.log
            The site mentioned below is not in the list, so it looks fine. There are currently 65 queues with problems. It would be nice if some action could be taken, the situation is persisting since > than 1 month.
          5. (CE - CYFRONET) CIC SD notification. It seems that some downtime notification are wrongly addressed. e.g. admins of CYFRONET-IA64 sites received notification about T2_Estonia downtime. It would be helpful if the tool would tell why it considers the recipient as "relevant" for the message. That would help to find out if this was wrongly addressed by the SD submitter or this is a bug in the SD web form.
            Comment (OCC):Question moved to CIC Portal. Point addressed later in this agenda
          6. (CE) Comment from CE sites about expiration of SL3 service after SL4 version is released. SA3 proposed 1 month time which seems reasonable in case there are no issues with SL4 service which prevents migration. We suggest a procedure in which sites will be able to raise issues which prevents them to migrate to SL4 version of the service and extends the SL3 expiration date until the issues with SL4 are solved. The procedure could be like that:
            1. SA3 announces expiration of SL3 service version
            2. sites have some time e.g. one week to raise issues with SL4
            3. if no issues the expiration is accepted and in 1 month time SL3 version expires
            4. if there were issues they need to be addressed by SA3 and expiration of SL3 is extended
            Comment (OCC):Suggestion forwarded to SA3. SA3 accepts the suggestion and in general agree with that
          7. (UKI - RAL T1) There have been a few recent cases where the gstat page shows a site in maintenance while the GOCDB does not have any listed downtime (for one example see the RAL Tier-1 case in GGUS ticket 28520).
            Reply(gstat):There is a problem with the maintenance module for gstat. We have disabled it for now so we can fix it and put it back into production. This stems from a problem that we have with the way we formulate the sql query to gocdb.
          8. Several sites have seen recent SAM sft-job failures relating to downloading from RBs. Errors like "Cannot download X from gsiftp://rb115.cern.ch" where X is usually .BrokerInfo or the tarball of SAM tests are being seen. Is this evident in other EGEE regions and what is behind it?
          9. (IT - CNAF) error: FTA_GLOBAL_DB_PASSWORD in site-info.def (very) possible reason: bug in yaim that doesn't change the password in /etc/tomcat5/Catalina/localhost/glite-data-transfer-fts.xml according to the variable FTA_GLOBAL_DB_PASSWORD.
            Question (OCC)Version of FTS? Was a bug reported for that?
            (IT - CNAF) answer: only reported to FTS developers. Will open not later than tomorrow.
          10. (IT - CNAF) Sam tests: we had a configuration problem with CE (ce05-lcg.cr.cnaf.infn.it), some SAM tests failed, and FCR (Freedom of Choices for Resources used by CMS) took off the CE from BDII. Even if we fixed fastly the CE, it was considered down for many hours, it seems that the "JS" result needed too much time to get correctly published after all single tests were successful. Because of this, the old JS ERROR status was still true, causing FCR to exclude our CE even if everything was ok. Is there a possible workaround for this? Has this already been raised by some other SITE/ROC?
        • <big> gLite Release News</big> 15m
        • <big>additional points and precisions about the new SD procedure</big> 5m
          Speaker: Gilles Mathieu (IN2P3/CNRS Computing Centre, Lyon, France)
        • <big> SAM: new security test in Validation </big> 15m
          new security test in Validation, what the test does is:
          1. reads all the env vars in the WN account where the job runs.
          2. for each directory/file specified in any variable, and for each file inside any of those directories:
            • the test checks if the file/dir has write privileges for the Other group (the --------X- bit).
            • if the file/dir is in $PATH, it returns an ERROR
            • if the file/dir is not in $PATH, it returns a WARNING
            • if no 'w' privilege was found in any of the files/dirs, the test returns OK.
          Validation instance of SAM portal is available here:
          https://lcg-sam-val.cern.ch:8443/sam-val/sam.py
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          • Item 1
        • <big> WLCG issues coming from ROC reports </big>
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. [Announcement] FZK (Tier1 GridKa): Scheduled downtime for maintenance on November 6, 8:00-22:00 UTC (9:00-23:00 CET). Upgrade to dCache 1.8. - All VO's using the GridKa SE are affected. Data transfers are stopped during this period.

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big> 5m

          Please read the report linked to the agenda.
          In particular ?

          Speakers: Gavin McCance (CERN), Steve Traylen
          Paper
        • <big>CMS service</big>
          • No Report Given
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • one
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • No Report Given.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination </big>
          • WLCG Service Reliability workshop, CERN, November 26 - 30 - agenda - wiki
          • Common Computing Readiness Challenge - CCRC'08 - Meetings schedule
          • CMS CSA07 has been extended till mid-November.
          • ATLAS M5 detector cosmics run has started to run till 5 November. Data for reconstruction and export not expected till later this week.
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
          1. Discussion of open tickets for OSG.
          2. https://gus.fzk.de/pages/download_escalation_reports_roc.php
          3. change in VOMS (announced in several EGEE broadcasts) caused OSG some amount of distress
          4. goc@opensciencegrid.org to be added to egee broadcast list?
        Speaker: Rob Quick (OSG - Indiana University)
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
        • .