WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: France, Russia, SWE
  • VOs: NO VO REPORTS WERE RECEIVED THIS WEEK
  • list of actions
    Minutes
      • 4:00 PM 4:00 PM
        Feedback on last meeting's minutes
        Minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: France / UK-I
          To: CERN/ CE


          Issues:
          1. *Difficulties to reach SAM web interface sometimes * https://gus.fzk.de/pages/ticket_details.php?ticket=33045
            ANSWER: For the moment SAM portal is quite functional. Judit is now working to use better SQL queries in SAM portal.
          2. *Last escalation step reached for site YerPhI* https://gus.fzk.de/pages/ticket_details.php?ticket=26634 => answer from site : "Hi, The SAM tests failures are caused by known issue in the SE software (https://gus.fzk.de/pages/ticket_details.php?ticket=30752). We have been advised to upgrade to the latest certified version. Currently we are trying to do that. Artem "
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, FR, IT, RU, SEE, SWE


          Re-organisation of the PPS:

          An activity of re-organization of the PPS is in progress in the aim of:
          • making the service more suitable for use by the HEP VO
          • extending the scope of the pre-deployment testing
          In this context a spreadsheet was edited and distributed to the PPS sites, containing a synthetic inventory of services and activities currently run within PPS.
          The Service Inventory, based on information available on the GOC DB, can be found in
          www.cern.ch/pps/index.php?dir=./site/
          PPS sites and EGEE ROCs are kindly invited to provide feedback and corrections to the spreadsheet within next week. The info collected will be used later on as an assessment point for the re-organisation .
          The contact point to be used for feedback is the list
          pps-support@cern.ch

          Issues from EGEE ROCs:
          1. ROC CE: There is a possible bug in latests lcg_utils (lcg_util-1.6.8-1.slc4). See https://gus.fzk.de/pages/ticket_details.php?ticket=33262
            All three PPS sites from ROC CE have problems with lcg-cr.

          Release News:
          1. Glite 3.1.0 PPS Update 19 was released to production and it is now in pre-deployment testing
            • WN 3.1 for sl4 64bits
            • glite-LSF_utils
            • lcg-vomscerts-4.8.0 adds next cert for biomed + egeode
            • new version of lcg-ManageVOTag fixing bug #31848
            Release notes in:
            https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update19
          2. A new update, gLite3.1.0 PPS Update20 is in preparation.
            This Update will introduce the MONBOX on the 3.1 baseline (for SLC4)
          document
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC DECH): GGUS ticket because of JS failing SAM. Reason was that the test jobs hit the resource limit of the queue. SAM needs to be submissioned with proper requirements. (DESY-HH)
          2. (ROC DECH): Problems with "larger" input sand boxes on WMS, where larger means 10MB or more. The WMS is configured to accept up to 100MB. Jobs go in running state but the input sand box does not arrive on the WN. GGUS ticket: 33136 (DESY-HH)
          3. (ROC DECH): Problems with publishing accounting data: APEL claims that data are missing since quite some time (Oct 2007), actually we have published already since then, perhaps not all. The big amount of accounting records can then not be published failing with: Exception in thread "main" java.lang.OutOfMemoryError. We are getting tired spending every other week a lot of effort in this business. (DESY-HH)
            ANSWER:
            ** Has this been reported through GGUS? **
            a) I was aware of long-standing discussion between DESY and Dave but I cannot see an open GGUS ticket.
            b) DESY run multiple CEs and there have been problems with such sites but we have successfully got most of them running. We are also working with YAIM people to make configuring multiple CEs easier.
            c) The problem of catching up a large number of job records is recognised and a solution is being researched (but see d) Meanwhile Cristina has a manual method of helping sites catch up by inserting their data directly into the database. I am afraid it requires her manual intervention so it will have to wait. Following CERN's example I recommend sites which run a lot of jobs to publish more than once per day
            d) The Gap Publisher allows a site to publish for a specified time interval. This is designed to help sites fill gaps when publishing failed but can also be used to reduce the number of records published at once and thus reduce memory problems.
          4. (ROC UKI): Why is MON not yet supported on SL4? Seems odd as it is java based!
        • <big> gLite Release News</big>
          An update to gLite (3.1 Update 15) will be released very soon (today) containing the new certificate of the VOMS server for the VOs biomed and egeode
        • <big>Support for gLite 3.0 services </big> 5m
          We plan to stop issuing updates for the following glite 3.0 services:
          SERVICE3.1 - RELEASE DATE
          glite-BDII - 21/11/07
          lcg-CE - 12/11/07
          glite-LFC_mysql - 14/12/07
          glite-LFC_oracle - 14/12/07
          glite-DPM_mysql - 14/12/07
          glite-DPM_disk - 14/12/07
          glite-TORQUE_server - 12/11/07
          Speaker: Oliver Keeble (CERN)
        • <big>pre-release version of the SAM web services</big> 5m
          As announced last Friday Feb. 22 through the same-announce mailing list, there is a new pre-release version of the SAM web services (lcg-sam-server-ws-0.11.0) installed on the SAM Validation instance.
          All the information related to bugs fixed, configuration changes and validation portals available for testing is described at:
          https://twiki.cern.ch/twiki/bin/view/LCG/SamReleaseActivity#SAM_Validation
          and in particular for this new RPM at:
          https://twiki.cern.ch/twiki/bin/view/LCG/SamValidationWs
          People are encouraged to review these changes, adapt their code (if necessary) and test the new interfaces as soon as possible.
          Best regards,
          SAM Team
          Speaker: Mr David Collados (CERN)
        • <big>Glue 2.0 Draft</big>
          FOR INFORMATION:
          Please find at the following URL the initial draft of Glue version 2.0 which will be shown at OGF 22.
          http://forge.gridforum.org/sf/go/doc15023
          If anyone has any comments or suggestions, please email them directly to me and I will merge them together to form a response from EGEE.
          Thanks
          Laurence.Field@cern.ch
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> Where to get dCache updates during CCRC '08
          This is to clarify that during CCRC '08 WLCG sites should take dCache updates from the official dCache repositories (see http://www.dcache.org/ for details).
        • <big> WLCG issues coming from ROC reports </big>
          None this week
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. FZK (GridKa) SCHEDULED downtime : 26-02-2008 08:30 to 10:30[UTC] AT RISK due to "After the dCache upgrade to patch 12p6 the pools have to be restarted which might take a while."
          2. FZK (GridKa) SCHEDULED downtime: 10-03-2008 06:00 to 08:00[UTC] AT RISK due to "There is a network maintenance but it is unlikely that problems occur."

          Time at WLCG T0 and T1 sites.

        • <big> CCRC'08 Operational Review </big>
          • Item 1
          Speaker: Harry Renshall / Jamie Shiers
          Minutes of CCRC08 meetings
      • 5:00 PM 5:30 PM
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          Escalation Reports
      • 5:30 PM 5:35 PM
        Review of action items 5m
        list of actions
      • 5:35 PM 5:35 PM
        AOB
        1. Item 1