WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: Central Europe / North Europe
          To: France / AsiaPacific


          Report from Central Europe COD:
          1. request to ROC's to remind people of not putting ticket comments in other languages then English
          2. there is GGUS ticket 34338 assigned to GStat with last update 2008-07-02. Some COD tickets depend on it. Request to GStat for update.
          Report from North Europe COD:
          1. the SAMAP tests put some sites in critical error for not yet having the new CA rpms. Normal SAM tests give a warning for this now, as noted in the lcg rollout list.
        • <big> PPS Report & Issues </big>
          1. UK/I ROC: No UKI PPS sites appear in the site reports area.
        • <big> gLite Release News</big>
          Now in PPS
          • gLite3.1 PPS Update34 to PPS has successfully passed through deployment testing and will be released to the PPS within the next days. This update contains:
            • DPM and LFC 1.6.11 (see details in PATCH:1987)
            • dCache 1.8.0-15p5 with new YAIM nodule for configuration


          Soon in Production
          • gLite3.1 Update28 in preparation. This update has been delayed due to issues with the release process but will be released within the next days. The release contains:
            • glite-CONDOR_utils for lcg-CE(PATCH:1856)
            • New version of gsoap plugin with a vulnerability fix (affecting LB, WMS, UI, WN, VOBOX, CE)(PATCH:1846)
            • Several bug fixes on WMS and clients (PATCH:1780)
            • New Short Lived Credential Service (SLCS), allowing to get short-lived personal certificate based on Shibboleth AAI identity (PATCH:1693)
            • MyProxy? version 1.6.1-7 (fixes build issue related to globus flavour, already deployed in production) (PATCH:1978)
            • Various improvements on lcg-extra-jobmanagers (CE) (PATCH:1942)
            • GFAL and lcg_util update with new function gfal_removedir and Several bug fixes
            • FTS SL4 release (32 and 64 bit)
        • <big> EGEE issues coming from ROC reports </big>
          1. France: CA Update 1.24 has not been followed-up properly one again as repositories have not been updated along with SAM tests update. Sites have been in Warning state. Proposal : - Ask that CA Update Procedure is followed up. - When a delay occurs, the 7 days SAM count down should be reset to 7 days.
        • <big> Experience of countries/regions with the WMS? </big>
          reminder to sites/ROCs to send us one paragraph on your WMS experience for compilation

          In the UK we are still trying to understand when to move to relying on the WMS and how many we require. What are the experiences of other countries/regions?
          Here is some background from a GridPP meeting today:

          "The RAL WMS lcgwms01 (SL3 host with gLite-WMS-2.4.9-0 and glite-LB-2.3.5-0) became heavily loaded on 22nd and user throughput suffered as a result. The underlying problem was not understood as the service returned to normal without a clear intervention required. This prompted SL to comment on WMS and RB availability in the UK. He noted 5 RBs (3 RAL; 1 Glasgow and 1 IC). He was only aware of the 1 WMS instance at RAL. As of today, the default server in Glasgow is a gLite 3.1 WMS instance (RB to be removed at the end of July and possibly replaced with another WMS). RAL maintains one test instance on SL4 – to be moved to production after further testing. IC has PPS-glite-WMS.i386 3.1.8-1. This WMS is stable with 20-30,000 jobs a day not causing a problem. NGS has an unadvertised WMS hosted at RAL. Grid Ireland run a WMS and has seen “quite a few issues” while working with users to get their apps working via it. Throughput performance of the WMS is good.

          Stephen recently noticed that YAIM will soon be configuring UIs to work with service discovery (WMS and LBs will be discoverable through the information system using appropriate UI commands): https://savannah.cern.ch/bugs/?31211.”
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. None.
        • <big> End points for FTM service at tier-1 sites </big>
          There is a request to know what are the FTM endpoints at the Tier-1 sites.
          We can collect these manually now, but how should the list be kept up-to-date?

          The list of FTM end-points we have so far is:
          • ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
          • FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
          • IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
          • INFN: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
            https://cmsfts3.fnal.gov:8443/transfer-monitor-gridview/
          • PIC: http://ftm.pic.es/transfer-monitor-report/
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. reminder that tomorrow 5-Aug-2008 PIC will have a scheduled downtime from 6:00 to 18:00 UTC. The main services (CE and SE) will be affected.

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
            Atlas events in August:
          1. Kors - We will organize a last Jamboree before LHC turn-on on Thursday August 28 and a preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38738 We would really appreciate if representatives of at least all Tier-1's but also of the major Tier-2's will be there, but of course everybody is welcome. The Friday we will use for tutorials and training but we can also organize some extra meetings if needed. The Monday through Wednesday of that same week there will be an Analysis workshop with a focus on tools and development. We have reserved the IT Aud. for that whole week and a video link will be set up also.

          2. Xavi (by email) : We will organize a Tutorial and Training session on the 29th of August, just the day after the ATLAS Tier-1&2&3 Jamboree. Preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38864 I specially encourage potential future shifters, actual shift crowd and site contacts to assist. We will have tutorials for the fundamental services and systems in ATLAS, and also a special monitoring training session based on the ATLAS dashboards (specially interesting for site contacts, as one can see if a site is performing well -either in data management or in simulation production- which is very useful to spot, track and debug problems)

          3. Massimo & Johannes - as discussed in the last month and presented in the last two ADC weekly meetings, we are going to have an ATLAS analysis workshop on August 25-27 at CERN. The outline of the session is online on indico ( http://indico.cern.ch/conferenceDisplay.py?confId=38560 ). We feel that the workshop will be a good opportunity to consolidate the successful experience in grid analysis of our Ganga and pAthena and continue to build on that. We insist on the "workshop" format because we feel that the three days of the event will be best used in technical discussion (with little formal presentations).

        • <big> CMS report </big>
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          1. EGEE broadcast sent today about the new VOMS "pilot" role that must be configured on every site. This role will be supposed to run generic pilot and then used only to submit through a CE and run glexec.
          2. Remark the importance of Savannah bug http://savannah.cern.ch/bugs/?39641 (User proxy mixup for job submissions too close in time) to be escalated at the EMT.
          3. SAM tests results when the experiment framework changes: we migrated indeed from DIRAC2 to DIRAC3 the SAM suite for CE and we would like to advertize (a posteriori) that most of the bad results for this service are due of that. What is the recommended procedure to disable these tests results from the final site availability computation?
            Answer from SAM: It's better if we discuss it offline with them, but either they set the test as 'non-critical' (a priori). Or they come to us and say, from day X to day Y, we would like to have test Z as 'non-critical' (a posteriori) but before the end of the month (before calculating the sites' availability). We are discussing the way to deal with this particular situation, while we have already implemented other mechanisms ~ to deal with cases when test are submitted and fail due to SAM problems.
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
        • <big> Storage services: this week's updates </big>
          • dCache announced version 1.8.0-16. It will most probably be available in one month. It contains several improvements:
            1. New Information Providers in accordance with the decisions taken by the "Dynamic Megatable" working group
            2. Improved version of Pin Manager. It allows to release pins per VO.
            3. Better performing srmLs
            4. New Pool System with no overcommitted space
            5. Improved srm clients with better handling of command line options
            The CCRC08 branch will still continue to be supported
          • New CASTOR information providers compliant with the decisions taken by the "Dynamic Megatable" working group in validation.
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB