WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:05 16:45
        EGEE Items 40m
        • <big> Grid-Operator-on-Duty handover </big> 5h
          From ROC SEE (backup: ROC CE) to ROC SWE Europe (backup: ROC DECH)

          Lead team handover
          Tickets:
          Backup team hand over: Open: 37
          Site ok: 24
          Close : 19
          2nd mail: 13
          Quarantain : 16

          Notes:

        • No sites to be considered for suspension from our shift.
  • <big> PPS reports </big>
    PPS reports were not received from these ROCs:
  • The PPS has been set to ''maintenance'' in the GOCDB. However neither the pre-report or the SAM pages do reflect this. A ticket (#19257)was submitted (from ROC_DECH)

  • Answer (from SAM support team): The ticket has been received and it is currenty under analysis
    Speaker: Nicholas Thackray (CERN)
  • <big> top-level BDIIs </big> 5m
    The immediate problems at CERN are resolved:
    Removing a few spurious hosts that were hammering the BDII here.
    Also the large improvement in GFAL's queries we are expecting is going to make a large difference when it comes in.

    The second problem about persuading VO users not to hard-code to the CERN BDII is not easy. Have discussed about having BDIIs publish themselves and a summary as to what they contain, e.g. FCR-ENABLED-EGEE-CERTIFIED or whatever. The other thing we could is some analysis of the declared top BDIIs but even for this we need to know the complete list of BDIIs, i.e. they publish themselves. It is clear all services should publish themselves. Needs a bit talking first about what to do. We can go either for the easy they just publish themselves or the harder they should publish what they contain.
    Speaker: Steve Traylen (CERN)
  • <big> EGEE issues coming from ROC reports </big>
    Reports were not received from these ROCs: France, Russia, SEE
    1. gLite WMS problematic in production (100k tmp-files e.g. at DESY) - Corresponding ticket still not in progress: https://gus.fzk.de/ws/overview.php?ticket=18270 Has the problem been forwarded to the EMT? Is it tackled at all?
      (DECH Europe ROC)

    2. Answer:Tmpwatch can be configured to clean those files up more often, even once a day, if needed.
      The location and verbosity of those files was made configurable as of Condor version 6.8.3, released on January 8, 2007.
      To the best of my understanding this Condor version is being tested for further distribution, but this issue is closed as far as development goes.
      the Condor version 6.8.3 is currently in certification. It was hold back because it was meant to be delivered together with a bundle of various fixes for the gLiteCE scheduled for gLite3.1
      In consideration of the issue reported on the WMS, however, the Deployment team is going to deploy it on gLite3.0 in one of the next patches.



    3. Due to an error with the top-level BDII configuration file, occasionally lfc02.pic.es was also published together with prod-lfc-*-central.cern.ch, which might have led to files registered in a wrong catalog for these VOs: atlas cms diligent dteam lhcb magic ops picard The symptom happened since Monday. The issue is fixed now. (CERN ROC)


  • 16:45 17:05
    WLCG Items 20m
    Reports were not received from these tier-1 sites: Site1, ...
    Reports were not received from these VOs:
    VO1

    Tier1 reports
    • <big> Request for VO interventions </big> 5m
      All significant intervention (those involving multiple sites, multiple services or significant work for a single service) requested by VOs should be announced at the operations meeting, in the WLCG section of the meeting. It will be the responsibility of the VO to find coordinator for the intervention (could be from the CERN EIS team or a service manager or someone with sufficient knowledge from the VO) The coordinator will create an intervention plan (template available) which must be ratified by all parties involved. Once the interevention is requested through the operations meeting, planned, and agreed, the proper broadcast should be sent. Examples of these interventions are e.g. the SRM endpoint changes. Once this procedure is agreed, it will be documented at the operations manual.
    • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>

      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      • None foreseen for current week

      Time at WLCG T0 and T1 sites.

    • <big>FTS service review</big> 5h
        Read the attached report. Main issues this week: Not ticketed yet: CERN-PROD: There were problems on the Castor atlas pool this weekend causing large failures: being investigated now. 19088 : BNL-LCG2 will be in downtime until 16th March 2007 (GOC DB) Atlas are still transferring data here: 45% of transfers are still successful. Should we close the channel? 19009: IN2P3-CC. Problems this week: many queuing PUT requests. This is possibly made worse by some behaviour in the current production FTS - investigating. This problem was addressed on Friday. Currently having authorisation configuration problems (a different problem). 19144: GRIDKA (solved now): intermittent problem? - but it would good if the ticket response could indicate this. 19157 PIC (solved now): ATLAS running out of disk space - new disk being installed - problem known to ATLAS who have stopped transfers.
      • FTS report index - status by site and by VO
      • Transfer goals - status by site and VO
      • Transfer Operations Wiki
      Speaker: Gavin McCance (CERN)
      more information
    • The production FTS service prod-fts-ws.cern.ch has been split into two services. 5m
      The production FTS service prod-fts-ws.cern.ch has been split into two services:
      prod-fts-ws.cern.ch
      tiertwo-fts-ws.cern.ch
      The new tiertwo service will maintain any CERN<->T2 traffic where as prod-fts-ws will have this portion removed to become strictly the T0<->T1 export service.
      The change to prod-fts-ws with the removal of the existing tier2 traffic will take place shortly after April the 1st.
      Please move all tiertwo traffic to the new FTS instance. (CERN ROC)

    • ATLAS service / "challenge" issues & Tier-1/Tier-2 reports
      Speaker: Kors Bos (CERN / NIKHEF)
    • CMS service /
      See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning
      -- Job processing: CMS MCprod continues.
      -- Data transfers: last week was CMSweek, and week-3 of the CMS LoadTest07 (see [*]) was a breathe-and-assess week.Some bugs were fixed in PhEDEx 2.5, and a new subrelease is foreseen imminentlythis week. The LoadTest07 set-up and the communication model was reviewed tobetter accomodate Tiers needs and to better involve them in the testing loops.
      This week we will restart with T0-T1 transfers mainly, in preparation formulti-VO transfers.
      [*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
      Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
    • ALICE service / "challenge" issues & Tier-1/Tier-2 reports
    • LHCb service / 5h
      1. the gLite job wrapper (using rb112 and rb117 gLIte WMSes dedicated to LHCb)doesn't take into account the EDG_WL_SCRATCH variable so, for some sites, jobsare run on the home directory filling it up. Please upgrade (read: patch) thosetwo machines (used in production now by LHCb) to the latest available version ofgLite WMS middleware so that LHCb will benefit of it. Note that those machinesare running a pre-Xmas version of the gLIte m/w that starts to be reallyinadequate to sustain their productions. We put them temporary off line untilthey will be completely drained of thousand of jobs backlogged on theirbellies.

      2. GGUS #19205 : a tURL (returned by lcg-gt asking for gsidcap)results to be not-staged in the disk pool. This is very strange. Rootapplication fails then to open it (being still only in the MSS). This isobservable only on purely gsidcap sites (IN2P3 is one of them).

      3. Onceagain we have to report about problems in moving data and/or accessing data viaapplication due to very poor storages performances. Transfers show a generalslowness of the SE respnse with many failures due to time out or other errorindicating SRM not responding (Failed to get the source file size). (CERN andCNAF first of all)

      This week LHCb want to point out another problem: lcg-gtproblems across many of the sites . Many jobs fail becasue the command takes awhile to retrieve tURLS of file to be open by Root application.
      This is truenonetheless LHCb is using an high performant utility that allows for bulkqueries to the SRM endpoints and their optimizations rather than lcg-gt.(utility created explicitely to cope with several limitations already pointedout to developers).
      On this respect CNAF is the most problematic site whereno jobs are run successfully since 1st of March.
      Speaker: Dr roberto santinelli (CERN/IT/GD)
  • 17:05 17:10
    OSG Items 5m
    Item 1
  • 17:10 17:15
    Review of action items 5m
    list of actions
  • 17:20 17:25
    AOB 5m