WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: Russia
  • Tier-1 sites: BNL; NDGF; SARA; TRIUMF
  • VOs: ATLAS; BioMed; LHCb
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC CERN (backup: ROC SW Europe) to ROC Russia (backup: ROC UK/I)

          Tickets:
          New : 38
          1st mail : 31
          2nd mail : 13
          close : 26
          Quarantine : 23
          Site OK : 47

          1. General gCE middleware problem:
            A general problem has been detected with the gCE job submission: Got a job held event, reason: "The job attribute PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" Developers confirmed that there is a bug in the communication between a gCE and the WMS, that causes this error.


          2. PPS middleware problem:
            Some sites show the following problem: Time to Match History : http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=lcg2rb2.ific.uv.es Publication Date (UTC) : Wed, 11 Apr 2007 06:35:02 +0000 /opt/edg/bin/edg-job-submit output : JobID : None
            Selected Virtual Organisation name (from --config-vo option): ops
            **** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lcg2rb2.ific.uv.es:7772)
            **** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server


          3. There was also a problem with work sharing between mean and backup team because of a problem with the Dashboard filter, which was fixed.
        • <big> PPS reports </big>
          PPS reports were not received from these ROCs: Italy, Northern Europe, Russia
          • No PPS-update deployed last week due to Easter holiday
          1. On the gLiteCE: 4444 Waiting job problem : static data cannt been replaced by dynamic data (fixed by adding "ADMIN3 edguser" in maui.cfg on lcgCE) . [AP ROC]

          2. Question: was this bad configuration due to an error in YAIM?
          Speaker: Nicholas Thackray (CERN)
        • <big> R-GMA Report </big>
        • <big> Update on move of MW to SL4 </big>
        • <big>Job Wrapper tests: status update and next steps</big>
          The wiki page describing how to disable these tests (at the site) is here: How to disable CE JobWrapper tests
          Speaker: Piotr Nyczyk (CERN)
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC SE Europe): For Information: SL4 Worker Nodes have been installed in AEGIS01-PHY-SCL and are running with no problems so far. Installation notes have been published in the SEE wiki: http://wiki.egee-see.org/index.php/SL4_WN


          2. (ROC SE Europe): We would appreciate a clear and updated timeline regarding the availability of gLite components (per service, if possible) for SL4/64bit and SL4/32bit. Is it possible to setup a wiki page with such a timeline, updated regularly based on plans and changes according to the development/certification progress?


          3. (ROC SE Europe): It would be nice to have a wiki page with all the already available information on installation SL4 glite services (even with workarounds). Could it be possible for all to coordinate and put all related links to a wiki page? SEE ROC has already published some information on SL4 WN (see point 1)


          4. (ROC SW Europe): At PIC we observe intermitent SAM failures on the SRM tests with the error message "BDII Connection Timeout: sam-bdii.cern.ch:2170". The CE SAM tests running in the WNs are already using the regional top-BDIIs, but it seems that the SRM, SE, etc SAM tests (launched centrally from CERN) all of them use the CERN top-BDII, which is highly loaded and often times out. Could these central SAM tests use non-CERN top-BDIIs to balance the load? Is the new lcg-utils with 60sec timeout (GFAL>=1.8.1) being used for these SAM tests?


      • 16:30 17:00
        WLCG Items 30m
        • <big>T0 and T1 Site Reliability in the ROC Weekly Reports</big> 5m
          Speaker: A.Aimar (CERN)
          Slides
        • <big> WLCG issues coming from ROC reports </big>
          Nothing this week.
        • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          • PIC will be adding memory to the node running their region's top-level BDII, next week.
          • See CERN intervention next week (Wednesday) on CMS CASTOR stager.

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>
          Speaker: Gavin McCance (CERN)
        • <big> ATLAS service </big>
          Speaker: Kors Bos (CERN / NIKHEF)
          ATLAS LFC status
        • <big>CMS service</big>
          -- Job processing: Left-overs of MC production data transfers to CERN are beingfinalized. Needed CMSSW new versions (123/13x) for next MC production round havebeen installed CMS-wide and MC production is starting.
          -- Data transfers: lastweek was week-4 of Cycle-2 of the CMS LoadTest07 (see [*]). It was not aLoadTest exciting week for T0-T1 transfers (lack of central operations peopledue to Easter effect, and main focus on production transfers). Good results insome regions, though, in the T1<->T2 routes.
          -- News: There is a CMSOffline/Computing workshop this week. Intervention to migrate from DBS to DBS-2are foreseen this week, and PhEDEx will be down for 2 days (at least, mostprobably Wednesday/Thursday). Following this request from PhEDEx team, atFriday's LoadTest meeting it was hence agreed to basically freeze LoadTesttransfers this week, so to not interfere (with system and pleople). Planning forFTS2.0 testing (in progress with Gavin) will plug-in accordingly. This week wewill also work on a workplan to test in depth T1-T1 transfers and to debug allpermutations of T1-T2 non-regional routes, in connection with the CMS Networkproject (subgroup of the Facilities Infrastructure Ops project).
          [*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> ALICE service </big>
          We have been testing the numbr of running jobs vs. waiting jobs using 4 different information sources: batch system, RB, local gris and top BDII at 3 sites: CERN, FZK and CNAF. We have seen large discrepancies in the RB with the rest of the information sources. It should be solved with the new glite-WMS so we will begin to test it with this system. Because of the interest of this test for the rest of VOs we will begin to do it for ATLAS and CMS also.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> LHCb service </big>
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> WLCG Service Coordination Issues </big>
          Speaker: Jamie Shiers / Harry Renshall
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m