WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (VRVS (Sky room))

28-R-15

VRVS (Sky room)

Maite Barroso
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • VRVS "Sky" room will be available 15:30 until 18:00 CET

    actionlist
    minutes
      • 14:00 17:25
        28-R-15

        28-R-15

        • 16:00
          Feedback on last meeting's minutes 5m
          Minutes
        • 16:05
          Grid-Operator-on-Duty handover 5m
        • From France (backup: South East) to Italy (backup: Russia)

        • Tickets:
          - created: 35
          - quarantine: 32
          - closed: 51
          - updated: 20
          - 2nd mail: 9
          - reopened: 1
  • 16:10
    WLCG SC report and upcoming activities 15m
    Speaker: Harry Renshall
    more information
  • 16:25
    gLite 3.0 updates 5m
    We will start moving to per-service upgrades. First one, probably released today:
    - CE: lcg-info-dynamic-scheduler fix for host/queue name matching
    next ones in the queue:
    - FTS
    - DPM/LFC
    - UI and WN
  • 16:30
    Change of format of operations meeting 5m
    1) EGEE Items
  • Grid-Operator-on-Duty handover
  • Any other items/announcements specific to EGEE (eg updates to mw)
  • Issues coming from VO and ROC reports (ROC reports not received)
  • 2) OSG Items

  • Issues coming from OSG
  • 3) WLCG Items

  • Upcoming SC4 Activities
  • Any other general WLCG items
  • WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (VO reports + Tier 1 reports not received)
  • 4) Review of action items

    5) Feedback on last meeting's minutes

    6) AOB

  • 16:35
    REMINDER: to update to the 1.8 IGTF CA package for every service (not only the WNs, only ones checked with SFT) 20m
  • 16:40
    Bug 16625: 10-50 times speedup for lcg-info-generic. 5m
    http://savannah.cern.ch/bugs/?func=detailitem&item_id=16625
    10-50 times speedup for lcg-info-generic, tested at RAL and GridKa. Request to increase its priority so it is included in a release asap.
  • 16:45
    Issues to discuss from reports 25m

    Reports were not received from:
    ROCs: UKI (holiday)
    Tier-1s (reports attached): BNL
    VOs:

  • CE ROC: Improvements to gLite update release process needed.

    1.A) (4444 jobs problem. Bug affects all sites containing a special character in domain name, or a queue name)
    gLite updates for production sites should not contain packages that are known to have bugs. Package lcg-info-dynamic-scheduler released with gLite 3.0.2 contained well known bug that affects CEs with hostname containing character '-' or queue name containing underscores, uppercase letters, and numbers. This bug is not listed on any download page (e.g. http://glite.web.cern.ch/glite/packages/R3.0/deployment/lcg-CE/3.0.3/lcg-CE-3.0.3-update.html) as known issue.
    Since the publish date this issue has generated at least three tickets:
    https://gus.fzk.de/pages/ticket_details.php?ticket=11681&from=allt
    https://gus.fzk.de/pages/ticket_details.php?ticket=11619&from=allt
    https://savannah.cern.ch/bugs/?func=detailitem&item_id=19233
    There's a lot of such CEs in central BDII so there will probably be more tickets. The worst part is that the problem has already been reported in May:
    https://savannah.cern.ch/bugs/?func=detailitem&item_id=17716
    and the patch has been available since July:
    https://savannah.cern.ch/patch/? func=detailitem&item_id=754

    1.B) Updates should be coordinated with central services (GSTAT is affected here). For example, MyProxy's ServiceType has changed from 'myproxy' to 'MyProxy' since Yaim 3.0.0-17 (release in June) and Gstat still issues warning on PROX nodes because it looks for 'myproxy'
    (https://gus.fzk.de/pages/ticket_details.php?ticket=11653&from=allt).
  • 2. DECH: Is there a way to clean-up the RBs MySQL database for (very) old entries? Database files have sizes of many GBs already." (DESY-HH)
  • 3. LHCb: I'd simply like to put more pressure on GridKA site admins whose site is failingreconstructions jobs for the on going DC06.
    There is a GGUS ticket (#11599)describing the problem qhose priority was severe.

    Problems with lcg-gt at GRIDKA in DC06
    Detailed description:
    Dear Site Manager,
    Since several days now, when we (LHCb) are trying to run Reconstruction DC06 jobs at your site, for data we have just transfer to your site we get in to the following situation: when the job issue lcg-gt commands to get appropriated TURL for the dcap protocol to be used by the application a large fraction of them timeout (by our own wrapper after 30 seconds) and thus the intput data to the jobs can not be resolved.
    This same logic has been working fine at your site in the past, and it is also working at other Tier1's (PIC, RAL, IN2P3) and CERN. Please investigate the problem and let us know if we can help you to debug the issue.
  • more information
  • 17:05
    Review of action items 10m
    actionlist
  • 17:20
    AOB 5m