WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 1
        EGEE Items
        • a) <big>Central Grid-Operator-on-Duty (c-COD) handover</big>
          From Northern (NE) to Italy (IT)

          The problems with Asia Pacific have now been resolved.
          Two sites (NE and AP) have overdue alarms, but I have informed both of them about this.
          No issues to report to the WLCG meeting, except to inform them that the AP problems are now resolved.

        • b) <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:
          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
          1. Nothing this week?
          2. c) <big> gLite Release News</big>
            Please find gLite release news in:
            https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

            1. Nothing yet.
          3. d) <big> EGEE issues coming from ROC reports </big>
            Italy, France and UKI had not validated their ROC reports as of the 14:00 deadline.
            Reports show no major operational issues encountered during the reporting period, and no points to raise at this meeting.

            • FZK-LCG2 wishes to convey the following INFO: Planed downtime at FZK-LCG2 on 10-09-2009 07:00 - 08:00 UTC The LFC service lfc-fzk.gridka.de will be down (not LHCb LFC) due to splitting it into an ATLAS (atlas-lfc-fzk.gridka.de) and a non-ATLAS (lfc-fzk.gridka.de as before) one.
            • SEE ROC: At the previous operations meeting it is briefly discussed the issue “WLCG MB agreed on 4th of August to ask for the SL5 migration at all Sites, including the Tier-2 Sites.”. As far as we know MPI it is still not supported by the glite-3.2 (see https://gus.fzk.de/ws/ticket_info.php?ticket=47422).
              We understand that this affects only the WLCG sites (at the moment), but since there are many users/teams in our region that they are depending on the MPI facility/capability of the Grid, we think that this issue could be given higher priority at the developers.
            • SWE ROC: We d like to certify a site that runs only central services (WMS, LFC, etc..), the site has no storage or computing backend. Is this possible from the point of view of OPS?
          4. e) <big>Grid Service Interventions </big>
            Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
            Please consult the URLs above for details.

          5. f) <big>Miscellaneous</big>
            • SAM default DPM upgrade

              Last reminder that the default DPM used for SAM tests will be upgraded to SL4 next Monday 7th of September, and that sites with obsolete client S/W will start failing tests.

            • SAM MPI tests will NOT be activated

              There are pending tickets for SL5

            • Notification of new gstat beta version (see attached material)

            • 7 Sites running legacy gLite releases, those not upgraded next week will be moved to suspended/uncertified till they do so:
              Site Host Version
              EENet kriit.eenet.ee 3.0.2
              HK-HKU-CC-01 ce.grid.hku.hk 3.0.2
              JP-KEK-CRC-01 dg10.cc.kek.jp 3.0.2
              Taiwan-IPAS-LCG2 atlasce.phys.sinica.edu.tw 3.0.2
              Taiwan-NCUCC-LCG2 ce.cc.ncu.edu.tw 3.0.2
              TW-NTCU-HPC-01 host001.hpc.ntcu.edu.tw 3.0.2
              UKI-LT2-RHUL ce1.pp.rhul.ac.uk 3.0.2
            more information
          6. g) EGEE OAT Releases

            GStat 2.0 Beta Release

            The Beta release of GStat is now available. Installation and configuration instructions are available. http://goc.grid.sinica.edu.tw/gocwiki/GSInstallationGuide For any questions or comments, please email GStat support list. project-grid-info-support@cern.ch.

            Update to the EGEE SA1 OAT release

            An update to the EGEE SA1 OAT release has now been released and is available in the usual repositories.

            There are no changes to the YAIM configuration required but it is necessary to rerun ncg.pl at least e.g via a YAIM rerun following the "yum update" of your packages.

            Changes include:

            • Changes to grid-monitoring-probes-org.bdii probes with NCG providing configuration for them. Probe details: http://goc.grid.sinica.edu.tw/gocwiki/NagiosProbe
            • Addition of org.gstat.CE and org.gstat.SE probes. These provide the sanity checks similar to those the gstat1 web interface provided. These are the gstat2 probes. In particular these look for greater compliance to the WLCG/EGEE glue schema usage documents.
            • Nagios probe results that are collected via the messaging system now have their status prefixed with the hostname from where the test was executed. e.g For a ROC that submitted a WN test to site via a CE then the probe result once transmitted to the site nagios via msg service will appear as before as service "org.sam.WN-Bi-dteam-roc" on the CE node but the status line contains the WN name. e.g lxbra3908.cern.ch: OK: getCE: ce103.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam indicating that lxbra3908 was the WN where the test was executed.
            Bug Fixes:

            Install Instruction via YAIM. EGEE.GridMonitoringNcgYaim

            Bug Reports https://savannah.cern.ch/projects/sa1tools/

            Discussion Mailing List including pre-release announcements join egee3-operations-automation-discuss@cern.ch via https://groups.cern.ch

            Description of yum repositories including pretty repoview html pages and rss feeds of packages updates. EGEE.EGEESA1PackageRepository

            Known Problems: We plan to deploy a bug fix to the production message brokers shorty that at times can cause consumers to fail to get messages.

        • 2
          OSG Items
          The OSG supporter wrote in the diary of GGUS ticket 49970 that the problem is solved, hence the ticket will be closed. However, the corresponding OIM ticket 7148 is in Status: Support Agency. Therefore the GGUS ticket cannot be closed. Please adapt the ticket status and put a comprehensive text in the Solution field for the GGUS Knowledge Data Base. Thanks maria
          Speakers: Maria Dimou, Rob Quick
        • 3
          Review of Action Items
        • 4
          AOB