WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: RUSSIA
  • VOs: CMS, ALICE
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From:ROC Italy/ ROC Russia
          To: ROC GermanySwitzerland / ROC CentralEurope



          BACKUP TEAM: RUSSIA
          Tickets treated by backup team group:

          opened: 24
          closed: 29
          2-nd mail: 19
          extended: 24

          total: 96

          Issues:
          1. There are many nodes which are not registered or have switch off the
            monitoring in the GOC DB, but tested by SAM:

            (snip) LONG LIST OF HOSTNAMES (snip)

          2. Our team have to treat 40 tickets which have been expired during the
            previous COD shift plus treat 50 very old alarms.


          LEAD TEAM: ITALY
          Issues:
          1. Ticket statistics not available.
          2. Many nodes found with monitoring off . We think that there should be a good reason to keep monitoring off on a node. The reason could be:
            • quickly reported in the node description (e.g. "will be dismissed soon")
            • notified in the tkt exchange, so that COD can put a note in the per-site dashboard note tool
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          UKI, SEE, RU, NE, IT, AP

          Issues from EGEE ROCs:
          1. from PPS-CYFRONET: There were a lot of intermittent SAM tests failures this week due to a bad configuration of SAM UI client at RAL. The problem existed for quite long.
            SAM job submission from RAL is suspended for the time being, so tests are submitted at half the usual rate (every two hours )
            Suggestions to fix the issue have been sent to PPS-RAL
            [CE ROC]
          Release News:
          • On Thursday the 13th the new version of the WMSLB was released to production
            Release Notes: http://glite.web.cern.ch/glite/packages/R3.0/updates.asp
            WMS Specifics: https://twiki.cern.ch/twiki/bin/view/EGEE/Glite30WMSCHKPTProd
            This WMS/LB is built for SL3/VDT1.2 and therefore is being released as an update to gLite 3.0. The codebase is commonly referred to as '3.1', so you will hear this version of the WMS/LB referred to as '3.1'. It is the 3.1 WMS/LB for gLite 3.0.
            As the upgrade path from the old version of the WMS is not supported, and you will have to re-install the WMS from scratch, please make sure to reserve a convenient slot for the upgrade of the service at your site.
        • <big> EGEE issues coming from ROC reports </big>
          1. Central Europe: Do we have any news about lcg-CE for SLC4?


          2. Northern Europe: Currently our srm v2.2 SE fails the SAM tests because the ops VO is not supported on this SE. I don't think that non-production resources that are published in the production bdii because a VO wants this should be subject to SAM tests. Will GGUS tickets be raised about failing srm v2.2 SEs in SAM or will this be ignored.


          3. SouthEasternEurope:
            Update 34 to gLite-3.0 brought new the so called 3.1 WMS(LB) for gLite-3.0. However, from release notes one can find out that this new WMS(LB) has no upgrade path from the previous version (so, it is really not an update, but new release; even the repository has changed!), and a lot more - 3.1 WMS does not contain Network Server component anymore for the following reason: "it is no longer supported by JRA1". We believe that JRA1 should support user communities, not Network Server.

            Now, such a sudden decision is, in our opinion, quite dubious:

            1. This was NOT ANNOUNCED to users nor to sites; although the migration to WMProxy can be done pretty straightforward, it is necessary that such a change is announced well in advance.
            2. Has anyone at least tried to estimate the NUMBER OF USERS relying on Network Server and those relying on WMProxy? From our experience with various user communities, local and EGEE wide, majority of them didn''''''''t even heard of WMProxy, and those who did believe that it does not work or it is still in a tests phase.
            To conclude, we would propose that, for the users sake, the compatibility with Network Server is kept. This can be done even on just the apparent level - the old UI commands using Network Server can be actually made to use WMProxy with auto delegation.


          4. SouthWestern: A problem was discovered last week with the WMS jobwrapper. Apparently it resets the umask of the user. This prevents sites with pool accounts for SGM for instance to work in this mode. There is a savannah bug (https://savannah.cern.ch/bugs/index.php?29604) For us this is a high priority issue to be solved.


          5. From CERN_PROD: New AFS UI 3.1 available on lxplus
            Configuration instructions are available at
            https://twiki.cern.ch/twiki/bin/view/LCG/AfsUiUserSetup
            [CERN ROC]


          6. Sites complain about the lack of proper communication in the changes of glite-WMS in update 34, since it assumes installation from scratch and new WMS(LB) does not contain Network Server, which will affect many of our users that use this service. Two GGUS tickets were created regarding last update: https://gus.fzk.de/pages/ticket_details.php?ticket=26830 https://gus.fzk.de/pages/ticket_details.php?ticket=26831


          7. .


          8. .


          9. .


          10. .


      • 16:20 16:30
        Progess report of EGEE Middleware Nodes and SL4 10m
        Speaker: Oliver Keeble
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
        • <big> WLCG issues coming from ROC reports </big>
          1. .


          2. .


          3. .


          4. .


          5. .


          6. .


          7. .


          8. .


          9. .


          10. .


        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. Scheduled downtime of CASTOR instance at INFN-T1 CNAF We would like to upgrade the CASTOR instance next Wednesday 19th. Downtime will begin at: 07:00h, 19th September (UTC) A new release of the Castor software will be installed, fixing several operations problems.

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>

          Please read the report linked to the agenda.

          Speakers: Gavin McCance (CERN), Steve Traylen
          Paper
        • <big>CMS service</big>
          • No report.
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • PIC: after some problems last week with the lhcb stager (restarted) the SRM endpoint now returns a wrong rfio turl (apparently a wrong path missing /shift) due to some change in their configuration. Joel followed last week this issue with Esther but he didn't receive any further news. ROOT fails to open for read files like rfio:///stage/cfs0162/lh/stage/00001368_00000206_5.digi.405705 as returned by lcg-gt and reconstruction cannot run there. Please PIC admins look at that. Reference: GGUS 26900 R.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • No report.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> Service Coordination </big>
          The CMS CSA07 service challenge external phase is due to start on 24 September and run for 30 days. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan
          Speaker: Harry Renshall / Jamie Shiers
      • 17:10 17:15
        OSG Items 5m
        1. Discussion of open tickets for OSG.
        2. https://gus.fzk.de/download/escalationreports/roc/html/20070917_EscalationReport_ROCs.html
      • 17:15 17:20
        Review of action items 5m
        list of actions
      • 17:25 17:30
        AOB 5m
        • .