WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs:
  • Tier-1 sites: INFN
  • VOs: Atlas
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC Russia (backup: ROC UK/I) to ROC Italy (backup: ROC France)

          Tickets:
            1. The last week problem is still here - some sites show the same problems:
              Time to Match History : http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=lapp-rb01.in2p3.fr
              Publication Date (UTC) : Wed, 18 Apr 2007 09:35:01 +0000
              /opt/edg/bin/edg-job-submit output :
              JobID : None
              Selected Virtual Organisation name (from --config-vo option): ops
              **** Error: API_NATIVE_ERROR ****
              Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lapp-rb01.in2p3.fr:7772)
              **** Error: UI_NO_NS_CONTACT ****
              Unable to contact any Network Server
              -------------------------------------------------
              Time to Match History : http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=rb-fzk.gridka.de
              Publication Date (UTC) : Wed, 18 Apr 2007 13:35:07 +0000
              /opt/edg/bin/edg-job-submit output :
              JobID : None
              Selected Virtual Organisation name (from --config-vo option): ops
              Connecting to host rb-fzk.gridka.de, port 7772
              Logging to host rb-fzk.gridka.de, port 9002
              **** Error: API_NATIVE_ERROR ****
              Error while calling the "edg_wll_RegisterJobSync" native api Unable to Register the Job:
              https://rb-fzk.gridka.de:9000/bWPNXoGJ9qvNzOfjRk7o5w
              to the LB logger at: rb-fzk.gridka.de:9002
              Connection refused (edg_wll_ssl_connect())
              -------------------------------------------------


            2. GStat of some sites (RWTH-Aachen,BEgrid-ULB-VUB,GR-04-FORTH-ICS,CSCS-LCG2, UNI-FREIBURG) demonstrates the warning:
              Service Entry Check: warn
              Service with incorrect versions found:
              ID: httpg://grid-srm.physik.rwth-aachen.de:8443/srm/managerv1
              Type: srm_v1
              Vers: 1.1.1
              Service with bad SRM service type found


            3. A failure of HDD on pps-wms.cern.ch (aka lxb2092.cern.ch) - WMS at CERN_PPS site brings appearance of problems with a number of PPS sites.
        • <big> PPS reports </big>
          PPS reports were not received from these ROCs: Italy, Asia Pacific
          • * PPS-Update 27 released to the PPS. This contains:
            • patch #1118 lcg-vomscerts-4.4.1 has correct cert for biomed/egeode
            • patch #1115 New version of lcg-info with support for VOViews, sites and services
            • patch #1110 Dcache 1.7.0-34 upgrade with GridFTP bug fixes
            • patch #1108 glite-yaim 3.0.1-12 5 => This version of YAIM enables DGAS logging on the LCG CEs.
          • Significant issue found in SL4 natively complied WN (gridFTP ls causes segmentation fault)
          • A meeting with all PPS sites (VRVS or phone conference)is being scheduled. The temptative date is:
            Thursday 03 May 2007
            from 15:00 to 16:3

            The preliminary agenda is available at http://indico.cern.ch/conferenceDisplay.py?confId=15191

          • SRM-2 testing in PPS
            Although SRM-2 is not certified yet, experiments are requesting the PPS to give them support for initial testing of SRM-2 capabilitiies.
            A number of new sites already running the new SRM-2 are going to join PPS.
            This will force a re-organization of the data management in PPS
            • New sites joining and why
              A number of new sites already running the new SRM-2 in the context of the SRM-2 "pilot" are going to join PPS.
              This will force a re-organization of the data management in PPS (e.g. end-points to be published and updated in FTS).
              Sites willing to volunteer to a pre-installation of their SEs with SRM-2 are welcome.
              Sites will be also asked to volunteer to declare SRM-2 SEs as their "Close SE"
            • HEP VO specific testing The SRM-2 testing concerns for the time being only HEP VOs.
              Sites mainly dedicated to serve non-HEP VOs (e.g. Biomed, Diligent), although welcome to join the exercise, may found it useful to call them out in order to avoid conflicts
              In that case they would need to stop supporting the HEP VOs
            • Installation of 'uncertified' software There is no guarantee, so far, that the SRM-2 will be certified before this test activity starts.
              As usual we will not ask sites to install uncertified software.
              However, if sites are willing to do it in this case, and if it is compatible with any other use currently done of the storage resources of PPS, they are welcome
            • Data in the catalogs to be modified: Conflicts? The migration of the catalogs is not reversible. The migration scripts are meant for use in production.
              Experiments will know, that data created in the PPS catalogs during the exercise are going to be "lost" afterwards.
              We have to check is there is any showstopper to the migration of the existing catalogs in PPS.
            • Configure CE with SRM-2 SEs as 'close SE'. Volunteers? As long as the list of end-points is not available we ask here only for an expression of interest
          • Issues coming from the ROCs
            1. A roadmap for gLite MW in general will assist us to plan ahead [SEE ROC]
          Speaker: Nicholas Thackray (CERN)
        • <big> Decision needed on moving forward with SL4 WN </big>
          There is a bug in the native SL4 WN, but the upgrade path from the 'interim' WN (SL3 made compatible with SL4) is very difficult. Given the circumstances, how should we move forward?
          RECOMMENDATION: Make the interim WN available to production sites, with a clear message regarding the upgrade problems and also the timelines for the native SL4 WN, and leave it up to each site to decide how they will handle it.
          Speaker: Nicholas Thackray (CERN)
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC CERN, TRIUMF): SAM still handles timezones incorrectly. Maintenance on Fri 20th scheduled for 14:00 - 16:00 UTC but SAM show maintenance incorrectly at 08:04 UTC and in error at 14:02 UTC, i.e. wrongly during our maintenance.


          2. (ROC CERN, FNAL): 1. We set up a 2nd lcg gateway for redundancy. But if either goes down, SAM flags us as being down, thereby defeating the purpose of the 2nd gateway. Of course we are still operational, only SAM is marking us incorrectly. How can this be improved? I was told CERN runs multiple gateways, how do they handle this? 2. We need to split the cmswnNNN accounts on the 2 gateways since they operate independently.


          3. (ROC France): Within the relocatable distribution of WN/UI, check_crl script is not relocatable (GGUS #20970)
          4. ANSWER: ticket submitted 19/04 and assigned to the Installation and Configuration/New Release support unit. A bit more of patience before we escalate it.

          5. (ROC France): Please to notice that the number of wrong SAM test failures is decreasing. Congratulations.


          6. (ROC France): Announcement: A regional top bdii has been put in production. For now, it is only used by the T1 for a while to check the load, but it will be proposed afterwards to all french sites.


          7. (ROC DECH): Announcement: There is a planned outage of site UNI FREIBURG at the 2nd May from 5:00 UTC until 11:00 UTC.


          8. (ROC SEE): The latest update 21 to gLite introduced a new version of yaim. It has some new features which is very positive, but in deployment we encountered excessive problems due to the introduction of special pool accounts for prd and sgm users (earlier just one account for each of these) and new groups for them. Although advertised at the very end of new yaim guide on twiki, this has profound effects: if these new accounts are introduced, this must be done on all nodes, otherwise people mapped to one of these accounts will have problems trying to access local resources on other nodes where such accounts do not exist.
            Specifically: release notes stated that reconfiguration is needed just for lcg-CE, lcg-CE_torque, and glite-CE, but in fact you need to introduce new accounts on all WNs at the same time. This is the list of GGUS ticket we crated so far:
            https://gus.fzk.de/pages/ticket_details.php?ticket=20941
            https://gus.fzk.de/pages/ticket_details.php?ticket=20942
            https://gus.fzk.de/pages/ticket_details.php?ticket=21044
            To conclude, I would say that release note missed to mention some important things yet again, and that it can badly affect VOs that massively use prd or sgm accounts.


      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          document
        • <big> LHCb service </big>
          New problem with dCache SEs.
          The problem was first discovered when trying totransfer DSTs from disk only storage (d1t0) at our Tier1s to CERN. It was observed that many files from SARA and PIC were terminally failing with the error:
          "Transfer failed. ERROR the server sent an error response: 425 425 Cantopen data connection. timed out() failed." ,br> Regardless ofthe number of retries attempted the files failed with the same error. Whenchecking the files on the SRM (using srm-get-metadata) the SRM showed that thesefiles were not staged i.e. isCached = false.
          This was clearly a problem and the relevant files given to the sites for further investigation. At PIC and SARA these files were confirmed to reside in '/pnfs' (and as such visible by the SRM) but not on a disk pool. Since these files were not backed up, they bec onsidered lost.
          Further investigation is ongoing at PIC where a '/pnfs' to disk pool consistency check is being performed (although this operation is extremely heavy and it is more than a week scripts are running).
          These files were d1t0.

          Over this weekend further files were discovered at GRIDKA, SARA and RAL (provided to the sites this morning) which look to be suffering from the same problem (i.e. are registered in '/pnfs' but can't be brought online).
          This time the files are supposed to be in backed up storage (d0t1) and so in principle the files should be recoverable. But, after three days of attempting to stage these files over the weekend (using LHCb'scentral stager service), these files haven't been available. Attempts to staget hese files the hard way (i.e. attempting to copy the files out) have also failed.
          It is possible that the same problem affecting the disk only files alsoaffected these files on the disk cache before being migrated to tape.
          With both d1t0 and d0t1 files seeming affected by this problem it is hard to assess (on the experiment side) which files are affected by this bug.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> WLCG issues coming from ROC reports </big>
          1. (ROC ???): ???


        • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          • Downtime announcement : The RAL-LCG2 Castor will be down for upgrade on the 30th April and the 1st May, this will affect the 6 SRMs ralsrm[a-f].rl.ac.uk.

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>
          Speaker: Gavin McCance (CERN)
        • <big> ATLAS service </big>
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          -- General: last week there was a CMS Offline/Computing workshop, attracting most of the attention.
          -- Job processing: MC production in progress. Nothing to report, apart from some left-overs of transfers to CERN stil to be finished (mostly site problems, not FTS problems).
          -- Data transfers: PhEDEx was off due to DBS-1 -> DBS-2 migration. Last week was week-5 of Cycle-2 of the CMS LoadTest07 (*), and it was a suspension week. Activity will restart as soon as PhEDEx is back up (updated plan states: Monday), focus will be on T1<->T2 regional and non-regional routes. Planning of PhEDEx/FTS2.0 in progress.

          [*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> ALICE service </big>
          Nothing special to report for alice, just in case sites need anything and require anything to Alice.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination Issues </big>

          Multi-VO Tier0-Tier1 transfer tests. The results of the previous tests (week of March 26th) show good overall daily / weekly transfer rates for both ALICE and CMS.

          These have to be repeated including (at least) ATLAS, who has significantly higher rates (~1GB/s out of CERN to all Tier1s, not including the current increased event sizes).

          The earliest that such a combined test can be organised is ~end May - more details will follow as they are established.

          Speaker: Jamie Shiers / Harry Renshall
          Multi-VO results from week of March 26th
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
      • ???