WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Steve Traylen
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: South Eastern Europe
  • Minutes
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: GermanySwitzerland and Taiwan
          To: Southwestern and UK/I

          Issues: There are quite some node appearing in the alarm table although they have monitoring disabled in GOCDB. Might be the change has been done only recently:
          1. srm-v2.cr.cnaf.infn.it
          2. gridse2.pg.infn.it
          3. dcsrmv2.usatlas.bnl.gov
          - 11/28 Due to the connection problem to GOCDB, can not access CIC dashboard. - Ru-Trcitsk-INR-LCG2 and KTU-BG-GLITE did not update their CA rpm to the latest version .->sent mail to site managers.
          • New tickets:24
          • 2nd mail:8
          • Quarantine:14
          • Extend:11
          • Close:19
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, CERN, IT, SWE, SEE

          Issues from EGEE ROCs:
          1. No issues reported

          Release News:
          1. gLite3.1.0-PPS-UPDATE10 was released to PPS This update introduces a number of new services to gLite 3.1 for SL4 (32 bit)
            • glite-AMGA_postgres
            • glite-LFC_mysql
            • glite-LFC_oracle
            • glite-PX
            • glite-SE_dpm_disk
            • glite-SE_dpm_mysql
            • glite-VOMS_mysql
            • glite-VOMS_oracle
            Records of the pre-deployment testing can be found in
          2. release of gLite3.1 Update07 to production in preparation:
            (To be announced early this week)
            This release will contain:
            • JobWrapper tests - new version with no R-GMA dependencies
            • glite-VOMS_mysql metapackage for gLite 3.1 and SL(C)4
            • glite-VOMS_oracle metapackage for gLite 3.1 and SL(C)4
            • Bug fixes for UI and WN

          Service developments :
          1. Last week we sent the following call for a volunteer site to join the pre-deployment testing team in order to test the newly released AMGA service

            AMGA consists in a set of drivers implementing a powerful interface layer towards databases, centralising in a single protocol many communication aspects mainly dealing with security.
            A suitable strategy for the deployment and the use of the AMGA server in PPS (and then in production) is currently being studied together with the main users of the service, the VOs LHCb and Atlas.
            This analysis will hopefully lead us to find the best placed site(s) to host the service.
            For the time being, the predployment testing team is looking for one volunteer site to run the pre-deployment test (installation + configuration) of the AMGA service.
            This is an invitation to interested sites to show-up and possibly contact Mario, David as coordinator of the pre-deployment, who will gladly provide them with the technical info they need.

            We would be particularly happy to receive volunteers for this activity, in the framework of the "Special support to PPS Operations" among those certified PPS sites which still don't appear in the lists in http://www.cern.ch/pps/index.php?dir=./panel/ , namely:
            • DESY-PPS
            • FZK-PPS
            • GSI-LCG2-PPS
            • SCAI-PPS
            • PPS-SiGNET
            • PreGR-01-UoM
            • PreGR-01-UPATRAS
            Suggestions from the ROCs /PPS sites dealing with possible deployment scenarios of the AMGA service in PPS are also very welcome.
            We are actively looking for a user community interested to try out the newly released postgres-based version of AMGA.
        • <big> EGEE issues coming from ROC reports </big>
          ------------------------------------------------- AsiaPacific -------------------------------------------------
          ->  Major Operational Issues Encountered During the Reporting Period
          == ROC Report ==
          <Site issues>
          Open GGUS tickets status:
          #29850- TW-FTT enable VOView in the bdii (f-ce01.grid.sinica.edu.tw).
          #29445-  MyProxy failure on dg15.cc.kek.jp
          -> need more information from site admin.
          #27112- SE failure on SE.pakgrid.org.pk
          #26941- CE failure on CE.pakgrid.org.pk
           ->Site is in SD till 2007-12-03
          5 Sites NOT publishing accounting data:
          Australia-ATLAS: 19 days
          HK-HKU-CC-01:  24
          INDIACMS-TIFR: 70 (SD till 2007-12-07)
          JP-KEK-CRC-02: 55
          KR-KISTI-GCRT-01: 33
          <Other issues>
          TWAREN network maintenance on Dec. 1st.
          Start time: 2007-12-01 09:01
          End time: 2007-12-01 14:01
          Sites in APROC affected: NCUCC, NCUHEP, TW-NIU-EECS-01
          == T1 Report ==
            * Established a new CERN 5Gbps link which is now in production since on Nov. 29th
            * Testing Tokyo ICEPP network performance over new 1G connection
            * Flapping with multiple outages on backup CHI-AMS link, requesting report from network provider.
            *  build up bdii load balancing and fail over
            * rename lcg00126 to bdii01, the second bdii is named bdii02.
            * base on keepalived( integrated vrrp and ipvs ).
              * one level HA, director routing load balancing.
              * using a sample script to check bdii service instead TCP port check.
            * Added 40TB into cms CSA tape pool
            * investigating Castor tape migration performance with rfcp testing
             * one bottleneck found is at the uplink at the network switch integrated in disk server blade chassis
             * additional links will be added to increase network bandwidth 
          ------------------------------------------------- CentralEurope -------------------------------------------------
          ->  Major Operational Issues Encountered During the Reporting Period
          Accounting for site with SGE batch system.
          Site installed lcg-CE glite 3.1 on a host with Sun Grid Engine batch system but got into troubles while trying to publish APEL accounting data. While trying to solve the issue the site was told they should not use uncertified (in term of etics) batch system at all. 
          -> Points to Raise at the Operations Meeting
          1. SAM Apel test. When it is scheduled to become a critical test?
          -> Availability report
          CYFRONET-LCG2 Tier-2 site remarks that while analyzing availability reports it is hard to determine the reason for decreased availability because the tools which affects (FCR) and computes (GridVIEW) availability base on SAM results which are available only for last 7 days. We are aware the longer history is a performance problem but maybe it would be possible to provide an interface to show some short period of SAM results in the past?
          ------------------------------------------------- CERN -------------------------------------------------
          SFU-LCG: We have 400 queued atlasprd jobs for 10-cpu cluster. Some SFT
          job fail because they could not be run for a long time. 
          CERN-PROD: Scheduled intervention on LSF subsystem. Has been announced,
          and a downtime was scheduled in GOCDB.
          CERN-PROD: Soon after the release of GGUS we received a number of update
          e-mails from GGUS concerning the verification done by the 
          users of (sometimes) very old tickets. As the corresponding tickets were
          already frozen in our internal TT system, this 
          caused a lot of new tickets to be opened.
          The issue was not systematic, in the sense that it did not concern
          tickets in the whole history, but it was however significant
          We are asking the GGUU team if thy are aware of possible causes. We
          reckon a post mortem analysis as envisageable in
          order to correclty record and address the same issue for future
          CERN-PROD: Submission storm due to WMS bug. affecting CMS. This
          started on Tuesday evening went on until Thursday evening, 
          and overloaded both the batch system and the CEs hosting the jobs. Due
          to this CERN hosted more than 30k GRID jobs for quite 
          some time, and we passed a limit on the maximum number of jobs allowed
          in the batch system. This limit was increased from 50k 
          to 75k to allow new submissions.
          ------------------------------------------------- France -------------------------------------------------
          ------------------------------------------------- GermanySwitzerland -------------------------------------------------
          ->  Major Operational Issues Encountered During the Reporting Period
          Report for Tier1 GridKa (FZK):
          [ author : Jos van Wezel]
          ---T1 site report went missing in this ROC pre-report----
          Reconstructed here:
          Short SE service interruption for an emergency update of the dcache SRM
          on 29/11.
          Report for ROC DECH
          [author: Clemens Koerdt]
          o 15 German/Swiss sites in production running with gLite 3
          o Specific news by site
            * (none)
          o WN MW version
             o 5 sites gLite 3.1
             o all other sites: gLite 3.0
          o WN OS overview
             o SL version 4 (7 sites)
             o SL version 3 (5 sites)
             o Debian (1 site)
             o CENT OS (1 site)
             o SUSE 9 (1 site)
          -> Points to Raise at the Operations Meeting
          Issues compiled by ROC DECH
          [author: Clemens Koerdt]:
          1.) Some sites are unsure about the correct procedure to introduce new service nodes in the production environment. Now that GOCDB no longer allows sites to switch off the monitoring the sites should put the nodes initially in ''maintenance''!?!
          Once they are in maintenance, can monitoring be switched off? What about the procedure if a nodes needs to be decommissioned? Set first into maintenance, then delete from GocDB, knowing that SAM continues to test for another three days?!
          2.) Ticket  https://gus.fzk.de/ws/ticket_info.php?ticket=28099 remains in status ''assigned'' since already two weeks now.
          3.) At least on site report went missing in this week''s ROC pre-report.
          -> Availability report
          ---T1 site availability report went missing in this ROC pre-report----
          Reconstructed here:
          Short SE service interruption for an emergency update of the dcache SRM
          on 29/11.
          ------------------------------------------------- Italy -------------------------------------------------
          ------------------------------------------------- NorthernEurope -------------------------------------------------
          ------------------------------------------------- Russia -------------------------------------------------
          1. It seems like some users try to submit jobs to the sites bypassing RB/WMS
          system, directly using CE job submission APIs or globus tools. What should we do
          with this (i.e.: don't care, encourage, prohibit in some way)?
          2. One of russian Alice managers asked us to "install pbs client on the VOBox".
          We are wondered if this should be allowed at all or not. What is the common
          practice on other VOBoxes? We certainly would not like to allow any grid users
          to submit jobs directly to the CE bypassing the grid layer from the VOBox.
          ------------------------------------------------- SouthEasternEurope -------------------------------------------------
          ------------------------------------------------- SouthWesternEurope -------------------------------------------------
          ------------------------------------------------- UKI -------------------------------------------------
        • <big> gLite Release News</big>
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
           Availability report : BNL-LCG2
           -> Remark[s] on 2007-11-25 
          Saturday Nov 24
          Problem: panda monitor on gridui03 was not available for 16 hours.
          Cause: high data movement and nfs problems caused the monitor to hang.
          Solution: we will move panda monitor machines to .54 subnet. 
           -> Remark[s] on 2007-11-27 
          Problem: User output file failed write to dCache
          Cause: dc008 run out of inodes on the file system.  
          Impact: read pool was disabled, complaining no space. No data can write
          Solution: Create a separate 4gb partition mounted as 
          /controldata with a default inode size of 1k.   
          -> Remark[s] on 2007-11-28 
          Problem: Prestage requests in dc027 were stuck.   
          Cause: NFS client has problem in dc027. Could not list
          Impact: Prestage requests could not send to HPSS
          Solution: Reboot dc027 
          -> Remark[s] on 2007-11-29 
          wendsday Nov 28
          Problem: Panda monitor machine gridui01 crashed for 2 hours
          Cause: High memory/high load caused the machine to go down
          Solution: Machine rebooted. High memory usage must be adressed by developers 
           -> Remark[s] on 2007-11-30 
          Problem: Machine dbarch5 and the database on it is not available
          Cause: Machine taken down to move it to another subnet
          Solution: This is scheduled downtime, machine will come back when work is completed. 
           Availability report : CERN-PROD
           -> Remark[s] on 2007-11-30 
          Scheduled intervention on LSF subsystem. Has been announced, and a downtime was scheduled in GOCDB.
          Submission storm due to WMS bug. affecting CMS. This started on Tuesday evening went on until Thursday evening, and overloaded both the batch system and the CEs hosting the jobs. Due to this CERN hosted more than 30k GRID jobs for quite some time, and we passed a limit on the maximum number of jobs allowed in the batch system. This limit was increased from 50k to 75k to allow new submissions.
           Availability report : TRIUMF-LCG2
           -> Remark[s] on 2007-11-27 
          SRM trouble.
           -> Remark[s] on 2007-11-30 
          SAM test fail everywhere(?)
           Availability report : SARA-MATRIX
           -> Remark[s] on 2007-11-25 
          Problem: GIIS old entries found in sitebdii
          Solution: One time error, went away by itself.
           -> Remark[s] on 2007-11-29 
          Mainentance due to necessary immediate upgrade of dCache.
          The red is due to SAM problems.
           -> Remark[s] on 2007-11-30 
          Problem1: GIIS old entries found
          Solution1: One time error, went away by itself.
          Problem2: import_cred.c:160: gss_import_cred: Unable to read credential for import: Couldn''t open the file: /opt/edg/var/spool/edg-wl-renewd/48548840f49ff0d9359531e927e61fd6.177
          Solution2: One time error, went away by itself.
          Problem3 and 5:lcg-rm test timed out after 600 seconds
          Solution3 and 5: went away by itself
          Problem4: srmAdvisoryDelete failed. The error messages was: lcg_del: Communication error on send
          Solution4: went away by itself.
           Availability report : pic
           -> Remark[s] on 2007-11-27 
          Date: 26/11/2007 from 12:40 UTC until 15:40 UTC
          Problem: A failure in the internal pro-active monitoring system (Ingrid) caused the site-bdii.pic.es to fail during some hours.
          Severity: Medium. lcg-utils commands failed, since SEs were not in the infosys.
          Solution: Restarting the site-bdii.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. [Announcement] FZK (Tier1 GridKa): Scheduled downtime for maintenance on November 6, 8:00-22:00 UTC (9:00-23:00 CET). Upgrade to dCache 1.8. - All VO's using the GridKa SE are affected. Data transfers are stopped during this period.

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big> 5m

          Please read the report linked to the agenda.
          https://twiki.cern.ch/twiki/bin/view/LCG/TransferOperationsWeeklyReports there does not appear to be a report this week?

          Speakers: Gavin McCance (CERN), Steve Traylen
        • <big>CMS service</big>
          • Item 1
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • Last Friday CNAF site admins went through an extra-ordinary emergency intervention and, using the usual procedure described at http://cic.gridops.org/index.php?section=home&page=SDprocedure they put in Scheduled Downtime the CNAF batch farm (until today).

            The procedures foresee a broadcast message sent to affected people (for LHCb this is the lhcb-production mailing list). We didn't receive any message. It would be nice to understand the reason of that. Being the procedure very well defined (and then the possibility of errors from the sysadmin side minimized) I tend to believe that the broacast tool didn't work properly this time causing some perturbation in the daily activity of LHCb. Can relevant people (maintaining these tools) look into that?

          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • Item 1
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination </big>
          • Item 1
          Speaker: Harry Renshall / Jamie Shiers
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        1. Item 1