WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

  • NB: Reports were not received in advance of the meeting from the following ROCs: SouthWest Europe
  • Reports were received from the following VOs: ATLAS
      • 1
        Feedback on last meeting's minutes
        Minutes
      • 2
        EGEE Items
        • a) <big> Grid-Operator-on-Duty handover </big>
          From: CERN / CE
          To: Italy / SWE


          Issues from CERN ROC:
          1. The CEs at CERN-PROD were under heavy load and generated many alarms for the COD. However, the CEs are behaving as expected because they disappear from the information system when they are overloaded in order not to receive more jobs. The monitoring tools should be able to detect this condition doing the appropriate correlation between SAM and gstat results. Eventually a warning to the site about the overload could be sent, but not an error report.
          2. Next COD should not open tickets for alarms about "RGMA-host-cert-valid failed on LRZ-LMU". Bug in SAM tests (https://savannah.cern.ch/bugs/?32497)
        • b) <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, IT, NE, SWE


          Issues from EGEE ROCs:
          None reported

          Closure of the PPS service inventory:

          Thanks to the sites and ROC that during this week have provided feedback on the spreadsheet published last week.
          The last version can be found in
          www.cern.ch/pps/index.php?dir=./site/
          Under "Service Inventory"
          The contact point to be used for further feedback is the list
          pps-support@cern.ch

          Test of 64-bit WNs in PPS:

          The 64bit natively compiled WNs have reached the pre-production phase and they are ready to be deployed.
          We want to do it in PPS in the most convenient way for the VOs to use them.
          We have sent last week two messages:
          • To the VOs (through the EIS team), asking for an expression of interest and possible suggestions and feedback about possible testing scenarios.
            We received a reply from LHCb and we are now working to address their reuirements.
            Are there other VOs interested in being involved in a pre-production activity of 64-bit WNs?
          • To the PPS sites, asking for sites willing to dedicate 64-bit machines to this deployment and testing activity.
            So far only CERN and one site in Baltic grid have volunteered to pilot their 64-b deployment through PPS.
            Are there other sites willing to support this testing platform?

          Release News:
          1. Glite 3.1.0 PPS Update 19 passed the pre-deployment testing and it is now being deployed in PPS
            • WN 3.1 for sl4 64bits
            • glite-LSF_utils
            • lcg-vomscerts-4.8.0 adds next cert for biomed + egeode
            • new version of lcg-ManageVOTag fixing bug #31848
            Release notes in:
            https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update19
          2. gLite3.1.0 PPS Update20 was released to PPS and it is going through the pre-deployment test.
            The update introduces the MONBOX on the 3.1 baseline (for SLC4)
        • c) <big> EGEE issues coming from ROC reports </big>
          1. [Question] As the classic SE will be phased out soon, are there plans to continue development of that MW service? There are other developments build one the SE like http://www.isgtw.org/?pid=1000820 (bridging the islands of SRM and SRB). What is the current status of the future SRM2 interface for SRB (ASGC)? (FhG SCAI)
        • d) <big> gLite Release News</big>
          1. An update to gLite (3.1 Update 15) was released containing the new certificate of the VOMS server for the VOs biomed and egeode.
            All site supporting these VOs should upgrade their services as the old certificate is already expired (1st of March).
          2. An update to gLite (3.0 Update 40) will be released very soon (today). The update will contain
            • The new certificate of the VOMS server for the VOs biomed and egeode.
              (expired since the 1st of March)
            • Fix for the bug of limit on uid for gridftp server
          3. release of gLite3.1 Update16 to production in preparation
            The update will contain:
            • A new index to speed the BDII up
            • UI: Bug fixes to JDl API (bulk submission) and gfal cliens
            • dcache SE: Glue 1.3 clean ups
            • DPM SE: version 1.6.7 (32-bit and 64-bit) fixing various configuration bugs; introducing new front-ends for Xroot and HTTP/HTTPS; upgrading the version of gSOAP form 2.6.2 -> 2.7.6b
            • lcgCE: bug fixing
        • e) <big> Operations Tools downtimes this week
          1. [SAM] Downtime will begin at: 07:45h UTC, 4th March (08:45h Geneva time) Downtime will end at: 10:45h UTC, 4th March (11:45h Geneva time)
            Affected services are: GRIDVIEW, SAM and FCR
          2. [GOC DB] GOCDB was down on 28/02 (announced by CIC portal team). No announcement from GOCDB about this failure, neither about the return to service... (from ROC France)
        • f) <big> Heinz Stockinger still blocked at some sites
          Heinz Stockinger is still blocked at some sites and has asked if these sites can rant him access again. The list of CEs is:
          • ce00.hep.ph.ic.ac.uk
          • ce01.marie.hellasgrid.gr
          • ce01.tier2.hep.manchester.ac.uk
          • ce02.esc.qmul.ac.uk
          • ce02.tier2.hep.manchester.ac.uk
          • ce05.pic.es
          • ce06.pic.es
          • ce07.pic.es
          • dgc-grid-40.brunel.ac.uk
          • dgc-grid-44.brunel.ac.uk
          • egee-ce1.gup.uni-linz.ac.at
          • grid002.jet.efda.org
          • gw-2.ccc.ucl.ac.uk
          • helmsley.dur.scotgrid.ac.uk
          • mars-ce2.mars.lesc.doc.ic.ac.uk
          • serv03.hep.phy.cam.ac.uk
          • svr016.gla.scotgrid.ac.uk
          • t2ce02.physics.ox.ac.uk
          • t2ce03.physics.ox.ac.uk
      • 3
        WLCG Items
        • a) <big> WLCG recommendation: DPM and filesystem choice </big)
          It has been proven that the ext3 filesystem is far less performing then the xfs filesystem for file deletion operations. In particular, deleting 2048 files with 1.5GB size takes 5 seconds on XFS and 90 minutes on ext3. Therefore, I think we should recommend that sites running DPM migrate from ext3 to xfs, if possible. In fact, running XFS does not have any counter effect, only benefits.
          Speaker: Dr Flavia Donno (CERN)
        • b) <big> WLCG issues coming from ROC reports </big>
          None this week
        • c) <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. [CERN] There will be SCHEDULED Downtime for SRM at CERN on 06-03-2008 from 8:00 to 12:00 (UTC+1) for 2.1.6 CASTORLHCB upgrade (The machines are: castorsrm, srm.cern.ch, srm-durable-lhcb, srm-lhcb.cern.ch)
          2. [CERN] There was an UNSCHEDULED downtime uring the week-end: lcg-voms.cern.ch was down due to a hardware problem. It is now back to work. although the problem is not fixed, we will do our best to prevent this happening again (requires a hardware change in the future).
            Note that voms.cern.ch wasn't affected, so voms proxy and gridmap file generation were fine during the week-end.
            This effected ALICE, ATLAS, CMS, LHCb, DTEAM, OPS, Sixt, Unosat, Geant4, Gear.

          Time at WLCG T0 and T1 sites.

        • d) <big> ATLAS Issues
          ATLAS ask to all the T2 to implement SRM2.2 before the 2nd of April, to have the time to test the full system before the CCRC08 phase 2. From February 27 2008: ATLAS T0/1/2 Jamboree "About 50% of all T2’s are using SRMv2.2 now, but we would like to have them implement this well before CCRC in May because we need time to test. Deadline 2nd of April"
          Speaker: Alessandro Di Girolamo
        • e) <big> CCRC'08 Operational Review </big>
          • Item 1
          Speaker: Harry Renshall / Jamie Shiers
      • 4
        OSG Items
        Speaker: Rob Quick (OSG - Indiana University)
        • a) <big> Discussion of open tickets for OSG
          The only outstanding GGUS tickets is: https://gus.fzk.de/ws/ticket_info.php?ticket=31037
          more information
      • 5
        Review of action items
        list of actions
      • 6
        AOB
        1. Item 1