WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: SEE
  • VOs:
  • Minutes
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: SEE / AP
          To: France / NDGF

          Report from SEE COD: Please note the following about these sites:
          1. Australia-UNIMELB-LCG2: GGUS Ticket #34393 No update, problem still there.
          2. YerPHI: GGUS Ticket #26634 SAM tests are not stable, the problem is still there, no updates to the ticket.
          3. The new COD dashboard interface seems to be better.
          Report from AP COD:
          1. On 4/23-24 The CIC Alarms interface was not working.
          2. On 4/25 GOCDB was not be access able and it also affected COD portal and COD work.
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP IT NE SEE

          Issues from EGEE ROCs:
          1. ROC France: First attempt to deploy the Tarball version of WN gLite3.1 x86_64. Except a package missing, seems ok.
          2. ROC UKI Due to problems with the hardware on which the UKI-SOUTHGRID-BHAM-PPS site is installed, the site will no longer be maintaining a PPS site
            Comment (PPS coordination): As the site was involved in pre-deployment testing, this info has to be forwarded to the test coordinator (Mario David)
        • <big> gLite Release News</big>

          Release News:

          Now in production

          gLite 3.1.0 Update20 and 21were released to production with HIGH priority.
          Update 21 was an urgent fix for a compatibility issue affecting lcg-CEs still running at version 3.0 introduced by Update 20
          The main changes introduced by Update20 (relevant for CCRC08) are:
          • UI/WN/VOBOX
            • new feature: glite-data-gfal version (1.10.11-1) provides new functions gfal_abortrequest and gfal_abortfilesseveral,
            • new feature: glite-data-dm-util (lcg_util) version (1.6.11-1) now prints the SE type (SRMv1, SRMv2, Classic SE) in verbose mode (when relevant)
            • bug fix: lcg-ls does not work for the classic SE
            • bug fix: lcg-cr glibc memory corruption
            • bug fix: gfal_stat seg. fault with dummy LFN
            • bug fix: lcg-sd doesn't doesn't work with SRMv2 request token
            • bug fix: lcg-gt segmentation fault
            • fix globus-cass-cache problem on WN
          • DPM/LFC v1.6.10
            • fix problem of replication of a zero-length file improve logging of updatefilestatus method
            • DICOM back-end service for DPM
            • producing re-buildable source RPMs
            • group writable directories when SRM started with umask 0
            • DPM-DSI: DPM's gridftp does not allow for ':' in SURL (GGUS ticket #32335)
            • support for CKSM (md5 only yet)
          • lcg-CE
            • Changes in Globus jobmanager and GASS cache. These modifications improve the performance of the lcg-CE by a factor of two to three
            Details in http://glite.web.cern.ch/glite/packages/R3.1/updates.asp

          Now in pre-production

          PPS site are now upgrading to gLite 3.1.0 PPS Updates 25 and 26:
          • gLite-PX
            • dynamic service publisher, replacing the previous static configuration
          • dcache 1.8
            • Major dcache version change, adds support for SRM 2.2.
          • VOMS
            • new VOMS core 1.8.3-4 (affecting VOMS servers and clients on UI WN VOBOX CE SE_dpm LFC WMS LB
            • Many bug fixes. Fully backward compatible.
            • fix to trustmanager install script
              • MPI_Utils
            • wrapper scripts to compile Fortran MPI programs.
          • APEL (CE and MON and BATCH_utils)
            • APEL working with external log4j and BC
          • GFAL APEL (CE and MON and BATCH_utils)
            • APEL working with external log4j and BC
          • UI/WN/VOBOX
            • dcache client 1.8
            • glite-data-gfal v1.10.11-1 (bug fixes)
            • glite-data-dm-util v1.6.11-1 (bug fixes)
            • fix globus-cass-cache problem on WN
            Details in
            • https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update25
            • https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update26

          Soon in production

          No release to production scheduled for this week
        • <big> EGEE issues coming from ROC reports </big>
          1. CE, DECH ROCs: Admins are complaining about production updates which are not checked enough. It is much better to invest some more effort of one tester in testing, than hundreds of site administrators in debugging problems. They complain that this time it wasn't even possible to run YAIM on CE.
            Reply (Release managers and pre-production teams): We apologise for the disruption caused. Update20 was accelerated due to requirements coming from CCRC08. The installation issue with the CE was due to a mistake in the release preparation, because the dependency of the installation function from the new version of yaim-core version was not correctly set. Of course, as the correct version of yaim-core was already deployed in pre-production (but not in production) this issue was not visible for the pre-deployment testers in PPS. This particular issue could only have been trapped by a deployment test in production (currently not foreseen by the release procedure). BTW: yaim-core was being held back in PPS because it forced a change in the permissions schema for the site-info.def and containing directory to be implemented at all sites, which was not rated acceptable for the operations.
            The issue found later on in production affecting the submission from CE3.0 to WN3.1 has another explanation. CE at version 3.1 has been in production for more than two months, which means that regression tests are not being done in certification. Pre-production run, by mandate, the top version of the services
          2. ROC Cern
            • SAM Unavailability
              • From: 22-04-2008 (Tue) 07:45 UTC
              • To: 23-04-2008 (Wed) 13:30 UTC
              • Severity: Minor
              • Affected services: all
              • Symptoms: problems/fixes propagated to SAM possibly 1 hour later than normal (tests only in every odd hour)
              • Reason: upgrade of SAM UI (SLC4, gLite 3.1)
              • Solution: sorting out problems arising during the installation + testing
            • top-BDII config generator tool
              • From: 23-04-2008 (Wed) 16:15 UTC
              • To: 23-04-2008 (Wed) 21:15 UTC
              • Symptom: presence of OSG sites alternating
              • Reason: misconfiguration of the top-BDII config generator
              • Solution: configuration fixed
          3. AP ROC: 1.6.7-4 and 1.6.10 DPM releases were not found for glite 3.0. Is this only available for glite 3.1?
            Answer (gLite Release team): DPM is not supported anymore on 3.0. Be aware that this means that no regression testing is currently being done for services in this version
          4. DECH ROC:
            • CSCS did not upgrade their dCache installation on Friday as originally scheduled. They expected a minor update as suggested by the release numbering. But because it turned out that the configuration and the installation scripts had changed, they decided not to take the risk of breaking their installation, which is running on Solaris machines, on a Friday afternoon. They encourage dCache developers to use proper numbering semantics, which would help distinguish between minor and major updates.
            • Could we get official information on the requirements for T2s to participate in the coming CCRC? (There had been unspecific complains about a lack of reactivity on the side of T2s!?! We are not aware of any such problems with sites in our region, but would like to encourage VOs to let us know, if there are any such concerns)
            • daily edited Availability comments appear without dates in the weekly availability site report. Would it be possible for the CIC-team to add timestamps for this part of the roc report? This would help us to prepare the ROC summary.
          5. SWE ROC: There was an request og knowing how many sites need information on configuring the SGE batch system for short deadline jobs. Are this short deadline jobs obligatory? Which VOs do request this feature?
          6. UKI ROC: (Feedback to be passed to the CIC portal team from one site: Table layout of weekly report is far from ideal. It is very easy to mix a detail field belonging to a failure with the comment box belonging to the previous or next failure).
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. AP ROC: Site decommissioning
            • GOG-Singapore would like to decommission their site by June 2, 2008
              They support the following VOs: Alice, Atlas, CMS, LHCb
              Please migrate what is still needed by your VO before the site is disabled
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. The Classic SEs at IN2P3-LPC are planned to be removed from production the 15th May:
            - clrauvergridse01.in2p3.fr
            - clrlcgse02.in2p3.fr
            Please backup your data before that date.

          2. The old Edinburgh site, ce.epcc.ed.ac.uk will be retired from use in one week time (1 May 2008). Storage services, via srm.epcc.ed.ac.uk, will be accessible via the new Edinburgh site, ce.glite.ecdf.ed.ac.uk for some time after this, although the intention is to slowly migrate to newer storage. This means that support for several VOs will be dropped by Edinburgh, as they are not part of UKI-SCOTGRID-ECDF's supported VO list. In particular, these vos are:
            alice, babar, biomed, cdf, cms, dzero, esr, fusion, geant4, hone, magic, minos, na48, planck, sixt, t2k and zeus
          3. At the start of May, the site egee.man.poznan.pl will be removed from production and shut down. Please backup your data stored on storage elements belonging to this site.

          Time at WLCG T0 and T1 sites.

        • <big> CCRC'08 Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
          nothing reported in CIC portal
        • <big> Atlas report </big>
          nothing reported in CIC portal
        • <big> CMS report </big>
          nothing reported in CIC portal
          • News on Development:
          • Data certification, Processing at the T0:
          • Re-processing:
          • MC production:
          • Data Transfers and Integrity, DDT-2/LT status:
          • LINKs:
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          nothing reported in CIC portal
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
      • 17:30 17:35
        Review of action items 5m
        list of actions
      • 17:35 17:35
        1. Item 1