WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: Italy, Russia, SE Europe, UK/I
  • VOs: Alice, ATLAS, LBCb, BioMed
  • Minutes
    Recording of the meeting
      • 4:00 PM 4:00 PM
        Feedback on last meeting's minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: France / NDGF
          To: CERN / SE Europe

          Report from France COD:
          1. A new wiki has been set up for the COD to follow up operational use case and their status: https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus#Use_cases_status

            2 related issues are of priority in that list:
            1. node removal
            2. retention period and SD in SAM
            As COD, we would like that the retention period to be reduced to 1 day as decided in ARM 11 (february 2008). This is still not the case.
          2. I want to add the recurrent case of YerPhI in the handover to be discuss at the WLCG meeting:
            This site is not able to have a production quality. They have big problems with network and the availability of the site is very low.
            What is the interest for the site and for the users that the site stays in production in these conditions ?
          Report from NDGF COD:
          1. Main issue was the large amount of alarms generated after the cert for lcg-voms.cern.ch was changed. Most of those were because the site may not have upgraded their host cert. Konstantin Skaburskas replied to my query with some useful information. The situation made creating tickets from alarms unviable.
          2. As Friday was a holiday, tickets expiring on that day were postponed to this week. I think many Russian sites could also have been on holidays last week.
          3. Its still very noticeable that you have to update the alarm fields a few times before an OK alarm is properly switched off.
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, IT, RU, SEE, SWE, UKI

          Issues from EGEE ROCs:
          1. ROC France (IN2P3-CC-PPS): A new lcg-CE (cclcgvmli10.in2p3.fr) was set up to provide access to our x86_64 WNs installed with the WN_TAR-x85_64 3.1.5-0 distribution. At the time being, 4 WNs (~30 job slots) are then available. All VOs are invited to test their 64b-software through this PPS CE.
        • <big> gLite Release News</big>

          Release News:

          Now in production

          No releases to production last week.
          Last one: gLite3.1 Update21
          Details in http://glite.web.cern.ch/glite/packages/R3.1/updates.asp

          Now in pre-production

          No releases to production last week.
          Last one: gLite3.1.0 PPS Update26
          Details in

          Soon in production

          Release of gLite 3.1 Update22 in preparation.
          The update, to be released next Wednesday, will contain:
          • lcg-CE
            • SGE Engine enabled on lcg-CE
            • fix for DENY tags to lcg-info-dynamic-scheduler
          • dcache
            • Dcache (First dcache 1.8 release)
          • MPI_utils
            • Rebuild MPI_utils mpich RPM with Fortran wrappers
          • gLite-PX
            • first version of the dynamic service publisher, replacing the previous static configuration
          • VOMS core (affecting clients)
            • new VOMS core 1.8.3-4 (affecting VOMS servers and clients on UI WN VOBOX CE SE_dpm LFC WMS LB
            • Many bug fixes. Fully backward compatible.
            • fix to trustmanager install script
          • client tools
            • lcg-infosites: new option to query for the wms and the lb associated to a certain VO. The -f option to filter based on the site name is also available.
            • bug fixes for edg-gridftp-client
        • <big> EGEE issues coming from ROC reports </big>
          1. ROC France: - IN2P3-SUBATECH: I would like to discuss the sec-fp test : I use a world writable directory /dlocal on the worker nodes as EDG_WL_SCRATCH and consequently the sec-fp test report a warning. Can we exclude the directory referenced by EDG_WL_SCRATCH from the test ? As this variable is site-wide and used by all VOS, I do not see a simple method to avoid the top dir being world writable (apart use the t bit like /tmp)

          2. ROC SW Europe: Comment from PIC site: Last thursday and friday there was a scheduled downtime at PIC. We scheduled the downtime in the GOCDB some days ago, and this generated some info e-mails from the CIC-portal. We know this because we received copy of them, but we do not know who exactly received this notification. We would like to have a way to ensure that the vo-managers of the VOs affected did receive this e-mail. How can we do this now?
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. ROC ???: Item
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. The Classic SEs at IN2P3-LPC are planned to be removed from production the 15th May:
            - clrauvergridse01.in2p3.fr
            - clrlcgse02.in2p3.fr
            Please backup your data before that date.

          2. The old Edinburgh site, ce.epcc.ed.ac.uk will be retired from use in one week time (1 May 2008). Storage services, via srm.epcc.ed.ac.uk, will be accessible via the new Edinburgh site, ce.glite.ecdf.ed.ac.uk for some time after this, although the intention is to slowly migrate to newer storage. This means that support for several VOs will be dropped by Edinburgh, as they are not part of UKI-SCOTGRID-ECDF's supported VO list. In particular, these vos are:
            alice, babar, biomed, cdf, cms, dzero, esr, fusion, geant4, hone, magic, minos, na48, planck, sixt, t2k and zeus

          3. At the start of May, the site egee.man.poznan.pl will be removed from production and shut down. Please backup your data stored on storage elements belonging to this site.

          4. GOG-Singapore would like to decommission their site by June 2, 2008. The hardware and services at the site will be shutdown permanently. Please migrate data that is still needed by your VO before the site is disabled.

            The site currently supports the following VOs: alice, atlas, lhcb, cms, biomed, dteam and ops

          Time at WLCG T0 and T1 sites.

        • <big> CCRC'08 Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
        • <big> CMS report </big>

          • Data certification, Processing at the T0:
            CERN CPUs busy mostly with CMSSW 205 RelVal production. Validated releases: CMSSW V2.0.5. On the Tier-0 side, we had the Castor upgrade of the CMS instance to 2.1.7, plus LSF and /afs interventions. The /afs problem for cmsprod volume seems fixed (moving to a fresh volume did the trick).
          • Re-processing:
            Still ~10 CSA07 "long tails" workflows running HLT step. Finished most of the requests with FastSim 1.8.4, running some large MadGraph workflows. Started the iCSA08 pre-production: in progress. These data need to get back to CERN for further manipulation and injection to T1 sites for the CSA08 exercise.
          • MC production:
            DPG requests with CMSSW_184: 10M cosmics done, 4M cosmic (4T) done (GEN-SIM-DIGI-RECO) (running AlcaReco), 6M BeamHalo done, 4M MinBias done, 1M Zmumu done; 4M cosmic (0T) will start soon. DPG requests with CMSSW_177: 1M TIF cosmics (all files at CERN, Reco can start). FastSim production with CMSSW_184: 8 QCD workflows (6 done, 2 running), 9 photonjets done, 10 photonjets_etgam done, 1 Bphys done; 7 QCD workflows showed a substantial job crash so prod of these stopped and situation is under investigation. Pre-CSA08 production with CMSSW_205 running.
          • Data Transfers and Integrity, DDT-2/LT status:
            Production transfers in the /Prod instance of the pre-CSA data suffer of the "FILE_EXISTS" problem on Castor at CERN: Castor experts suggests it is related not to Castor itself but to to the SRM11->22 upgrade, SRM/storageware experts contacted already. --- Production subscriptions older than 2 months being suspended and cleaned up to prepare for May. Change in the transfer priorities to accomodate the CSA/CCRC use-cases in May: done. Stop&start of PhEDEx agents to move PhEDEx names into more consistent CMS naming convention: done (only few sites in the tails still need fixes: no worries). --- DDT status: progress continues, now efforts go into the debugging of non-regional routes in the CCRC scope. Day by day details at https://twiki.cern.ch/twiki/bin/view/CMS/DDTLinkExercising, visual overview at http://magini.web.cern.ch/magini/ddt.html.
          • LINKs:
            Computing meetings of last week: http://indico.cern.ch/conferenceDisplay.py?confId=33110
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
      • 5:00 PM 5:30 PM
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          Ticket 33220 is the only ticket for discussion but it's Inactivity Index is relatively low.
      • 5:30 PM 5:35 PM
        Review of action items 5m
        list of actions
      • 5:35 PM 5:35 PM
        1. Item 1