WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs:
  • VOs: Alice, LHCb
  • list of actions
      • 4:00 PM 4:05 PM
        Feedback on last meeting's minutes 5m
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC Italy (backup: ROC Taiwan) to ROC CERN (backup: ROC UKI)

          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          1st mail: 14
          2nd mail: 2
          Quarant : 13
          Site OK : 9
          Solved : 32
          Unsolv : 2

          1. No last escalation steps this week.

            Last week's sites:
            INFN-CAGLIARI: down for SE upgrade to DPM
            INFN-ROMA1 : solved
            INAF-TRIESTE : solved
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:


          • Issues from EGEE ROCs:
            1. AP ROC: Request from CERN PPS site to publish GStat test results to SAM. If approved, this will be enabled tomorrow.
            2. NE ROC: SARA likes to add a site to the PPS for the srmv2.2 tests. We (as a ROC) have sent an email to pps-support@cern.ch but we get no reply.
          Speaker: Nicholas Thackray (CERN)
        • <big> Status update: job wrapper tests</big>
          See attached slides
          Speaker: Piotr Nyczyk (CERN)
          more information
        • <big> Migration to SL4 WNs </big>

        • <big> EGEE issues coming from ROC reports </big>
          • Germany-Switzerland ROC: DESY reported: After installation of gLite production upgrade 27 the user mapping on the dcache_SE changed when using the kpwd mechanism. The reason is the introduction of a new grid-mapfile-to-kpwd converter script in the RPM d-cache-lcg-6.2.0-1. Now sgm/prd users are mapped to sgm/prd accounts although they were previously mapped to the first VO poolaccount. Unfortunately we did not find anything in the release nodes. Other sites should be warned.
          • Germany-Switzerland ROC: DESY reported: Having installed SL4 WNs with gLite 3.1, we see that some rpms are missing. Unfortunately, these provide essential user utils like lcg-infosites. Is this planned by the Middleware group? Have they made an announcement? Are there plans to provide these RPMs for the gLite 3.1 version on the SL4 WNs?
          • Italy ROC: FOR IMFORMATION:
            we are going to put in production WN (gLite 3.0) for Windows and AIX (details https://grid-it.cnaf.infn.it/checklist/modules/dokuwiki/doku.php?id=rel:aixwindows).
            Obviously these WNs can run only AIX/Windows codes and the users must take care of requiring explicitly the correct platform for they submission, by providing the proper parameter in their jdl.
            For the access to AIX WN:
            requirements = GlueHostOperatingSystemVersion == "AIX"
            For the access to Windows WN:
            requirements = GlueHostOperatingSystemVersion == "Microsoft WINDOWS"
            If the parameter is not set the job can be sent unpredictably to any system and if the execution is attempted on the wrong platform the job will fail.
            This means that users MUST ALWAYS provide this parameter in their jdl.
            The choice of the value of the parameter "GlueHostOperatingSystemVersion" as the selection criteria for the choice of the platform where the job must run is not fully satisfactory but we have not found a better solution in the existing set of available parameters.
            The criteria for which is the user who is responsable of the choice of the parameters which select the resources has been the result of the discussions that we have triggered after last EGEE conference and that have involved Claudio Grandi, Ian Bird, Charles Loomis and others. Claudio has recently reported us that that the criteria has been discussed at the operation meeting and endorsed by EMT without objections from TCG.
            ACTION PLAN
            1) Register the two site on GOCDB (ENEA-INFO is just registered);
            2) Send a broadcast to inform all users about the criteria that shold be adopted to select the proper O.S. and platform in their job.
          • SEE ROC: The SEE ROC decided last week to stop supporting the current version of the gLite_CE due to its instability. We will slowly take all gLite_CE out of production till a new stable version is released.
          • SEE ROC: We are still having problems due to the Synchronization of GOCDB, GSTAT and SAM DB. A node from AEGIS01-PHY-SCL that was removed from GOCDB and site-BDII more than 3 weeks ago is still tested by SAM. Since it is now removed also from DNS, SAM tests started to fail and our site risks getting a COD ticket for this:
          • UKI ROC: We still have sites reporting SAM failure tests being included in their site report even though the site was in downtime. The site in question last week was UKI-SOUTHGRID-BRIS-HEP. One example being the rm test at 22-07-2007 12:03 (4hrs).
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> Tier 1 reports </big>
        • <big> WLCG issues coming from ROC reports </big>
          1. Germany-Switzerland: Sites complained that communication from the VOs to the sites concerning their intended switch to SL4 was not made clear enough. Had there been any announcements by the VOs in this meeting we might have missed? From the management board, we hear that officially, Tier2 must not yet switch to SL4, on the other hand, the VOs are putting pressure on the sites to migrate, because otherwise sites cannot participate at challenges. What should sites do?

          2. Germany-Switzerland: Concerning PRD and SGM pool accounts: Looking at the minutes EGEE-WLCG-OSG operations meeting from 9th July 2007, one has the impression that the VOs now can decide what they want the sites to do. This seems to be in contradiction with previous statements that in the mid and long term, VOs have to go to pool accounts, and a short-term-solution is a fix in YAIM being able to handle both static and pool accounts. For sites supporting many VOs, they MUST insist on having one single way to go. (BTW: This is also what the VOs want).

        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>
            Please read the report linked to the agenda.
          Speaker: Gavin McCance (CERN)
          more information
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

          • Problem with FTS trying to move files from/to US sites:
            There are US sitesthat don't show their infos on BDII (lcg-info --vo atlas --list-se --query'SE=*.edu ') but are in TiersOfATLAS
            If transfers are scheduled from/to one of these sites to/from wherever other ATLAS site, FTS fails to transfer.
            The list of the sites is:

            How do you think it's better to proceed?
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          • Job processing: CSA07 (GEN-SIM) pre-Production so far: 78.5M events(Minbias: 24M, PH soup: 45.9M events, DPG: 8.5M events), at a steady rate of ~47M events/month (w/o Minbias). CSA07 (GEN-SIM) left-overs: 37.6M events; i.e. assuming same production rate as so far, < 1 month to finish the total pre-CSA07 production. --- Castor@CERN issues seen during T0 tests are being addressed.
          • Data Transfers: "production" transfers continue: among 47GEN-SIM workflows, 28 are done but only 21 DIGI-RECO workflows ready due to data transfer tails to T1s: being debugged. --- "Test" transfers: the Debugging DataTranfers (DDT) program improved already the number of commissioned links to be given to Data Operations: from 11 (previous weekly report) to 35 (lastweek).
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          1. .
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
      • 4:55 PM 5:00 PM
        OSG Items 5m
        1. Item 1
      • 5:00 PM 5:05 PM
        Review of action items 5m
        list of actions
      • 5:10 PM 5:15 PM
        AOB 5m