WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


Maite Barroso Lopez (CERN)
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
    NB: Reports were not received in advance of the meeting from:

  • ROCs: NorthernEurope, Russia
  • Tier-1 sites: BNL
  • VOs: Alice, LHCb
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC DECH (backup: ROC France) to ROC Italy (backup: ROC Taiwan)

          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          1. DECH: These three sites received a "final remainder" that the COD requests information about the progres on pending issues:
            If there is no reaction next week, the COD team should escalete the tickets to the operations meeting.
          2. France: 4 sites are requested to attend the Weekly Operations Meeting :
            - (SU-GRID) No entries except COD people in the ticket .
            - (RU-Phys-SPbSU) No entries except COD people in the ticket since more than one month .
            - (VGTU-gLite) No entries except COD people in the ticket since 4 weeks at least. The monitoring of the node has been disabled since few time but no explanation , and no downtime .
            -(TECHNION-LCG2) No entries except COD people in the ticket since 2 weeks
            More generally ROCs and Sites must agree about a duty turn . I agree that lots of people are in vacation but the ROCs must assumed a minimum of presence . Most of tickets are in the same state during 2 weeks .
          3. France: There are also still problems of synchronization between GOC DB and SAM:
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:

          • New updates will be announced to PPS tomorrow, including a patch for FTS 2.0 (to be applied only by Tier-1s) and a new version of the gfal client.

          • Issues from EGEE ROCs:
            1. None reported
          Speaker: Nicholas Thackray (CERN)
        • <big> Phase out of LCG-2_7_0</big>
          There are still a few production sites publishing the OS version as LCG-2_7_0. After more than one year running gLite releases, it is time to phase it out and officially stop the support. For this we would like to request all remaining sites to upgrade in the coming month, deadline by the end of August. The list of sites still publishing LCG-2_7_0 is available from Gstat. There are 14 sites as of this morning. Min has kindly provided the details about how this is extracted, here you have them so you can better understand the results shown: -------------------- GStat downloads all the GlueHostApplicationSoftwareRunTimeEnvironment attributes for the entire site of all present GlueSubClusterUniqueID entries. Then tries to find the newest release of glite or lcg. This method only provides on version result for the site. This can then be difficult to find clusters within a site that has not upgraded. Some nodes do not show up is because this information gets aggregated for each site and lost at the end. So sites like RAL with several CEs, will only have one version result from GStat, even though each CE has different versions. ---------------------
          Speaker: Steve Traylen (CERN)
        • <big> Migration to SL4 WNs </big>
          • AP ROC, ASGC T1: SLC4 migration for WN in progress. 200 Nodes have now been upgraded and another 300 nodes are still remaining.
          • DECH ROC: SL version 4 (2 sites)
          • SWE ROC, PIC site: The most important issue concerning pic this week is the migration to SLC4. We have installed a new CE from scratch which points to a pbs queue with WN''s running SLC4. This CE at the moment is configured to support ops and dteam. We have opened as well the access to the sgm users from the vo''s cms, atlas and lhcb so that they can test the installation software. It has been really a tedious decision of what to include in the software directory for every VO. Different VO''s have different needs. Finally we have decided to define a "brand new" software directory and not to publish any old tag from the other production CE''s, nor copy old software. From the atlas point of view this approach is acceptable. They have already installed their latest version of the software in the ce-test and it seems it works fine. Instead from the cms point of view this is not efficient at all. We are still trying to reach a solution which will be ok for all the vo''s, or at least for lhcb, cms and atlas.
        • <big> EGEE issues coming from ROC reports </big>
          • UKI ROC: RAL: have hit the limit of the current hardware that is running the current RGMA registry, unfortunately to move it to new hardware will also require a change in ip address. We realise that this may require sites to change firewall rules. How much notice would be necessary to allow sites to prepare for this change?
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          more information
        • <big> WLCG issues coming from ROC reports </big>
          1. Germany-Switzerland: Tier1 Report: Propose to revert to a common reporting template. + Definition of severity is needed.

          2. PIC had an issue last week with new WNs installed, that were missing a library apparently leeded by the ATLAS sw: /usr/lib/libg2c.a. Installing the rpm gcc-g77 solved the issue, but we believe it would be very useful to avoid these issues that each VO expresses their "base installation requirements" in some standard way. For instance, having some meta-rpm like "atlas-requirements" that sets an rpm requirement on gcc-g77 would have been convenient. The same holds for the other VOs.

        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

          Time at WLCG T0 and T1 sites.

          1. RAL:RAL-LCG2 3D services will be at risk on the 26th July due to the application of Oracle patches. .
        • <big>FTS service review</big>
          Speaker: Gavin McCance (CERN)
        • <big> SRM v2.2 testing </big> 5m
          The various client tools (FTS, GFAL, lcg_utils) have been enhanced to support SRM v2.2. During certification and testing, some bugs have been found. More details on the schedule and the feature list will be provided at future operations meetings. See also last week'sLCG ECM.
        • <big> ATLAS service </big>
          See also and for more information.

          • About the OS publication (discussed last meeting, here, we would have a clarification to understand the differences between the 32/64 bit publication.
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          • Job processing: Focus of the week was on 1) 'Spring07' tails clean-up: going on.. ; 2) 1st half CSA07 requests termination (60.5 Mevts, of which 59.2 merged; production performances is: an average production rate of ~42M evts/Month (steady), <job-slots usage>: 4600 (+43% compared to Spring07 GEN-SIM) with regular values >5000 and best in 24h ~6500); 3) 2nd half CSA07 requests assignment (~ 51 Mevts will follow). In MC production, the merging steps encountered problems with some massive lost of produced (unmerged) events which apparently cannot be recovered at CERN due to Castor issues: being followed up.
          • Data Transfers: continuing "production" transfers, i.e. GEN-SIM data shipping to T1's; "test" transfers: LoadTest infrastructure converging into the Debugging Data Tranfers (DDT) program: the DDT Task Force gave its first report at last Integration/CSA07 meeting, they identified the first set of already-commissioned links, and they are reviewing the 'LoadTest sample population' procedure to increase the nb of T2->T1 which can actually be tested within the program.
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          1. .
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
      • 17:10 17:15
        AOB 5m