WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: France, Russia, SE Europe (problems with report interface?)
  • VOs:
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: ROC Central Europe / ROC DECH
          To: ROC France / ROC SW Europe


          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Issues:
          1. Reminder to sites/ROC that they are primarily responsible for providing a solution (COD should help of course) and also for updating the ticket incl. setting an appropriate ticket status.
          2. Some sites appear as production sites on the COD dashboard, though type in GOCDB is PPS. Next COD keep an eye on that.
          3. No case transfered to 'political instances' this week.
          4. site status 'maintenance' sometimes not correctly propagated to COD dashboard.
        • <big> Use of the DAG repository </big>
            Thanks for the responses on this issue. The (very abbreviated) feedback is as follows:
          • North Europe: Happy to use the DAG repository (with some reservations)
          • SE Europe: Preference is not to use DAG repo, but will use it with some reservations
          • Central Europe: Happy to use the DAG repository (with some reservations)
          • UK/Ireland: Many reservations about using DAG repository
          • DECH: No strong opposition to using the DAG repository (with some reservations)
          • SW Europe: Some concerns expressed but not clear if there is strong opposition.

          General consensus is that the DAG repository can be used.
          Any questions for SA3?
          Detailed responses:
        • <big> Announcing SAM updates? </big>
            SAM updates are currently announced using the SAME-announce mailing list.
            Should other mailing lists or methods be used as well/instead (e.g. COD, news on the CIC portal)?
        • <big> Use of production resources for SRM2.2 testing</big>
          Speaker: Flavia Donno
        • <big> Migration to SL4 WNs </big>
            The WLCG Management Board and the GDB have requested that all WLCG tier-1 sites must migrate to SL4/gLite 3.1 WN by the end of August.
            The MB and GDB have also expressed the strong desire that all WLCG sites migrate to the SL4/gLite 3.1 WN as soon as possible.

          Updates from the Tier-1 sites:
          • ASGC: New CE hosting 200 SL4 cores has been brought online Aug 10, 2007. Remaining 350 cores will be migrated to the new CE in phases.
            Preparing for next batch of SL3 WNs to be migrated to SL4, but no changes this past week

          • CERN: CERN is on track to fulfil it's commitments, for providing SL4 based WNs, by the agreed date of end of August.

          • BNL: No report.

          • FermiLab: No report.

          • TRIUMF: All new resource will be installed with SL4 and will be coming online around 20th August. The old cluster will be moved and re-installed with SL4 shortly after.

          • IN2P3: No report.

          • GridKA: All WNs at GridKa on SL4 since 27-7-2007. gLite WN-package 3.021 (''compatible''). Upgrade to 3.1 WN package planned for early September after allowing some time for testing in PPS end of August.

          • INFN: No report.

          • SARA/NIKHEF: NIKHEF will upgrade their WNs to CentOS-4 this week (33). They have done this already by now.
            SARA will upgrade the WNs in September (no date fixed yet). It is not possible to do this earlier because of vacations of persons involved.


          • PIC: We have migrated nearly 90% of our WN's to slc4.
            All the Grid WNs at pic are now running under slc4 and Glite 3.1

          • RAL: A new CE has been deployed - lcgce02.gridpp.rl.ac.uk for access to SL4 WNs, and 20% of our worker node capacity has been reinstalled with SL4. The test CE lcgce0371.gridpp.rl.ac.uk has been taken out of service. We are discussing with the experiments further migration of capacity.

        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, CERN, FR, IT, RU, SEE

          Issues from EGEE ROCs:
          1. Note from SAM client admin. There was a delay in upgrade of RAL SAM UI to newest sensors with updated lcg-CA version (ticket #25724) Action 60 from ops meeting: At Cyfronet we use dedicated certificate for SAM UI. The cause of the mapping problem experienced by CODs is to be searched on the other SAM client (PPS-RAL), where the administrator is using the same cert both for ops and dteam [CE ROC]

          2. Answer by PPS Coordination: The issue has been addressed via e-mail exchange with the SAM client admin at PPS-RAL
          Release News:
          • gLite3.1.0-PPS-Update05 just released to PPS. This release is done to synchronise the PPS with the corresponding update already in production (service discovery packages for SLC4).
          • gLite3.0.2-PPS-Update38 to be released soon (1-2 days) in PPS. It contains the SLC3 version of the 3.1 WMS. The upgrade path is not supported for this release, so a campaign will be needed to re-install, in turns, those sites running a gLite WMS.
        • <big> EGEE issues coming from ROC reports </big>
          1. Complaint about introducing new critical test (CE-sft-lcg-rm-free) without any information as far as we know (NIKHEF, Belgrid-UCL).
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
        • <big> WLCG issues coming from ROC reports </big>
          1. ROC DECH: [For Information] Some confusion about Tier-2 accounting/renaming initiative. There has been an update of this issue (see below) yet: It turns out to be an initiative of the LCG-MB. In region DECH, Andreas Heiss (FZK, substituting H. Marten) is coordinating this together with Peter Kunszt (CSCS).
            DESY-HH (Aug 12th): "We are not sure about the organization of the Tier-2 accounting. In a document (send by Sue Foffano on June 20th to the project-wlcg-tier2 list) some points including changes of the site names are addressed. Is this the site name we presently use in the GOCDB? If yes many chances would be a ahead of us. Who is coordinating this? Are there instructions for institutes that have more then one site (e.g. DESY-HH (Hamburg) and DESY-ZN (Zeuthen, close to Berlin))? Note that DESY supports different experiments in different "Federations". We do not expect an immediate answer, we rather want to ask for some channels where the mentioned points can be brought up."


        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Known interventions affecting Tier 0 operations are networking interruptions between 04:00 to 06:00 UTC on 22 August; pilot glite 3.1 WMS upgrades from 07:00 to 11:00 UTC on 22 August; FTS Tier-0 to Tier-1 and Tier-0 to Tier-2 upgrades from 06:30 to 10:00 UTC on 23 August.

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>

          Please read the report linked to the agenda.
          This report is about INFN, IN2P3 and RAL sites.

          Speaker: Gavin McCance (CERN)
          Paper
        • <big>FTS 2.0 services at CMS T1 sites for CSA07 - status & plans</big>
          FTS 2.0 for the T1 sites is now released. The upgrade procedure is linked above.

          ...this is the proposed procedure for upgrading FTS 1.5 to FTS 2.0 and includes some suggested DB operations that have been discussed on this list, and others such as schema backups.

          Sites that have upgraded:

          • CERN
          • PIC
          • SARA(?)
          • Others?

          ASGC, FZK in pipeline. (Before CHEP for FZK). RAL - shortly. IN2P3 not before 2nd half September. TRIUMF - this week. Will be done as clean install rather than upgrade.

          upgrade procedure
        • <big>LHCb LFC issues</big>
          Together with the LHCb team, we are repeating the tests done on last May on the LHCb LFC replica at CNAF. We would like to test the behaviour of the system with more then one replica. In order to make the monitoring of the entries in the database it would be useful to have a read only account on each replica and possibly the privileges necessary to see the performance of the database from the 3D OEM (performance tab).

          If your site has an LHCb LFC replica up and running, we would like to ask you for such account (LCG_LFC_LHCB_R if you followed CERN naming conventions) for 2 weeks on you LHCb LFC replica.

          From the Strmmon interface I can see at the moment

          • CNAF
          • IN2P3
          • RAL
          • GridKa
          • PIC
          • SARA ? (only for conditions DBs?)
          connected to the LHCb LFC replication.

          • as a DBA, I know about 4 T1s (beyond CNAF) running LHCb LFC Streams replica, they are RAL, PIC, IN2P3 and GridKa. Do they run the LFC frontend in a production environment already?
          • In case the replica is in production, how many server are used as frontend and (roughly) which hardware characteristics do they have?
          Barbara Martelli, INFN-CNAF
          checklist
        • <big> ATLAS service </big>
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          • No report.
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • No report.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • No report.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> Service Coordination </big>
          The ATLAS M4 cosmic ray run is scheduled from 23 August to 3 September. See https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperatonsPlanningM4 The CMS CSA07 service challenge is due to start on 10 September and run for 30 days. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m