WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: NorthernEurope, Russia
  • Tier-1 sites: in2p3, fnal, ndgf, triumf
  • VOs: Alice, Atlas, CMS, LHCb
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC SWE (backup: ROC SEE) to ROC DECH (backup: ROC France)

          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Tickets:
          lead team:
          Opened new:
          Closed:
          2nd mails:
          1st mail:
          site OK:


          Backup team
          Treated tickets:
          Opened new 27
          Closed 44
          2nd mails 23
          Quarantine 9
          All together 103
          Issues:
          1. SWE: There were a lot of alarms of nodes that were not already in the GOCDB. Tickets resolution has been effective. There was not any outstanding problem.
          2. SEE:A lot of nodes were switched off monitoring in the GOCDB. Possibility to open a ticket for the site ( when a lot of alarms arise simultaneously on one and the same site) lead to less work for COD team, but we have to be more carefully in order to avoid a new tickets for separated nodes of this site, because the alarms still exist in the alarm page.
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP IT NE RU SWE


          Release:
          • New updates have been announced to pps last Thursday:
            • gLite3.0.2-UPDATE35
            • gLite3.1.0-UPDATE03
            These updates include the new version of YAIM
          • next update to PPS will be released out-of-schedule possibly within this week (depending on the results of certification). It will contain the last version of GFAL/dpm clients fully compliant with STORM implementation of SRMv2

          Operations:
          • Diligent started running the first phase of its data challenge this week. The activity, involving all sites supporting Diligent, will be carried on according to the following schedule:
            • 1st Part (2 weeks)
              • Start: Monday 16th July
              • End: Friday 27th July
            • 2nd Part (2 weeks)
              • Start: Monday 20th August
              • End: Friday 31st August
            Sites willing to start supporting Diligent and to be involved in the next phase are welcome to find more information in the Diligent data Challenge web page

          Issues from EGEE ROCs:
          1. None reported
          Speaker: Nicholas Thackray (CERN)
        • <big> SL4 (32/64 bit) OS publishing</big>
          OS Publishing: http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name -> This is very well deployed and consistent Arch Publishing: http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_my_machine_architecture ->Progress here but this is not a default publication unless sites are using: YAIM-3.1.0-1 To give some figures on progress: There are 352 GlueSubClusters of which 30 publish a GlueHostArchitecturePlatformType Proposal: End users assume that unpublished GlueHostArchitecturePlatformType are 32bit. If they find 64 bit sites unpublished then raise a ticket with the ROC. New sites should be publishing correctly so the problem is finite.
          Speaker: Steve Traylen (CERN)
        • <big>Contact site -> VOs </big> 5m
          The present situation is: - for urgent issues/problems: GGUS; the ticket will be assigned the VO support units - for questions, non urgent issues, etc: operations meeting; sites need to write this on the weekly CIC site (RC) reports, on the "Points to Raise at the Operations Meeting" text box
        • <big> EGEE issues coming from ROC reports </big>
          • NO ISSUES REPORTED THIS WEEK
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          more information
        • <big> WLCG issues coming from ROC reports </big>
          1. None this week


        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

          Time at WLCG T0 and T1 sites.

          1. DB interventions (as with all others) that do not follow the agreed (May 2006) WLCG procedure will be classified as *unscheduled* . They should be discussed with the sites through the well established channel of the weekly joint operations meeting and properly broadcasted. .
        • <big>FTS service review</big>
          Speaker: Gavin McCance (CERN)
        • <big>Preparations for LFC service at LHCb Tier1 sites</big> 5m
          Following the mail from Eva to grid-services-databases@cern.ch last week (see below), LHCb Tier1 sites (other than CNAF - already done) are requested to allocate the hardware for the LFC middle tier.

          A single batch worker node per site is expected to be sufficient, both for load and availability services, the latter being addressed by the use of a R/O replica at another site in case of problems.

          The target date for entering production for these LFC services is end September (so as not to conflict with other pressing issues, such as SL4 WNs for CMS CSA07 et al, FTS 2.0 services, SRM 2.2 etc etc).

          Dear All,

          I have included the following Tier1 sites in the LFC Streams environment: IN2P3, GridKA, PIC and RAL. The LFC data has been imported in the appropriate schemas at the Tier1 sites LHCb databases and the replication using Streams has been successfully enabled.

          Tier1 database administrators and LFC team should now validate the copy and open the LFC service to production on their side.

          Please let me know if you have any question.

          Cheers, Eva

        • <big>Upgrade to SL4 WN release</big> 5m
          Speaker: Dr John Gordon (STFC-RAL)
        • <big>preparations for WLCG Collaboration workshop - operations session</big> 5m
          agenda

          As per the draft agenda for the operations session (see above), sites are requested to send their top 5 operations issues to Nick by August 10th so that these can be consolidated into a single list.

          Suggestions for additional topics for this session should be sent by July 31st.

          Suggestion from Gonzalo Merino:

          Experiences from sites/experiments operating the FTS servers

          Some months ago, a channel configuration at the FTS servers in the T1s was suggested (essentially T1s host channels where they are the destination). This configuration seems not to fit for instance the needs of CMS. It seems problematic also since sites have no control of the files being read from them, so if many sites start to request reading from a given T1, this could collapse the storage service. If this is the situation, we should make sure that the different SRM/Storage implementations provide sites with the tools to control these situations.

        • <big> ATLAS service </big>
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          • Job processing: CSA07 production status is in general steady: 39M evt/Month rate on QCD/Photon Jets assignments; so far 24.5M + 24M (Minbias) PH soup produced, will soon assign next 50M; 2 first CSA07 DPG requests started. Some sites did not join the MC production last week for different reasons (e.g. CNAF had 2 days of Castor upgrade - now done, ASGC still misses CMSSW 1_4_4 deployment - in progress)
          • Data Transfers: "production" transfers: GEN-SIM data shipping to T1s on-going; "test" transfers: LoadTest continues, Debugging Data Tranfers program launched, first report due next Thursday.
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          1. .
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • SRM v2.2 preparations 5m
          The various client tools (FTS, GFAL, lcg_utils) have been enhanced to support SRM v2.2. During certification and testing, some bugs have been found. More details on the schedule and the feature list will be provided at future operations meetings. See also today's LCG ECM.
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m