EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Steve Traylen (CERN)
Description
grid-operations-meeting@cern.ch
Weekly EGEE infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • EGEE operations team
  • EGEE ROC managers
  • site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    AND click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 4:00 PM 4:20 PM
        EGEE Items 20m
        • <big>Central Grid-Operator-on-Duty (c-COD) handover</big>
          From Northern Europe to Italy
          Handover Log:
          Currently there are 4 items in the C-COD dashboard. A very old unhandled alarm against UKI-NORTHGRID-LANCS-HEP and an expired ticket for UKI-SCOTGRID-ECDF United Kingdom. Explanation provided by ROD UKI (John Walsh) by today "The UK/I ROD-on-Duty did not receive any e-mail notification from Michaela with regards to the on-going alarm. The dashboard was difficult enough to use last week - often lost connections etc."

          Otherwise we have two APEL issues older than 30 days both in AP. Not exacly the same ones as last week, though.

          For the older one against TW-FTT Taiwan Jashon Shih provided an explaination after my question why apel-support is not evolved: "I believe the problem is not related to apel itself as the situation appear after the CE box migrate with creamCE. though we have problem to probe further but the rgma cfg seems not correctly define to the site mon box. i am sending remind if site admin can put more trouble shooting info there in the diary. sorry for not proactively checking the pending tickets while i am working on the pending site creamCE issue."

          For the new one against IN-DAE-VECC-02: Apel support is not yet involved, but site admins were not very active so far, at least, as I see it from the information in GGUS. More effort should be put in that one. One suggestion here: It would be nice if AP ROD could update the information in GGUS according to their internal communications to make these tickets more transparent for everybody. :-)

          Other things that appeared during the week: on other unhandled alarm against one UKI site, (explaination see above) one unhandeld alarm against one NE site: ROD forgot to switch off the ok alarm (maybe also due to dashboard instabilities as the alarm also started on Tuesday, when Dashboard problems were most prominent)

          Have a nice week,
          Michaela

        • <big> Pilot Services Report & Issues </big>
          • There are no currently active pilots
          Info about active pilot services at:
          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPilots
        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          France,ROC_Canada and Russia did not validate there reports this week

          South West Europe ROC: There is a new value in gstat2.0: GlueCEPolicyAssignedJobSlots, which is not queried yet by SGE. Therefore, our SGE sites will have a critical error. Following a mail from GonÁalo Borges the request to query this variable has not reached properly the SGE supporters. Is it possible to change the error to a warning until they will have implemented it in SGE?

          Comment from chair. In fact it is the same for torque as well: PATCH:3320

          A reasonable request but would be good to know the time scales for SGE. The torque patch is in status "Ready for rollout". Also although currently an ERROR it is not critical test so "does not matter" but it does create a lot of noise in the test results.

        • <big>Grid Service Interventions </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          Please consult the URLs above for details.
        • <big>CREAM-CE Deployment</big>
          CREAM CE on SL5 still not released: https://savannah.cern.ch/patch/?3260 SAM tests: available for viewing in Production (https://lcg-sam.cern.ch:8443/sam/sam.py). Only CEs that publish "production" tag are visible. Need to define when tests can be set to critical in order for alarms to be generated.
        • <big>Miscellaneous</big> 10m
          Various tidbits gleaned from the CIC portal:
          • New version of gstat portal available for download.

          Reminder:
          Interventions that are declared AT_RISK are not downtimes, and are completely ignored by SAM and GridView! Some comments are available within the Top-Tips Wiki that is linked from the monthly comments Wiki: https://twiki.cern.ch/twiki/bin/view/EGEE/SiteTopTips.

          HEPSPEC06:
          HEPSPEC06 has been validated by the ROC managers as the new benchmark for the infrastructure. A plan is being workout to define the timeline for sites to publish and operation tools to do aggregations. All sites can publish as many benchmarks as they wish according to the user communities they support, this is already possible is the deployed GLUE version (you can add the details on how to do it)

      • 4:30 PM 4:35 PM
        Review of Action Items 5m
      • 4:35 PM 4:40 PM
        AOB 5m