28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly EGEE infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
EGEE operations team
EGEE ROC managers
site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0148141
AND click HERE (Please specify your name & affiliation in the web-interface)
From Northern Europe to Italy
Currently there are 4 items in the C-COD dashboard.
A very old unhandled alarm against UKI-NORTHGRID-LANCS-HEP
and an expired ticket for UKI-SCOTGRID-ECDF United Kingdom.
Explanation provided by ROD UKI (John Walsh) by today
"The UK/I ROD-on-Duty did not receive any e-mail notification from
Michaela with regards to the on-going alarm. The dashboard was
difficult enough to use last week - often lost connections etc."
Otherwise we have two APEL issues older than 30 days both in AP. Not
exacly the same ones as last week, though.
For the older one against TW-FTT Taiwan Jashon Shih provided an
explaination after my question why apel-support is not evolved:
"I believe the problem is not related to apel itself as the situation
appear after the CE box migrate with creamCE. though we have problem to
probe further but the rgma cfg seems not correctly define to the site mon
box. i am sending remind if site admin can put more trouble shooting info
there in the diary. sorry for not proactively checking the pending tickets
while i am working on the pending site creamCE issue."
For the new one against IN-DAE-VECC-02: Apel support is not yet involved,
but site admins were not very active so far, at least, as I see it from
the information in GGUS. More effort should be put in that one. One
suggestion here: It would be nice if AP ROD could update the information
in GGUS according to their internal communications to make these tickets
more transparent for everybody. :-)
Other things that appeared during the week:
on other unhandled alarm against one UKI site, (explaination see above)
one unhandeld alarm against one NE site: ROD forgot to switch off the ok
(maybe also due to dashboard instabilities as the alarm also started on
Tuesday, when Dashboard problems were most prominent)
France,ROC_Canada and Russia did not validate there reports this week
South West Europe ROC:
There is a new value in gstat2.0: GlueCEPolicyAssignedJobSlots, which is not queried yet by SGE. Therefore, our SGE sites will have a critical error. Following a mail from GonÁalo Borges the request to query this variable has not reached properly the SGE supporters. Is it possible to change the error to a warning until they will have implemented it in SGE?
Comment from chair. In fact it is the same for torque as well:
A reasonable request but would be good to know the time scales for SGE.
patch is in status "Ready for rollout". Also although currently an
ERROR it is not critical test so "does not matter" but it does create
a lot of noise in the test results.
CREAM CE on SL5 still not released: https://savannah.cern.ch/patch/?3260
SAM tests: available for viewing in Production (https://lcg-sam.cern.ch:8443/sam/sam.py). Only CEs that publish "production" tag are visible. Need to define when tests can be set to critical in order for alarms to be generated.
Various tidbits gleaned from the CIC portal:
New version of gstat portal available for download.
Interventions that are declared AT_RISK are not downtimes,
and are completely ignored by SAM and GridView! Some comments are
available within the Top-Tips Wiki that is linked from the monthly
comments Wiki: https://twiki.cern.ch/twiki/bin/view/EGEE/SiteTopTips.
HEPSPEC06 has been validated by the ROC managers as the new benchmark for the infrastructure. A plan is being workout to define the timeline for sites to publish and operation tools to do aggregations. All sites can publish as many benchmarks as they wish according to the user communities they support, this is already possible is the deployed GLUE version (you can add the details on how to do it)