Ops team minutes 2012 02 28 (Ewan M)
Apologies: Mingchao M & Mark M
Experiment problems/issues
===========================
LHCb:
------
All sites have now been migrated away from WMS based submission of pilot
jobs to direct-to-CE submission instead, with the WMS route remaining
available as a backup
There have been some issues with running at ECDF that are the subject of
an ongoing dialog with Vladimir. The problem has been traced to ECDF
having an /etc/redhat-release that claims to be RHEL rather than SL, but
a solution is still being discussed.
The 'zombie jobs' issue has been worked around rather than solved, by
using a script which catches jobs that have been idle for more than 24
hours and then kills them. This (mostly) deals with the symptoms but not
the cause, which remains unknown.
JC noted a discussion at the PMB meeting yesterday about reduced workload
from LHCb due to work going to Russia. Raja explained that this had been a
side-effect of the move to direct pilot submission - a large site (Yandex,
described as 'the Russian Google') was moved and started a large number of
simulation jobs, essentially taking all the available work. This should
now be resolved. Raja also noted that there is generally little LHCb work
aorund at the moment, in particular not much MC activity; Raja is making
efforts to make LHCb working groups aware of the situation.
ATLAS:
------
Elena reported that ATLAS are in the process of introducing funtional test
jobs for production queues, in essentially the same way that they
currently have them for analysis queues. All UK sites are already running
the tests, but the results are not currently used to automatically mark
failing sites offline, though that is expected to happen later this week.
Elena has manually checked the results are it seems that UK sites are
passing well, with the exception of a known downtime at Cambridge for a
DPM upgrade. Elena will email tb-support with more information about the
details of this.
CMS:
----
Duncan noted that there was nothing to report from CMS this week.
Other VOs:
------------
Chris reported that t2k are still feeling some after effects from partial
or incorrect updates to the GridPP VOMS server information; tickets have
been sent (to Lancs, Sheffield (though Elena seems not to have seen it)
and a site in Spain).
Operations:
============
ROD team:
-----------
Daniela described an odd problem with an alarm persistently showing for
RAL-LCG2's wms03, despite the Steve Lloyd tests showing that it's actually
working fine. Kashif volunteered to have a look at this from a monitoring
perspective.
EGI:
-----
Stuart circulated an email yesterday covering the latest meeting (see copy
included on this meeting agenda). It was particularly noted that the EMI
update to deal with the recent EMI/EPEL/IGE Globus library problem is
expected to release on the 15th of March. It was pointed out that the UMD
release doesn't suffer from this particular problem, but this is
essentially believed to be happenstance rather than necessarily being
reflective of higher general quality in UMD. Sam advised that anyone
installing a DPM in the very short term should install from UMD until the
fixed EMI release is available. There was then a discussion on whether or
not the EPEL repository was required for UMD installs, and several people
were going to check (as it turns out, it is).
SAM Nagios:
-----------
Nothing to report.
Tier 1 update:
---------------
RAL is currently in the middle of a Castor upgrade process, having done
the CMS instance last Monday, ATLAS on Wednesday, LHCb yesterday, and with
the 'gen' instance for everyone else planned for tommorow. That process
seems to be going well.
An Oracle update was applied to 3D databases, but this was a transparent
process since it was done on each node of the RAC clusters in turn.
Gareth gave some advance warning of forthcoming work, including an upgrade
to the MyProxy service (this will be a re-attempt of an update that didn't
go through a few weeks ago), and there will need to be one final
intervention of about 1-2 hours on the Castor databases, with a formal
announcement of the schedule planned for tomorrow.
Security:
==========
Mingchao is away, but Jeremy noted that we are behind schedule on a
security challenge (pending arrival of some scripts), so that may be
taking place some time in the next month or two.
Tickets:
==========
74675: The EMI tarball install ticket. Several UK sites have asked for the
priority to be raised on getting tarball releases of the EMI WN and UI
out, as had John Gordon on behalf of the UK as a whole. It was suggested
that any other sites interested in this add their names to the ticket.
It does appear that there will be a gap in support between the end of
gLite support and the release of the EMI tarballs.
Biomed tickets: There have been tickets filed by Biomed asking for sites
to add software tags for the Hydra client tools, depsite them being a
standard component of the WN install. Several arguments for why we should
not be doing this have been collected in the RHUL ticket (). The biomed
person responsible (Frank) seems to be responding appropriately, and we're
now done to one outstanding 'real' problem at Lancaster. This is because
they actually don't have the Hydra components due to having an older gLite
release (Matt explained that they'd been waiting for the EMI tarballs to
do the upgrade with).
Sam went on to raise a general issue of concern that the people managing
software deployment for Biomed appear to have a concerningly poor
understanding of the principles and operations of the system that they are
using. Jeremy will convey this concern in the general direction of EGI.
79571: Ticket to enable gLexec testing on the SAM Nagios in advance of
some of the LHC VOs starting (again) to use it. Ticket was wrongly
assigned to Manchester, re-assigned to Oxford, and then closed because we
already meet all of the requirements.
79545: The LHCb 'zombie job' ticket. Catalin described the RAL solution
which consists of a MySQL query against the CREAM CE database to identify
idle jobs more than 24 hours old, and to then run a script adapted from a
CREAM tutorial to purge those jobs from the CE. This is now running as a
daily cron job at RAL. It was asked whether this problem was seen at other
sites and if the RAL code would be useful for them. Kashif opined that
this is a very common and general problem, seen at our sites and described
on LCG-ROLLOUT. The cause of some of these problems is known to be expired
proxies, but it's not clear whether that's the case for the recent LHCb
ones. Sites that are interested in Catalin's solution should get in touch
with him.
74353: The long-running RAL WMS Pheno ticket. RAL have been waiting for a
reply from Pheno for some time and have had a short and not entirely
helpful one, so have asked for more detail and set the ticked back to the
'waiting for reply' state.
78835: Manchester's DPM problem; the ticket was filed by Biomed, but it's
just a specific instance of the same DPM publishing zero free space issue
that has been discussed as affecting the ops test jobs. Manchester have
added a dedicated small (~200Gb) pool for ops, but have no short term
plans to extend such a facility to other VOs. It was pointed out that the
Manchester SE is actually broken for many VOs at this point, but
Alessandra is depending on a fix from the DPM developers. The
possibilities of such a fix were discussed again yesterday's DPM community
workshop (attented by Sam, Alessandra and Wahid) and the developers are
open to the posibility of changing things, but it's not likely to happen
in the very short term.
Other stuff:
=============
Moving on from the disk space discussion, Jeremy restated the current
expected GridPP resource shares for minor VOs as being 10% of CPU and 3%
of disk space. There are concerns about how disk resources are funded, and
Steve Lloyd is looking into this with a view to a discussion at the
Manchester GridPP meeting.
Catalin reported a problem that t2k had been having at RAL; the VO had
started to use the production and lcgadmin roles, but these are not
described as required in the CIC portal VO ID cards, so had not been set
up. It was agreed that it was the VOs responsibility in principle to keep
their portal entries updated, and that in practice Catalin would point
Chris Walker to the tickets, and he would then work with the VOs to get
them sorted out.
EMI 1/ EMI 2 Migration and staged rollout
-------------------------------------------
Daniela has been collecting some status information on the current
advanced/staged rollout deplayments of EMI components in the UK, with the
results visible at:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
Jeremy asked for some immediate feedback on the page, and it was generally
agreed that it was good. There was some discussion of specifics, including
the EMI WN release, which has been observed to be broken by RHUL, but
works elsewhere. Daniella pointed out that the page notes those
observations, and so gives site admins the information on which to base a
decision.
There was a brief discussion on whether the page should be a wiki page
editable by everyone, or stay in its current home in Daniela's web space.
It was decided to leave it where it is for the time being, and for
Daniella to take on the "editor's" role of maintaining it, and avoid the
risk of it becoming unmaintained as often happens with wiki pages.
There was a brief discussion of the possibilities of testing EMI2, and
whether sites that are staged rollout sites for EMI1 would simply be
expected to carry on, but it was noted that testing EMI2 on SL6 is a
rather more major job than a routine upgrade.
Storage versions
-----------------
Brian went through a list of sites with older storage systems that
fall below the current baseline recommendations:
- Durham: No detail.
- ECDF: Is only a test StoRM instance, Andrew will follow up with Wahid.
- Bristol: No definite plans, but discussions ongoing with Oxford.
- Oxford: Have been waiting for new hardware, new releases, etc. but have
a realistic plan to get the whole site updated soon.
- Brunel: Is a test DPM that's due to be simply retired from service.
GridPP4 / DRI funding
----------------------
No-one reported problems, but Chris Walker did note that Dell are
apparantly now able to offer and actually deliver 3Tb disks.
'Other' VO support
--------------------
Jeremy reported that based on quarterly report figures we've seen a drop
in usage from the 'minor' VOs, and queried what sites understanding of
this is and the possible reasons for it. Several areas were covered:
- The change in both tone and funding to be strongly in favour of
running resources for each sites' specific LHC VOs, and away from
a more 'generic' service.
- The increase in LHC work, leaving less 'spare' capacity for minor
VOs to occupy,
- It was noted that some LHC work is counted as 'other' VO work, and
likely dominates over the 'real' other VOs,
Core Ops work
--------------
Jeremy enquired of those people present who are not parts of the core ops
team whether there were any areas that they felt should be a higher
priority for the core team. There were no suggestions, thus demonstrating
that everyone on the core team is completely awesome.
Minor VO disk usage
--------------------
Jeremy noted that the discussion that's beginning about minor VO disk
space usage is likely to lead to a requirement for improved accounting.
AOB
-----
Duncan highlighted a recent broadcast from Matin Litmaath asking everyone
to get glExec/ARGUS installed and running so that ATLAS can test using it
with 'Glide-in WMS'
Chat window log
===================
[11:01:58] Jeremy Coles joined
[11:02:07] Mark Slater joined
[11:02:17] Mark Norman joined
[11:02:45] Elena Korolkova Could you put the link to the meeting in indico. please
[11:02:51] Andrew Washbrook joined
[11:02:54] Rob Fay joined
[11:03:06] Duncan Rand joined
[11:03:16] Andrew McNab left
[11:03:17] Jeremy Coles http://indico.cern.ch/conferenceDisplay.py?confId=179305
[11:03:19] RECORDING Ewan joined
[11:03:51] David Crooks joined
[11:03:53] Elena Korolkova thank you, Jeremy
[11:06:53] Alessandra Forti joined
[11:07:09] Duncan Rand nothing to report from CMS
[11:08:11] Gareth Smith joined
[11:08:32] Brian Davies joined
[11:08:33] Andrew McNab joined
[11:09:21] Govind Songara joined
[11:09:26] Matthew Doidge joined
[11:10:53] Brian Davies left
[11:11:00] Brian Davies joined
[11:11:17] Elena Korolkova we do not have ticket from t2k
[11:11:32] Elena Korolkova we have a ticket from biomed
[11:13:00] Pete Gronbech joined
[11:21:14] Queen Mary, U London London, U.K. Elena 79369 contains the details
[11:22:27] Catalin Condurache joined
[11:24:55] David Crooks Sorry, I need to drop out of the meeting, another meeting
[11:25:00] David Crooks left
[11:25:28] Gareth Roy Sorry, same meeting as Dave
[11:25:30] Gareth Roy left
[11:25:51] Andrew Washbrook sorry matt, which ticket?
[11:26:05] Matthew Doidge https://ggus.eu/tech/ticket_show.php?ticket=74675
[11:26:12] Andrew Washbrook thanks
[11:28:36] Daniela Bauer You don't need the bleeding edge: 3.2.9 is suffcient
[11:29:03] Daniela Bauer I can't hear you !!!
[11:29:54] Ewan Mac Mahon Matt is pretty quiet.
[11:34:51] Elena Korolkova we fixed that]
[11:35:02] Elena Korolkova this morning
[11:35:47] Mohammad kashif https://ggus.eu/tech/ticket_show.php?ticket=72506 Cream issue
[11:35:54] Queen Mary, U London London, U.K. pheno are likely to be hit by the voms certificate problem too.
[11:35:59] Mohammad kashif https://savannah.cern.ch/bugs/?86700
[11:36:58] Duncan Rand a new SE or a new file system?
[11:37:11] Sam Skipsey (a new pool or a new SE, rather?)
[11:37:56] Jeremy Coles https://ggus.eu/tech/ticket_show.php?ticket=54818
[11:40:28] Elena Korolkova If some vo give you finding for their resources can we use more storage in this case?
[11:40:38] Elena Korolkova funding
[11:43:10] Jeremy Coles http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
[11:46:58] Mark Slater looks good to me - especially the guides/docs
[11:47:22] Ewan Mac Mahon Well, this is the first I've seen it, but I agree, it looks like a very good thing to have,
[11:47:32] Matthew Doidge Agreed
[11:47:35] Ewan Mac Mahon Should help make the whole staged rollout effort more useful.
[11:49:11] Ewan Mac Mahon One query - this isn't a wiki page, it's on daniela's home space. Should it be a wiki page? (If no-one else is expected to update it, then probably not, but...)
[11:49:34] Daniela Bauer I hate that wiki, I find it impossible to get teh formatting worng.
[11:49:42] Daniela Bauer aeehh I mean right
[11:50:05] Queen Mary, U London London, U.K. A link fromn the wiki would be fine.
[11:51:20] Daniela Bauer I did put a link on the Wiki, somewhere at the end.
[11:56:30] Sam Skipsey If I recall correctly, the DPM guys were fairly confident that we'd get an EMI 1.8.3 DPM sometime around mid-March.
[11:58:11] Ewan Mac Mahon No, no-one wants a 50 minute open discussion.
[12:01:35] Ewan Mac Mahon The PMB rarely gives a strong clear statement of anything.
[12:01:43] Ewan Mac Mahon Especially anything 'negative'.
[12:02:19] Jeremy Coles 50?
[12:02:31] Ewan Mac Mahon Did you say 15?
[12:03:00] Jeremy Coles yes
[12:03:22] Ian Collier Sounded like 50 from here as well - I was worried for a moment
[12:03:45] Ewan Mac Mahon Ah. I don't mind a longer disucsiion, but from the minuting POV I'd rather everyone goes for lunch
[12:09:19] Mark Slater left
[12:09:19] Catalin Condurache left
[12:09:20] Ian Collier left
[12:09:20] Andrew McNab left
[12:09:21] Elena Korolkova left
[12:09:21] Queen Mary, U London London, U.K. left
[12:09:22] Raja Nandakumar left
[12:09:23] Mark Norman left
[12:09:23] Duncan Rand left
[12:09:24] Stephen Jones left
[12:09:25] Mohammad kashif left
[12:09:26] Sam Skipsey left
[12:09:28] Govind Songara left
[12:09:29] Matthew Doidge left
[12:09:30] Brian Davies left
There are minutes attached to this event.
Show them.