Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 126540 with code: 4880. Apologies: Mingchao, Mark M
Ops team minutes 2012 02 28 (Ewan M)

Apologies: Mingchao M & Mark M

Experiment problems/issues
===========================

LHCb:
------

All sites have now been migrated away from WMS based submission of pilot
jobs to direct-to-CE submission instead, with the WMS route remaining
available as a backup

There have been some issues with running at ECDF that are the subject of
an ongoing dialog with Vladimir. The problem has been traced to ECDF
having an /etc/redhat-release that claims to be RHEL rather than SL, but
a solution is still being discussed.

The 'zombie jobs' issue has been worked around rather than solved, by
using a script which catches jobs that have been idle for more than 24
hours and then kills them. This (mostly) deals with the symptoms but not
the cause, which remains unknown.

JC noted a discussion at the PMB meeting yesterday about reduced workload
from LHCb due to work going to Russia. Raja explained that this had been a
side-effect of the move to direct pilot submission - a large site (Yandex,
described as 'the Russian Google') was moved and started a large number of
simulation jobs, essentially taking all the available work. This should
now be resolved. Raja also noted that there is generally little LHCb work
aorund at the moment, in particular not much MC activity; Raja is making
efforts to make LHCb working groups aware of the situation.

ATLAS:
------

Elena reported that ATLAS are in the process of introducing funtional test
jobs for production queues, in essentially the same way that they
currently have them for analysis queues. All UK sites are already running
the tests, but the results are not currently used to automatically mark
failing sites offline, though that is expected to happen later this week.
Elena has manually checked the results are it seems that UK sites are
passing well, with the exception of a known downtime at Cambridge for a
DPM upgrade. Elena will email tb-support with more information about the
details of this.

CMS:
----

Duncan noted that there was nothing to report from CMS this week.

Other VOs:
------------

Chris reported that t2k are still feeling some after effects from partial
or incorrect updates to the GridPP VOMS server information; tickets have
been sent (to Lancs, Sheffield (though Elena seems not to have seen it)
and a site in Spain).

Operations:
============

ROD team:
-----------

Daniela described an odd problem with an alarm persistently showing for
RAL-LCG2's wms03, despite the Steve Lloyd tests showing that it's actually
working fine. Kashif volunteered to have a look at this from a monitoring
perspective.

EGI:
-----

Stuart circulated an email yesterday covering the latest meeting (see copy
included on this meeting agenda). It was particularly noted that the EMI
update to deal with the recent EMI/EPEL/IGE Globus library problem is
expected to release on the 15th of March. It was pointed out that the UMD
release doesn't suffer from this particular problem, but this is
essentially believed to be happenstance rather than necessarily being
reflective of higher general quality in UMD. Sam advised that anyone
installing a DPM in the very short term should install from UMD until the
fixed EMI release is available. There was then a discussion on whether or
not the EPEL repository was required for UMD installs, and several people
were going to check (as it turns out, it is).

SAM Nagios:
-----------

Nothing to report.

Tier 1 update:
---------------

RAL is currently in the middle of a Castor upgrade process, having done
the CMS instance last Monday, ATLAS on Wednesday, LHCb yesterday, and with
the 'gen' instance for everyone else planned for tommorow. That process
seems to be going well.

An Oracle update was applied to 3D databases, but this was a transparent
process since it was done on each node of the RAC clusters in turn.

Gareth gave some advance warning of forthcoming work, including an upgrade
to the MyProxy service (this will be a re-attempt of an update that didn't
go through a few weeks ago), and there will need to be one final
intervention of about 1-2 hours on the Castor databases, with a formal
announcement of the schedule planned for tomorrow.

Security:
==========

Mingchao is away, but Jeremy noted that we are behind schedule on a
security challenge (pending arrival of some scripts), so that may be
taking place some time in the next month or two.

Tickets:
==========

74675: The EMI tarball install ticket. Several UK sites have asked for the
priority to be raised on getting tarball releases of the EMI WN and UI
out, as had John Gordon on behalf of the UK as a whole. It was suggested
that any other sites interested in this add their names to the ticket.
It does appear that there will be a gap in support between the end of
gLite support and the release of the EMI tarballs.

Biomed tickets: There have been tickets filed by Biomed asking for sites
to add software tags for the Hydra client tools, depsite them being a
standard component of the WN install. Several arguments for why we should
not be doing this have been collected in the RHUL ticket (). The biomed
person responsible (Frank) seems to be responding appropriately, and we're
now done to one outstanding 'real' problem at Lancaster. This is because
they actually don't have the Hydra components due to having an older gLite
release (Matt explained that they'd been waiting for the EMI tarballs to
do the upgrade with).

Sam went on to raise a general issue of concern that the people managing
software deployment for Biomed appear to have a concerningly poor
understanding of the principles and operations of the system that they are
using. Jeremy will convey this concern in the general direction of EGI.

79571: Ticket to enable gLexec testing on the SAM Nagios in advance of
some of the LHC VOs starting (again) to use it. Ticket was wrongly
assigned to Manchester, re-assigned to Oxford, and then closed because we
already meet all of the requirements.

79545: The LHCb 'zombie job' ticket. Catalin described the RAL solution
which consists of a MySQL query against the CREAM CE database to identify
idle jobs more than 24 hours old, and to then run a script adapted from a
CREAM tutorial to purge those jobs from the CE. This is now running as a
daily cron job at RAL. It was asked whether this problem was seen at other
sites and if the RAL code would be useful for them. Kashif opined that
this is a very common and general problem, seen at our sites and described
on LCG-ROLLOUT. The cause of some of these problems is known to be expired
proxies, but it's not clear whether that's the case for the recent LHCb
ones. Sites that are interested in Catalin's solution should get in touch
with him.

74353: The long-running RAL WMS Pheno ticket. RAL have been waiting for a
reply from Pheno for some time and have had a short and not entirely
helpful one, so have asked for more detail and set the ticked back to the
'waiting for reply' state.

78835: Manchester's DPM problem; the ticket was filed by Biomed, but it's
just a specific instance of the same DPM publishing zero free space issue
that has been discussed as affecting the ops test jobs. Manchester have
added a dedicated small (~200Gb) pool for ops, but have no short term
plans to extend such a facility to other VOs. It was pointed out that the
Manchester SE is actually broken for many VOs at this point, but
Alessandra is depending on a fix from the DPM developers. The
possibilities of such a fix were discussed again yesterday's DPM community
workshop (attented by Sam, Alessandra and Wahid) and the developers are
open to the posibility of changing things, but it's not likely to happen
in the very short term.

Other stuff:
=============

Moving on from the disk space discussion, Jeremy restated the current
expected GridPP resource shares for minor VOs as being 10% of CPU and 3%
of disk space. There are concerns about how disk resources are funded, and
Steve Lloyd is looking into this with a view to a discussion at the
Manchester GridPP meeting.

Catalin reported a problem that t2k had been having at RAL; the VO had
started to use the production and lcgadmin roles, but these are not
described as required in the CIC portal VO ID cards, so had not been set
up. It was agreed that it was the VOs responsibility in principle to keep
their portal entries updated, and that in practice Catalin would point
Chris Walker to the tickets, and he would then work with the VOs to get
them sorted out.

EMI 1/  EMI 2 Migration and staged rollout
-------------------------------------------

Daniela has been collecting some status information on the current
advanced/staged rollout deplayments of EMI components in the UK, with the
results visible at:
 http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
Jeremy asked for some immediate feedback on the page, and it was generally
agreed that it was good. There was some discussion of specifics, including
the EMI WN release, which has been observed to be broken by RHUL, but
works elsewhere. Daniella pointed out that the page notes those
observations, and so gives site admins the information on which to base a
decision.

There was a brief discussion on whether the page should be a wiki page
editable by everyone, or stay in its current home in Daniela's web space.
It was decided to leave it where it is for the time being, and for
Daniella to take on the "editor's" role of maintaining it, and avoid the
risk of it becoming unmaintained as often happens with wiki pages.

There was a brief discussion of the possibilities of testing EMI2, and
whether sites that are staged rollout sites for EMI1 would simply be
expected to carry on, but it was noted that testing EMI2 on SL6 is a
rather more major job than a routine upgrade.

Storage versions
-----------------

Brian went through a list of sites with older storage systems that
fall below the current baseline recommendations:

- Durham: No detail.
- ECDF: Is only a test StoRM instance, Andrew will follow up with Wahid.
- Bristol: No definite plans, but discussions ongoing with Oxford.
- Oxford: Have been waiting for new hardware, new releases, etc. but have
  a realistic plan to get the whole site updated soon.
- Brunel: Is a test DPM that's due to be simply retired from service.

GridPP4 / DRI funding
----------------------

No-one reported problems, but Chris Walker did note that Dell are
apparantly now able to offer and actually deliver 3Tb disks.

'Other' VO support
--------------------

Jeremy reported that based on quarterly report figures we've seen a drop
in usage from the 'minor' VOs, and queried what sites understanding of
this is and the possible reasons for it. Several areas were covered:

- The change in both tone and funding to be strongly in favour of
  running resources for each sites' specific LHC VOs, and away from
  a more 'generic' service.
- The increase in LHC work, leaving less 'spare' capacity for minor
  VOs to occupy,
- It was noted that some LHC work is counted as 'other' VO work, and
  likely dominates over the 'real' other VOs,

Core Ops work
--------------

Jeremy enquired of those people present who are not parts of the core ops
team whether there were any areas that they felt should be a higher
priority for the core team. There were no suggestions, thus demonstrating
that everyone on the core team is completely awesome.

Minor VO disk usage
--------------------

Jeremy noted that the discussion that's beginning about minor VO disk
space usage is likely to lead to a requirement for improved accounting.

AOB
-----

Duncan highlighted a recent broadcast from Matin Litmaath asking everyone
to get glExec/ARGUS installed and running so that ATLAS can test using it
with 'Glide-in WMS'


Chat window log
===================


[11:01:58] Jeremy Coles joined
[11:02:07] Mark Slater joined
[11:02:17] Mark Norman joined
[11:02:45] Elena Korolkova Could you put the link to the meeting in indico. please
[11:02:51] Andrew Washbrook joined
[11:02:54] Rob Fay joined
[11:03:06] Duncan Rand joined
[11:03:16] Andrew McNab left
[11:03:17] Jeremy Coles http://indico.cern.ch/conferenceDisplay.py?confId=179305
[11:03:19] RECORDING Ewan joined
[11:03:51] David Crooks joined
[11:03:53] Elena Korolkova thank you, Jeremy
[11:06:53] Alessandra Forti joined
[11:07:09] Duncan Rand nothing to report from CMS
[11:08:11] Gareth Smith joined
[11:08:32] Brian Davies joined
[11:08:33] Andrew McNab joined
[11:09:21] Govind Songara joined
[11:09:26] Matthew Doidge joined
[11:10:53] Brian Davies left
[11:11:00] Brian Davies joined
[11:11:17] Elena Korolkova we do not have ticket from t2k
[11:11:32] Elena Korolkova we have a ticket from biomed
[11:13:00] Pete Gronbech joined
[11:21:14] Queen Mary, U London London, U.K. Elena 79369 contains the details
[11:22:27] Catalin Condurache joined
[11:24:55] David Crooks Sorry, I need to drop out of the meeting, another meeting
[11:25:00] David Crooks left
[11:25:28] Gareth Roy Sorry, same meeting as Dave
[11:25:30] Gareth Roy left
[11:25:51] Andrew Washbrook sorry matt, which ticket?
[11:26:05] Matthew Doidge https://ggus.eu/tech/ticket_show.php?ticket=74675
[11:26:12] Andrew Washbrook thanks
[11:28:36] Daniela Bauer You don't need the bleeding edge: 3.2.9 is suffcient
[11:29:03] Daniela Bauer I can't hear you !!!
[11:29:54] Ewan Mac Mahon Matt is pretty quiet.
[11:34:51] Elena Korolkova we fixed that]
[11:35:02] Elena Korolkova this morning
[11:35:47] Mohammad kashif https://ggus.eu/tech/ticket_show.php?ticket=72506 Cream issue
[11:35:54] Queen Mary, U London London, U.K. pheno are likely to be hit by the voms certificate problem too.
[11:35:59] Mohammad kashif https://savannah.cern.ch/bugs/?86700
[11:36:58] Duncan Rand a new SE or a new file system?
[11:37:11] Sam Skipsey (a new pool or a new SE, rather?)
[11:37:56] Jeremy Coles https://ggus.eu/tech/ticket_show.php?ticket=54818
[11:40:28] Elena Korolkova If some vo give you finding for their resources can we use more storage in this case?
[11:40:38] Elena Korolkova funding
[11:43:10] Jeremy Coles http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
[11:46:58] Mark Slater looks good to me - especially the guides/docs
[11:47:22] Ewan Mac Mahon Well, this is the first I've seen it, but I agree, it looks like a very good thing to have,
[11:47:32] Matthew Doidge Agreed
[11:47:35] Ewan Mac Mahon Should help make the whole staged rollout effort more useful.
[11:49:11] Ewan Mac Mahon One query - this isn't a wiki page, it's on daniela's home space. Should it be a wiki page? (If no-one else is expected to update it, then probably not, but...)
[11:49:34] Daniela Bauer I hate that wiki, I find it impossible to get teh formatting worng.
[11:49:42] Daniela Bauer aeehh I mean right  
[11:50:05] Queen Mary, U London London, U.K. A link fromn the wiki would be fine.
[11:51:20] Daniela Bauer I did put a link on the Wiki, somewhere at the end.
[11:56:30] Sam Skipsey If I recall correctly, the DPM guys were fairly confident that we'd get an EMI 1.8.3 DPM sometime around mid-March.
[11:58:11] Ewan Mac Mahon No, no-one wants a 50 minute open discussion.
[12:01:35] Ewan Mac Mahon The PMB rarely gives a strong clear statement of anything.
[12:01:43] Ewan Mac Mahon Especially anything 'negative'.
[12:02:19] Jeremy Coles 50?
[12:02:31] Ewan Mac Mahon Did you say 15?
[12:03:00] Jeremy Coles yes
[12:03:22] Ian Collier Sounded like 50 from here as well - I was worried for a moment  
[12:03:45] Ewan Mac Mahon Ah. I don't mind a longer disucsiion, but from the minuting POV I'd rather everyone goes for lunch  
[12:09:19] Mark Slater left
[12:09:19] Catalin Condurache left
[12:09:20] Ian Collier left
[12:09:20] Andrew McNab left
[12:09:21] Elena Korolkova left
[12:09:21] Queen Mary, U London London, U.K. left
[12:09:22] Raja Nandakumar left
[12:09:23] Mark Norman left
[12:09:23] Duncan Rand left
[12:09:24] Stephen Jones left
[12:09:25] Mohammad kashif left
[12:09:26] Sam Skipsey left
[12:09:28] Govind Songara left
[12:09:29] Matthew Doidge left
[12:09:30] Brian Davies left
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS REFERENCE LINK: https://ggus.eu/ws/ticket_info.php?ticket=78315 SUBJECT: CMS software has issues running on EMI WN - ATLAS - Other - T2K
    • 11:20 11:40
      Meetings & updates 20m
      - ROD team update - EGI ops Thanks to Stuart: fixes for the Storm and DPM Globus libs in EPEL problem are on the way, and that the problem can be avoided by repo priority settings with the UMD versions. EMI Forthcoming in update 14 (tentitive 15th March) - BDII core - reduced disk and memory footprint and other minor things - DPM 1.8.3 - Many small enhancements and bug fixes. - Hydra (first release) - GFAL/lcg_util - Glue 2.0 support and minor enhancements - SToRM 1.8.2-2 - repackage with Globus 5.2 libraries. - VOMS - fixing - AC query fails when more than two fqans are explicitly requested (proxy renewal issuses on WMS) - Should fix WMS problems with an update (so not an explict WMS update to come for that) Some discussion on how the Globus lib update affected things, and strategies to prevent similar problems in the future. Staged Rollout - Some Unicore / IGE under SR. - Due soon: dCache 1.9.15, Site BDII 1.1.0, Storm 1.8.2 and more Unicore / IGE The Storm package is likely to be superseded by the EMI update one. Worth noting, about the EPEL Globus lib problem: Sites using the UMD repositories with the correct priorities with respect to the EPEL repo, are safe from this update to globus (gridftp in particular). Documented in the UMD installation notes: http://repository.egi.eu/category/umd_releases/distribution/umd_1/ EMI-2 Early adopters still sought, initial release schedule expected soon. - Nagios status - Tier-1 update - Security update - T2 issues - General notes. - Tickets For those sites waiting on the EMI tarball there's a ticket you might want to add your voice too: https://ggus.eu/tech/ticket_show.php?ticket=74675 (Thanks to Daniela for this). Last week Biomed ticketed a bunch of sites about publishing the "Hydra client" on their CEs. Brunel (79499), Birmingham (79505), Durham (79503) and Glasgow (79504) haven't replied yet. There seem to be good reasons for not simply complying (there's a good thread in the RHUL ticket engaging biomed on this: https://ggus.eu/ws/ticket_info.php?ticket=79500 ). -> ... just Lancaster then? NGI: https://ggus.eu/ws/ticket_info.php?ticket=79571 "NGI_UK - SAM configuration for glexec monitoring" - assigned to "ops@tier2.hep.manchester.ac.uk"... (maybe because of ops in the e-mail address?) This seems to be assigned to the wrong place, unless I've misunderstood something shouldn't it be going to Oxford? -> Should have been Oxford. Kashif already checks glexec. Ticket solved. https://ggus.eu/ws/ticket_info.php?ticket=78991 The ticket on the GridPP Voms Certificate e-mail address field. This ticket is solved, so should now be closed. If a ticket is wanted to track the CA issues then a new one should be opened. -> Now solved RAL Tier-1: https://ggus.eu/ws/ticket_info.php?ticket=79545 The lhcb "zombie job" problems have been split off into this ticket. Catalin reports that they've made good progress figuring out how to spot the zombies (jobs that are IDLE for 24 hours or more) - could this mechanism be applicable and useful to other sites/VOs? https://ggus.eu/ws/ticket_info.php?ticket=79369 t2k.org problem probably caused by the gridpp voms certificate change. I set it to waiting for reply and poked t2k for a response. -> Lancs and Sheffield ticketed for errors in config. This ticket was solved. https://ggus.eu/ws/ticket_info.php?ticket=74353 Pheno WMS problems. After a few weeks waiting for reply from pheno they've replied with positive news: 'some of the jobs I sent via "lcg0682.gridpp.rl.ac.uk" worked'. They then ask if we want a more detailed breakdown. I think the answer is "yes!". -> Asked for more feedback Manchester: https://ggus.eu/ws/ticket_info.php?ticket=78835 Biomed have trouble using the SE due to Manchester's "usable space" problems. -> No site response. Linked to: https://ggus.eu/ws/ticket_info.php?ticket=78776 Where Manchester are working around the problem by building a second SE. Put online this morning, hopefully it will ease their issues. Are Manchester planning on adding other VOs to this SE? That would ease Biomed's problems. -> Issued worked around with ops pool. The underlying problem pointed to in the ticket https://ggus.eu/tech/ticket_show.php?ticket=54818 is marked solved! Is anyone in the storage group working with the DPM developers?
    • 11:40 11:55
      Transition to EMI-1 and volunteers for EMI-2 15m
    • 11:55 12:10
      Open discussion 15m
      - Perhaps hardware - Support for wider VOs - Operations problems that need work/investigation
    • 12:10 12:15
      Non-LHC VO disk usage 5m
      - The GridPP4 proposal indicates a 3% allocation of purchased disk for non-LHC VOs - There is going to be at least a quarterly need for these figures and possibly monthly
    • 12:15 12:16
      AOB 1m