GridPP Operations team (& sites) meeting 27/03/12

In attendance: Jeremy Coles (Chair), Catalin Condurache, Pete Gronbech, Sam Skipsey, Mark Norman, Rob Fay, Matthew Doidge, Brian Davies, Stephen Jones, Elena Korolkova, Santanu Das, Raul Lopes, Govind Songara, Duncan Rand, Wahid Bhimji, Rob Harper, Stuart Wakefield, Mark Slater, Ewan Mac Mahon, Gareth Smith

11:00
Experiment problems/issues (20')

Review of weekly issues by experiment/VO

- LHCb

Issues : 3 sites out of LHCb mask

Durham : https://ggus.eu/ws/ticket_info.php?ticket=79880
Glasgow : Does not seem to be starting LHCb SAM jobs. Some SAM jobs have been waiting for many days now.
Manchester : In downtime

Sam: network maintenance at Glasgow

- CMS
Stuart: nothing to report

- ATLAS

-- Manchester and Glasgow are in downtime for network upgrade.
-- UCL has solved it's storage problem and the site was put back online.
-- Liverpool had a power failure over the weekend but it is now solved.
-- Few columns added to SSB included the releases status and the new PFT status although the latter is not working correctly yet.

Brian: Atlas might change plans for group disks at T2s (medium term change on space tokens - Alastair working on it)

- ALICE
Pete: good use of farm at Oxford

- Non-LHC VOs

11:20
Meetings & updates (20')

- ROD team update

- EGI ops

- Nagios status

- Tier-1 update

Last Tuesday (20th March) the disk array in the Oracle RACs hosting the Castor databases was replaced. This change was completed the following day. Since then the Castor databases, with two systems being maintained in sync. by Oracle Data Guard, has been working well.

On Wednesday (21st ) changes were made to try and address some of the problems with FTS 2.2.8. Patches were applied both to the Oracle database and to the FTS code itself. So far so good. Monitoring of the FTS has been re-enabled.

On Friday afternoon we had a problem when a cable in one of our key links became disconnected. Site outage declared. The problem was found and fixed within 20 minutes and systems recovered well.

Around 400TB of disk was deployed to LHCb yesterday.

- Security update

- T2 issues

- General notes.

Grid Engine sites should consider supporting the GESS: https://forge.in2p3.fr/projects/gess/wiki

- Tickets

Only 20 open tickets this week.

NGI:
https://ggus.eu/ws/ticket_info.php?ticket=80535
Chris ticketed the Manchester-based VOMS server admins to "fix" his sno+ membership on the VOMS webserver. Robert Frank applied a fix to Chris's cert and also sent a long post explaining the situation to TB-SUPPORT. Are things better now?

https://ggus.eu/ws/ticket_info.php?ticket=80259
Tracking the creation of the new neuroscience related VO. Do we have a name yet?

Mark: not sure about the name; will follow up

Oxford:
https://ggus.eu/ws/ticket_info.php?ticket=80554
Sno+ also need the bzip2-devel packages installed on WNs. They're having trouble updating their VO card with this information.

Glasgow:
https://ggus.eu/ws/ticket_info.php?ticket=80371
Following on from last week, Sno+ requested support on the Glasgow WMS. Glasgow are working on it, but things are currently "on fire". Sno+ support will be enabled once they're out of the current downtime.

Durham:
https://ggus.eu/ws/ticket_info.php?ticket=80407
enmr.eu users reporting "disk full"-like errors on their jobs. As a thought could these errors actually be on the WMS?

Sam: will follow

https://ggus.eu/ws/ticket_info.php?ticket=79880
LHCB were seeing Maradona errors. Do the problems persist? I see that Durham is still banned on the LHCB Production page.

UCL:
https://ggus.eu/ws/ticket_info.php?ticket=80331
UCL having dpm problems, affecting atlas transfers. Apparently the dpm service had died (sounds familiar to problems Lancaster & RHUL have been seeing over the last few weeks, discussed on the storage lists). Request from Ben for information about which pages to use to monitor the DDM status of his site.

Tier-1:
https://ggus.eu/ws/ticket_info.php?ticket=80119
Sno+ having trouble building ROOT at RAL. g++ versions seem the same at RAL as they are in Sussex, where ROOT builds fine. Has anyone seen a similar problem getting ROOT to build before?

Solved Cases from the last week:

A few tickets from hone (80572, 80482 & 80483, at RHUL, Birmingham & RALPP), all
attributed to hone being "drowned out" by LHC jobs and having comparatively
lower priorities.

https://ggus.eu/ws/ticket_info.php?ticket=80538
Lancaster got bitten by a wierd dpm problem. If you ever notice your dpm
daemon has died check dmesg & the syslog for dpm segfault error messages.

11:40
March GDB (15')

- The meeting took place last Wednesday: https://indico.cern.ch/conferenceDisplay.py?confId=155066.

GridPP report will appear here: https://www.gridpp.ac.uk/wiki/GDB_reports.

Main topics:

Election of new GDB chair

- Michel Jouvin won the vote and the handover takes place for April.

John's update

- Concluding with a few critical questions for WLCG to consider:

•          Can the computing infrastructure continue to scale with LHC needs?
–         Increasing experiment requirements
–         Reducing finances

•          Where will the middleware come from?
- At the moment EGI finishes next year. There is an increasing focus on clouds.

A demonstration of Vidyo
- Some of the additional (expected) functionality that people want will arrive within the next 1-2 months. For example chat with the video.

Modernisation of the CERN Computing Infrastructure

- Looking increasingly at cloud options
- Monitoring remains a focus (Lemon remains) but exception and performance data passed into Hadoop for data minin
- Looking at open source and commercial stuff too (partly because of remote Tier-0). The remote Tier-0 will be in Budapest (Wigner Institute)
- Cloud options discussed. Helix Nebula

GESS collaboration
- Encouraging collaboration of Grid Engine sites at the operation/management level in order to tighten requirements and influence developers. Being led by IN2P3. GE sites should join!

Operations updates:
ALICE
CMS

- Multicore provides 20% memory gain compared to single core jobs – Asynchronous merging very much reduced
– Number of processing jobs very much reduced • Dedicated queues at Tier-1 sites used for initial tests
– Tier-1 sites will not like to move parts of their resources to multi-core usage
• Dynamic multi-core slots at Purdue are working and simple to use
– ~5k jobs run with about 70 jobs in parallel (70x8 cores!)
– Preferred solution of Tier-1 sites to use multi-core jobs, but still questions about accounting (for example when draining a node to have enough cores for 1 job)
• T1_DE_KIT will be providing similar queues with 4 and 8 cores available per slot very soon
•   WLCG TEG recommendation. Number of cores configurable during job submission and site provides dynamically access to multi-core slots

LHCb
- Critical on disk space (Tier-1s mainly)
- CVMFS improved dramatically the software distribution... will start Tier-2s.

ATLAS

- Data Taking is now starting – Triggerrateupto400Hz+125Hz(delayed stream)
- ATLAS is efficiently using all available resources
- Ongoing a consolidation of tools to automate/improve
Operations (cvmfs, switcher)
- Better publication of site usability for ATLAS
-Learned a lot from last year

EMI news

- First EMI WN/UI tarball release: Beta testing GGUS #74675
- Link on webpage.


WLCG recommendations

- First sites supporting WLCG experiments moved to emi-1
- This page has been updated: https://twiki.cern.ch/twiki/bin/view/LCG/WLC GBaselineVersions.

- new globus version in EPEL brakes SEs that use some gridFTP APIs
(non backward compatible change) – affects STORM, DPM etc.
•   Has been solved: https://ggus.eu/tech/ticket_show.php?ticket=79541
•   EMI and globus produce updated releases
– highlights a more general problem with EPEL
•   components can change with little warning
- Recommendation is not to use auto update!

The next meeting is on 18th April (clashes with GridPP28).

11:55
AOB (5')

Jeremy: Any issues with HW acceptance?
Ewan: Latest EMI DPM (+EPEL) release is fixed.

Chat transcript:

[10:56:42] RECORDING Catalin joined
[10:57:19] Catalin Condurache I'll be taking minutes
[10:57:46] Pete Gronbech joined
[10:58:37] Sam Skipsey joined
[10:59:19] Jeremy Coles joined
[10:59:45] Mark Norman joined
[10:59:45] Rob Fay joined
[10:59:55] Jeremy Coles Thanks Catalin.
[11:00:06] Matthew Doidge joined
[11:00:33] Stephen Jones joined
[11:00:41] Brian Davies joined
[11:00:55] Stephen Jones left
[11:01:10] Santanu Das left
[11:01:25] Elena Korolkova joined
[11:02:03] Santanu Das joined
[11:02:20] raul lopes joined
[11:02:47] Govind Songara, joined
[11:03:38] Duncan Rand joined
[11:03:46] Wahid Bhimji joined
[11:03:53] Rob Harper joined
[11:07:49] Stuart Wakefield joined
[11:07:51] Mark Slater joined
[11:08:19] Ewan Mac Mahon joined
[11:11:15] Mark Slater not from me!
[11:14:01] Gareth Smith joined
[11:17:01] Ewan Mac Mahon Sussex
[11:32:15] Rob Harper Can't hear anything
[11:32:15] Pete Gronbech no sound
[11:32:16] Mark Slater Have we just lost Jeremy? or is it just me?
[11:32:19] Sam Skipsey No sound at all.
[11:32:22] Ewan Mac Mahon Not just you.
[11:32:29] Elena Korolkova indeed
[11:32:33] Brian Davies jeremy you have gone silent
[11:32:38] Brian Davies
[11:32:52] Santanu Das
[11:34:01] Catalin Condurache vidyo
[11:38:27] Wahid Bhimji Bristol
[11:40:49] Mark Slater Note that at Bham, we won't need tarball after a month or so
[11:41:32] Jeremy Coles https://twiki.cern.ch/twiki/bin/view/LCG/WLC GBaselineVersions
[11:41:54] Jeremy Coles https://twiki.cern.ch/twiki/bin/view/LCG/WLC GBaselineVersions
[11:42:15] Jeremy Coles https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
[11:46:13] Ewan Mac Mahon I think the storage group consensus was to use EMI. And that's certainly what I'm doing for DPM.
[11:46:21] Rob Harper fire alarm, so going
[11:46:30] Mark Slater I was going for UMD but can change....
[11:46:35] Stephen Jones I've been fixing on using EMI, but now I'm conflicted!
[11:47:24] Wahid Bhimji quite
[11:47:34] Stephen Jones It's giving me cognitive dissonance
[11:47:45] Ewan Mac Mahon There's something to be said in principle for not all picking the same one.
[11:48:50] Stephen Jones Yes, but there's the group-think-herd-instinct thingie, as well
[11:49:27] Ewan Mac Mahon You can always have a mixture too, the gLite, EMI and UMD DPM releases all interoperate.
[11:50:36] Ewan Mac Mahon For bonus insanity points, have several head nodes with MySQL database replication between them.
[11:50:50] Ewan Mac Mahon (don't actually do that)
[11:54:50] Sam Skipsey (I actually was considering doing that at one point   )
[11:55:52] Stuart Wakefield left
[11:56:25] Mark Slater left
[11:56:27] raul lopes left
[11:56:27] Brian Davies left
[11:56:30] Elena Korolkova left
[11:56:32] Gareth Smith left
[11:56:32] Govind Songara left
[11:56:33] Sam Skipsey left
[11:56:35] Matthew Doidge left
[11:56:36] Ewan Mac Mahon left
[11:56:38] Wahid Bhimji left
[11:56:39] Duncan Rand left
[11:56:41] Mark Norman left