Minutes: 
Chris=1 Ewan=0 Duncan=1 Alessandra=1 Stuart=1 David=1 Stephen=1 Catalin=1 Rob=1

Present:
-------
Alessandra Forti 
Andrew McNab 
Andrew Washbrook 
Brian Davies 
Christopher Walker 
Daniela Bauer 
David Colling 
David Crooks 
Dan Traynor
Elena Korolkova 
Govind Songara 
John Bland 
John Kelly 
Matthew Doidge 
Mingchao Ma 
Raja Nandakumar 
raul lopes 
Rob Fay 
Rob Harper 
Sam Skipsey 
Santanu Das 
Stephen Jones 
Stuart Purdie 

Apologies:
---------
Apologies: Mark M, Kashif, Catalin

	
ROD team update
---------------

Nagios affected for a couple of hours on Sunday night due to site-wide network problems with Oxford's Begbrook site. This is not related to the problems they
are having with the Dell network switches.

Nagios has been updated to the latest release. 

Incident 2 weeks ago where asia pacific ROC tested UK sites and caused
alarms. Should have no impact on availability and reliability of UK sites.


Tier-1 update: John Kelly
-------------


A couple of disk server failures Thursday and Friday
(pretty routine) both atlas.

Garbage collection policies changed to give better performance. 

FTS - Atlas alarm ticket - very high load on SRMs. Stopped machine
from active draining reduced that load.


Security: Mingchao Ma
---------------------

Incident ongoing. No GridPP sites have reported issues, though a few
sites including the Tier-1 have reported scans. All info should be in
the e-mails sent to security contacts. Will update all sites if new
developments - including more details of the incident. 

Andy McNab: We have been asked to block an IP. Has this been fed back
to the ISP of the site in question? 

Full details of the compromise are not yet known. It is believed that
a user account was compromised, then used to login to development host
which has a root vulnerability. Whilst it isn't known if this was the
vulnerability used, sites should not assume internal system are only
visible to themselves, and are reminded to keep internal systems patched. 


Tier-2 Issues
--------------

* Publishing
  - QMUL not publishing - but aware of this. 

* NGI 
  - In process of setting up NGI. Created in GOCDB. Will be some
process of moving sites across to this.


* QMUL: SE issues - QMUL will upgrade to StoRM
1.7 - in the hope it might solve the problem.


* T2K space tokens
  - T2K are using a lot of space compared to other small VOs. 
  - Storage group has recommended T2K move to space tokens so
    they can account for space, and not take all the space for
     other small VOs. 

  - Imperial don't currently have enough space free to allow them to
    transfer all their data into a space token.

* Proxy delegation at Imperial
  - Daniela has no idea what the ticket is about and said so in the ticket.

* Hone jobs - Brunel
  Waiting for update from EMI Cream CE. 

* Snoplus: Tier-1 - waiting for Catalin to return. 

* Pheno VO: high number of job failures. 
  David Crooks - all pheno jobs going to SARA seem to fail, but dteam
  jobs work. Perhaps ticket should be reassigned to SARA.


Experiment problems and issues
-------------------------------

LHCb - Raja: Running fine, no particular problems. 

Atlas:

Problem Glasgow and QMUL not receiving database releases. Put offline
for quite a while. Solution to avoid these problems is to move to
latest CVMFS release. Also reduces load on storage element. 

Glasgow and Manchester have installed CVMFS 

QMUL problem storage under load, also WAN network bandwidth saturated.

Oxford: Passing storage problem, presumably due to network break. 

UCL: Software area full 
BHAM: downtime until yesterday. Still testing. 

Squid changes: Cambridge - ticket now closed. 

WMS:
	Steve Lloyd's tests: lots of failures due to RAL and Glasgow WMS
	overloaded. RAL fixed number of gridftp slots.

	Manchester: CVMFS upgrade accidentally unset DEFAULT_SE
	environment variable which caused Steve Lloyd's tests to fail.

	Discussion over whether WMS was useful for Atlas. 


Multi cloud operation:
      Roger Jones believed that all T2D sites are registered for multi cloud
      operation. Alessandra didn't think that was the case. 

CMS:  
[11:38:57] David Colling Nothing too much to report 
[11:39:16] David Colling Minor problems at Imperial meaning that we fell below 80% ... bad
[11:39:49] Jeremy Coles What was the underlying problem?
[11:39:54] David Colling On a positive note Bristol moved off 0% readiness! ... V. Good!
[11:40:34] David Colling Tape 0 isk 1 trials - make me nervous - as people know
[11:41:21] David Colling that should be Tape0 colon disk 1 ( not smiley face)


Events affecting job slot requirements:
------------------------------------
Summer conferences in August. 

Site performance and Accounting
-------------------------------

Metrics
-------

PMB: T2 accounting periods not discrete - no plan to stop monitoring metrics
- expectation is that it will be continuous - so don't hold off upgrades 
just because it is an accounting period.


EGI service operations security policy draft
--------------------------------------
https://wiki.egi.eu/wiki/Talk:SPG:Drafts:Operations_Policy 

Main point is on page 6 of the document. 

JC: Major changes are to terminology, and whilst there is no intention
to make substantive changes, sites are advised to check this document
while it is still in draft.


WLCG workshop
-------------
See Ian's summary slides and Jeremy's notes.

Things have gone well in first year of running. Contention expected
next year. Soem efficiency improvements possible at sites. Concern
that EGI and WLCG goals may not align. Concerns over use of EMI 1. How
quickly do we move to SL6 and SL7. Aligning computing models - focus
on improving commonality between different groups and
experiments. Tier-3s: how independent should they be.

Storage and Data: LHCb plan to start doing some reprocessing at
Tier-2s. Chaotic transfer of input files to Tier-2s. Plan is to
transfer data over the WAN direct to WN. Only Manchester involved at
present. Some concern was expressed over WAN link saturation. Raja:
LHCb are currently doing the throttling and plan to continue to do so.

LHCOne: Feeling that we are not as involved as we should be. Mark
Mitchell will help drive the discussion about where we should be
focussing in this area. Perfsonar network.

[12:03:09] David Colling I see this issue, but I think that it is very unlikely
[12:03:55] David Colling We made a policy decision not to be involved
[12:04:02] David Colling this was discussed at the PMB
[12:04:47] David Colling As this would take hardware money
[12:04:55] David Colling so may be monitoring
[12:06:08] David Colling We may end up needing to have a connection to the LHCONE backbone somwhere
[12:06:21] David Colling from Janet


Whole node scheduling
Memory usage 

Middleware: 

Opening talk gave lots of useful facts and figures. 

	Each fill now provides more data than taken in 2010!!!
	Trigger rates increased by experiments, so taking more data.

	Some issues with CREAM - divergence between WLCG, EMI and EGI. 

Pileup becoming a problem.  

Move from MONARCH model to equal based architecture. 

How are batch systems holding up. Some concerns about whether
Torque/Maui are holding up. Some interest in SLURM - as rewrite from
the ground up.

Users want more grid stability. General failure around 10%. Lots of
those failures in UK are IO failures.

FTS - monitoring and FTS3. 

CMS to replace jobrobot with their version of hammercloud. 

Cloud usage: Some concern about sending proxies to commercial clouds. 

Lots of storage talks mentioning http support. 

Moving away from the strict MONARCH model

Talk about injecting pilot jobs directly into batch system rather than
using grid submission (very ALICE like says Dave Colling).

SL5/SL6/SL7. What hardware support

Dell discussion about new hardware that is coming out. Isn't clear
that HEPSPEC will accurately represent performance of HEP jobs on the
next generation of hardware.

CVMFS: Sites invited to install it - it solves many of the NFS. On
lxplus and lxbatch since autumn 2010.
       http://northgrid-tech.blogspot.com/2011/07/cvmfs-installation.html
       [12:28:19] Alessandra Forti for who's interested

       See also http://hepwww.rl.ac.uk/sysman/Nov2010/agenda.html 

AOB
---

Tier-2 reports - hopefully everything in this week. 

GridPP 27 meeting open for registration. The earlier you book the
better. Please be as economical as possible.

Let Jeremy know if you have any topics that need to be discussed in
the PMB/ops meeting.

EGI technical forum starting 19 September
https://www.egi.eu/indico/conferenceTimeTable.py?confId=452#all - the
week after the GridPP meeting.


Chat window
-----------
[11:01:28] Jeremy Coles Chris is taking minutes today.
[11:05:11] Mingchao Ma joined
[11:05:11] Elena Korolkova joined
[11:05:11] Sam Skipsey joined
[11:05:12] Alessandra Forti joined
[11:05:12] Stephen Jones joined
[11:05:12] Brian Davies joined
[11:05:12] Raja Nandakumar joined
[11:05:12] Stuart Purdie joined
[11:05:13] Rob Harper joined
[11:05:14] Andrew McNab joined
[11:05:14] Rob Fay joined
[11:05:14] Daniela Bauer joined
[11:05:16] John Kelly joined
[11:05:17] John Bland joined
[11:05:19] Matthew Doidge joined
[11:08:11] Christopher Walker joined
[11:08:40] RECORDING Christopher joined
[11:13:47] Stuart Purdie left
[11:14:25] Stuart Purdie joined
[11:18:23] Govind Songara joined
[11:20:28] Mingchao Ma CVE-2010-3847
[11:20:42] Mingchao Ma https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/liblinker-2010-10-18
[11:24:24] Santanu Das joined
[11:26:28] David Colling joined
[11:29:39] Elena Korolkova We have T2K space token in Sheffield.
[11:29:56] Elena Korolkova It was tested by T2K. It workes
[11:29:57] Andrew Washbrook joined
[11:31:55] Elena Korolkova I think You cab put 2 TB in spacetoken and then it's up to T2K to move data to the spacetoken
[11:32:33] David Colling sorry my microphone is jnot working
[11:32:41] David Colling why do they need space tokens
[11:32:42] David Colling ?
[11:33:00] Daniela Bauer So they don't suck up all the space they can get ....
[11:33:13] David Colling But they don't actually have that much data - surely?
[11:33:54] David Colling What is their request size at Resource Board?
[11:34:09] David Colling This is set by Glenn's 
[11:37:53] Jeremy Coles I will check.
[11:38:49] David Colling Sorry ... 
[11:38:57] David Colling Nothing too much to report 
[11:39:16] David Colling Minor problems at Imperial meaning that we fell below 80% ... bad
[11:39:49] Jeremy Coles What was the underlying problem?
[11:39:54] David Colling On a positive note Bristol moved off 0% readiness! ... V. Good!
[11:40:27] Andrew Washbrook left
[11:40:34] David Colling Tape 0 isk 1 trials - make me nervous - as people know
[11:41:21] David Colling that should be Tape0 colon disk 1 ( not smiley face)
[11:47:34] David Colling So does CMS !
[11:49:06] David Colling That is what Roger said quite clearly that this was thew case
[11:49:33] David Colling He said this in Hamburg as well in another conversation
[11:49:49] David Colling Bristol!
[11:51:17] David Colling The next start date is the first day after the end of the current stop date
[11:54:18] Andrew McNab left
[11:59:00] Andrew McNab joined
[12:03:09] David Colling I see this issue, but I think that it is very unlikely
[12:03:55] David Colling We made a policy decision not to be involved
[12:04:02] David Colling this was discussed at the PMB
[12:04:47] David Colling As this would take hardware money
[12:04:55] David Colling so may be monitoring
[12:05:09] David Colling exactly
[12:06:08] David Colling We may end up needing to have a connection to the LHCONE backbone somwhere
[12:06:21] David Colling from Janet
[12:09:45] Raja Nandakumar Apologies - got to go.
[12:11:00] David Colling The comment about reserving machines for specific VOs was to do with whole node scheduling and somebody claimed that they were the same thing
[12:11:11] Raja Nandakumar left
[12:11:23] Jeremy Coles https://computing.llnl.gov/linux/slurm/
[12:17:22] David Colling Very ALICE like
[12:18:11] David Colling Certainly CMS will not!
[12:18:16] David Colling Certainly CMS will not!
[12:18:38] David Colling T3s yes
[12:19:13] David Colling Centos 6 is out (couple of weeks ago) and we are looking to move 
[12:25:07] Christopher Walker Can someone buy Alessandra a headset!!!
[12:26:35] Alessandra Forti  
[12:27:08] Alessandra Forti I had a loud speaker but suddenly it isn't working anymore. according to my mac it sucks too much power.
[12:28:10] Alessandra Forti http://northgrid-tech.blogspot.com/2011/07/cvmfs-installation.html
[12:28:19] Alessandra Forti for who's interested
[12:29:16] Jeremy Coles On monitoring overall (and the Site Status Board) https://indico.desy.de/materialDisplay.py?contribId=37&sessionId=3&materialId=slides&confId=4019
[12:29:25] Govind Songara RHUL having segfault with dpm 1.8.1
[12:30:01] David Colling I know that we are wrapping up but I to go. Byee
[12:30:09] David Colling left
[12:31:39] Stephen Jones Get we get 25 pence per mile?
[12:31:56] Sam Skipsey Only if you cycle it, Stephen.
[12:32:19] Mingchao Ma https://www.egi.eu/indico/conferenceTimeTable.py?confId=452#all