Attending:
Chris Brew, Andrew McNab, Raul Lopes, Dan Traynor, David Crooks, Elena Korolkova, Ewan Steele, Gang Qin, Ian Collier, John Bland, Garether Smith, John Hill, Mark Mitchell, Daniela Bauer, Matt Raso-Barnett, Mohammad Kashif, Rob Fay, Robert Frank, Sam Skipsey, Steve Jones, Wahid Bhimji, Ewan MacMahon, Andrew Lahiff (over phone bridge), Matt Williams, Govind Songara, Pete Gronbech.
Chair: Jeremy
Minutes: Matt
Apologies: Chris W, Raja, Alessandra

Experiment problems/issues (20')   	 


- LHCb
Nothing from Raja
-- Update on ARC-DIRAC issues
Andrew - tweaked some job environment setup at Tier 1, should work for LHCB now.

- CMS
Daniela- Fermilab can't handle SHa-2 certificates, users advised to stick to SHA1. Imperial having problems with jobs being held,  looks to be a CMS problem (and they're almost admititng to it). 
Jeremy - Tier 1 CMS share moving back to 5% for analysis work. 
On SHA2 - most issues seemed to be resolved, with a few problems at CERN. France already moved.

- ATLAS
Elena-Not much to report, reduced production work due to large output files filling Tier 1 datadisks. Atlas experts working on it.

- Other
-- ILC is moving to CVMFS. Please see https://ggus.eu/ws/ticket_info.php?ticket=101502

Meetings & updates (20')   	 
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

Monday 24th February
    There is a test GridPP website for SHA-2.
Jeremy- Might move mid-march. Jens has been doing testing.

    The final WLCG Tier-2 availability/reliability reports for January 2014 are available. 

 Alessandra noted a FR cloud report on January's VO test results. The suggestion made was to do something similar for UK sites.
We need to revisit our plans for RIPE ATLAS probes.
-Likely to be scaled down from our previous grand plans for dozens of probes
to maybe one at each site.

Janet is moving away from SeeVogh/EVO. Support ends in August. Our meetings will migrate to Vidyo. 
-Testing over coming weeks, people encouraged to try it out.


- Tier-1 status
From Gareth:
    There were problems with the FTS3 service last Tuesday when difficulties were encountered moving the VMs around. Since then the service has run successfully and is being used for an extensive test by Atlas and CMS.
-One VM was lost entirely in the process - under investigation.
    The software server used by the small VOs will be withdrawn from service (aiming for June).
-Contacting of the VOs and surveying the use cases to be done.
    A replacement MyProxy server is being put into production (to resolve the MyProxy issues raied in GGUS ticket 97025). It will be necessary for VOs to make appropriate reconfigurations to use this.
    It is most likely the Tier1, along with some other non-GridPP services at RAL, will move to use the new site firewall on Monday 17th March and there may be some disruption around this change. We do not have a date for the other significant network change we have to do which is the installation of our new Routing layer and changes to the way the Tier1 connects to the RAL network. 
-Date to be confirmed. The break should be short, but the operation is non-trivial. Will be declared in gocdb.

-Jeremy- A lot of traffic bypasses the firewall?
-Gareth- Yes, but there will be a drop of outside conenctivity when the new firewall gets put in (although everything wiil stay up internally). 
Risk of firewall rules being incorrectly set, so there will be an at risk period for some time after. New Firewall is a different make so transferring rules non-trivial, but new firewall has good debugging tools.
Jeremy- Expected throughput of new Firewall?
Gareth - will find out. Data flows avoid firewall, but control flows don't.


- Accounting
No updates.

- Documentation
Keydocs owners need to take some action! 
Just under half need updating - some just need a header fixing.
Naughty step: Mark, Pete, Jens, Rob, Allesandra, Wahid, David, Raul, Matt.

Jeremy - reminds us of the percentage complete column, which can be used to gauge accuracy if time is to constrained to do a "thorough" job.
-May need to review what key docs are key docs.

- Interoperation
    [from David] Meeting Agenda: https://wiki.egi.eu/wiki/Agenda-24-02-2014 

        URT News: ARC, WMS, SAM probes 

        UMD 3.5 released last week. Storm 1.11.3, other updates for openssl 

        SR: IGE.globus-rls v. 5.2.5 no EA 

        DMSU: WMS-ARGUS connection errors GGUS ticket https://ggus.eu/ws/ticket_info.php?ticket=101486 

        EMI-2 Decommissioning: Sites will start to receive alarms on Monday March 3rd, probes deployed in midmon and ready to be checked by 27-02-2014, interest requested for EMI3 tarball "informal" SR. 

        GLUE2 Validation: Possible timeline: Broadcast to ROD and Sites on March 3rd, Probe will be set OPERATIONAL on March 10th, Sites will have other two weeks to fix the Site-BDII before receiving alarms. 

-David advises reading the agenda for more info.
-EMI3 WN tarball staged rollout to start next month.


- Monitoring
    Next meeting of the consolidation group this Friday, agenda looking at HammerCloud Functional tests 
-Looking at folding hammercloud into site availability monitoring

- On-duty
-rota needs updating, Jeremy is looking into it.

- Rollout
Will start ticketing sites not meeting the  baselines at some point in the near future.

- Security
    Let Orlin know if you wish to try connections with the NGI ARGUS server. Tested working for EM including national banning. There are some setup docs:
http://wiki.nikhef.nl/grid/Argus_Global_Banning_Setup_Overview#NGI_Argus

Steve- I will review the documentation. 
Ian C - Reminds us we don't really have a choice, universal banning is a thing all sites have to be able to impliment somehow.
Ewan- There are alternatives of argus. For those wanting to test the central argus Ewan has a special banned DN which he can use to test sites. Contact him if you want to try it out.

- Services
Reminder-Perfsonar is a production service!

- Tickets
-ILC supporting sites (most of the UK) need to review the instructions to
impliment the ILC cvmfs and roll it out (or stop supporting ILC). The best way
to track this is in the ticket itself https://ggus.eu/ws/ticket_info.php?ticket=101502
The status will be reviewed in next week's meeting after which sites may be
ticketed individually.
-Durham asked for some help with their perfsonar problems, encouraged to post
details to TB-SUPPORT.
-There was a side line debate about the relative priority that perfsonar should be given, and even if it should be classed
as a production service.
Matt- "Not having a working perfsonar is kind of a black mark against your site."
Wahid- "Surely it should be more of a tiny smudge."
See the chat window for more exchanges.

- VOs
WMSs now updated, so upgrading openssl can go ahead if you haven't already..


Housekeeping! (20')   	 

- Check of updates in different areas:

HEPSPEC06
-https://www.gridpp.ac.uk/wiki/HEPSPEC06
Jeremy looking through the quarterly reports to look at this.
Nothing for UCL. Durham has had trouble running HEPSPEC (having not done it before). ECDF only have SL5, have jobs running to test HEPSPEC. Sussex isn't actually on the page. JET missing completely. RALPP onto it, used to be Rob's job. Tier 1 - Work been done, just need to update page. 


perfSONAR
- http://netmon02.grid.hep.ph.ic.ac.uk:8080/maddash-webui/index.cgi
Edinburgh - tried upgrade, back up failed but Wahid wanted a backup. Ewan M reiterates that you really don't need a backup. Duncan's mesh configs do all the work for you. Nuke the box from orbit, then configure it from Duncan's mesh.
Wahid asks what instructions to use? Jeremy will circulate.
Sheffield - next week
Brunel - almost done
RALPP - when we have time.

IPv6 status
- https://www.gridpp.ac.uk/wiki/IPv6_site_status
Table looking pretty complete.
Imperial needs to update - IC can actually accept IPv6 jobs now.

ARGUS deployment
- https://www.gridpp.ac.uk/wiki/ARGUS_deployment
Birmingham TBC, as well as EFDA-JET and Sussex. 
Matt RB-Sussex done, just need to update table.

Batch systems
- https://www.gridpp.ac.uk/wiki/Batch_system_status
Quite a few holes in the table, needs to be filled in over the coming days.
No comments.

Jeremy-Reminder of the multicore taskforce, meeting Tuesday afternoons.

Resilient VOMS
Leave this till next week now.

- 

 12:55 		
Further checks (10')   	 

Sites with EMI-2 services (services to be decommissioned by April)
- http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html
Number of sites still on EMI2 APEL.
Tier 1 - reasonably confident that will have migrated away from EMI2 in time.

Brunel-i EMI2 free
Imperial- APEL and a couple of creams.
Liverpool- EMI2 free.
QMUL- no EMI2 her either.
Cmabridge- ditto.
RHUL- bdii and apel still to go.
Lancaster - need to make EMI3 tarball, coming along well. Will test it at Lancaster. New voms tools biggest pain.
Also asked question about how DPM emi generation is calculated - only yaim is
got from the EMI repos, check /etc/emi-version on headnode.
Oxford - No EMI3 on production - will need to check test boxes
Manchester- APEL, Bdii and VOMS need upgrading.
Durham - have a few services left to go. 
ECDF- waiting on tarball, need to update a CE.
Sussex- argus, cream, apel and bdii need upgrading
Glasgow- have a few things, working on them.
Birmingham and Bristol need to be checked, but Ewan M reckons Bristol is looking good.
RALPP- bdii and argus to go.

Sites below baseline (WLCG will start monitoring in coming months)
- https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
Sites are asked to review against this table. 

Jeremy-Maybe we can improve our monitoring on this? Automate the checking.

Sites with GLUE2 issues 
No time to go through it, Glue2 validator, sites encouraged to look at this. Probe will become operational 10th of March, and sites 
with problems will have 2 weeks to fix.

Daniela notes that IC have an issue by virtue of running an ARC CE, but is a problem with the CE middleware itself that hasn't been addressed.

 13:05 		
AOB (1')   

No AOB.

Chat Window:
[11:01:20] Jeremy Coles Matt is taking minutes today.
[11:09:27] Steve Jones Instruction in resource table of approved VOs
[11:09:35] Steve Jones for CVMFS for ILC
[11:10:50] Steve Jones
https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements
[11:22:31] Daniela Bauer mea Culpa, will do it right now ...
[11:23:10] Mark Mitchell Core Services was my bad hadn't updated the date on
the doc, sorry.
[11:23:17] Mark Mitchell Had added to it  
[11:26:00] David Crooks Sorry if I was a bit noisy, we have a cleaning lorry
right outside our window
[11:26:24] Jeremy Coles It was fine for me David.
[11:32:07] Elena Korolkova which ticket?
[11:32:25] Daniela Bauer we've already installed it (cvmfs for ilc)
[11:32:36] John Hill So have we
[11:34:45] Ewan Mac Mahon @Elena - this one:
https://ggus.eu/ws/ticket_info.php?ticket=101502
[11:35:58] Chris Brew can't talk for some reason
[11:36:44] Chris Brew Only issue on the LHCb ticket was it's effectively for a
new service just coming in as an urgent GGUS ticket with no warning/discussion
with us
[11:37:22] Matt Raso-Barnett i've marked the sussex perfsonar ticket as solved
now
[11:37:39] Wahid Bhimji "production" service - actually needed for any jobs or
workflows?
[11:37:54] Ewan Steele anyone got any bright ideas for fixing mine at durham?
[11:39:39] Ewan Mac Mahon It's a production service as much as a CE is. One
offers network monitoring, one runs jobs, both are things you're supposed to
be doing.
[11:39:43] Matt Doidge None I'm afraid Ewan. Anyone else?
https://ggus.eu/ws/ticket_info.php?ticket=100968
[11:41:19] Ewan Mac Mahon Clearly the CE is offering a more valuable service,
but a site with a broken perfsonar is a site with one of its services down.
[11:41:35] Ewan Mac Mahon It's not fully working at that point.
[11:42:47] Sam Skipsey but a *production* service is one which is necessary
for *production*, surely?
[11:43:02] Sam Skipsey (I agree that Perfsonar is a service.)
[11:43:22] Ewan Mac Mahon There's also a chicken/egg issue here that we can't
use perfsonar to test the network if most of the failures are down to badly
configured endpoints.
[11:44:09] Ewan Mac Mahon Or even just enough of the failures are that no-one
can have confidence in the results.
[11:44:12] Ian Collier And the network is necessary. We'd better have a way of
monitoring that. Perfsonar is teh chosen mechanism.
[11:45:13] Wahid Bhimji As usual with this thing - any new supposedly
"production" services to the list just distracts short manpower sites from
actually making jobs work
[11:46:02] Sam Skipsey You haven't demonstrated that perfsonar is a production
service, Ian, just that we need a working network. Which is not a function of
perfsonar working.
[11:46:14] Wahid Bhimji if physics is being done then site is working.
[11:46:50] Ian Collier But it is the mechanism that teh collaboration has
chosen to monitor the network. That can only be done effectively if it is
treated as a production service itself.
[11:47:18] Matt Raso-Barnett Sussex isn't actually on the HEPSPEC page -- I'll
try to get updated figures for us on that page this week
[11:47:55] Sam Skipsey Of course, perfsonar doesn't really monitor "the
network" - it does latency and bandwidth tests that give point-to-point
measures of transfers. Strictly, a proper network monitor would look more like
the RIPE probes that Ewan keeps talking about.
[11:48:28] Mark Mitchell However, we do need a reporting mechanism for network
connectivity per tier-2. I wouldn't have said that this is a production
service in a specific UK instance. It is a service which we need to monitor
it. Ah the joys of semantics  
[11:48:53] Sam Skipsey (In any case, I do accept the pragmatic point that the
Collaboration has decided to declare Perfsonar a Production Service.)
[11:49:31] Ewan Mac Mahon Point-to-point measurements between all the points
we care about monitors the network as we see it though, which on one level is
what we directly care about.
[11:50:00] Ewan Mac Mahon And on a practical level, perfsonar nodes are really
simple - you install off the disk image, point at Duncan's config files, and
you're done.
[11:50:22] Ewan Mac Mahon And then you pretty much forget about it.
[11:50:24] Wahid Bhimji everything is supposedly "really simple" but it adds
up
[11:50:37] Sam Skipsey Except when they break, Ewan, which apparently keeps
happening for more than one site?
[11:50:39] Wahid Bhimji and I'm not ware of a real experiment issue fixed by
perfsonar
[11:50:45] Wahid Bhimji e.g. our T2D issues
[11:51:09] Elena Korolkova I plan to to this first week of March
[11:51:16] Elena Korolkova to do
[11:51:18] Ewan Mac Mahon No-one's going to be able to fix any _real_ issues
until we have a reliable test bed.
[11:51:50] Sam Skipsey So, the only *real* network issue we had over the last
year was caused by a deep routing issue that perfsonar would never (and
didn't) detect.
[11:52:26] Mark Mitchell Which was picked up by Chris as he was looking at an
increase in latency with file transfers if I remeber correctly.
[11:53:03] Mark Mitchell The interesting outcome of that was that the
escalation process between CERN and GEANT was manual at that point.
[11:54:27] Raul Lopes Brunel pretty much done
[11:54:56] Mark Mitchell This may change with the RIP Probes but what is
evident is that a carrier configuration issue went undetected
[11:55:04] Jeremy Coles https://www.gridpp.ac.uk/wiki/IPv6_site_status
[11:56:50] Jeremy Coles https://www.gridpp.ac.uk/wiki/ARGUS_deployment
[11:56:57] Ewan Mac Mahon perfSonar does do both routing and latency
measurements. With a fully working perfSonar setup QMUL might have been able
to diagnose their issue by noting that the route between the endpoints had
changed, and changed to a silly route.
[11:57:36] Ewan Mac Mahon But that would require everyone to have their
perfSonar boxes working, and available to the internet.
[11:58:18] Mark Mitchell I agree, the deployment is vital to this as the route
change caused an increase which on the surface was minor, unless you were in
the north west of Europe  
[11:58:19] Wahid Bhimji wyeah with all this network probelms thats the reply -
that if perfsonar was made up to the point it was useful then it would be
useful
[11:58:31] Wahid Bhimji but it also has to be maintained
[11:58:41] Wahid Bhimji and you have to know that its being maintained
[11:59:54] Wahid Bhimji etc...etc. no grid system should rely on the sites
setup and configuration that much .
[12:03:50] Elena Korolkova Was the deadline for movimg to EMI3 changedb to end
of May?
[12:03:59] Daniela Bauer Imperial has a couple of EMI2 creamce + APEL, but
will update before the deadline
[12:04:22] Mark Mitchell Also, the Perfsonar box doesn't really fix our
network user status other than being able to supply a lot of information to
JANET. Which is one of its advantage. I suppose.
[12:05:03] David Crooks Elena: So the end of support is end of April, but
effectively there's an additional month before the decommissioning deadline.
[12:05:17] Raul Lopes No EMI-2 here
[12:05:43] Steve Jones Liverpool: NO EMI2 Here, Either.
[12:05:56] Steve Jones Mostly EMI3 (some UMD3)
[12:05:58] Daniel Traynor qmul - no emi2
[12:06:18] John Hill Cambridge: no EMI-2
[12:06:27] Govind Songara May be apel and site bdii
[12:06:52] Chris Brew site bdii, apel (to be retired) CreamCEs (To be retired)
[12:07:13] Wahid Bhimji what is the difference (for the WN?)
[12:07:41] Elena Korolkova Sheffield: ce's , apel and bdii will be moved to VM
and EMI3
[12:07:47] Wahid Bhimji um oh
[12:08:14] Elena Korolkova in March, I don't see a problem here
[12:08:31] Wahid Bhimji test it at Lancs first !
[12:08:49] Ewan Mac Mahon And for Oxford we think we're EMI3 everywhere, but
we need to check some of the oddball test boxes just to be sure. So, nothing
important is EMI2, probably nothing at all.
[12:12:07] Wahid Bhimji I have to go in a min. I can imagine ECDF might have
EMI2 (the storage is EMI3) - certainly the CE is oldish which has to wait for
a new (sl6) CE
[12:12:33] Chris Brew and argus
[12:12:48] Matt Raso-Barnett Sussex: argus, cream, apel, bdii all are still
EMI2
[12:13:25] Ewan Mac Mahon I think Bristol are in pretty good shape.
[12:14:32] Jeremy Coles
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
[12:15:35] Daniela Bauer It's not doing anything special right now
[12:17:10] Matt Raso-Barnett i'm not familiar
[12:17:14] Wahid Bhimji well obviously I am unfamilar with it
[12:17:47] Daniela Bauer we ahve an issue due to having an ARC-CE
[12:17:59] Daniela Bauer but this is not fixed in EMI3, so I don't consider it
my problem
[12:18:32] Wahid Bhimji bye