Attending- Andrew L; Brian D; Dan T; David C; Duncan R; Elena K; Federico
Gang Q; Gareth R; Gareth S; Govind; Ian L; Jeremy C; John B; John H; Kashif
Lukasz K; Matt D; Matt RB; Oliver S; Raul; Rob F; Sam S; Steve J
Elena Korolkova;Daniela B; Alessandra F; David C

Apologies-Raja

11:00 - 11:20 Experiment problems/issues 20'

Review of weekly issues by experiment/VO

- LHCb

From LHCb nothing much to report except standard productions. We are still having problems with submitting jobs to ARC CEs but have alleviated most of it I believe now.

- No issues raised

- CMS
Matt D-Healthy number of CMS tickets coming through.

Brunel had a subtle DPM problems, where nodes got slower without crashing. This caused transfers to fail. Raul is looking into it. Topic for tomorrow's storage meeting. Sam mentions that Raul is running some exciting plugins at the urging of the DPM devs, which might cause some interesting behaviour.

- ATLAS
-Liverpool and Sheffield up for promotion to T2D.
-Durham now online. GGUS 107536 can be closed.
-107416 still happening, problem with communication to US Tier 2s. Last 5 days seem okay, not sure what's changed though.
-Problems with Panda lookup, Rucio team made some cache optimisation and increased resources. Reported at WLCG meeting that 2 of the 5 servers are out.
-Brokering algorithm changed to reduce pressure on rucio. Problem with rucio stress test for RAL.
-13TeV simulation about to start - this will be multicore. Request for increase of multicore resources.
-19.1 built but not validated yet.
-multicore queue porblem at Liverpool. Alessandra is back now. Only one release validated, so Liverpool wasn't receiving jobs.
	- Update from Steve, Liverpool receiving jobs as of this morning.

Jermey notes that the UK share of atlsa work is up to ~16%, whilst this is good news it's unsure where this increase is coming from.


- Other
-Steve is chasing hyperK to clean up their VO card to show their cvmfs use.


11:20 - 11:40 Meetings & updates 20'

With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

- General updates

    Bristol suggests it is seeing connection problems mostly, but not exclusively, to US sites.

    The GOCDB test server has been updated to v5.3.

    GOCDB has received a new service type request for ‘egi.Perun’. As required by the lightweight EGI review process, we are required to respond with any suggestions/issues before 14th August. Perun is used by the EGI Fed Cloud to manage users access rights to cloud services. Therefore, every cloud VO needs to be supported by Perun, this is why it has been requested to be properly registered and then monitored.
    
    LHCb have reported that the dCache problems they have seen recently do not seem to have any correlation with a particular dCache version. Even different endpoints in the same site could fail or work OK. This is all related to xrootd endpoints and some sites have solved the issues that seem to be caused by misconfigurations on their site.
    
    GridPP will receive some RIPE probes for distribution - note there is now a waiting list for UK based requests.
-Distribution can be decided at Ambleside.

- WLCG ops coordination

    The next meeting is on 21st August.
    For the September MB we have been asked for ideas on improving operational efficiency within WLCG.

- Tier-1 status

    Ongoing investigations into problems with draining disk servers in Castor 2.1.14.
    We have announced that we will shutdown both the FTS2 service and the software server used by the small VOs on the 2nd September.
-Nothing to add from Gareth

- Storage and data management
    Pool nodes at RHUL have received test errors.
-Sam pointed Govind to the DPM devs after extensive debugging himself. Still working on it. "Very wierd".

- Accounting
   Accounting looks behind for UCL, Sheffield and Sussex. 
-Elena fixed it last night, so Sheffield should be okay
-Sussex only just out of downtime, so they should come back okay.

- Documentation
The keydocs php scripts are working now, so we can restart our review process.... 

- Interoperation
David- Some confusion over the next meeting date. Sumemr hiatus?

    Last meeting yesterday.

    Agenda: https://wiki.egi.eu/wiki/Agenda-14-07-2014
    Minutes:

    URT: see agenda for details
    SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
    DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
    Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
    EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
    Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
    Next meeting placeholder 28th July, but may not happen (OMD depending)
    Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

David - cvmfs monitoring picked up by squid monitoring TF, more news after GridPP.

- Monitoring

    Next consolidation meeting this Friday, Messaging and SAM 3 UI: https://indico.cern.ch/event/334354/
    Kick-off meeting discussing cvmfs monitoring in squid monitoring TF being arranged now.

- On-duty

    Last week was quiet.
    Still one or to responses needed for next rota allocations.

- Rollout
Nothing to see here

- Security
There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard. 

 Ewan - state of the dashboards in review, pakiti should be fixed now. 
Minor updates - DPM argus integration should work now - Ewan's post on storage blog
http://gridpp-storage.blogspot.co.uk/2014/08/argus-user-suspension-with-dpm.html

Arc CEs respond oddly to banned accounts due to fallback mapping. Getting different results at different sites, 
if fallback account fallbacks to "nobody" and nobody exists the job runs. Bad! Seems to be fragile. Ewan requests
Unix map lines from people's Arc CE's to see what's up. The request will go to TB-SUPPORT.

-No other issues brought up.

- Services

    A reminder to update site status information in the IPv6 pages.
    We will shortly review issues being picked up by perfSONAR and the steps to take when investigating.
There is a new version (v3.4rc2) of perfSONAR being tested at QMUL [1]. Details here [2]. 
http://perfsonar03.esc.qmul.ac.uk/toolkit/gui/services/
http://psps.perfsonar.net/toolkit/releasenotes/pspt-3_4rc2.html
Duncan - new perfsonar being tested by sites in "testing" mesh. Not for production yet.

- Tickets
Tier 1 106324  -The Tier 2 at RAL *doesn't* see the problem, so the Tier 1 chaps are looking at what's different (if anything) between the two.

- Tools
Naught
- VOs
Nada
- Site updates
Nothing

11:40 - 11:45 Multi-core status and actions 5'
Material: 	document pdf file
- See the document
In Summary apart from Oxford (in downtime) current multicore sites are running okay. RALPP, ECDF and Sheffield next on the list.
Elena wil look into it.
Jeremy reminds us that we need to make progress across all atlas sites by October.
Matt RB would like to look into this for Sussex.
Glasgow running MC on 24-core nodes, Oxford plan to run on 64-core nodes. Sussex would run on 64-core nodes too.
Kashif plans to try to fix the Oxford MC condor soon- something seems to be missing from the setup.

Largest nodes in the UK at the moment are 64-core slots. 
Running 3 Raid-0 disks at Glasgow. Oxford running 3-way stripe with fast disk.

Jeremy reminds us of the Batch System Status page on the wiki.
https://www.gridpp.ac.uk/wiki/Batch_system_status

11:45 - 11:55 Feedback for operational effort optimisation 10'

- Reference email sent on Friday 
- Selection of sites to provide input
- Twiki: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationCosts

Request for volunteers from sites. Looking to get good ideas.

Useful to get any thoughts people have. Would at least like a few sites to input directly.
AF - Feedback is important. We are given a voice for once. 
Ewan suggested some kind of "European Middleware Initiative" to coordinate middleware production and maintenence.
Also suggested for  people stop knocking holes in our infrastructure.
-Middleware readiness group attempting to bring things together
Sam - these guys are underresourced though (e.g. Argus lacking support).
AF - trouble is this is a political funding problem, so (probably) outside the scope of this.
Maybe we need pointers for future funding requests.
glexec mentioned as a possible source of wasted effort.
Sam - atlas seem to have good/sane opinions (cms do to, but not as good/sane). Reducing layers seems to be the way to go.
e,g. parameter passing to batch systems. glexec again brought up as an example of an unnessicery layer.

Jeremy asks how clould might affect things? 
Sam - for better or worse, losing what batch systems do for you. (Sam also mentioned a paper he published recently over
lost efficiency of VMs).
 
Gareth - It has been brought up that mature sites in same countries could be grouped together as a single point of contact/submission for VOs. Ticket 
review/ROD mentioned as points of effort.
AF - we are a model country for atlas. So this doesn't really count for the UK.
AF- mentions the history of the hammercloud tests, and how they've been useful for atlas and adopted by cms.
AF- "We are what atlas are asking for"
JC- puts in the counter point that we're also better funded then some NGIs.


11:55 - 12:10 Networking 15'

- Review and update of https://www.gridpp.ac.uk/wiki/IPv6_site_status
All sites have discussed with their network team, a few sites have yet to ask for addresses. Perhaps stale entries?
Duncan -RAL have some IPv6 addresses.
Gareth S- unsure who's taking over that work.
Gareth R- difference between being given ipv6 addresses and *actually* being able to use them. Glasgow had trouble when they wanted to apply Ipv6 on their production cluster, despite having no problems on their test. Some core routers won't route ipv6 traffic. Creative routing to get around this makes things more complicated and reduces redundancy.
Jeremy - exposing this to Janet could hopefully alleviate fears and they (janet) can engage with network teams.
Duncan - we're a large user of janet, we have clout (paraphrased).
Gareth R- Some of the Glasgow networking guys "terrified" of IPv6 - greater training needed, maybe we can ask for it. 
Ewan - Oxford in similar boat - core routers not really capable, but due for an upgrade.

Ewan - worth having a census of core routing kit at sites?

Ewan- if you get any ipv6 conenctivity it's simple to set up perfsonar on it and get going. Important to remind network planners that Ipv6 is something they need to plan for.

Ewan-  do we need an IPv6 SIG? Janet have seemingly "solved" Ipv6 their end, so we need to remind them it's still an issue.

Jeremy- different allocations at different sites. QM for example got their own.
Gareth - has some questions from his network chaps for janet next week. Best pratices for management/allocation. 

- Using perfSONAR (Bristol example)
Duncan- the bristol problem does seem to be linked to the firewall somehow. Although oddly enough Bristol are the only green site for tranfers to FNAL!
Lukasz- waiting for more information from the networking team.
Duncan -firewalls introduce latency. Traceroutes are a good tool. Ideally Bristol putting their perfsonar box outside the firewall would be useful.
Brian- Do we have a table somewhere recording 10G or 1G on their perfsonar?
Duncan - We don't, be we should. We'll add it.
Jeremy - we'll pick this up next week.

- http://netmon02.grid.hep.ph.ic.ac.uk:8080/maddash-webui/index.cgi?grid=UK%20sites%20-%20intercloud%20OWAMP%20Mesh%20Test

- RIPE update  
Become an ambassador and set up an anchor. Discussion with them, they'd like something in Scotland.
Ewan- be clear with them that this is an academic network that they're on. They might have routes being a little odd.
Sam- I think there are janet gateways outside of London.
Ewan-yes, but due to the fast interconnects not sure they're all used
Brian- believes there's a route out to the US at Manchester.
Ewan-interesting to traceroute from your house to somewhere on JANET.
JC- Would like to bring this up at Ambleside.

12:10 - 12:11 AOB 1' 
No AOB.

CHAT WINDOW
Jeremy Coles: (12/08/2014 11:05:06)
Please note the meeting is being recorded.
Note suggestion to close
https://ggus.eu/index.php?mode=ticket_info&ticket_id=107416
Present 11:15: Andrew L; Brian D; Dan T; David C; Duncan R; Elena K; Federico
M; Gang Q; Gareth R; Gareth S; Govind; Ian L; Jeremy C; John B; John H; Kashif
M; Lukasz K; Matt D; Matt RB; Oliver S; Raul; Rob F; Sam S; Steve J
Elena Korolkova: (11:12 AM)
I'll close https://ggus.eu/index.php?mode=ticket_info&ticket_id=107416
Daniela Bauer: (11:12 AM)
Sorry I am late.
raul: (11:14 AM)
Not really. DPM lightly unstable.
Yes.
Jeremy Coles: (11:15 AM)
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
Govind : (11:20 AM)
Sorry I try to speak but Vidiyo is too slow 
One of of gridftp expert developer is on holiday .. so waiting for him to
comeback..
Jeremy Coles: (11:26 AM)
For minutes: + Daniela B; Alessandra F; David C.
Ewan Mac Mahon: (11:27 AM)
It's not me.
Elena Korolkova: (11:29 AM)
sorry
Ewan Mac Mahon: (11:30 AM)
And that storage blog on setting up DPM ARGUS banning is here:
http://gridpp-storage.blogspot.co.uk/2014/08/argus-user-suspension-with-dpm.html
Elena Korolkova: (11:33 AM)
I decided to fully reconfigure ce1 today
Ewan Mac Mahon: (11:33 AM)
WTF? Why is their perfsonar only on 1G?
Lukasz Kreczko: (11:34 AM)
Because we had only those boxes availabel
Ewan Mac Mahon: (11:34 AM)
We gave you the money FFS! Buy something decent.
Lukasz Kreczko: (11:34 AM)
If that is so, I will chase it up. Might have been before my time
Samuel Cadellin Skipsey: (11:34 AM)
Lukasz: on the actual link issue (rather than the Perfsonar one) - what are
your network tuning settings like on the storage?
(although, since you actually see P/L, I am suspicious of your link firewall's
ability to firewall at the line rates you are hitting)
Lukasz Kreczko: (11:35 AM)
AFAIK we have not applied any network tuning. But I might be wrong. Winnie
should now about it as the storage was set up by her and Bob
Samuel Cadellin Skipsey: (11:36 AM)
Right: so, we generally find that long distance transfers benefit from
increasing the default TCP window sizes on storage boxes.
Lukasz Kreczko: (11:37 AM)
OK, thx. I will forward this to Winnie. But the network issues seem more
general (glideins as well)
Ewan Mac Mahon: (11:37 AM)
IIRC the money was earmarked in one of the grants (the DRI money, I think) but
got mis-spent on something else at the time. Then Bristol were supposed to be
getting perfSonar boxes after the event. However, there's just no point having
a 1G perfsonar on a 10G site; it's going to see completely different things
than your real services. It's actively misleading.
Samuel Cadellin Skipsey: (11:37 AM)
Lukasz: sure, which is why I suspect your firewall.
Lukasz Kreczko: (11:38 AM)
I see. Thanks for clarifying Ewan.
Duncan Rand: (11:43 AM)
Note the problems are inbound whereas tuning affects outbound transfers.
Lukasz Kreczko: (11:43 AM)
Are there specific hardware specs for perfsonar boxes? I could not find
anything on the GridPP wiki (although I did not look for long). I assume one
box with 2 NICs (1 x 10 Gbit [bandwidth) and 1x 1 Gbit [latency[) would be
enough, right?
Samuel Cadellin Skipsey: (11:43 AM)
Duncan: ah, I'd thought they were in both directions. In that case, I'm even
more suspicious of the firewall.
Duncan Rand: (11:44 AM)
Me too.
Ewan Mac Mahon: (11:46 AM)
There was a standard spec for the perfsonar nodes; i'll see if I can dig it
up. It's supposedly somewhat important to have separate boxes for 'good'
latency measurements, but one box is better than nothing. I'm not sure about
running the two kinds of measurements on different interfaces; I think that's
likely to be more trouble than it's worth.
Elena Korolkova: (11:46 AM)
Yes, I will
Ewan Mac Mahon: (11:47 AM)
In a nutshell though, the 'gridpp standard' units were base model Dell R610s
with 10Gbit card and a couple of nice-to-haves like dual PSU and RAID disks.
They were somewhat overpowered, but they weren't all that expensive, so it
seemed like the way to go.
John Hill: (11:48 AM)
Lukasz - we at Cambridge run a single perfSonar box for throughput and latency
over a single 10GBE NIC
Ewan Mac Mahon: (11:48 AM)
Also, you do need something with a moderate degree of oomph to be sure of
driving a 10Gbit card fast.
Matt Raso-Barnett: (11:49 AM)
i'll be using 64 core machines if I get this setup. that's all we have
same as ewan
Lukasz Kreczko: (11:49 AM)
ok, thanks for the info
Duncan Rand: (11:50 AM)
I think (but am not sure) the latest version of perfsonar that I mentioned may
be able to support throughput and latency properly on the same node: "A new
common test scheduler is used to run all throughput, owamp, ping and
traceroute tests. It initiates all tests except constant OWAMP tests (i.e.
powstream) via BWCTL."
http://psps.perfsonar.net/toolkit/releasenotes/pspt-3_4rc2.html
Lukasz Kreczko: (11:50 AM)
same node but separate NICs?
Duncan Rand: (11:51 AM)
You have teo nodes though...
two
Jeremy Coles: (11:52 AM)
https://www.gridpp.ac.uk/wiki/Batch_system_status
Ewan Mac Mahon: (11:52 AM)
I tihnk Lukasz is planning to replace the two old nodes with one decent one.
Lukasz Kreczko: (11:52 AM)
Right, sorry for the confusion. Since our two boxes are quite old I was
thinking of reducing them into one.
I am putting together an email so I am just making sure I mention all
possibilities
Ewan Mac Mahon: (11:53 AM)
And I don't think it would need two NICs - presumably the point of using the
common scheduler is that it avoids running a latency test and a bandwidth test
at the same time.
Lukasz Kreczko: (11:53 AM)
OK. Thanks
Duncan Rand: (11:53 AM)
I'll see if I can find out.
Ewan Mac Mahon: (11:53 AM)
Which avoids the problems of the bandwidth test load screwing up the latency
measurement.
Duncan Rand: (11:54 AM)
Indeed, except this is confusing: "It initiates all tests except constant
OWAMP tests (i.e. powstream) via BWCTL"
Jeremy Coles: (11:54 AM)
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationCosts
Duncan Rand: (12:01 PM)
See https://lists.internet2.edu/sympa/arc/perfsonar-user/2014-07/msg00112.html
"tips for One way latency and bw test on separate NICs on perfSonar 3.4rc2"
Samuel Cadellin Skipsey: (12:06 PM)
(Honestly, the most radical thing you could do to "improve efficiency" would
be to change the way we do authentication etc, but that's probably too radical
nowadays, and mostly would help smaller VOs.)
Alessandra Forti: (12:06 PM)
I think we should put that as well
Jeremy Coles: (12:11 PM)
https://www.gridpp.ac.uk/wiki/IPv6_site_status
Gareth Smith: (12:11 PM)
The efficieny stuff may well benefit from a chat over a beer at the GridPP
meeting.
Govind : (12:15 PM)
I am going to ask for update from network guy now..
Duncan Rand: (12:23 PM)
https://www.ja.net/products-services/janet-futures/ipv6
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=IPV6-USERS
last updated nov 2013
Ewan Mac Mahon: (12:24 PM)
Well, resurrect rather than start then :-)
I'm not surprised about that though, but i think Janet did their IPv6 stuff so
early that they didn't have the sites ready to take it up, now they do (more).
John Hill: (12:25 PM)
Cambridge policy is to allocate /56 per department. We then got a single /64
within that
Brian@RAL: (12:27 PM)
http://netmon02.grid.hep.ph.ic.ac.uk:8080/maddash-webui/index.cgi?dashboard=UK%20sites
Duncan Rand: (12:27 PM)
http://netmon02.grid.hep.ph.ic.ac.uk:8080/maddash-webui/index.cgi?grid=UK%20sites%20-%20intercloud%20OWAMP%20Mesh%20Test
Brian@RAL: (12:28 PM)
https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhosts%26host%3Dac.uk
Duncan Rand: (12:37 PM)
I'm at home. The route to ICL seems to go to Janet via telehouse.ukcore.bt.net
Samuel Cadellin Skipsey: (12:37 PM)
I note that JANET's own documentation suggests that "Global Transit" is at
Manchester and London.
but there's other points where it connects to MANs etc