Attending: Andrew Washbrook, Christopher Walker, Daniela Bauer, Daniel Traynor, Duncan Rand, Elena Korolkova, Emyr James, Ewan Mac, Gareth Roy, Gareth Smith, Jeremy Coles, John Bland, John Green, John Hill, Mark Mitchell, Mark Slater, Mohammad kashif, Pete Gronbech, Raja Nandakumar, RECORDING Matt, Robert Frank, Rob Fay, Sam Skipsey, Steve Jones, Wahid Bhimji Chair: Jeremy Minutes: Matt 11:00 Experiment problems/issues (20') Atlas report pdf file Review of weekly issues by experiment/VO - LHCb Nothing much to report, turning atetntion to MC. Not much on the horizon. -Noone had anything to discuss - CMS Nothing to report. - ATLAS (see attached report) Oxford and Birmingham encouraged to try to try Liverpool sysctl settings (from the Storage mailing list). AF will sent a summarry. It's worth following up, sites should all collect our sysctl settings on TB-SUPPORT/wiki. Wahid will start the wiki ball rolling on this. AF has already e-mailed a summary to the storage list. SRM problems: Gareth wonders how we can exert pressure to get a fix for the SRM problems hitting QMUL. Also there's some confusion over why the problem has period of being worse. DDM developers also not in a rush over this. Ewan asks if the problem is caused by robot certs with : in their names, why can't we just issue :-less robot certs? Some discussion over setting memory limits for sites in Panda. Has worked nicely for Manchester. In case of an "out of hours" user based troubles (like the multi-core jobs) then the atlas AMOD shifter can be contacted to ban the user. Sites generally happy over the handling of last week's incident. Ewan mentions enabling multi-core queues in the future to allow legitimate multi-core jobs. Maybe we need to ramp up multicore queue deployment? We'll have to wait for a request from atlas. Ewan also mentions SLURM/SLURM CE - SLURM has good "core locking" involving control groups and if we had been running slurm we wouldn't have had this problem. CW reports that SL6+ Grid Engine can also do this. Slurm was supposed to be in EMI3 - but Jeremy couldn't find it. CW - users shouldn't be in a position to be able to crash a node, and we should be able to prevent this using the batch system. - Other VOs From bulletin: VomsSnooper http://www.gridpp.ac.uk/news/?p=2695 - worth doing - found 4 errors. NGS VOMS server. 2 Sites remaining: Glasgow and Durham. Progress on both. https://ggus.eu/ws/ticket_info.php?ticket=90356 Recommendation to everyone to use VomsSnooper at least once. 11:20 WLCG Ops Coordination meeting & Task Forces (15') Next meeting on Thursday: https://indico.cern.ch/conferenceDisplay.py?confId=237012 (see attached slides) Atlas have OS checking tools in cvmfs - needs documentatiob nut very useful. Need greater involvement from UK Tier 2s. Call to arms for the UK to help out with SL6 testing - some sites are already ahead on this. Chris has asked for reminders for these meetings. Some discussion over which list such reminders should be sent to (TB-SUPPORT or gridpp-ops) or at least be put onto the Bulletin. Link to the grand WLCG Ops meeting list: https://indico.cern.ch/categoryDisplay.py?categId=4372 Lots of love for HEPOSLIB rpm in the UK. Need to find a new home for it. (sysadmin suggested, experiments are using github). 11:35 IPv6 next steps & perfSonar status (15') Mark: * IPv6 wiki page: https://www.gridpp.ac.uk/wiki/IPv6 Cern running out of IPv4 addresses (in part due to deploying lots and lots of VMs). One big hold back for IPv6 is AFS. (JC- some look use this as an excuse to move away from afs). RAL still needing to be IPv6d but work ongoing. Glasgow attempting to run a hybridised IPv6/IPv4 network. Mark encourages sites to "have a go" at IPv6 - but don't expect anything to work. Very little can cope with the dynamic changes. Glasgow works around some issues by obtaining "life leases" for their boxen. DON'T try to deploy IPv6 on a production service - it won't end well. Experiments are working on it, CMS ahead, Atlas a little way behind them. Experiments need to tell US what they want/need tested. IPv6 very very different. Janet do support IPv6, but takes some work to get the infrastructure set up in a timely manner. Lots of chasing up is needed. IPv4->IPv6 translation is NOT an option- too expensive. Call for a few more sites to get involved. Imperial is about to step up. QM and Oxford are also seeing if/how they can get involved. Oxford may have a showstopper in their central IT services. Security with IPv6 also an unknown. Potential HEPSYSMAN/GRIDPP meeting topic. Duncan: (See attached slides on Perfsonar mesh) The slides contain instructions on how to update Perfsonar to Perfsonar mesh. Uses a config file rather then manual input to set up your Perfsonar and it's tests. Duncan advises to look over the setup scripts before you run them, to familiarise yourself with what's going on (always wise!). AF points out that once set up a site won't notice any change (due to the full intrasite testing the UK already has). PG asks if we understand the current perfsonar problems - Duncan reports that they're roughly understood. Duncan asks Liverpool if they've upgraded their link - John B says they did about a month ago but seen no benefit on the perfsonar front. A few other sites have issues/oddities. Duncan will review and contact sites once again. A question raised at a recent GDB concerning the "ownership" of perfsonar revealed issues. Volunteers to try out the mesh would be appreciated. Oxford, Birmingham and Lancaster stepped forward. All sites should check their site/university IPv6 plan - even if they don't want to get involved in the current work. * PS: http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=UK 11:50 Review of GDB topics (10') Slides powerpoint file pdf file Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=197800 Minutes: https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20130213 GridPP Tier-2 notes: https://www.gridpp.ac.uk/wiki/GDB_reports 12:00 Meetings & updates (10') With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status Raja asks about the status of the Tier 1 batch system. Tuesday 19th February Ongoing problems with the batch farm not starting enough jobs over the past couple of weeks AFS clients removed from the worker nodes last week. A small number of nodes now form a SL6 batch queue behind its own CE (lcgce12). Ongoing testing of FTS version 3. - Accounting - Documentation The status of documentation is brought up - please check the pages if you're responsible for any documentation. - Interoperation - Monitoring - On-duty - Rollout - Security John Green would like to remind people about a current RH security vulnerability (CVE-2013-0871) - Services - Tickets - Tools - VOs - SIte updates 12:10 Actions & AOB (2') No AOB Chat Window: [11:01:34] Wahid Bhimji morning - [11:02:21] Rob Fay no feedback here [11:02:25] Sam Skipsey no-one else has feedback [11:02:29] Jeremy Coles Humm. [11:02:35] Rob Fay no humming either [11:02:39] Jeremy Coles I already had one EVO session freeze. [11:02:43] Sam Skipsey We can hear you perfectly, though. [11:02:50] Sam Skipsey Well, when you were speaking, at least. [11:03:15] Jeremy Coles It does not work when it feeds back quite loud. [11:04:54] Jeremy Coles Now fixed - changed the Panda server to a UK one. [11:06:25] Jeremy Coles https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=4&confId=220594 [11:10:45] Wahid Bhimji probably more likely DPM version rather than SL6... Glasgow has the same lower TCP setting as we used to have and trasnfers are fine for them [11:11:01] Wahid Bhimji so still mysteries... [11:11:33] Ewan Mac Mahon And Oxford's been bad for ages and we've had a variety of DPM versions in that time. All SL5 though. [11:11:38] Wahid Bhimji all I did was [11:11:38] Wahid Bhimji net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 --- > net.ipv4.tcp_rmem = 65536 1048576 16777216 > net.ipv4.tcp_wmem = 65536 1048576 16777216 [11:11:43] Ewan Mac Mahon It's still really, really wierd. [11:11:56] Wahid Bhimji that was good enough for me... [11:12:21] Emyr James sorry am late - had pc issues [11:12:30] Sam Skipsey Yeah, I don't think we understand this well. [11:13:16] Sam Skipsey So. Our settings are: [11:13:24] Sam Skipsey # TCP buffer sizes net.ipv4.tcp_rmem = 131072 1048576 2097152 net.ipv4.tcp_wmem = 131072 1048576 2097152 net.ipv4.tcp_mem = 131072 1048576 2097152 net.core.rmem_default = 1048576 net.core.wmem_default = 1048576 net.core.rmem_max = 2097152 net.core.wmem_max = 2097152 [11:13:34] Sam Skipsey and they've been that way for *ages* [11:14:05] Steve Jones Thanks for these settings, but is there a more "sticky" way to document them? [11:14:17] Christopher Walker # 32 MB might be needed for some very long end-to-end 10G or 40G paths net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and max number of bytes to use # (only change the 3rd value, and make it 16 MB or more) net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # recommended to increase this for 10G NICS net.core.netdev_max_backlog = 30000 [11:14:22] Jeremy Coles Yes once we agree Steve. [11:15:37] Ewan Mac Mahon So, tb-support and a wiki page? [11:15:41] Christopher Walker Yes [11:15:48] Ewan Mac Mahon Which settings do we want reported then? [11:22:44] Ewan Mac Mahon Um - I'm sure I'm missing something here, but if the problem only affects Robot certs with colons in their DNs, can we not just issue some robot certs without colons in their DNs? [11:26:14] Emyr James some of our local users are using an atlas package called lhapdf which grabs 2GB for itself as soon as the job starts....seems a tad excessive [11:31:27] Wahid Bhimji theres nothing "organised" about atlas letting run anything they want. But yes multicore queues are good in general [11:32:39] Wahid Bhimji well they can run them single core - it might be more efficient in the future to work that way but only if users are clever enough - I expect many are not [11:34:32] Ewan Mac Mahon Of course, once we're all cloud based the VMs will contain the jobs. [11:34:35] Ewan Mac Mahon :-P [11:36:12] Jeremy Coles https://indico.cern.ch/conferenceDisplay.py?confId=237012 [11:36:39] Steve Jones https://www.gridpp.ac.uk/wiki/VomsSnooper_Tools [11:37:15] Steve Jones See checkSite use case. [11:38:24] Jeremy Coles Alessandra's slides: https://indico.cern.ch/getFile.py/access?contribId=5&resId=0&materialId=slides&confId=220594 [11:39:48] Christopher Walker Send reminders and I might go. [11:39:57] Christopher Walker Particularly for the perfsonar meeting this afternoon. [11:46:02] Jeremy Coles Raul's comment on HEPOS_libs (offline): atlas, cms, and lhcb, others have been using SL6 in production at Brunel for 6 months. Actually Brunel was used by C. Wissing for debugging CMS. What I did was to open the old HEPOS_libs for SL5, locate each i386 package and find equivalent for SL6 64 bits and install. IMPORTANT: in case of CMS, Atlas and LHCB, it only works because we use CVMFS. CMS, for example, brings even the compiler (gcc 4.6) from CVMFS. Hone, however, don't use CVMFS and rely on those libraries. [11:49:06] Pete Gronbech joined [11:50:13] Alessandra Forti https://indico.cern.ch/categoryDisplay.py?categId=4372 [11:51:00] Christopher Walker Agree completely. [11:51:15] Christopher Walker HPOSLIB RPM is a good idea. [11:51:20] Matt Doidge Agreed [11:52:24] Christopher Walker EMI/UMD repo - use that anyway. [11:54:03] Jeremy Coles IPv6 wiki page: https://www.gridpp.ac.uk/wiki/IPv6 [11:57:43] Christopher Walker Hepsysman/GridPP session on IPv6 deployment issues? [12:01:55] Christopher Walker QMUL can have IPv6 - but StoRM isn't IPv6 compliant. [12:03:52] Ewan Mac Mahon On AAAA records, I think the Oxford situation is that we can have the record, but the DNS servers aren't IPv6 reachable, so as long as your client can look up the DNS over IPv4, then you get your AAAA record. [12:04:23] Ewan Mac Mahon Interesting times. [12:05:15] Ewan Mac Mahon If nothing else, you can contribute a place for us to tunnel to..... [12:05:24] Mark Mitchell Yeah we built out a second DNS server for this [12:06:04] Ewan Mac Mahon OUCS jealously guard the DNS (preciousssss) so we can't easily do that, they'd have to. [12:07:08] Ewan Mac Mahon This is their main page on IPv6: http://www.oucs.ox.ac.uk/network/addresses/ipv6/ [12:07:19] Gareth Smith Sorry I have to go now. [12:07:24] Ewan Mac Mahon Last updated in 2011 by Oliver Gorwits, who's long gone. [12:09:57] Jeremy Coles http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=UK [12:11:40] John Hill Oxford seems to have had a problem for about a week [12:11:53] John Bland our perfsonar systems have 10G cards [12:11:53] Rob Fay he's typing [12:12:03] Mark Mitchell Duncan will have a look at the Glasgow tests [12:12:07] Duncan Rand ok [12:12:17] Sam Skipsey There was a comma there in Mark's last sentence. [12:12:23] Sam Skipsey We're not telling Duncan to look at them [12:12:23] John Bland we see 2-3G no problem on our storage nodes on that link and we can get ~10G between nodes on our switches but perfsonar just doesn't show any change [12:12:49] John Hill ECDF is very asymmetric [12:13:02] John Green left [12:13:13] Elena Korolkova We haven't finished network upgrade yet [12:14:29] Wahid Bhimji haha depends what you say [12:14:29] Jeremy Coles thanks [12:15:30] Ewan Mac Mahon @John - what problem does Oxford seem to be having? [12:15:52] John Bland which john? [12:16:01] Alessandra Forti hill [12:17:11] Ewan Mac Mahon Hill, but it's OK, he's been chatting with Pete - the problem looks like everyone else testing to us is failing, but we're testing outwards OK. [12:17:20] Ewan Mac Mahon We'll have a look at it, anyway. [12:17:35] Mark Slater I may get round it at Bham as well [12:17:39] Ewan Mac Mahon Er - we're looking at the bustedness, not the new thing. Yet. [12:17:50] John Hill My perfSonar plots show "no data" for Cambridge==>Oxford for the last week [12:17:52] Ewan Mac Mahon Sorry; that was less than clear. [12:18:00] Jeremy Coles Moving next to https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest# [12:19:12] Jeremy Coles https://bugzilla.redhat.com/show_bug.cgi?id=911937 [12:21:04] Jeremy Coles https://www.gridpp.ac.uk/php/KeyDocs.php?sort=reviewed