Chair: Jeremy Attending: Alessandra Forti, Andrew McNab, Andrew Washbrook, Daniela Bauer, Daniel Traynor, David Crooks, Duncan Rand, Elena Korolkova, Ewan Mac, Gareth Roy, Govind Songara, John Hill, Mark Norman, Mark Slater, Mohammad kashif, Pete Gronbech, raul lopes, Rob Fay, Rob Harper, Sam Skipsey, Stephen Jones, Stuart Purdie, Stuart Wakefield, Wahid Bhimji, Ian Collier Minutes: Matt Doidge. Apologies from many at RAL, they were in the middle of a site wide network intervention. Tuesday, June 19, 2012 11:00 Experiment problems/issues (20') Review of weekly issues by experiment/VO - LHCb We have mostly smooth running for LHCb in the UK. Issues : 1. Various CVMFS errors at different sites. Followed up through GGUS tickets. 2. Interesting 3-day oscillation in running jobs at RAL (Tier-1). Trying to understand is origins and implications. - CMS Some issues at IC caused by dcache. Generally busy, had to close gridftp doors on older nodes with low RAM due to memeory issues. - ATLAS IC - cvmfs missing release problems. Oxford FTS timeout problems in ticket https://ggus.eu/ws/ticket_info.php?ticket=83330 Sheffield had a power cut, Birmingham offline due to cooling installation. -cmvfs timeout problems, two savannah tickets. New cvmfs due to fix these problems. -Manchester seeing cvmfs hangs - yet another bug. -Sheffield & RalPP saw transfer problems in FTS. UKI-LT2-IC-HEP: Long standing problem with missing release needs some dedicated testing due to the different setup IC has. Waiting on AdS to supply some code. UKI-SOUTHGRID-OX-HEP: problems with FTS time out settings. Brian has now changed them to a longer time. Downtimes UKI-NORTHGRID-SHEF-HEP: uni power cut UKI-SOUTHGRID-BHAM-HEP: disruptive installation of new aircon units UKI-SOUTHGRID-RALPP: site routers maintenance RAL-LCG2: site routers maintenance CVMFS * timeout problem has now two tickets one for atlas and one for cvmfs. https://savannah.cern.ch/bugs/?95420 https://savannah.cern.ch/support/?129468 Jakob thinks ha has found a solution and has a test version of cvmfs for it. * Another bug we are looking at is cvmfs hanging every now and then this affects lhcb too so Raja might want to give a look. https://savannah.cern.ch/bugs/?92112 Transfers errors UKI-NORTHGRID-SHEF-HEP AND UKI-SOUTHGRID-RALPP had some problem with jobs in tranferring state accumulating after RAL downtime last week. This was due to FTS reporting the same error code for two different errors confusing Site Services. The problem has been noted and reported to the WLCG meeting. Pete G asks if the T2D stats site gets updated more then once a month. Allessandra says probably. Ewan would like to know when Oxford will start to reap the rewards of their upgraded connection. - Other 11:20 Meetings & updates (20') With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest General Updates Monday 18th June CADist-Check test has been updated to new version and now it takes CA distribution version directly from http://egi-igtf.ndpf.info/distribution/egi/current/release.xml , so it is very unlikely that we will see that problem with SAM-Nagios again. - Tier-1 status Tuesday 19th June The update of Castor on Wednesday 13th June to version 2.1.11-9 went well. The update of the database behind the non-LHC VO's LFC and FTS on Wednesday 13th June was problematic. The FTS service was restored using a clean database that afternoon. The LFC service was not restored until the Friday morning (15th). Today (Tues. 19th) there is a site networking upgrade. A problem with file transfers to/from the German Tier1 (FZK) has been investigated and worked around. Castor databases will be updated to Oracle 11 on Wednesday 27th June. Ian C - RAL in the middle of sitewide network intervention FTS upgraded had troubles due to network problems preventing a rollback. THis extended the outage but there was no data loss or other badness. Hopefully today's network outage will fix network problems at the tier 1. - Storage & Data Management. No news from Storage groups this week. - Accounting No news this week. - Documentation Stephen - good progress with VomsSnooper, currently using 2 test subjects. Needs a few more volunteers to try things out. - Interoperation Monday 18th June A few updates: https://twiki.cern.ch/twiki/bin/view/EMI/Emi2EgiGOM#Status_18_06_2012 The EMI 1 updates are just minor revisions: Top BDII, BLAH and Storm. A repackage for GFAL/lcg-utils to handle the globus lib dependancy problems. Further EMI-2 updates, probably of interest only for those doing EA of them. Staged rollout: Lot's of EMI-2 packages, working their way through the verification/SR process. The software that is just a repackage from EMI-1 to EMI-2 are skipping SR on SL5 - the SL6 versions will be tested. Most of the products in SL5, support upgrade and reconfiguration from the EMI1 versions. Note that CREAM is one of the products that can't to an inplace update - new DB schema, so needs a drain/wipe/re-install. Question from Tiziana - anyone using CREAM in Cluster Mode? Any feedback on that? Stuart - If anyone has publishing without userDN but has a good reason for it please let him know. Ewan - thinking about maybe perhaps using CLUSTER in the near future, any reason we shouldn't? Stuart - none known. Stuart - Re-emaphasised that EM1 - EM2 CREAM needs a reinstall - Monitoring -Ranking continues. Plan to have a meeting in July to discuss good approaches to the plethora of monitoring available. Current priority is ranking the tools available. Glasgow dashboard now packaged and can be downloaded here. - On-duty Stuart - nothing exciting last week. - Rollout Daniela - both IC and Brunel have SL6 test CREAM ces but no staged rollout ticket for these services. No EM" WMS packages yet. - Security No issues at the moment. - Services Tuesday 19th June Some of the volunteer sites may not have perfsonar by end of June. Which other sites are close? Lancaster installed but at 1G. Manchester would like more information from Ian C, Ian suggests just following the net install instructions but Allessandra would perfer to use SL. QMUL & Oxford used the "standard" install method. Ewan reminds that the kernal important. Ewan - the Perfsonar installs can be considered "dispoasble", and can be redone in the future without problems. Allesandra - probably going to go with the CD image then. GridPP will resume running VOMS. Current plan is for the master to remain at Manchester and to host backups at Oxford/Imperial. - Tickets Monday 18th of June, 13:00 BST 22 Open UK tickets this week. Not much beyond bulletin contents. A number of cvmfs related tickets. - Kashif configuring WMS at Oxford to test Sussex. - Tools Jeremy - would like to discuss how nagios failover will work at a future Ops meeting. - VOs Nothing. - Site updates Nada. 11:40 GDB overview (10') June meeting Wednesday 13th June WLCG meeting notes Welcome [MJ] August meeting is cancelled. October meeting is in Annecy. EGI Technical Forum 17-21st September: http://tf2012.egi.eu/ HEPiX Fall . 15th -19th October. - would like to get a few more people to Hepix Post-TEG Working Groups [Ian Bird] Large number of WGs proposed. DM&S: Benchmarking. Federation. Networking WLM: Extensions of CE (multi-core; whole node; pilot support). Information System. Security: Proposals coming Database: share experiences. Operations: m/w sw process. Monitoring. Teams approaches: Operations coordination team. Sharing experiences/tech watch (pre-GDB discussion) Possibly Missing? Cloud. SRM (but to be more generic in title!). Bartch systems. Storage Accounting (John Gordon) StAR Plan to publish to APEL but in EMI-3 for May 2013 Interim possibility to use gstat Noted that information that is published is not precise. Dave C - Could storage accounting be installed outside of EMI(3)? - this would probably be too much work. Information System Status and Evolution (Maria Alandes Pradillo) Caching BDII. In since February. Documentation improved. Use variable BDII_DELETE_DELAY. (https://tomtools.cern.ch/confluence/display/IS Failover. LCG_GFAL_INFOSYS Use 1& 2 BDII in region and 3 as CERN. On data quality - caching Bdii not in use as much as it should - glue-validator (in EMI-1 ans 2) - glue 2 still to be deployed widely - Future work (EMIR; ginfo and IS monitoring/metdata). Question if OSG fully engaged? -"For expert only" AAI on WN update (Romain Wartel) Security controls . central banning body required ARGUS locally needed (to pull banning lists from central ARGUS) Ownership of traceability. VO-site collaboration needed to cover all cases Recommendations to fulfill logging and traceability policy on WN. Not current possible to use clouds (VMs) in a way that conforms with WLCG security policies. Critical proxy extension (ALICE less limited) Proxy lifetime - reduce back to 24hrs? Balanced compromise between complexity and risk. Proxy credentials can not be revoked. Pool account recycling . recycle only after 6 months. EMI update (Cristina Aiftimiei) EMI-1 at update 15 (23.04.2012) EMI-1 Full support & maintenance until 28.02.2012. Updates till 31.10.2012. EMI-2 released 21.05.2012. Supports SL5 and SL6. Some Debian6. New products: CANL, EMIR, EMI-Nagios, Pseudonymity, WNoDeS. Hydra and WMS not released yet. Some backward incompatibilities due to existing EPEL package names. UI/WN tarballs in the next update. Globus SW support at OSG Discussions including use of Cream/Glue2; this to be investigated as it impacts use of the WMS EMI Sustainability Plans (Alberto Di Meglio) The end of EMI is the end of the coordination between product teams . not the end of those product teams. Ian Bird: the outcome of the above WLCG-EMI-EGI meeting needs to be how do we manage software in the future, also to discuss: how do we do certification, staged rollout and deployment in general. Communicating Machine Features to Batch Jobs (Tony Cass) Jeff will share a script for PBS to test implementation using /etc/machinefeatures. - has implications for "real" machines as well as virtual ones. MUPJ . gLexec update (Maarten Litmaath) gLExec flag needs to be set in GOCDB -few more UK sites published as running glexec. Status at http://cern.ch/go/PX7 Deployment guide https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment CMS pushing because of July security challenge. Federated Identity Vision (Romain Wartel) Document presented at last GDB. Approved by MB on 5th June. Pilot project for WLCG - any volunteers to be involved? Stuart - any description of how this differs from what we currently do, as it sounds like what we currently do. Dave - https://indico.cern.ch/getFile.py/access?contribId=18&resId=0&materialId=slides&confId=155069 Stuart - ah, it a SaRongs kind of thing. 11:50 glexec & ARGUS (10') No sites not looking to have gLexec installed that don't require a relocatable install. -Daniela working on tarball, Oxford might have a look at it if they can, Lancaster will have a look at it to try to help. 12:00 AOB (1') No other business. VOMS discussion. https://indico.cern.ch/getFile.py/access?contribId=5&resId=0&materialId=0&confId=192186 33 ngs VOs with only a few active 6 VOs at Glasgow - might want to consolidate into new VOMS. 23 GridPP VOs , with ~12 active. Oxford & Imperial secondary sites. Timeline? *EVO crashed so I missed a lot. All will run the same VOMS version. Alessandra - DNS does the failover currently Ewan - do we need to do that. VOMS client tools *should* be able to do this Need to chat to Robert about merging with ngs, if it becomes too much work with the their heavy amount of customisation it won't be able to be done. Ewan - we can use this to "pull" ngs to more standard ways of working. Andrew - we're in a stronger position to do it how we want to do it. Pete G - also an opportunity for a VO cull. Jeremy- layout of initial plan over the next fortnight. Daniela will be the Imperial primary contact. Need to find ways to make the replication more secure. Looking at what other NGIs are doing, php solution. Kashif has installed a voms server on a VM at Oxford, Ewan suggests installing a second and testing replication between them. mysql replication seems to be the best way of doing this within gridpp. Would like to get things moving before Robert moves on, although Manchester will try to keep him on. Rough deadline for the end of the summer. Glasgow willing to consolidate if VO creation is kept simple. Ewan - who do we want to have full blown admin web access? -It should be someone at the sites, but further details need further discussion. Ewan - any thoughts on hostnames/certificates. Stuart suggests gridpp.ac.uk names. Alessandra agrees. Keep the primary as is, voms.gridpp.ac.uk. Oxford & Imperial voms will be voms02 & voms03.gridpp.ac.uk Some discussion on the future of the NGI CHAT WINDOW: 11:01:20] Stephen Jones joined [11:01:22] Stuart Purdie joined [11:01:22] Alessandra Forti joined [11:01:23] Sam Skipsey joined [11:01:24] Daniela Bauer joined [11:01:24] Mark Norman joined [11:01:25] Matthew Doidge joined [11:01:28] Mohammad kashif joined [11:01:28] Stuart Wakefield joined [11:01:31] Gareth Roy joined [11:01:31] David Crooks joined [11:01:31] Wahid Bhimji joined [11:02:03] Jeremy Coles Will start in 2 mins. [11:02:38] Ewan Mac Mahon joined [11:02:40] Mark Slater joined [11:02:55] Andrew McNab joined [11:02:56] Andrew Washbrook joined [11:03:53] Jeremy Coles Matt is taking minutes today. [11:04:06] RECORDING Matthew joined [11:04:17] Rob Fay joined [11:04:51] Phone Bridge joined [11:05:10] David Crooks For Glasgow: https://ggus.eu/ws/ticket_info.php?ticket=83283 [11:05:50] Ewan Mac Mahon And the now closed and verified Oxford one was: https://ggus.eu/ws/ticket_info.php?ticket=83376 [11:06:46] Duncan Rand joined [11:07:22] John Hill joined [11:07:30] Wahid Bhimji left [11:08:06] raul lopes joined [11:08:09] raul lopes left [11:08:17] raul lopes joined [11:08:19] raul lopes left [11:08:29] Pete Gronbech joined [11:08:30] raul lopes joined [11:08:38] Matthew Doidge The brunel lhcb cvmfs ticket was https://ggus.eu/ws/ticket_info.php?ticket=83326 [11:08:39] raul lopes left [11:08:58] Govind Songara joined [11:09:00] raul lopes joined [11:09:01] raul lopes left [11:09:12] raul lopes joined [11:09:12] raul lopes left [11:09:20] raul lopes joined [11:09:21] raul lopes left [11:09:33] raul lopes joined [11:09:34] Ewan Mac Mahon All the figures are in this ticket: https://ggus.eu/ws/ticket_info.php?ticket=83330 [11:09:44] Matthew Doidge Ewan beat me to it [11:10:01] Ewan Mac Mahon (and I've just closed that ticket) [11:10:03] raul lopes left [11:10:50] raul lopes joined [11:11:57] Pete Gronbech Alessandra, Dose this page get updated more than once a month http://gnegri.web.cern.ch/gnegri/T2D/t2dStats.html [11:12:18] Andrew McNab left [11:12:55] Elena Korolkova joined [11:13:21] Ewan Mac Mahon We've got a spiffy new internet connection, when are you going to notice [11:13:32] Ewan Mac Mahon Sorry - there was supposed to be more in that line. [11:13:55] Ewan Mac Mahon Turns out EVO strips comedy HTML tags. [11:14:32] Andrew McNab joined [11:18:07] Daniel Traynor joined [11:22:16] Ewan Mac Mahon FWIW I have every intention of having a proper look at VOMSsnooper, it's just a round-to-it issue. [11:29:51] Ewan Mac Mahon No, security's pretty quiet atm. [11:31:24] Matthew Doidge Lancaster's boxes are online @ 1Gb, just need some testing to see if I got it right [11:33:29] Matthew Doidge pygrid-sonar1.lancs.ac.uk & pygrid-sonar2.lancs.ac.uk are their names [11:33:50] Daniel Traynor i don't know, have to ask chris [11:35:00] Ewan Mac Mahon Matt - any plans to get them onto the 10G ? [11:35:19] Matthew Doidge yes, but not by the end of June [11:35:26] Ewan Mac Mahon (and I'm pretty sure QMUL has the same setup as us from the PS install image) [11:35:53] Matthew Doidge We also used the PS install image [11:37:29] Ewan Mac Mahon Sounds good to me [11:37:48] raul lopes i will have it by the end of the week [11:37:54] Jeremy Coles thanks [11:39:09] Rob Harper joined [11:39:43] Ewan Mac Mahon By 'Ewan has solved' we mean 'Brian has solved, Ewan has closed' [11:39:54] Rob Harper Network outage at RAL -- OK at the moment, but I think they're doing more work later, so may drop out again. [11:42:58] Govind Songara left [11:44:16] Ewan Mac Mahon Kashif's about to say what I was I think..... [11:44:41] Ewan Mac Mahon He did. [11:54:54] David Crooks EMI News slides: https://indico.cern.ch/getFile.py/access?contribId=4&resId=1&materialId=slides&confId=155069 [11:55:23] Wahid Bhimji joined [11:57:12] David Crooks https://indico.cern.ch/getFile.py/access?contribId=18&resId=0&materialId=slides&confId=155069 [11:58:16] Ewan Mac Mahon So it's Sarongs? [11:58:22] Stuart Purdie That sort of idea, yes. [11:58:28] Ewan Mac Mahon Also file under 'we did this already' then. [11:59:30] Alessandra Forti federated identity assumes the UI is not the machine the user is using which is true most of the times but it's a failure of the UI concept [12:00:10] Wahid Bhimji definately ECDF will not do within month [12:00:40] David Crooks I think this is the status page [12:01:05] David Crooks http://grid-monitoring.cern.ch/mywlcg/services/?facelist_values_regions=&facelist_values_Sites=&facelist_values_services=&vo=37&profile=25&monitored=2&status=1&status=2&status=3&status=4&status=5 [12:01:16] Wahid Bhimji There are more pressing computing operations issues in this critical year for LHC data analysis [12:01:36] Matthew Doidge I'm not directly involved but I'm happy to get involved [12:01:51] David Crooks Short link if it's more useful http://is.gd/KpvbSZ [12:02:00] Phone Bridge left [12:02:18] Wahid Bhimji our priority should be ensuring we provide resources for LHC this year - taking manpower away from that goal this year is bad [12:02:34] Alessandra Forti sorry might have put the problem at the wrong level. [12:03:19] Alessandra Forti what to do in a month? [12:04:27] Alessandra Forti glexec.... [12:04:30] Wahid Bhimji (jeremy asks which sites would not deploy glexec within a month) [12:04:45] raul lopes left [12:04:48] Daniel Traynor left [12:05:05] Alessandra Forti not even atlas wants to deploy it before ichep [12:05:24] John Hill left [12:05:25] Wahid Bhimji of course ! [12:05:31] Mark Slater left [12:09:25] Andrew Washbrook left [12:09:55] Wahid Bhimji left [12:12:06] Ewan Mac Mahon Yup; that sounds fine. [12:12:33] Ewan Mac Mahon Just want to be clear that we're choosing this - it's not a constraint. [12:13:23] Stuart Wakefield left [12:14:37] Elena Korolkova left [12:14:55] Matthew Doidge left [12:15:02] Matthew Doidge joined [12:16:40] Matthew Doidge I'm afraid EVO crashed on me so I've missed most of the VOMS discussion to take minutes for it - sorry! [12:17:29] Alessandra Forti yes [12:24:20] Ewan Mac Mahon OK; how does this sound - as a technical decision we're only going to do it our way or not at all. [12:24:47] Ewan Mac Mahon The choice of whether the NGS wants to take advantage of that offer or do their own thing is a political one. [12:34:37] Rob Harper left [12:36:06] Sam Skipsey Essentially, it's all about agility, without annoying the people managing the VOMS server [12:40:05] Stuart Purdie I'd suggest, for usability, that girdP [12:40:11] Stuart Purdie gridpp.ac.uk names are better [12:42:20] Mark Norman left [12:43:48] Ewan Mac Mahon Right, so voms01.gridpp.ac.uk = Manchester [12:43:52] Ewan Mac Mahon 02 = Oxford [12:43:56] Ewan Mac Mahon 03 = Imperial? [12:44:26] Stuart Purdie I suggest leaving the primary as voms.gridpp.ac.uk - no change for endusers, and thus easier. [12:44:37] Alessandra Forti indeed [12:47:55] Ewan Mac Mahon Given that we have one fail-over mechanism already, I'd just stick with the one. [12:48:09] Ewan Mac Mahon And have all the clients list all the voms servers. [12:48:48] Ewan Mac Mahon Sp, shall we say voms.gridpp.ac.uk = Manchester, voms02.gridpp.ac.uk = Oxford, voms03.gridpp.ac.uk = Imperial [12:49:01] Alessandra Forti yes [12:49:10] Ewan Mac Mahon I think the DNS is at RAL isnt' it? [12:50:23] Ewan Mac Mahon Right, so we know what we're trying to get certificates for, so we can get on with that. [12:55:21] Ewan Mac Mahon NGS would continue to meet the EGI international tasks and fund the CE however there was no funding for training, outreach, or VOMS. [12:55:29] Ewan Mac Mahon ^ From the just out PMB minutes. [12:55:44] Ewan Mac Mahon (I assume CE=CA) [12:57:14] Alessandra Forti I agree [12:58:11] Ewan Mac Mahon At this point the NGS seems to have no sites, [12:58:15] Ewan Mac Mahon hardly any services, [12:58:21] Ewan Mac Mahon and dwindling staff [12:58:38] Ewan Mac Mahon (and still no grid) [12:58:45] Gareth Roy left [12:58:45] Rob Fay left [12:58:46] Andrew McNab left [12:58:46] Duncan Rand left [12:58:47] Alessandra Forti left [12:58:48] Mohammad kashif left [12:58:48] David Crooks left [12:58:52] Daniela Bauer left [12:58:54] Sam Skipsey left