UKI Monthly Operations Meeting (TB-SUPPORT)

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the monthly UKI meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The UK phone bridge is on +44 (0)161 306 6802. The UK bridge has been known to cause problems so the CERN one might be preferred: +41 22 76 71400. For cheaper calls to the CERN phone bridge please consider using a service like Telediscount.co.uk which has the following number for Switzerland (2p per minute): 0844 462 95 95. The phone bridge ID is 1197332 with code: 4880. - If the CERN phone connection does not work please try Caltech +1 626 395 2112 or DESY +49 40 8998 1346. - For more information on the UK phone bridge: http://www.ja.net/services/video/agsc/services/evotelephonebridge.html
UKI Meeting minutes UKI Meeting - 27th August 2009 Attendees ========= Derek Ross Ben Waugh William Hay Andreas Roussos Jeremy Coles Christopher Walker Phil Roffe David Ambrose-Griffith John Bland Govind Songara Brian Davies Matthew Doidge Stuart Kenny Elena Korolkova Santanu Das Rob Fay Sam Skipsey Peter Lover James Cullen Winnie Lacesso Stephen Burke Mohammad kashif Graeme Stewart Pete Gronbech Stephen JOnes Dug McNab Alastair Dewhurst Duncan Rand Mingchao Ma Raulk Lopes Ewan MacMahon Wahid Bhimji Mike Kenyon Experiements problems/issues/updates ==================================== No sites having issues with CMS or LHCb Atlas ----- - Fairly quiet - MC09 still delayed for various reasons - new release of Athena due soon -Hammercloud tests - covered later on - Detector doing subsystem Cosmic runs ramping up detector operations in preparation for LHC startup - Brian looking at Diskspace - revised table for data distribution, Other vos --------- - e-NMR wish access to resources in UK, formal request to be sent to PMB shortly - T2k also want resources No issues with other vos Atlas Site testing ================== Hammercloud test progess ------------------------ - Hammercloud running in file stager mode at sites. - Examining throughput plots product of eff * no jobs runing General limit throughput plateaus - decreases wn Efficeny - useful product doesn't increase Actual limit is very site specific Plots are on line, also a script for querying torque log files to extract distributed by Sam Skipsey Major factor - running 2 - 4 jobs on a single host - local system disk is bottle-neck - downside to multi-core hosts Liverpool have WNs with > 1 disk which may improve this How many sites in HC tests have produced plots? 5 out of 12 sites. Should help sites increase their efficency Plots can be sent to atlas-uk-comp-ops or written blog posts. Suggestion is to create a wiki page to hold the results. ACTION Brian to create wiki page Plan for next test - use local file access (rfio, file etc) using a small readahead buffer Would like sites to configure a common buffer size, would allow comparison of file access vs downloading the file to the WN - Sites should set window size to 4KB on WN. - Plan for next Tuesday Sites should mail atlas uk support list if they don't want to participate Oxford has 1 KB - does it need to be 4K? Anything less than 8KB is okay but 4KB is preferred for comparison Once these tests have been run for a long time, Ganga will keep record of optimal access method for each site and automatically configure the method when job are submitted to the site - currently users are configuring by hand Range of tests are increasing. Is there any page where site performance in STEP 09 can be seen - check post mortem, plus link in chat window Security Update =============== No patch yet for SL3 Increased vigilance reccomended Other vulnerabilities being used - such as web applications - recent incidents have used a twiki bug. RHUl done Most of Scotgrid done QMUL awaiting Lustre module, have applied workarounds, Sheffield done Oxford in progress, done by end of the day General issue - third pary modules have not yet been recompiled against the new kernel which may prevent sites from updating kernels Turning services off is a site decision. Currently there is no private mailing list for discussion of security incidents amongst sites - TB-Support is public. Suggestion is to use HEPSYSMAN list or sysadmin chat area WLCG Sl5 update =============== Most of Liverpool nodes are not 64bit capable UCL-HEP will be done soon UCL-Central not before January Atlas needs a symlink creating for libblas - blas-devel rpm may work instead Atlas need a new SW area, do other VOs? Suggestion is that CMS do not expect a new area - not confirmed Birmingham have been requested to install CREAM CE for Alice Site /ROC news ============== T1 news ------- - Quattorising batch server and WNs - Acceptance of new hardware - Data reconciliation with Atlas AOB === HEPSYSMAN @ QMUl Oct 7th, suggestion for talks welcomed Sites round table ================= Chat room discussion -------------------- IRC suggested but some sites firewall IRC Concerns about logging by external people Chat window [10:57:52] Jeremy Coles joined [10:58:23] Christopher Walker joined [10:58:44] Jeremy Coles UCL HEP please could you mute your session? Thanks. [10:58:48] Phil Roffe joined [10:58:53] Graeme Stewart joined [10:59:02] David Ambrose-Griffith joined [10:59:20] John Bland joined [10:59:25] Govind Songara joined [10:59:52] Brian Davies joined [11:00:08] Matthew Doidge joined [11:00:09] Stuart Kenny joined [11:00:42] Elena Korolkova joined [11:00:52] Rob Fay joined [11:00:52] Rob Fay left [11:00:56] Santanu Das joined [11:01:14] Sam Skipsey joined [11:01:20] Peter Love joined [11:01:57] James Cullen joined [11:02:05] Winnie Lacesso joined [11:02:19] Stephen Burke joined [11:02:28] Mohammad kashif joined [11:03:35] Derek Ross Jeremy has been talking [11:03:44] Graeme Stewart ok, i will rejoin... [11:03:44] Brian Davies gr4aeme i think it is you [11:03:45] Christopher Walker We can all hear something Graeme [11:03:51] Graeme Stewart left [11:03:57] Dug McNab joined [11:03:58] Graeme Stewart joined [11:04:59] Pete Gronbech joined [11:05:10] Dug McNab left [11:05:23] Stephen Jones joined [11:05:47] Dug McNab joined [11:07:00] Alastair Dewhurst joined [11:07:14] Duncan Rand joined [11:07:46] Mingchao Ma joined [11:08:34] Elena Korolkova is the new vo on gridpp wiki? [11:08:47] raul lopes joined [11:10:12] Ewan Mac Mahon joined [11:13:24] Elena Korolkova There are also individual atlas users. Some of their jobs are running with good efficiency. some with very very low efficiency. Are we supposed to communicate with these users? [11:15:34] Christopher Walker QmUL [11:15:38] James Cullen We have not yet produced plots yet, but I intend to [11:15:41] Christopher Walker QMUL has some per [11:16:13] Christopher Walker QMUL has some preliminary plots - but is using sge which makes things more complicated. [11:16:27] Ewan Mac Mahon I think this calls for a wiki page. [11:17:34] Wahid Bhimji joined [11:22:43] Ewan Mac Mahon I somewhat think that if we're going for small, we might as well go for really small. [11:24:06] Elena Korolkova should the rfio be 1-4 K? I didn't hear well [11:24:35] Pete Gronbech it is suggested that we should use 4k next week for comparason [11:24:52] Elena Korolkova thanks, peter [11:25:52] Stephen Burke Graeme, do you have someone talking loudly in the background? [11:26:44] Graeme Stewart Yes, sorry. [11:26:49] Graeme Stewart Shared office [11:27:07] Graeme Stewart STEP09 HC Page: http://gangarobot.cern.ch/st/step09summary.html [11:27:18] Elena Korolkova In Sheffield, we already updated [11:27:28] Stephen Burke You could try asking them to be quiet! [11:27:49] Graeme Stewart That's a little hard when they are having another meeting on the phone. [11:28:17] Stephen Burke Well, OK, but they were as loud as you, it makes it difficult to understand [11:29:43] Stephen Burke (whereas Mingchao is at least louder than the apparently large number of people taling in his office ...) [11:32:09] Stephen Jones We're starting to upgrade our nodes to the new kernel today... [11:32:33] Elena Korolkova we done this in Sheffield [11:32:43] Ewan Mac Mahon Oxford has done all the service nodes, WNs are offline pending being updated, they'll be done and be back later today. [11:32:45] Govind Songara RHUL also done last week [11:33:03] James Cullen update planned for today/tomorrow in Manchester [11:34:27] Ewan Mac Mahon What is the state of GPFS sites? Surely IBM should be giving you guys some support for a paid-for product? [11:34:57] Mike Kenyon joined [11:36:22] Dug McNab Wahid was there not a mention of an inifiband driver issue at Edinburgh? [11:37:45] Wahid Bhimji Yes - sorry. I just looked up orlando's email - the two issues were: [11:37:54] Wahid Bhimji 1. Infiniband drivers [11:38:13] Ewan Mac Mahon I'm not sure the modules blacklisting is enough any more; Graeme pointed out another bug (iin the UDP support?). I haven't seen expoits for that yet, but that's not to say that it isn't exploitable. [11:38:15] Wahid Bhimji 2. not validated for GPFS jumbo frames maybe not working [11:40:16] Ewan Mac Mahon The risk is that if anyone develops an expoit for the UDP bug they can send you a grid job that then roots your WNs. [11:40:58] Stephen Burke the basic question is how likely you think grid users are to be hackers [11:41:15] Ewan Mac Mahon Or; how likely you think a hacker is to be able to steal a grid proxy. [11:41:33] Derek Ross No [11:41:46] Ewan Mac Mahon You root one WN, you can probably submit atlasprd jobs to everywhere. [11:42:20] Stephen Burke As things stand we have no evidence that general hackers are aware of the grid as a vector, so I think that's probably less significant [11:52:55] Ewan Mac Mahon I think it sounds like a new tb-support-private list is the way to go. [11:58:04] Ewan Mac Mahon Is that link not part of the blas-devel package? [11:58:07] Christopher Walker QMUL also has some old 32 bit nodes - though most of the resources are 64 bit. We have some test nodes - the timescale is challenging, but not impossible. [11:58:30] Graeme Stewart No, it's not. [11:59:24] Stephen Jones What is the URL to the SL5 migration twiki? [11:59:35] Graeme Stewart https://twiki.cern.ch/twiki/bin/view/Atlas/SL5Migration [11:59:43] Stephen Jones cheers [12:01:32] Ewan Mac Mahon Graeme - do realy, it is. I've just install blas-devel.i386 on an sl5 node and it does contain a symlink at /usr/lib/libblas.so that points to libblas.so.3.0.3 [12:01:32] Mingchao Ma cheers [12:01:44] Mingchao Ma left [12:01:44] Ewan Mac Mahon s/do/no/ [12:01:49] Graeme Stewart Ah, ok, let me 2x check... [12:01:51] Stephen Jones We've made an SL5 build for worker nodes, but we havn't yet run any jobs... [12:05:20] Santanu Das we r still running SL3 CE [12:05:35] Winnie Lacesso I confess Bristol still has an SL3 MON. [12:05:52] Stephen Burke mon shouldn't matter [12:05:54] Graeme Stewart Ewan - you're right. I shall updated the twiki because this is surely easier than the soft link patch. [12:06:22] Graeme Stewart "shall updated" is a time travel tense [12:06:30] Ewan Mac Mahon [12:08:10] Ewan Mac Mahon Long term it might be possible to get it added to the HEP_OSlibs_SL5 package (along with the other missing ones) which would make it nice and easy. [12:12:45] Christopher Walker It's all gone quiet - is it just me? [12:12:52] Ewan Mac Mahon Just you. [12:12:54] Wahid Bhimji its just you [12:12:57] Christopher Walker left [12:13:06] Wahid Bhimji but the meeting is ending anyway [12:13:09] Christopher Walker joined [12:13:14] James Cullen left [12:13:25] Jeremy Coles left [12:13:40] Graeme Stewart left [12:14:03] Mike Kenyon left [12:14:05] Winnie Lacesso I would prefer EVO meetings myself not IRC chat... [12:14:12] Ewan Mac Mahon I think it might be useful, but IIRC when it's come up before various sites firewall IRC pretty badly. [12:14:33] Stephen Burke irc used to be banned at RAL, not sure about the current status [12:14:35] Mohammad kashif left [12:14:35] Sam Skipsey And, indeed, other chat systems - jabber, Skype chat, etc. [12:14:51] Ewan Mac Mahon The external logging concern goes for EVO as well. [12:15:22] Rob Fay it's not that hard to run an irc server, and 24/7 live chat can be useful... if people use it [12:15:37] Brian Davies left [12:15:47] Dug McNab it needs to be something the tier1 can use, so whatever is allowed there would be the way to go [12:16:22] Peter Love sign up for google-wave [12:16:30] David Ambrose-Griffith left [12:16:37] Phil Roffe left [12:16:45] Wahid Bhimji left [12:16:46] raul lopes left [12:16:52] Matthew Doidge left [12:16:53] Stuart Kenny left [12:16:54] Dug McNab left [12:16:54] Rob Fay google wave might be a solution, but not yet [12:16:57] John Bland left [12:17:02] Christopher Walker left [12:17:09] Winnie Lacesso left [12:17:10] Christopher Walker joined [12:17:12] Peter Love left [12:17:28] Stephen Burke left [12:17:45] Ewan Mac Mahon I think if we;re setting up something ourselves I'd go for jabber rather than IRC, [12:18:11] Alastair Dewhurst left [12:18:13] Christopher Walker Audio had gone again for me - is anyone talking? [12:18:16] Elena Korolkova left [12:18:24] Ewan Mac Mahon I don't think anyone's saying anything. [12:18:28] Govind Songara i can not hear as well [12:18:30] Sam Skipsey No, Chris, we're all silent. [12:18:34] Duncan Rand we're chatting! [12:18:42] Sam Skipsey We are? [12:18:44] Duncan Rand see [12:18:51] Duncan Rand yes in the chat window [12:18:55] Derek Ross problems is that may only fix the problem for one site - would other sites be happy about using a service hosted at another univeristy to discuss internal security issues? [12:18:58] Sam Skipsey Oh, yes. But silently. [12:19:19] Ewan Mac Mahon I think EVO is supposed to be able to do jabber-y things like presence notification etc. but I've never tried to use it like that. [12:19:36] Rob Fay Derek, if they're not, nothing will work [12:19:47] Duncan Rand well maybe a 'not for security issues' to start with [12:19:56] Sam Skipsey Why use EVO, though, since Jabber already solves those problems? [12:20:15] Christopher Walker I'm not familiar with jabber. How does it differ [12:20:23] Christopher Walker from IRC [12:20:26] Duncan Rand is this a meta-chat? [12:20:27] Rob Fay argument for evo: everyone already uses it, argument against evo: jabber can be run by us with the advantages that entails [12:20:38] Rob Fay whereas evo can't AFAIK [12:20:41] Derek Ross evo is quite resource intensive [12:20:46] Ewan Mac Mahon Well, we do sort-of all use EVO already as well, and it does have the possibility of turning a chat into an audo conversation. [12:21:09] Sam Skipsey EVO is used by use because people keep scheduling meetings on it, not because it necessarily solves the problem we want to solve. [12:21:14] Rob Fay audio is overrated, at least with text you can't hear the other people in graeme's office typing [12:21:19] Ewan Mac Mahon Plus, AIUI it's trickier to get jabber to do chat rooms, it's more one-to-one. [12:21:43] Ewan Mac Mahon I know it /can/ do multi-user, I'm just not sure how to do it. [12:21:45] Derek Ross can be done its not that hard - the Tier 1 has one [12:21:51] Sam Skipsey Jabber does do chat rooms, it's not too hard. [12:21:54] Derek Ross we use conference.jabber.org [12:22:02] Ewan Mac Mahon I think it needs setting up on the server though, doesn't it. [12:22:08] Sam Skipsey Yes. [12:22:13] Derek Ross Pidgin (a jaber client) seems well set up to use them [12:22:16] Christopher Walker What we want - or to be more precise, what I want is a way of saying - hmm I've got this problem anyone have the same problem, or have anyideas. [12:22:22] Ewan Mac Mahon You can't just /join #newchannel like you can on IRC. [12:22:33] Rob Fay http://www.jabber.org/index.php/faq/#chatrooms [12:22:43] Sam Skipsey However, IRC is horribly insecure, Ewan. [12:23:00] Derek Ross true its hard to split off for small sub discussions with jabber unles they are 1-1 [12:23:38] Ewan Mac Mahon Sam: It's unencrypted, but (unless you're using something like OTR) so is jabber. How's IRC any worse? [12:23:39] Sam Skipsey And, according to Rob's link, it is quite possible to do IRC style "#newchannel" conference creation in RJabber. [12:24:04] Derek Ross Can we move this discussion to TB-Support - some others in the Tier 1 would be interested in the solutions too (and I want to go for lunch and need to copy the chat log before the meeting shuts [12:24:33] Sam Skipsey Fair enough. We could start the discussion with an email transcript of the chat log. [12:24:35] Sam Skipsey [12:24:45] Rob Fay jabber can use SSL + TLS I think [12:24:54] Rob Fay and I'd like lunch too
There are minutes attached to this event. Show them.
    • 11:00 11:10
      Experiment problems/issues/updates 10m
      CMS: LHCb: ATLAS: - Other VOs -- e-NMR enablement request expected shortly -- T2K enablement request expected soon
    • 11:10 11:20
      ATLAS/site tests 10m
      - Hammercloud test progress - Lessons learned - Future testing strategy - What else is needed before data taking?
    • 11:20 11:30
      Security update 10m
      - Review of current status - Check on site responses - Discussion on use of email lists and contact mechanisms
    • 11:30 11:40
      WLCG/SL5 update 10m
      - No recent GDB but the MB requested sites to move to SL5 "as soon as possible" - Discussion: What are the constraints and problems being encountered? What has been the experience of those sites already moving? - Situation for ALICE (Birmingham).
      ATLAS Migration Twiki
    • 11:40 11:45
      Site/ROC news 5m
      EGEE update *************** - https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions - The default DPM used for SAM will now change to SL4. Old clients will start failing T1 news ********** - General update on status and plans
    • 11:45 11:55
      Site roundtable 10m
      - A chance for each site admin to make known/discuss current work and concerns
    • 11:55 12:00
      AOB 5m
      - The GridPP23 agenda is here http://www.gridpp.ac.uk/gridpp23/ - Next HEPSYSMAN meeting.