Attending: Jeremy (chair), Matt (minutes). Alessandra Forti,Andrew McNab,Andrew Washbrook,Brian Davies,Catalin Condurache,Chris Brew,Christopher Walker,Daniela Bauer,Ewan Mac,Gareth Smith,Govind Songara,Mark Slater,Mingchao Ma,Mohammad kashif,Pete Gronbech,Peter Grandi,QMUL (Chis Walker et al),Raja Nandakumar,raul lopes, Rob Harper,Stuart Purdie,Wahid Bhimji. 11:00 Meetings & updates (20') - EGI updates UPDATE 30 for gLite 3.2 is now ready for production use. The priority of the updates is: Normal The highlights of the update are: - New version of glite-BDII_top - New version of glite-CREAM - New version of glite-LB - New version of glite-SGE_utils All details of the update can be found in: http://glite.cern.ch/R3.2/sl5_x86_64/updates/30/ - ROD team update *No updates - Nagios status -- Note Steve Lloyd's email about his SAM pages. Who uses them? *Steve needs to change how he does his queries. Is it worth him doing this? -me, they're very handy - Yesterday's sam WN-cr problems -Kashif: changed switch at Oxford, killed BDII at Oxford which caused problems despite some redundances built into the system. -Shouldn't affect site statistics. Main metric did not fail. - Tier-1 update -Gareth: -Database outage to Castor over the weekend, not fully understood, Oracle are being contacted about it. Short outages during weekend. -ATLAS disk servers out, a spate of disk failures (statistical quirk?). 3 disk server currently out, all rebuilding atm. -Outage planned for a week today (5th July) -Upgrade to Casotr to take advantage of larger tapes -Coincides with work by RAL networking team. - Everything declared down, batch system will be drained. -Atlas LFC will also be down. -Timeslot agreed with ATLAS ahead of time. -Sites might want to schedule work to coincide with RAL outage. -Question from Gareth asking for comments concerning how problems from outside of the site can flag failures (rep-del test, couldn't clean up at Birmingham), should CE test require another site to be up? -Kashif: During yesterday's problems rep tests moved to Birmingham -Problem was probably transitory. -Problem compounded if it the On-Call guys are called in. -Ewan in chat window "Usually if Oxford fails everyone fails, yesterday was wierd" -Debate over the complexity/usefulness/use case of the SAM tests, particularly for a Tier-1. Should they be used to flag real problems. Jeremy- how often do CE sam tests trigger a callout. Gareth- Not often. Pete G- should the secondary SE be at a Tier-1? AlF- Past Castor instability made this undesirable. Ewan - Maybe a dedicated secondary SE should be set up somewhere else. - Security update -Mingchao: Nothing to report from UK about torque vulnerbility, all sites keeping firewalls tight. -HEPSYSMAN security workshop on Friday. -There is a DDOS vulnerbility in torque (seperate vulnerbility). -Sites should consider patching torque. - WLCG update: A new WLCG Technology Evolution Work Group is being formed with Markus Schulz and Jeff Templon as chairs: “The overall goal is to ensure the long term support of the LHC community use cases, taking into account experiments, sites, and operational needs. Reducing where possible complexity and manpower needs for users, sites and developers. Improving functionality and performance where needed…. to define the vision for evolution according to the WLCG collaboration, and secondly to coordinate work being done… The group will cover topics such as: Security Model, Job Management, Virtualization, Data Management, Data Access, Information and Service Discovery etc. To get started we ask the Computing Coordinators to nominate for their experiments a permanent member and deputy. We will try to identify suitable site delegates, security and operations watchdogs.” -This group will have Experiment and Site representitives, supporting WLCG members. Jeremey and Pete G may be put forward as possible suggestions. -AlF, There should be more "technical people" nominated. Deja Vu with the old TiG. -- T2 issues Please check the site data here under "Tier-2": http://wlcg-rebus.cern.ch/apps/topology/ - Specific question for Peter/Durham: Is 1920 logical CPUs correct? -Peter Gh.- double publishing. - Several sites still publishing "EGEE" -- General notes. - Escalated tickets https://gus.fzk.de/download/escalationreports/roc/html/2011mmdd_EscalationReport_ROCs.html 10 red, 5 on hold. -Cambridge- noones about, following up offline. -71294, Glasgow. Stuart: WMS having troubles submitting to SARA. looks like it's actually a problem with SARA, only affects a particular user/vo/wms combination. Might need to be reassigned to them. -Squid at Oxford, Ewan: progressing nicely, if a bit slowly. -68077, RAL. Should it be closed? Gareth is prodding Jens about it. 11:20 Experiment problems/issues (20') Review of weekly issues by experiment/VO - LHCb Some problems with RAL over the last week or so. Trouble with reconstruction and stripping, accessing files from tape. Being worked on. Files need to be staged 3-4 times to be succesful. Raja asked Alf if CVMFS installed at Manchester, not yet sadly. Raja comments that cvmfs makes software install a lot smoother. AlF - this is a priority for us LHCB struggling getting Dirac to install patches properly, but not desperate for software area yet. Jeremy- anyone else installed cvmfs *silence* Pete G - it's the next thing to do at Oxford. - CMS having some troubles. - ATLAS AlF, Nothing much going on. Ticket for Glasgow, checksumming problems. Files declared lost. ECDF in broker off, but fixed. Had some release problems and pool file catalogue hadn't been installed. UCL blacklisted, but noone about to comment. Couple of short downtimes, Brunel for network upgrade, Oxford today for similar. No questions. - Other - Experiment blacklisted sites Done - Experiment known events affecting job slot requirements Done last week. - Site performance/accounting issues - Metrics review 11:40 Open discussion (15') Some areas that could be covered: - glexec issues jeremy- are people getting tests? stuart, we are at glasgow mark, don't you have to "volunteer" for glexec tests jeremy- yep mark - then that's what we need to do. still testing. rob h - ralpp up but not publishing. Is everyone not installed waiting on the relocatable install? Manchester just need to roll out to more nodes and publish, testing done. Jeremy will follow up offline during the week. - perf-sonar work Brian - tranfers into Tier-2s are greater then out of the Tier-2s, lots of assymmetries. RAL is the other way around. Oxford-Lancs low, despite each site having no particular problems at either site. -Brian goes over the sonar results -If site wants to become a T2D good -> Tier 1 rates more important then -> Tier 2 rates. - No other comments - topics we want explored at the WLCG workshop in July -no takers 11:55 Actions (05') - http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items - wrong link, should be: http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items 12:00 AOB (01') Gareth - People travelling can use GridPP project codes to stay at Cosners Chatlog: [11:13:58] Jeremy Coles thanks [11:14:00] Mingchao Ma the advisory was sent to sites on Friday 24th June with the subject "SVG Advisory for 'Moderate' Risk Torque Server Buffer Overflow Vulnerability - CVE-2011-2193" [11:18:27] Ewan Mac Mahon Usually it just requires Oxford to be up, which /all/ the tests do since the test infrastructure is here. [11:18:35] Ewan Mac Mahon This was slightly unusual circumstances. [11:19:18] Ewan Mac Mahon Though I'm not quite sure why it's failing for an 'extended period' given what happened. [11:23:31] Chris Brew how long can the list be? [11:31:16] Daniela Bauer sorry I am late - the latest glite update broke my cluster [11:34:07] Jeremy Coles https://ggus.org/ws/ticket_info.php?ticket=71294 [11:45:34] Wahid Bhimji QMUL have it obviously ... for atlas [11:45:43] Alessandra Forti and RALPP [11:45:59] Alessandra Forti I think [11:51:31] Peter Grandi that's double publishing, [11:51:37] Peter Grandi my local system is a bit iffy [11:51:51] Peter Grandi overloaded because of a growing process, trying to kill it [11:54:18] Peter Grandi I may have 'glexec' enabled in my new CREAM CE, if that's the default. But having usual trouble with the site firewall, so commissioning the CREAM CE is lagging. [11:55:48] Peter Grandi just checked my CREAM CE BDII and 'glexec' is not published, but it is installed. [11:58:32] Ewan Mac Mahon I think we're still waiting on getting an iPerf server up at Lancs for this, aren't we? [11:58:53] Matthew Doidge Yep, I dropped the ball on that one [11:58:58] Brian Davies http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview?view=Sonar# [12:00:47] Queen Mary, U London London, U.K. I did run one at QMUL - and will restart it on request [12:03:21] Queen Mary, U London London, U.K. An iperf server that is. [12:03:23] Ewan Mac Mahon Chris: Well, if Matt gets his going, please consider it requested, thanks. [12:04:37] Ewan Mac Mahon I'm sort-of hoping that Matt and I might be able to have a tinker with this at HepSysMan, so it would be useful to have the QMUL 'reference' one up and running. [12:05:04] Matthew Doidge I'll make sure to get my server up before Thursday [12:09:26] Queen Mary, U London London, U.K. gridftp02.esc.qmul.ac.uk iperf server [12:09:34] Jeremy Coles http://www.gridpp.ac.uk/wiki/Operations_Team_Action_items [12:09:38] Alessandra Forti I thought there was something wrong [12:14:36] Raja Nandakumar Apologies - got to go. [12:17:23] Wahid Bhimji bye [12:18:22] Jeremy Coles Thanks for taking minutes Matt.