DTeam Minutes 05 October 2010: Attending: Sam Skipsey (minutes) Alessandra Forti Brian Davies Daniela Bauer Duncan Rand Gareth Smith Graeme Stewart Jeremy Coles (chair) Mark Mitchell Mingchao Ma Mohammed Kashif Richard Hellier Phone Bridge Apologies: Wahid Bhimji Meeting starts: =ROD Team Status: Andrew raised an issue last week - the problem with WMS failover for Nagios tests. Kashif is using 3 WMS - 2 at RAL and one at Imperial. Last week there was a problem with Imperial, so he removed it from the config. Nagios submits a job through the WMS - if the WMS isn't working, then the job will fail. If one WMS is failing, then 1/n jobs will have issues (where n is number of WMSes). Kashif manually removes broken WMSes from the list when needed. Nagios test submission interval is currently 55 minutes. But, it depends on the failures: if a job aborts within 1 hour, then a new job is sent, if a job is queued, then it will submit a job within an hour. If the job is running, then it will take 4.5 hours to submit a new job. Question: Can Nagios be made to realise that when a particular WMS is linked to failures, then the WMS is failing. Graeme: is it not easier for the Nagios test to just… directly test the CEs? The endpoints are published. Is WMS-CE path useful to pick up failures in the WMS-CE interaction? Graeme: Well, ALICE, ATLAS already do direct submission to CEs, so they're more interested in if the CE works by itself. Kashif: we can look at this, but how does Nagios know precisely which queue to use for each endpoint? (Sites tend to have "special" queues for SAM tests and the like). =T-1 update: Gareth: The main think is the LHCb CASTOR instance was upgraded. Was successful. One remaining issue - in the 2.1.9 version, there is csumming of files on disk. On some (32bit) disk servers, the checksum is miscalculated - turned off while investigating. One or two new niggles being followed up, but pleased. Moving on - towards end of October upgrade of Gen Instance for CASTOR (ALICE and non-LHC). Other things: declared top-BDII at risk tomorrow, to reconfigure. Scheduled power outage in the ATLAS building has been cancelled (would have made weekend at risk - networking passes through this building). We have no date for the rescheduling of this. 3D databases issue for LHCb, seems to be at their end. Looking at rolling out kernel updates to the Oracle DB machines behind LFC/FTS/3D machines. Planning time 18th-20th October; at risks for electrical work at site. =Next GDB next wednesday. Morning is Operational - Security incident, middleware updates. How do you find out when/where network links are failing (OPN fault diagnostics). Afternoon: Data Access and Management Demonstrators, status reports. Graeme is the T2 rep for this month. (Graeme will also be giving a DAMD talk in any case.) =Escalated tickets: 1 NGS RAL BDII entry - Jeremey follows up Biomed at RALPP issue. dCache issue. 58733. Sam to follow up. ALICE issue at T1. Ticket on hold pending 2.1.9 upgrade (Gareth). =Experiment problems and Issues: ==Raja update LHCb: Jeremy - don't seem to be any problems. Raja emailed this morning suggesting that Glasgow might have a problem again… For the last 7 days, the success rate for transfers was 37.8% (mostly during the weekend). Last week it was 100%. Not sure if we've solved the problem or not. Mark & Sam at Glasgow will look into it. Upgrade to 2.1.9 for CASTOR went well at T1. ==CMS: Update for PMB said: T2s for CMS being asked to increase centrally managed space from 100 to 150T. Agreed for Imperial but not Brunel (Brunel reliably issues). Duncan: Brunel have SL5 XFS problems. CMS can also stage files, which we'd like to investigate (rfio direct access doesn't benefit from the driver readahead improvements). Imperial: Big increase in throughput after addition of a NetApps box. Duncan: we did have an issue a while ago with home directories being pummelled by various globus machines. we now export via NetApps box, which improved performance. The other problem was that the files were being stalled with lcg-cp - the fix is in CRAB, which will have also improved things. ==ATLAS: some problems at QMUL in last 24 hours (extremely high failure rate). Chris was trying to set up CVMFs, but shouldn't be in production. Dark Data being dealt with at Durham. We did a review in ADC Operations of CREAM CEs (link: http://dl.dropbox.com/u/5159849/atlas/slides/uk-cream-status-2010-10.pdf). We are using them where available and successful when used. We need more CREAM CEs - almost all of Dutch and Italian clouds have CREAM CEs. What is the movement at other sites - especially larger UK sites - Lancaster, RALPP, QMUL, etc. Any sites blacklisted by experiments? No? Events affecting CP Brian: re CPU requirements. Gstat pages - slightly confused, as number of running jobs for ATLAS seemed higher than number of logical CPUs. But logical CPUs should equal job slots…. Jeremy: of course, this depends on what the site is publishing. Graeme: from PANDA monitoring, we peaked at 11000 jobs on Wednesday. We have 1000 analysis jobs running as well. Lots of jobs in the UK. T2 availability figures for August: We were asked to follow up. No emails to Jeremey so far… so are there no problems? Kashif: BDII was also in error state for the period that was corrected for, and sites may gain even more reliability after this is taken into account. Alessandra: the August report: Liverpool has been fixed? Jeremy: that was reported some time ago, but don't think it was taken into account. The only fixes were Nagios. Alessandra: Liverpool got a ticket to complain about their low availability - erroneously, due to the known problem. Jeremy: yes, I forwarded that to people to complain. =CREAM STATUS: Not many sites have it, despite a year ago. Statuses at sites: Scotgrid: Glasgow have, ECDF have (but it isn't authorised to talk to SGE yet) Durham will be looking into it. Northgrid: Manchester has a working one. ATLAS jobs fine. Lancaster and Sheffield don't. - Alessandra will check on their plans. As will Jeremy. Liverpool has an instance in production. SouthGrid: Oxford has. Birmingham has. RALPP are planning on installing one, soon. Cambridge: suntan is not sure about COndor support in CREAM. Bristol: (only 50% FTE). LondonGrid: QMUL is getting one. RHUL is on their list of things to do. Within the month. UCL-Central were talking about installing CREAM as their new cluster upgrade. Imperial has. Brunel… will be a while. Raul is looking at installing it on one of his clusters. Jeremy will follow up over plans for sites that don't have anything yet. T1 have 1xCREAM. Brian: Alastair has been talking with Derek about ATLAS' plans for using CREAM CE. Possibly add something for ATLAS during the next upgrade when Castor Gen instance is upgraded. (This would be a second CREAM CE). =Top-Level BDII failure: Under EGEE, recommendation was sites with >600 CPU should have their own top level BDII. Now the suggestion is to reduce and consolidate Top BDIIs (as they are now more performant). The suggestion was to pair up the top level BDIIs. The pairing would be arranged by Flavia Donno. What happens when RAL's Top level BDII fails? Current site failovers are: Duncan: RAL first, CERN, then whatever they feel like afterwards. Alessandra: Manchester, then RAL. Jeremy: The new suggestion was a failover to another T1 centre, as CERN has Too Much Traffic. We have therefore have a mixture of settings. Which site will RAL be paired with? We should have that as a secondary (failover). We need to also know what the NGI/NGS strategy is, also - they have their own instance, at Manchester. Status of the RAL BDII Richard: RAL BDII has five individual hosts. DNS round-robined, but they're generally not stressed. Proposal for Top BDII linked to, was approved by the MB last week. =EGI EMI Feedback: something that was sent to DTeam list. EMI is now collecting requirements for their project in Year One - they'd like EGI operations to give requirements for the joint (gLite/UNICORE/OMII) stack. The Year One release will mostly be focussed on security and consolidation. Feedback? ALL: We like YAIM (and would like to ensure it stays around - at present, it is only in the gLite stack). Alessandra: YAIM is extensible Sam: we'd like all of the storage middleware to support a common interface (but this is the case currently with SRM) Brian: storage requirements of the NGS may also influence this. Jeremy: David Wallom is providing input for the NGS, via Pete Oliver. Question over the disposition of Repositories for the EMI bundle? Sam: Question over how releases are done? (this can be contentious). Jeremy will feedback query. Can people email Jeremy within the week with any comments. =AOB: GOCDB3 -> 4 interface change next week. If anyone has concerns, contact Jeremy. Potential plan to change site names in the GOCDB; this would break Experiment stuff, so this will no longer be done. Meeting ends. Chat log: [11:04:17] Daniela Bauer My microphone doesn't work [11:04:21] Duncan Rand david is abroad [11:04:27] Daniela Bauer But I can hear you fine. [11:05:00] Mohammad kashif Pete is on leave [11:06:13] Phone Bridge joined [11:06:33] Graeme Stewart joined [11:12:31] Brian Davies joined [11:15:16] Richard Hellier joined [11:17:19] Mingchao Ma joined [11:18:55] Alessandra Forti joined [11:24:52] Alessandra Forti http://www.bbc.co.uk/news/world-11476301 [11:27:33] Graeme Stewart http://indico.cern.ch/getFile.py/access?subContId=0&contribId=0&resId=5&materialId=slides&confId=108099 [11:28:16] Alessandra Forti can't access the file [11:30:48] Graeme Stewart Erm, it should not be protected, but try: http://dl.dropbox.com/u/5159849/atlas/slides/uk-cream-status-2010-10.pdf [11:32:24] Alessandra Forti it says it doesn't exist. [11:33:54] Gareth Smith left [11:35:31] Graeme Stewart maybe takes a while to sync - seems to be there now [11:35:37] Duncan Rand I managed to get it from indico [11:37:05] Graeme Stewart OK, I need to leave now. Cheers. [11:37:09] Graeme Stewart left [11:41:01] Mohammad kashif Alessandra, Manchester creamce is not appearing in nagios, what is the name of ce [11:42:16] Alessandra Forti it is not yet in production but it is in the atlas test syste. [11:42:19] Alessandra Forti sytem [11:42:32] Mohammad kashif Thanks [11:42:44] Alessandra Forti I'm going to reinstall one of the current CEs in the next 2-3 weeks [11:43:52] Daniela Bauer But mine works and I prefer to have it under my control rather than relying on somebody else !!! [11:45:21] Alessandra Forti I missed why we want to change the strategy [11:45:36] Duncan Rand http://pprc.qmul.ac.uk/~lloyd/gridpp/bdiitest.html [11:45:49] Richard Hellier http://ganglia.gridpp.rl.ac.uk/ganglia/?c=Services_Grid&h=lcgbdii0631.gridpp.rl.ac.uk&m=&r=hour&s=by%20hostname&hc=4 [11:45:59] Sam Skipsey Because the Top BDIIs can cope with more load, and so the interest is in consolidating to a smaller set of more reliable instances, Alessandra. [11:46:00] Alessandra Forti local and then ral [11:49:35] Alessandra Forti I have to think about it [11:49:58] Richard Hellier LCG_GFAL_INFOSYS=lcgbdii.gridpp.rl.ac.uk:2170 [11:53:09] Alessandra Forti why is it in doubt? [11:53:19] Alessandra Forti I DO [11:53:35] Richard Hellier date [11:55:05] Alessandra Forti it can be extended easily [11:56:40] Richard Hellier left [11:56:44] Alessandra Forti bye [11:56:46] Mingchao Ma left [11:56:47] Alessandra Forti left [11:56:51] Duncan Rand bye [11:56:51] Daniela Bauer left [11:56:52] Phone Bridge left [11:56:53] Mark Mitchell left [11:56:54] Mohammad kashif left [11:56:56] Brian Davies left [11:56:57] Duncan Rand left