GridPP Deployment Board Minutes 001 - 13th March 2008 ===================================================== Face-to-face meeting at GridPP20 - Dublin Present: Steve Lloyd (Chair), Jeremy Coles, Graeme Stewart, Roger Jones, Alessandra Forti, Pete Gronbech, Dave Colling, Duncan Rand, Andrew Sansum, James Catmore, Dave Kelsey, John Walsh, Andy Richards, Tony Doyle, Dave Britton, Matt Viljoen Apologies: Phil Clark, Pete Watkins, Stuart Wakefield, Raja Nandakumar, Glenn Patrick SL commenced by convening the first official meeting of the newly-appointed Deployment Board. He noted that this Board takes over from the old Tier-2 Board, the Tier-1 Board, and the Deployment Board - it represented an attempt to unify all elements into one strategic Board. What was distinctive here related to its strategic focus, its timescale of months rather than the DTeam response to immediate operational issues that were handled in days and weeks. SL advised that its Terms of Reference were that it would meet a quarterly basis with a Face to Face immediately following each GridPP Collaboration Meeting, and by phone/video between Collaboration Meetings. SL then asked the Board to identify themselves in turn. SL would Chair the meetings. GS would be present as ScotGrid Co-ordinator. Dave Kelsey was present to discuss security issues, and policy and operations issues. Matt Viljoen was representing NGS. RJ was present as NorthGrid Management. DB was now Project Leader. TD's new role was Technical Director, and today he was also standing-in for Phil Clark of ScotGrid. TD noted that he needed to update the Tier-1 webpage in relation to the old Tier-1 Management Board. SL advised at this point that the Tier-1 activities of the Deployment Board would be in a trial period to see if the issues would be satisfactorily covered. Alessandra Forti was present as NorthGrid Co-ordinator. Dave Colling was representing LondonGrid. James Catmore was representing ATLAS users. Pete Gronbech was present as Tier-2 Co-ordinator for SouthGrid. SP was present today as GridPP3 Project Manager in order to discuss metrics, milestones, and the Project Map. Stephen Childs was present representing GridIreland. Duncan Rand was LondonGrid, London Technical Co-ordinator. Jeremy Coles was the Production Manager for GridPP and was present for strategic guidance. John Walsh was present from GridIreland. Andrew Sansum was present as Tier-1 Manager. SL noted that the first action was on AS to nominate a Tier-1 Technical representative. It was agreed that at present no roles were missing and the members were happy to proceed. [After the meeting, following discussion with the experiments, it was agreed to change the roles from "XXX User" to "XXX Representative" leaving experiment XXX free to nominate who they wished.] 1. GridPP3 Milestones & Metrics ================================ SP advised that there would be a worksheet for each Tier-2 and at present 12 metrics were proposed - were these sensible to measure, cover all that GridPP needed to deliver, and was there any repetition? Would it be possible to use SL's tests on metrics? SP noted that she would need numbers/thresholds for what GridPP expected to achieve, and she would need quarterly reporting. PG asked about SL's pages being available on a quarterly basis? SL advised that these could be integrated over any period. Metric 4.1 - percentage disk available ---------------------------------------- RJ asked what does available mean? SL advised that available meant racked, powered, and ready to be brought on line when required - this would be one number to be updated quarterly, the same as was currently done now in the quarterly reports. AS reported that we are already declaring available capacity to WLCG. TD noted that it has to be the same capacity. RJ noted that what we declare to WLCG is not the same as that available to GridPP. TD suggested that we wanted a superset here - the issue of storage accounting to tape. AS reported that they do a disk server count of what disk servers are available. TD commented that this was however done through storage accounting. SL suggested that the percentage here should be high - either we delivery or we don't - it should be red if the MoU commitment is not met. It was agreed that the percentage should be set at 95%. Metric 4.2 - CPU KSI2K measured ------------------------------- AS asked how this would be attained? After some discussion, it was agreed to set the same percentage at 95%. Metric 4.3 - SAM ---------------- SP asked if the SAM numbers were ok, re both availability & reliability as defined in the MoU? DB noted that the numbers in the MoU were written a considerable time ago. SL noted that below 80% we would be struggling. GS commented that below 80 would be red - the minimum to be attained. AS advised that re WLCG availability, the target was set at 93% at present. DB suggested that we be pragmatic about this - achieving 93% at the Tier-1, 80% at the Tier-2 - he suggested setting the target to between 80 and 90%, to be reviewed in one year. TD noted that the timescale has to be October in terms of responding to LCG re the MoU numbers. SP suggested 80%, to be reviewed. SL suggested 85% for availability and 90% for reliability. This was agreed. Metric 4.5 - Steve's ATLAS tests -------------------------------- SL reported that they need the ATLAS code installed and integrate the number over the ATLAS sites on the Tier-2. RJ noted that SL's tests are used to see things as a user would, therefore prioritising it was the wrong way to go. DB noted that reliability goes down (via SL's tests) if ATLAS is being run (GS brought-up this point previously). TD advised that there should be a user-led metric for the site - this is to monitor 'on average' facilities to the user. DK asked what the aim of the metric was? SL responded that the aim would be if ATLAS user jobs don't go through. DB noted that they would appear red then and the reasons for it would be examined - this should be averaged over the 8 larger ATLAS sites, using this to present the project to other people. SL suggested setting the figure at 80%. DC noted that sites in downtime should not be counted. There was an action on SL to devise a new ATLAS test. It was agreed that this action should be similarly applied to CMS (possibly on the job robot page) and LHCb - actions on DC and RN respectively. Metric 4.6 - SL disk test ------------------------- This test would be left in - GS thought that the metric should be higher, at the level of 90%. Metric 4.7 - CPU delivered -------------------------- This should be taken separately from 4.8 which was TB of disk used. DR noted that hardware funding can depend on the results of this. RJ queried the meaning of 'available'. SL advised that we can measure the numbers but it was not clear what the numbers should be. DR thought that there was a secondary responsibility of the site to ensure that site was well used. DB noted that the number for the sites was the number of CPU delivered on average across the UK. SL responded that the boxes can't then go green. DB noted we could take out the average usage, but spot the sites being used. RJ noted a difficulty in the difference between 'making available' and 'actually used'. SL summarised that we should keep this metric in, and SP/SL would look at it in order to determine a sensible level that will flag what is obviously wrong. SL noted that we do have historical data that we can use. RJ asked what does the number of TB used mean (in relation to 4 x 8). He advised that we don't expect 80% of disk usage, we want disk to be ready to receive data, we don't expect it to be full. RJ noted that it would be a crude measure, but should in any case be low. SL suggested a figure of 70% noting that this was not a diagnostic tool, it was for the OC in order to show disk usage. RJ advised that this does help the VO to manage space. Metric 4.9 - Technical meetings ------------------------------- It was agreed that there should be 8 meetings. SL noted that this would show how the Tier-2 are operating. Metric 4.10 - Management meetings --------------------------------- It was agreed that these should be quarterly. TD advised that the DB should review the quarterly reports. SL considered that there were not enough management meetings generally. The DB agreed that there should be 4 management meetings per annum, and 8 technical meetings. Metric 4.11 - Tier-2 delivering to LCG MoU ------------------------------------------ AS advised that the MoU doesn't imply GGUS ticket responses - service delivery was the key. SL suggested that we needed to action someone to look into this - we need a metric that reflects the MoUs. TD advised that there is a GridPP MoU - and some way of leaving ourselves open to becoming detached from the LCG MoU - all references should be to the GridPP MoU. SL noted that this metric should be more about ticket response and how responsive a site is to reported problems. JC noted that one can't track phone calls - an audit trail is needed. DR asked about time to resolution? The metric could be weighted by urgency. It was agreed that JC look into this issue and provide recommendations. Metric 4.12 - Quarterly reporting --------------------------------- SL noted that this metric was time-related - if the quarterly report was late, the box would show red. SP advised that the overall metric was one month post deadline for the quarter. This was agreed. TD noted that we were merely penalising ourselves by having red boxes for this issue. SP asked overall if there was anything missing, and had we covered the most important issues? SL suggested that the most important issues were: did we have the hardware, and was it working? Middleware was brought-up as a missing issue. GS noted that this should be covered in the responses re tickets. AS advised however that there was no requirement to close tickets within a reasonable time. Another issue was software upgrades. DK noted that this was an aspect of Tier-2 performance. SL suggested that this issue comes at a point where it is untenable not to upgrade. TD suggested that the DB should set targets. SL suggested that it doesn't have to be a metric - just part of 'normal business' - but noted that if something is so out-of-date as to be useless, or if it is a couple of upgrades behind, what is the metric then? TD suggested that the metric would be 'non-usable' software. DR suggested that the measurement should not necessarily use the latest release, but should be that the site is still full of jobs. TD advised that the DB should respond to users' timescales - a fraction of sites were not responding to user deadlines - should this be an issue for the User Board? SL summarised that a metric was needed, but it was unclear as to what. If no requests were received for the previous quarter, it was not measurable, therefore would the box go grey? GS suggested that sites should have been asked by GridPP to upgrade by a certain time. JC asked what the penalty was if they didn't? There was a discussion re the SL3/SL4 issue. SL advised that we should set a target for the next meeting - would SL4 upgrade be the target? SP suggested that the metric be set at 100% and sites should upgrade software to timetable. This was agreed. SP further advised that every site in the Tier-2 had to comply with this in order to go green. DB noted that red was not penalising someone, merely letting the DB know that there was an issue to be addressed. TD advised that JC was therefore the person who should come back to the DB after having discussed timescales with DTeam regarding feasibility of dates etc. SP summarised that the Metric would be set at 100%, and it would be reviewed. DK noted that there was also nothing about networking on the Project Map. SL advised that things degrade as the network degrades, and tests begin to fail. DK noted that this should be about monitoring the network at specific sites. DR advised that CMS do a bandwidth test. SP was concerned that the network needs to be measured in some way - data transfer rates for example. GS noted that these can fail at a higher level. RJ noted that Robin Tasker deals with global network issues. SL advised that the measurement does have to be automatic - some count of the network where it is known to be ok (GridPP network monitoring?). AS noted that GridMon was a useful tool. TD suggested that it could be the number of network problems arising during the period - what was acceptable and what was not. SL suggested that Robin Tasker might be able to give an idea of what we could measure. AS noted that GridMon was good but erratic - it would need aggregated with an average value. TD also noted that the Tier-1 could monitor Tier-1 to Tier-2 connection problems. AS advised that GridMon does not produce easily-processable numbers. There was a discussion on ways of measuring. SL summarised that a Metric was needed - this was agreed - showing the number of sites that were ok, but this 'ok' was still to be determined. There was an action on JC to speak to Mark Leese about this. AS suggested end-to-end throughput? DR suggested bandwidth? SL advised that this was not a diagnostic tool and was only measuring effectiveness and related to the underlying network. Another issue that the metrics had not covered was security. DK asked about measuring incidents at sites? SP asked whether this would be per Tier-2 or the security box? DK suggested Tier-2 by Tier-2, or site-by-site. SL noted that there was already a box for security on the Project Map. AS noted however that this does not measure the number of incidents reported at sites. DK suggested measuring sites meeting their obligations to operational security. SL suggested measuring the number of failures to react to security incidents. DK advised that this should be passed to Mingchao Ma as Security Officer. SP summarised that this should go through the security box then be reviewed to see if we need to go through a Tier-2 box as well. This was agreed. The DB agreed that all issues had now been covered. 2. Current Deployment Issues ============================= SL advised that the meeting should concentrate on issues that we need to think about over the next few months, rather than smaller issues that can be addressed immediately elsewhere. TD noted the policy on killing jobs - there had been a draft written but this was not posted anywhere. There were no changes required to the draft policy. SL suggested it be adopted as 'official' GridPP Policy. TD noted that the draft was up to the end of 2007, and discussions had taken place with the experiments. SL asked if anything further needed discussed? It was agreed: no. The Policy was to be uploaded as formal GridPP Policy, to be reviewed annually by DTeam. TD noted that there should be something obvious on the site that points to it. It was agreed that TD would send-on the information to SL and he will make a webpage. SL asked if there were any other deployment/strategic issues? JC noted that upgrades were a primary concern, many sites had not yet upgraded. SL asked how we make this happen. DK suggested that we set a timetable and inform people. It was accepted that there was a general inertia about this issue. GS noted that SRM2 tokens at dCache were a separate issue. SL advised that if we agree a list of things here, set a timetable, in future JC should feed-in issues to be dealt with. JC asked about 64-bit worker nodes? SL asked for the CMS position - DC advised they were not ready. SL noted that there was no argument to rush to do this. DK commented that all VOs fed into the GDB agreement. SL suggested that it could be left as a request to do, but then why have a timetable? It was concluded and agreed that sites can upgrade if they wish, but there is no pressure at present. However, if they do upgrade, they will need compatibility libraries. TD asked about the move to gLite 3.1? JC noted that support was being stopped for gLite components. TD suggested that sites must upgrade to gLite 3.1 or violate the metric - this should apply from today. AS countered that this should be a planned phase-in. TD noted however that if the software was not being supported, then sites had no option but to comply immediately. SL suggested that a one-month timeframe was reasonable - this was agreed: sites should upgrade to gLite 3.1 within one month from today. It was understood that if sites had not upgraded their clusters then they would show red on the Project Map, but we would know why. SL asked for any other strategic issues. TD noted that we should remove the Tier-1 Board from the website. SL reported that a new website was currently underway. JC advised that another issue was the ticketing system, the UK no longer has automatic feed from GGUS. MV noted that this was having an impact on manpower, was labour-intensive and manual assignment was currently being done. GS advised that a site can't close a ticket when an issue is resolved. AF suggested that sites need to become recognised support. JC reported that we had changed the way we handle tickets and this has not been formally agreed. SL asked the DB if they were happy with this situation or not? MV suggested that an extra Helpdesk would not be useful. JC asked if we were happy to use GGUS directly and not have the ROC helpdesk in between? GS noted that he used to receive a weekly summary of tickets from ROC and this had been very useful. It was agreed to leave things as they stood at present - in future issues like this would arise and it would be possible to put them on the Agenda for prior discussion. The meeting closed for lunch at 1:00 pm. 3. Strategic Issues from Experiments ===================================== SL advised that one of the ideas of the new Board was to receive direct input from the experiments - were there any issues? In the short-term the DTeam would handle problems; for the longer term, strategy should be decided here. JCatmore advised that the general Grid infrastructure seemed to work for analysis, problems experienced related to datasets at the site not being there; ATLAS software at the site not being there or not being installed properly - these were niggly things, also, users needed to deal with Ganga and the tag system - they didn't achieve the tag system at FDR1 and the tag system had not yet been stress-tested. JCatmore was unsure how it would work, and for FDR2 they would need to ensure that it worked. JCatmore advised that Ganga currently had issues with datasets - they couldn't combine little datasets into one large job which is what they need (ATLAS-specific), but, the infrastructure was generally ok. They also needed to know who exactly to turn to when they got error messages - error-reporting needed to be much clearer for the user. JC noted that there was a WLCG wiki. SL suggested that we need to capture these obscure error messages and give a pointer as to what to do about them. JCatmore advised that if you look on the Ganga hypernews list, it is possible to compile a comprehensive list of error messages. SL noted that we do have a Documentation Officer - it was noted that he had already been requested to address this issue. SL noted that most of the errors were experiment-related and this was difficult to write centrally. MV suggested that there was a lot of commonality between the experiments however. SL asked if the error messages in Ganga were any help? JCatmore suggested that a page based on the Ganga hypernews would be a good place to start. SL noted that Ganga was in TD's remit now as Technical Director. There was an action on TD to highlight to Ganga that they should document issues. JCatmore volunteered to begin to devise a comprehensive list of error messages that would assist users - this was agreed. DC advised that one quarter of CMS analysis was done over the Grid and there were no central issues at present. DC would send a list of urls to Stephen Burke. It was noted that RN was not available for LHCb, and GP was not available re 'other' experiments. There was an action on JC to report back from UKQCD in relation to what was technically feasible. GS noted that data management in EGEE was low-level and generally poor - this made things difficult for users. DC had asked for some of Janusz Martyniak's time to deal with this. There was an action on GP to report on the status of MICE. AS noted that MICE has a more realtime requirement of the Tier-1 but he was unsure of the extent of this as yet. 4. Strategic Issues at the Tiers ================================= SL noted that everyone had already reported the status of the Tiers at the full meeting. Were there any longer-term issues that needed to be raised? DC advised that for LondonGrid, QMUL was the biggest single issue - an action plan was in place, and UCL would hopefully be online soon, it was a significant resource. Royal Holloway would be on line soon too. Other issues: DR had not yet been replaced at Royal Holloway, and at Brunel overall they should have a reasonable level of manpower. DR reported that at Brunel and Royal Holloway it was a 0.5FTE at present, but this was set to go up to 1.25 FTE each. DC didn't anticipate staffing would be a problem. AF reported that there were manpower problems at NorthGrid due to turnover and the dCache upgrade. SL asked if a recovery plan was in place? AF advised that Lancaster were migrating to DPM and she was monitoring the situation overall. DC noted that DPM was treated as a poor relative in CMS and problems were not generally solved with any high priority. DR asked if space tokens were an issue? DC reported that it was the biggest source of user complaints. RJ noted that other places also have problems with dCache. DB asked if there was a release schedule? DR was not sure. GS advised that dCache was complex because it was a scaleable tool for storage and would get worse - there was a discussion of dCache minutiae and the problems involved. SL noted that this issue affected RAL PPD, Liverpool, Manchester, and Edinburgh. DC commented that most of the time it worked well but the glitches were very difficult to solve. RJ advised that another NorthGrid issue was the plan to ship data to SARA rather than RAL - Kors would like to see this, but it had implications for the DTeam - the lightpath from Lancaster to CERN no longer exists. For ScotGrid, TD reported that at Glasgow one FTE (Mike Kenyon) was moving from being a developer to ScotGrid Systems Manager, and David Martin would continue in his role. The new EGEE Co-ordinator would be Will Bell and User Liaison would be handled by Morag Burgon-Lyon with WB taking-over in due course. Edinburgh still had Sam and Steve - 0.5FTE, with a quarter FTE at Durham (Phil). GS noted that things should begin to move with ECDF. TD noted the more general issue of funding. For SouthGrid, PG reported that the position of new Tier-2 Co-ordinator at Oxford would shortly be advertised. Cambridge currently had a problem with ATLAS and LHCb software, SL3, and accounting. RAL PPD were having trouble with dCache and space tokens. Birmingham was improving and the new e-Science cluster was now extant. Bristol was ok but SL pointed out that they won’t install ATLAS software on the new cluster. At Oxford they had new kit running on SL4 ok, and the old kit on SL3 would be upgraded soon along with a move to the new computer room. For the Tier-1 AS reported that he was completing the 'on-call' project and there was new hardware to deploy. Procurements were ongoing. Disaster planning was ongoing, and the Service Delivery Plan was also ongoing. He needed to reach a conclusion regarding the CASTOR tape performance, and CCRC'08 was to be prepared for in May. SL asked if there was AOB relating to the Tiers? None. DB asked if the Tier-1 was now more integrated with the Tier-2s? Was the blog effective? DC felt that they were integrated with the experiments ok and that the CASTOR meetings also helped - there were no problems at present. There was consensus that the Tier-1 blog was making a difference. SL asked about the status of the GridPP3 MoU? TD advised that this had been reviewed by both the Tier-1 and the Tier-2 Board and signed-off before Christmas, with minor changes. The remaining issues were mainly to do with the Annexes, manpower was considered to be ok. Regarding Tier-1 hardware, AS reported that as at April '08 it was on schedule, the PMB had discussed the STFC funding issues Regarding the Tier-2 - the first tranche is already delayed and the second might have to be reduced. TD suggested removing the April timeline, it was more a question of can we or can we not meet LCG requirements in that year. DB suggested that the best guess was to slip the dates by three months - April had only been nominated as April was the LCG review data - but beyond this the date did not matter. DB noted however that having a date does help - it makes a point about the damage the cuts are doing. There was an action on TD to update the info to show 'July' - this was agreed, and it would help address the funding problem. SL suggested that the other issue was in the 2nd paragraph - the 1st tranche was done - we should change the greyed-out one to nominally July, and get rid of 08Q2 and instead add 09Q2. This would need to be revisited at the next meeting, depending upon the funding change, from May '09 to August '09 - this would make it self-consistent. SL asked how we would get this signed-off? TD noted there was an action on all MBs to have their own MoUs by the next meeting. TD advised that the key point is that the coming quarter will not be an accounted-for period, feeding into the allocation. Version 4 was the release version as the MoU basis externally. SL noted that the experiments need to make the point that they are being harmed by the cuts and by funding not being released. There was an action on TD to send the docs to SL who would upload them to the website. 5. Security Policies ===================== DK advised that the MoU stated that we agree to implement security policy (the security documents were on the web). They were identical to EGEE security policy. DK reported that there are three policies currently pending public comment, and others had already been replaced. DK noted that once EGEE approves a policy it applies to GridPP - we only have the option of disagreeing at the development stage, and there is a mechanism in place to negotiate. GridPP needs to ensure that all its issues are fed-in earlier, at the development stage. Understanding was reached that: 1. DK would email out the draft security policy documents. 2. In future he will mail to the Deployment Board listing. 3. The website will be kept up-to-date. 4. During EGEEIII all documents will be revised again. 6. AOB ======= JC to upload talks to the GridPP20 website. The next meeting would take place sometime in June, it was likely to be a half-day meeting via EVO. SL would inform the DB. The meeting closed at 3:30 pm. ACTIONS AS AT 13.03.08 ====================== 001.1 AS to nominate a Tier-1 Technical representative for the Deployment Board. 001.2 SL to devise a new ATLAS test. 001.3 DC to devise a new CMS test. 001.4 RN to devise a new LHCb test. 001.5 SP/SL to look at Metrix 4 x 7 regarding CPU delivered and determine a sensible level that will flag what is obviously wrong. 001.6 Re the Tier-2 delivering to the LCG MoU (Metric 4 x 11), JC to look into this issue and provide recommendations. 001.7 Re network monitoring and quarterly reporting, JC to speak to Mark Leese. 001.8 Draft Policy on killing jobs now adopted as formal - to be uploaded, and reviewed annually by DTeam. TD to send-on the information to SL and he will do a webpage. 001.9 TD as Technical Director to highlight to Ganga that they should document issues in relation to error messages to users. 001.10 JCatmore to begin to devise a comprehensive list of error messages that would assist users. This takes over from the former PMB action noted below: [277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. SB was leading this with inputs from SP, SL and JC where needed. A new simple summary was required of all areas available plus a lookup/links facility, for the OC to review. This would include a list of most recent types of problems (possibly a 'top 12' for users - what the error means and the course of action to follow). SB to progress this. It was noted that James Catmore (via the DB) had volunteered to do this. This action is therefore transferred to SL for progression via the Deployment Board. Done, item closed.] 001.11 DC to send a list of urls to Stephen Burke relating to CMS analysis on the Grid. 001.12 JC to report-back from UKQCD in relation to what was technically feasible. 001.13 GP to report-back on the status of MICE. 001.14 TD to update the Tier-1 hardware status dates to July (rather than April) to highlight the damage that the funding shortfall is causing. 001.15 DC to provide updated LondonGrid MoU. 001.16 RJ to provide updated NorthGrid MoU. 001.17 PC to provide updated ScotGrid MoU. 001.18 PW to provide updated SouthGrid MoU. 001.19 TD to send documents relating to the MoU and Tier-1 hardware to SL for uploading to the website. 001.20 DK to email the DB with all draft security policy documents. 001.21 JC to upload talks to the GridPP20 website.