dteam minutes 10/04/07 ====================== Present: Greig Cowan (minutes) Derek Ross Jeremy Coles (chair) Dave Colling Barney Garrett Frederic Brochu Yingqin Zheng (joined late) Apologies: Jens Jensen Raja Nadakumar Experiment updates ------------------ LHCb: RN reported via email that there were no new issues, many free job slots, but their production has finished. CMS: DC gave a summary. Ongoing problems with CASTOR. Production is ongoing. There is a central CMS problem with software installs around the UK (Peter Elmer in charge and debugging it). Only IC up to date, not RAL. This is not an issue with PheDex but generic software installation. Transfers are going well; constant 40MB/s from RAL. This is twice the baseline rate. Stuart and Matt are trying various configruations (not high priority) to get better performance. Atlas: FB reported that there was little to report after the Atlas software week, then Easter holiday. Still waiting for central Atlas data management people to update the Atlas cache in order to help them replicate the data to the right places. No data can currently be put onto disk at RAL due to Map files not being updated. This means that they are trying to write to old dCache endpoint (which is read only) and not the CASTOR endpoint. Ticket submitted within Atlas. Unsure about adding CASTOR endpoints or replace the dCache ones. Waiting for developers to come back from holiday before any further action is taken. Problem with this is that Atlas data is now accumulating on T2 disk without the ability to replicate it to the T1. This will eventually become a site problem. RHUL were concerned that disk was filling up but not moving anywhere. Magnet failure and its impact ----------------------------- Dave Newbold sent round an email to state that we should not change the procurement plans, at Tier-1 or Tier-2. DC agreed that we should proceed as if there is no delay. It may eventually buy us some time if, for example, there are disk problems. Stephen West (minos) has been in touch with Stephen Burke regarding the creation of a production system. Invited to come to dteam meetings. DR: rb02 not supporting phenogrid properly. Catalin on holiday at the moment. General matters --------------- * Site Readiness Review London meetings are next week. Open day on the Monday then 9 sites to review. All responses have been compiled into a single document which has been sent off to the review team before Easter. Strengths and weaknesses already known. * T2C quartetly reports should be updated. * Attending expt weeks. PMB endorsed sending up to 3 people to software weeks. Olivier - CMS Alessandra/Graeme - Atlas Greig - natural to go to LHCb LHCb agendas: http://indico.cern.ch/categoryDisplay.py?categId=2l70 Atlas agendas: http://indico.cern.ch/categoryDisplay.py?categId=3l9 DC cannot go to the CMS software week next week due to LT2 SRR. ACTION: DC to send URL of CMS software week agendas to list. ACTION: GC to enter dates of software weeks into the dteam calendar. * We should encourage the smaller VOs to get involved. Minos do not seem to have any major requirements on the MC production. * Local testing in VOs. A few users joining the gridpp VO in Glasgow. What testing have they been doing? Need to postpone until more people attend next meeting. LT2 is a local VO - goes through bursts of activity (for demonstrations and teaching periods). How do you decomission a VO? Currently not an easy way to remove them. Andrew McNab thought about the nature of VOs, but not clear how much has been implemented. Action review ------------- Jeremy to talk to Alessandra about security audits. Derek now uploaded the nagios scripts by providing link to CVS repository. Greig will add note to the action about SRM2.2 storage spaces. DC: setting of alarms in gridload system. Will talk to Olivier. JC: Check megatable. Change in MB table (resources available, not allocated). GC: Make a list of SE occupation problems in wiki. Need to look back at minutes to understand this action. JC: We should try and get the list of actions down to about 10. Greig will update the actions. AOB --- GC mentioned the data integrity issues on disk. Jeremy will forward DNs response to the list. DC and JC mentioned the work going into the EGEE 3 proposal within the UK. SA1 will be UK's main activity. EGI will be after EGEE3 (not quite direct follow on). EGEE3 due to start in April 2008. Ops meeting: Some of the other regions are complaining about RGMA. Steve Fisher has reported that the introduction of the job monitoring has led to an increase in the number of producers (and hence connections) which has overwhelmed the system. In contact with Piotr and now we can alter the monitoring to prevent this happening. Alessandra raised a ticket about this. DC reported on the WMS testing. Tests have been written but they are waiting for an update before re-running. Packaging into rpms and should be available soon. Latest release: job submission much quicker (1/3s per job), but problem moving downstream. Now taking about 5 hours for the last of the jobs to leave the broker. 30% failing, but are retried until they pass. DC has various plots showing what is going on which he will publish soon after latest release has been tested. Is John Walsh doing something similar in Ireland but it is probably not an official test. SL4 32bit version of glite should be available in production within a couple of months. In PPS now. GC asked if site are going to deploy this? Are they not wanting to wait for a 64bit version? Jeremy will probably arrange a TB meeting for next week in order to get information.