DTEAM minutes from meeting on 24th June 2008 ********************************************************* Attendees: Jeremy (chair+scribe) Derek Raja Brian Pete Alessandra Duncan David Jens Andrew 11:00 Experiment problems/issues (20') Review of weekly issues by experiment/VO - LHCb: Nothing major happening. Discussion of Dirac taking place with regard to getting glexec testing. T1 problem getting jobs running via WMS - fine for RB. JC:camont have seen the opposite problem - jobs run fine via RB but not via the WMS! DR: Problems relate to what is used to match against. Queued jobs (RB) or VO object as queried by the WMS. RBs only match queues. This should be resolved shortly. RN: Why is it happening now? DR: Since changed the 50 job limit for CMS, jobs queue much sooner. - CMS: DC: Some movement recently. Analysis groups now defined for the T2s but UK underdelcared disk at Brunel so may get another group (electroweak) for Brunel. Allocated groups are exotica, Higgs, susy and egamma. These have been allocated for the UK sites in head mapping. There has been further discussion on stopping T1 non-production access. There is now a plan to have the T1 data published as unavailable. DR: What is the schedule for getting it unmarked? DC: Chris will do it today if there are no other comments DR: Problems for CMS go away once this is done. Need all users on the WMS.. except ATLAS who still use condor glide-in jobs and who do not see this problem. RN: Currently the RB sees 4 CEs at RAL for one job which is confusing..... - ATLAS No report - Other None - Site: -- Steve's new test results page: http://pprc.qmul.ac.uk/~lloyd/gridpp/uktest.html We took a quick look through Steve's new test page. It looks to be useful since it gives a true user view - where jobs are flowing and the success once there. 11:20 ROC update (15') (files more information - take a look at the agenda attached file for the Footprints tickets. It seems some site tickets are still appearing in Footprints. ) ops update *************** - Work to test and integrate the CREAM CE has begun: https://savannah.cern.ch/task/?group=sa1dep. JC noted that IC appear to be progressing with an installation. - Recommended storage solutions to deploy has come up in several places. This is the version presented to the ops meeting yesterday: "CASTOR Core * 2.1.7-10 will be released this week o Tier1s are recommended to upgrade faranno l'upgrade verso meta' Luglio * 2.1.8 will be released the first week of August o Tier0 will upgrade within the end of August o Tier1 will follow SRM * Current recommended version is 1.3-27 on SLC3 * Recommended version is 2.7-1 on SLC4 as soon as released. Support: For Castor core support is granted for 2.1.n and 2.1.[n-1] where n is the version currently installed at Tier-1s. However, as soon as Tier-1s will move to 2.1.7, then 2.1.6 will not be supported any longer. For CASTOR SRM, 2.7-n and 1.3-27 will be supported till new announcement. dCache Current version is 1.8.0-15p6 which fixes an essential bug with caching credential produced through grid-proxy-init. Patch release 7 is about to come out. It fixes a problem with checksum verification when copy a file in push mode between 2 dCache sites. 1.8.0-15p7 is the recommended version as soon as it is out (in the next days). StoRM Recommended and supported version is 1.3.20 on SLC4. DPM Recommended and supported version is 1.6.10 on SLC4. " JC asked about the status across the UK sites. JJ mentioned that Greig had supplied a link to a page automatically updated (but did not have the link to hand). [Action: JC to check and circulate the link again] Link is http://wn3.epcc.ed.ac.uk/srm/xml/srm_version_table found after some chat window link exchanges from BD and DR. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20080623_EscalationReport_ROCs.html JC: Good to see that UKI is down to just one ticket now in this list. we still need to look at tickets that have not reached this stage. Only ticket in escalation is 35089. FP ticket 3022 against RAL T1 is in progress. Relates to CMS tape servers locking up - Tim was looking into it. For Footprints list see attachment on agenda. 11:35 Post-mortem workshop (10') JC talked through the summary which is on the blog and stopped to ask questions on how issues affected the UK. On dCache AF noted that finally the replica manager was now working. This is a minimum since spacetokens are also needed. But anyway, with the rm working ATLAS can increase the number of jobs running - some still run even without spacetokens. Manchester is also working on DPM but this is going slowly. [Action: JC to check which ATLAS non-productino jobs still run on sites without spacetokens and whether this will change soon] DC asked if anyone goes to the EMT - he used to on WMS matters but never noted any discussion about TMB discussion and conclusions. Those attending tend to be middleware and testing people. Will probably do not need any representation... DR noted that the broadcast tool sometimes sent 5 notifications about the same event - different lists. RN asked how many people can be on the GGUS alarm list [Action: JC to find out how many people can be on the alarm lists] [Action: JC to check CERN statement about publishing CPUs vs cores and the impact this had on their site - what is correct?] DR noted that this related to the discussion a few weeks ago about glue settings. PG was not sure if TPM for August had changed [Action: JC to check August TPM schedule - impacts DR training too]. 11:45 Topics to revisit (10') - COD schedule and roles will be discussed next week - gstat publishing - Wiki/web page updates (see for example http://www.gridpp.ac.uk/deployment/contact.html) - Completion of the GridPP-NGS site status information in http://www.gridpp.ac.uk/wiki/Working_with_NGS - regional Nagios monitoring - Collecting site queue/fairshare information - Reminder for sites to add comments to http://www.gridpp.ac.uk/wiki/SAM_availability:_October_2007_-_May_2008. Will be mentioned on Thursday and one week later emails sent to sites. - Look at the Site Readiness Review reports - Comment on the EGEE SLDs (discussed at PMB yesterday. Tier-2 manager's to review and sign). - "We need to audit T2 sites to understand how many concurrent transfers each can cope. This requires details of how many servers are available and how the pools are allocated between the VOs." 11:55 Actions review (10') Check here: http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items Items discussed and minor changes made. 12:05 AOB (05') - Any issues/actions from HEPSYSMAN? http://hepwww.rl.ac.uk/sysman/june2008/agenda.html PG thought the event went well and flowed smoothly. JC agreed. - UKQCD progress VO now established. Need to get the users into the VO. DC noted that Jan. will be away for the next 3 weeks and that he and DR might lend support. This led to a discussion about what mailing list should be used for general Jan. type support. [Action: JC to check with SB about lists that are already in use or setup]. - Tier-2 status & issues: please send JC (by Friday am) comments for UK input to LHCC review talk. We do not have to provide anything but for core issues this is an opportunity to get something known! - Core topics for Thursday's UKI meeting - Registration now open for http://egee08.eu-egee.org/ Meeting finished at 12:30. EVO chat window content dump: [10:53:54] Brian Davies joined [10:56:57] Brian Davies yes [10:57:10] Brian Davies that's correct. [10:57:21] Brian Davies yep! [10:57:32] Brian Davies Dealing with it at the oment [10:57:33] Brian Davies yep [10:57:42] Derek Ross joined [10:58:59] Duncan Rand joined [11:00:16] Jeremy Coles Hi Duncan - can you hear me? [11:00:30] Duncan Rand old on please [11:00:34] Duncan Rand hold [11:00:53] Jeremy Coles No worries - just checking audio [11:05:01] Brian Davies atlas production should be restarting [11:05:37] David Colling joined [11:05:38] Raja Nandakumar joined [11:05:40] Alessandra Forti joined [11:05:55] Jens Jensen joined [11:13:20] Pete Gronbech joined [11:17:38] Brian Davies Banning Users From Tier 1 is also on ATLAS agenda.. [11:22:50] Andrew Elwell joined [11:23:02] Andrew Elwell Evening all [11:24:08] Brian Davies incoroporating in fairshare allocations for each site would also be interesting [11:25:50] Brian Davies What Plan for dCAche DPm support for older versions? [11:26:22] Brian Davies one sec.. [11:26:38] Brian Davies i get the linl [11:29:00] Brian Davies http://wn3.epcc.ed.ac.uk/srm/xml/srm_versions_pie?SRM_flavour=.*&endpoint=.uk&starttime=2008-06-23+10%3A25%3A12&endtime=2008-06-24+10%3A25%3A12 [11:29:13] Brian Davies This link gives current UK SRM deployments [11:31:49] Duncan Rand http://wn3.epcc.ed.ac.uk/srm/xml/srm_version_table?endtime=2008-06-24+10%3A26%3A56&endpoint=uk&type=.*&version=.*&starttime=2008-06-23+10%3A26%3A56 [11:32:58] Brian Davies 10/17 UK DPM's are on 1.6.10-*, 7/17 are still on 1.6.7-* [11:34:30] Brian Davies CMS issue wit Users flooding production perhaps [11:37:12] Brian Davies DPM is balanced by round robin. It is simple, but then DPm was designed to be simple ( less things to configure/break) [11:39:07] Brian Davies ATLAS will do [11:57:40] Duncan Rand back in a moment [12:01:54] Brian Davies Are seeing "Hot" files for Re-processing in atlas ( condfdtions file) being needed by many jobs. been investigatied [12:09:04] Brian Davies exrension of sls work is probably appropiate fo rlnger term [12:09:49] Brian Davies for Tier 2s [12:10:16] Brian Davies T1s are in it for ATLAS and LHCb [12:19:22] Andrew Elwell glasgow publish fairshare via the ganglia / monami info [12:21:41] Jens Jensen It is part of "usable" storage [12:21:55] Brian Davies Am going tat th esame tiem compile from the hepsysman plans from sites to get audit.. [12:22:50] Brian Davies and follow up on sites who gav no informatiaon on future plans [12:27:29] Andrew Elwell remote was good on the Fri - I understand the probs on Thurs [12:33:12] Raja Nandakumar left [12:33:13] Andrew Elwell left [12:33:15] Derek Ross left [12:33:15] Brian Davies left [12:33:15] Duncan Rand left [12:33:17] David Colling left [12:33:17] Alessandra Forti left [12:33:17] Jens Jensen left [12:33:24] Pete Gronbech left [12:33:27]