Deployment team

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 780409 with code: 4880.
Tuesday 28 April 2009 Present: Brian Davies (minutes), Jeremy Coles, Daniela Bauer, Raja Nandakumar, Sam Skipsey, James Cullen, Doug McNab, Duncan Rand, Gareth Smith, Graeme Stewart, Mingchao Ma, Alessandra Forti, Dave Colling, Mohammed Kashif • Review of weekly issues by experiment/VO - LHCb -- Enabling pilot roles RAL have deployed. Some T2s have , but most have not. Manchester QMUL and ECDF on one CE have. Pilots needs new pool accounts. Mail from Roberto Santinelli. --New software application tests. 10M event production will be pushed back until next week. (jobs @ T2s) Will take over 10 days and will go through all good WMS. UCL-CCC and Manchester were banned. Manchester is now passing SAM tests so will be un-banned. UCL-CCC has voms cert issue. Timeouts occurring within DIRAC at Liverpool and Birmingham. Still investigating. Sites blacklisting is done via GGUS tickets Historical information on blacklisting is in DRIAC 3 database. - CMS Site readiness page shows which sites are ready for usage. If you fail any critical SAM (ops or CMS) test; you get blacklisted. If you start passing tests then you get un-banned. This can be an issue if SAM test are badly written. Ie site config confuses tests and takes time to be corrected. - ATLAS Production in last week. QMUL had storage usage issues. RALPPD had storage issues. Liverpool efficiency down. ECDF had storage issues. Plan to continue at high level of production for foreseeable future Step 09 pre tests effected qmul storage. now being worked. Current setup is dpm FE Lustre BE. Bottleneck through DPM If they had storm (which understands Lustre ) thing would be better. Step09 Could Sites please configure the pilot role. 50% prod 255 pilot 25% regular Ganga has IO bound on job submission ( from hammercloud test.) Being worked on. GS will announce on tb support further HammerCloud tests. DR- is pilot role mandatory for step 09? GS it is highly desirable. Not needed for non atlas data sites ( IC BRUNEL BRISTOL) MAUI fairshare is based on dedicated PS so most sites configured correctly. Blacklist ECDF were, but now working UCL-CCC are offline but software install at ucl is going ahead so should work in the next few weeks. GS to look why qmul not running ( they are not blacklisted.) 2G was offline but 1G queue was online. Hence no jobs. 2G queue now back on Historical blacklisting could be recorded but is no tin place at the moment. PG – Oxford space tokens almost full an issue? GS more info to come out of CRRB meeting. - Other Fusion effected RHUL last week. - Experiment blacklisted sites: review -- Experiment descriptions of how blacklisting is handled - Site performance -- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html Culham issues. PG will look into it. • ROC update -From the EGEE ops meeting: Some gLite 3.0 services still remain https://twiki.cern.ch/twiki/bin/view/EGEE/SitesPublishinggLite30 RALPPD, IC-HEP, ECDF ECDF has one SE and mon box. Since they have no staff and on best effort these will be upgraded when they can. AF pointed out that RGMA is a pain to upgrade at the moment. - Running 64-bit WNs Liverpool have been only vocal site about not be able to run 64Bit SL5 DC Issues with running 2 sets of ROMs for supporting 32bit mode on 64. Change of YUM to groupinstall method causing issues. Ewan sent to tb-support Action to look at this further. - COD & ROD -- Any comments from the contributors of last week? Comment From Dug; ticket management involvement good. TPM monitoring by PG, DR and GS this week. TPM handover ticket was not received. -- 15th June may be the switchover date. Several documents have been circulated (attached) From the site reports: - SAM data was missing this week. Very few site comments - many indicated a good and quiet week. WLCG update ***************** - No recent WLCG meetings. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20090427_EscalationReport_ROCs.html 45327-gfal for RHUL- on hold. will be put into un-solved. 46024-pheno using data at usage level DN encryption. Good progress 47073-atlas STs at CAM-in progress 47342-ilc t1 se problem in progress 47393-lhcb stalled at Manchester. Now solved. Bug in NFS kernel server. updated 47653 Voms host update. 47759 Myproxy at T1. To go on hold into further work is done. *************** Two sets of slides regarding getting ready for region al ops 1sts set Checklist for regional model. One FTE to do work. Provide mailing list for rcod contact. Needs to be setup. No preference to name and host of list. Link to GGUS for regional knowledge base. 1st line supporter status JC to follow up. 2nd set of slides Rcod Readiness v1.1.pdf Need to decide Rota JC to return to this next week. Site joining information is now out of date What needs to be updated? Who should do it? Does scotgrid have info on this. DM an GS has not seen it. S has been document info for users which isn’t necessarily EGEE useful and what is specific al to local users. JC to take it offline. Good idea to have it I one page which becomes part of ROC and NGI. AOB- None.
There are minutes attached to this event. Show them.