Present: Raul Lopes, Gianfranco Sciacca, Chris Walker, Peter Hobson, Ben Waugh, Govind Songara, William Hay, Duncan Rand, Daniela Bauer ============================= Site status ============================= **** UKI-LT2-Brunel (Raul/Peter) **** => Hardware -> tender for storage and CE replacements . out until Wednesday . 400 TB (to the end of gridpp3) . CE replacements for dgc-grid-40 and dgc-grid-44 . monitoring system -> storage already acquired . still waiting for delivery of 48TB acquired from Streamline in DECEMBER! (any day now....) -> two hard-disks replaced in dgc-grid-53 are now integrated in the RAID => Software -> Cobbler running: worker nodes and Storage pools already installed from Cobbler -> BCFG2/Cfengine: under test -> Nagios: should be deployed in May -> start testing SL5 next week . ready for CE, WN, and storage pools? => Tests -> improved FTS reports -> errors in IC squid cache showing up in CMS dashboard causing problems at Brunel (Action on Imperial) Squid Problem at Imperial causes warning at Brunel, needs sorting out at IC See here about squid problem: http://dashb-cms-sam.cern.ch/dashboard/request.py/latestresultssmry?siteSelect3=T2T1T0&serviceTypeSelect3=vo&sites=T2_UK_London_Brunel&exitStatus=all&tests=all&services=CE https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=dgc-grid-40.brunel.ac.uk&vo=CMS&testname=CE-cms-swinst&testtimestamp=1239958181 -> UK Grid Status always "funny" . 17 Apr 2009 16:02:03 reporting 8444476 . number of jobs dgc-grid-40: 253 dgc-grid-44: 54 dgc-grid-35: 43 => Small issues -> job efficiency -> performance affected by huge number of job submissions (from Si, Maarten, Sciaba...) Other points noted: Duncan points out lcg-CE will not be ported to SL5. Raul thinks Steve Lloyd's page is wrong for Brunel (at least it doesn't match what he sees locally). Raul to talk to Steve Lloyd about this (Action on Brunel) **** UKI-LT2-IC-HEP *** 300 TB new disk have been delivered. Now we just need witches, space and some time to install them. **** UKI-LT2-IC-LESC *** Still debugging accounting problems for mars-ce0. Site trundling along, Daniela doing most of the support. **** UKI-LT2-QMUL *** 116 viglen nodes died under 1 TB files on Lustre file system, this killed about 1/3 of the machines, there is a problem with the network cards on the motherboards, seen elsewhere at QMUL, but probably something more than just network load. However Lustre held up reasonably well. LHCb locking solved by increasing number of locking threads. LHCB not very helpful, i.e. what file they were trying to lock. **** UKI-LT2-RHUL *** (Duncan) Working fine. CMS reserve space for data, whether needed or not, intervention by Monika. Duncan to help Govind install new worker nodes. SL5 64 bit only, 32 bit worker nodes. **** UKI-LT2-UCL-HEP New CE stable (Currently running mostly Atlas user jobs @~95% success rate) New DPM head node online and stable New DPM pool online (SLC3 pool replaced) Provisional ATLASPRODDISK space token deployed APEL/RGMA OK (closed ticket against it), but MON still on SLC3 Panda pilots now on new CE Next steps: Add enough disk to have ATLASPRODDISK and ATLASDATADISK with ~2TB each (possibly as early as next week?) Get added to ToA and try to run production Replace SLC3 3.0 MON (new SLC4 3.1 box installed but some services do not start after YAIM) Unsure of: Fixed in BDII CPU publishing to gStat according to latest guidelines, but not sure this is correct. The was some agreement that number of logical and physical CPUs should be set to the same value. **** UKI-LT2-UCL-CENTRAL Problems with scheduler, 66000 jobs in one go was too much .... LHCb problem: not advertising CPU time, just wall time, LHCb looks for CPU time other than zero ============================================================================== Experiments ============================================================================== Currently lots of Atlas jobs running - woo hoo. Duncan's list of useful webpages, went through most of them, few surprises here. http://dashb-siteview.cern.ch/generic/site-monitoring/test.html STEP09: 25 May-12 June 2009 QMUL yes, RHUL no (cluster being moved) end of May/beginning of June * ATLAS (15') copying user-requested data into QMUL production: http://panda.cern.ch:25880/server/pandamon/query?dash=prod Lots of jobs timing out in London - not sure why FTS: http://lcgwww.gridpp.rl.ac.uk/cgi-bin/fts-mon/fts-mon.pl?q=jobs&p=day&v=All&c=UKILT2QMUL analysis stress tests: http://gangarobot.cern.ch/st/ ATLAS: Expects all supporting sites to be available for STEP09. Plans: http://tinyurl.com/czx6rm * CMS (15') site status page http://dashb-ssb.cern.ch/dashboard/request.py/siteview? site readiness http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport.html#T2_UK_London_Brunel CMS: STEP09 plans: http://tinyurl.com/dd5bhq * LHCb (15') LHCb site status board http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html SAM test results http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry production - http://lhcbweb.pic.es/DIRAC/jobs/SiteSummary/display - some naming confusion * Other VO's (15') Camont (http://indico.cern.ch/conferenceDisplay.py?confId=56456) Fusion - issue at RHUL https://gus.fzk.de/ws/ticket_info.php?ticket=47699 https://gus.fzk.de/ws/ticket_info.php?ticket=47814 **** GGUS tickets https://gus.fzk.de/ws/ticket_search.php GGUS Ticket-ID: 45327 (GFAL versions < 1.10.6) Govind aware of it. **** Publishing DNs with APEL http://goc.grid.sinica.edu.tw/gocwiki/ApelFaq#head-69f1753f985897a37902df00734f2480220250b0 **** SPEC-HEP06 Eventually we will have to run SPEC-HEP06 benchmark https://twiki.cern.ch/twiki/bin/view/FIOgroup/TsiBenchHEPSPECWlcg This software needs to be *bought*. **** Storage issues Duncan encourages everybody to attend the weekly storage meeting. There is a storage workshop 2-3rd July (after HEP-sysman meeting at RAL 30 June- 1 July) **** WLCG Nagios https:sam-uki-roc.cern.ch/nagios/ Use nagios firefox plugin Alex setting up monitoring at QMUL Nagios: Prune down the errors, at least deal with the ones that can't work: https://gridppnagios.physics.ox.ac.uk/nagios needs some work so you only see your site *** Communication: Is it worth setting up a group chat ? (Note: There is one now, #londongrid on freenode IRC) skype -> problems on linux ? LondonGrid wiki: Please somebody do something...... blogging: http://planet.gridpp.ac.uk/ Not much enthusiam. *** Educating users on how to use the grid (data set storage) E.g. QMUL: Local user wants to get data on the grid - not Chris, use official channels, but what are they ? Which space token to use -> Atlas local ? **** Othe business Peter: Are we experiencing Denial of service attack from our own software ?