Present:
Raul Lopes, Gianfranco Sciacca, Chris Walker, Peter Hobson, Ben Waugh, 
Govind Songara, William Hay, Duncan Rand, Daniela Bauer

=============================
   Site status 
=============================

**** UKI-LT2-Brunel (Raul/Peter) ****

=> Hardware
  -> tender for storage and CE replacements
     . out until Wednesday
     . 400 TB (to the end of gridpp3)
     . CE replacements for dgc-grid-40 and dgc-grid-44
     . monitoring system

  ->  storage already acquired
     . still waiting for delivery of 48TB acquired from Streamline in DECEMBER!
     (any day now....)

  -> two hard-disks replaced in dgc-grid-53 are now integrated in the RAID

=> Software
 -> Cobbler running: worker nodes and Storage pools already installed from Cobbler
 -> BCFG2/Cfengine: under test
 -> Nagios: should be deployed in May
 -> start testing SL5 next week
  . ready for CE, WN, and storage pools?

=> Tests
  -> improved FTS reports
  -> errors in IC squid cache showing up in CMS dashboard causing problems at Brunel (Action on Imperial)


Squid Problem at Imperial causes warning at Brunel, needs sorting out at IC

See here about squid problem:

http://dashb-cms-sam.cern.ch/dashboard/request.py/latestresultssmry?siteSelect3=T2T1T0&serviceTypeSelect3=vo&sites=T2_UK_London_Brunel&exitStatus=all&tests=all&services=CE

 https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=dgc-grid-40.brunel.ac.uk&vo=CMS&testname=CE-cms-swinst&testtimestamp=1239958181

  -> UK Grid Status always "funny"

    . 17 Apr 2009 16:02:03 reporting 8444476
    . number of jobs
       dgc-grid-40: 253
       dgc-grid-44: 54
       dgc-grid-35: 43


=> Small issues
  -> job efficiency
  -> performance affected by huge number of job submissions (from Si, Maarten, Sciaba...)

Other points noted:

Duncan points out lcg-CE will not be ported to SL5.
Raul thinks Steve Lloyd's page is wrong for Brunel (at least it doesn't match what he sees locally).
Raul to talk to Steve Lloyd about this (Action on Brunel)


**** UKI-LT2-IC-HEP ***
300 TB new disk have been delivered. Now we just need witches, space and some time to install them.

**** UKI-LT2-IC-LESC ***
 Still debugging accounting problems for mars-ce0. Site trundling along, Daniela doing most of the support.

**** UKI-LT2-QMUL ***
116 viglen nodes died under 1 TB files on Lustre file system, this killed about 1/3 of the machines, 
there is a problem with the network cards on the motherboards, seen elsewhere at QMUL, but probably something 
more than just network load.
However Lustre held up reasonably well.
LHCb locking solved by increasing number of locking threads. LHCB not very helpful, 
i.e. what file they were trying to lock.

**** UKI-LT2-RHUL ***
(Duncan)
Working fine.
CMS reserve space for data, whether needed or not, intervention by Monika.
Duncan to help Govind install new worker nodes. SL5  64 bit only, 32 bit worker nodes.

**** UKI-LT2-UCL-HEP
New CE stable (Currently running mostly Atlas user jobs @~95% success rate)
New DPM head node online and stable
New DPM pool online (SLC3 pool replaced)
Provisional ATLASPRODDISK space token deployed
APEL/RGMA OK (closed ticket against it), but MON still on SLC3
Panda pilots now on new CE

Next steps:
Add enough disk to have ATLASPRODDISK and ATLASDATADISK with ~2TB each (possibly as early as next week?)
Get added to ToA and try to run production
Replace SLC3 3.0 MON (new SLC4 3.1 box installed but some services do not start after YAIM)

Unsure of:
Fixed in BDII  CPU publishing to gStat according to latest guidelines, but not sure this is correct.

The was some agreement that number of logical and physical CPUs should be set to the same value.


**** UKI-LT2-UCL-CENTRAL
Problems with scheduler, 66000 jobs in one go was too much ....
LHCb problem: not advertising CPU time, just wall time, LHCb looks for CPU time other than zero


==============================================================================
Experiments
==============================================================================
Currently lots of Atlas jobs running - woo hoo.

Duncan's list of useful webpages, went through most of them, few surprises here.
http://dashb-siteview.cern.ch/generic/site-monitoring/test.html

STEP09: 25 May-12 June 2009
QMUL yes, RHUL no (cluster being moved) end of May/beginning of June

    * ATLAS (15')
      copying user-requested data into QMUL
      production: 
      http://panda.cern.ch:25880/server/pandamon/query?dash=prod
      Lots of jobs timing out in London - not sure why
      FTS:
      http://lcgwww.gridpp.rl.ac.uk/cgi-bin/fts-mon/fts-mon.pl?q=jobs&p=day&v=All&c=UKILT2QMUL
      analysis stress tests:
      http://gangarobot.cern.ch/st/
      ATLAS: Expects all supporting sites to be available for STEP09. Plans: http://tinyurl.com/czx6rm

    * CMS (15')
      site status page http://dashb-ssb.cern.ch/dashboard/request.py/siteview?
      site readiness http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport.html#T2_UK_London_Brunel
      CMS: STEP09 plans: http://tinyurl.com/dd5bhq


    * LHCb (15')
      LHCb site status board
      http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html
      SAM test results
      http://dashb-lhcb-sam.cern.ch/dashboard/request.py/latestresultssmry
      production - 
      http://lhcbweb.pic.es/DIRAC/jobs/SiteSummary/display
      - some naming confusion
	

    * Other VO's (15')

      Camont (http://indico.cern.ch/conferenceDisplay.py?confId=56456)

      Fusion - issue at RHUL
        https://gus.fzk.de/ws/ticket_info.php?ticket=47699
        https://gus.fzk.de/ws/ticket_info.php?ticket=47814

	
 		
**** GGUS tickets  	
https://gus.fzk.de/ws/ticket_search.php
GGUS Ticket-ID: 45327 (GFAL versions < 1.10.6)
Govind aware of it.


**** Publishing DNs with APEL 
http://goc.grid.sinica.edu.tw/gocwiki/ApelFaq#head-69f1753f985897a37902df00734f2480220250b0


**** SPEC-HEP06
Eventually we will have to run SPEC-HEP06 benchmark 
https://twiki.cern.ch/twiki/bin/view/FIOgroup/TsiBenchHEPSPECWlcg
This software needs to be *bought*.
  		
**** Storage issues
Duncan encourages everybody to attend the weekly storage meeting.
There is a storage workshop 2-3rd July (after HEP-sysman meeting at RAL 30 June- 1 July)

 
**** WLCG Nagios 

https:sam-uki-roc.cern.ch/nagios/
Use nagios firefox plugin
Alex setting up monitoring at QMUL
Nagios: Prune down the errors, at least deal with the ones that can't work: 
https://gridppnagios.physics.ox.ac.uk/nagios
needs some work so you only see your site
	
*** Communication: 
Is it worth setting up a group chat ? 
(Note: There is one now, #londongrid on freenode IRC)
skype -> problems on linux ? 
LondonGrid wiki: Please somebody do something......
blogging: http://planet.gridpp.ac.uk/ Not much enthusiam.
 


*** Educating users on how to use the grid (data set storage)
E.g. QMUL: Local user wants to get data on the grid - not Chris, use official channels, but what are they ?
Which space token to use -> Atlas local ?



**** Othe business
Peter: Are we experiencing Denial of service attack from our own software ?