NorthGrid Technical Board Meeting
11.00am, Monday 1st June 2009
Present: Alessandra Forti (chair), James Cullen (minutes), Sergey Dolgobrodov, Stuart Wild, John Bland, Rob Fay, Stephen Jones, Peter Love, Matt Doidge, Elena Korolkova
* Sites reports
Lancaster
Matt Doidge - 2TB disks very competitive price (cf. storage group emails). Need to upgrade to 10 gigabit network on campus. Has got back to vendor with some more questions. Hope to put in purchase order in next 2 weeks.
STEP09 preparation.
One CE has a publishing problem – stops publishing dynamically. Peter Love – would like some help on info system to diagnose the problem. MAUI problem causes MAUI to fail and drop back to default values? Alessandra – which parameters are missing? Please send more detailed info and she will have a look at it. It is new CE which is having the problem.
Liverpool
Power cut over night, have brought everything up without problem.
Software server with SAS drives. No problem normally, just when jobs compile software.
Manchester
Sergey has installed Ganglia to monitor CE, DPM and dCache headnodes and DPM pools. Sergey is working on dpm admin and monitor tools.
Drain pool procedure from Admin is still not working - in conversation with developer David Smith (sent him logs last week) about this issue.
Hopefully we have now corrected our dCache problem, but we are still planning to decommission it.
James is installing a test instance of a CREAM CE.
SPEC2006 software arrived this morning so will perform benchmarking soon.
Alessandra has mirrored a new (SL4.7) repo with all the latest software releases and is installing a new CE, as we found that our repositories contained RPMs which didn't belong to that release (we currently use SL4.4).
Alessandra found the cause of our ATLAS and NGS problems to be an incorrect version of Globus installed on our CEs. Amended some files to correct the problem and both VOs can now run jobs successfully. Alessandra – has anyone else had Globus libraries problems? Not at Liverpool.
STEP09 – more disk space for ATLAS.
Sheffield
Elena - Sheffield is doing well apart the fact that storage system was extremely loaded last week by analysis jobs. This led to the memory leak on the storage head node and disk pools. The total number of analysis jobs was limited to 100 (50 hammercloud plus 50 pilots) which correspond to our share in STEP09 and storage services have been restarted. The storage system looks more
healthy now but this situation led to the failing SAM tests 66% efficiency on Saturday
* Atlas and LHCb news
* Other VOs
* AOB
UIs need to be updated when RAL services will be down later this month. These services suggested by Alessandra:
WMS=wms00.hep.ph.ic.ac.uk
LB_HOST=wmslb00.hep.ph.ic.ac.uk
BDII=top-bdii.tier2.hep.manchester.ac.uk
PX_HOST=myproxy.cern.ch
Matt – Lancaster have not discussed this yet.
Accounting
Alessandra – Manchester has now recovered the lost accounting records for the beginning of the year.
STEP09
Peter – jobs will start tomorrow at 8am. Pilot factory will send pilot jobs to all participating sites. Will have to monitor number of concurrent jobs because LAN will get saturated. Peter has a web page with info about what's happening for STEP09. Everyone has enough space allocated. How are people monitoring which jobs are producing bandwidth load? Liverpool are very good at this.
John Bland – watch Ganglia a lot to find which nodes have high network traffic, then looking at the job output etc. Uses monami on CE (job efficiency) and SE + pools (DPM/MySQL state and number of RFIO/GFTP connections).
Elena - Can we delete all jobs. Peter – yes, all ATLAS pilot job running now are probably stale jobs and will not be used. Also Johannes' jobs.
Peter send email feedback about STEP09 to ATLAS UK ops
There are minutes attached to this event.
Show them.