dteam minutes 2010-10-12

Experiment problems/issues
==========================

LHCb - Jeremy confused by variability of upload problem. Glasgow had a look and Raja said things were OK. Hopefully things will improve at Glasgow anyway after the upgrade. 

CMS - Stuart: 'There isn't much news as things are humming along smoothly though there may be an issue
with data access at Brunel, which Raul is looking into.'

ATLAS -  Reprocessing at RAL going OK. Tier-2: Problems at QMUL and ECDF - bad SW release, repair job failed at QM/Ed. Should be fixable/fixed.  
Kashif: lots of analysis jobs failed at Oxford at recently - is this our problem. Graeme: No. Tier-1 will be affected by the reprocessing.

September availability figures will be out soon.

Meetings & updates
==================

No ROD news (Daniela).

Tier-1 update: LHCb problems - access of conditions database, largely fixed now. LHCb castor upgrade - one outstanding problem - 
checksumming doesn't work on 32-bit disk servers. Plan B: put 64-bit OS on them. Graeme: will this be fixed by the time ATLAS Castor upgrade happens? 
Gareth: we believe so, but a bit more testing to do.

Sometimes 'central services' tickets not going to the right recipient. If you notice problems let us know.

EGI ops meeting yesterday. One item - current deadline for BDII has implications for LCG-CE.

GDB tomorrow. the agenda is here http://indico.cern.ch/conferenceDisplay.py?confId=72063 . Graeme is the T2 rep this month. 

Escalated tickets https://gus.fzk.de/download/escalationreports/roc/html/20101011_EscalationReport_ROCs.html

Deployment priorities & timelines

https://twiki.cern.ch/twiki/bin/view/EGEE/LCGprioritiesgLite#07_10_2010

CREAM: UCL will install as a front end to Legion. QMUL installing at the moment. Sheffield - in testing, ECDF : in testing. Bristol: soonish, Lancaster: new kit will be behind CREAM.

APEL: any other sites that have installed it? Glasgow: next 2-3 weeks. Getting more urgent since there is a deadline for RGMA support at end of year. RAL deploying a new APEL node. 
How easy is it to back out? You copy DB across so shouldn't be too hard. Revert changes in GOCDB. Most sites no fixed plans and unaware of this (new) priority.
Alessandra took an afternoon to install it. Hopefully bugs should be fixed.

ARGUS: Oxford has an ARGUS working, Birmingham also has one, possibly RAL (ALICE). Central Nagios (https://samnag010.cern.ch/nagios/ ) checks ARGUS/Glexec -
only two sites are green. Glasgow are helping to test Glexec for ATLAS.

AOB
===

New CA release  http://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html. 

Next HEPSYSMAN meeting Monday 22nd November in Birmingham - looking for speakers.

Top-level BDII decision in WLCG. Top-level BDII has become more stable. Intention is to pair Tier-1's and to fail over to the pair. Update to follow.

Tier-2 reports
==============

Any issues? SouthGrid: ongoing worry is manpower at sites. Sites are upgrading. LondonGrid: Sites need to meet pledges ASAP. Northgrid: Concern about storage units and calculation of usage. 
Scotgrid: not our best quarter wrt reliability.

DPM not consistent - it defaults to 1024 rather than SI, but publishes in SI. All mixed up - we should be using SI units. Ticket to DPM suggested - storage group will follow up. dpns-du has --si option. 
Many ways to get information out. Also concern about whether a single snapshot of disk usage is sufficient. Should we integrate over time in some way? 
For ATLAS DATADISK and MCDISK should be 80% full most of the time, other tokens are buffer spaces. 

GSTAT is not reporting the number of running jobs correctly if multiple CE's see the same batch system. Is this a sub-cluster issue? Jeremy will look into it.