* new middleware should require less manpower * T2 issues - middleware versions and configurations - information sources * T1/T2 issues - is quiet time OK or sign of a problem? where? - keep needing precise storage requirements * CMS issues/observations - dCache pool selection optimization - tape set definitions - dccp usage (PNFS mounted on all WN) - SRM calls expensive - file sizes OK - common tape writing not really tested - FZK, IN2P3 need some work - T1-T1 transfers need work, FNAL failed - T1-T2 some work needed - T2-T1 ~OK - CAF input data access - main concerns: * multi-activity, multi-VO * CSA overlap with CCRC'08-2 * ATLAS issues/observations - T1-T1 transfers - deleting data (1 Hz achieved at only 1/2 of the sites) - SRM v2.2 at T1 ~OK, fast fixes - ELOG vs. e-mail vs. GGUS (hotline needed) - user analysis not tested - no more time for big standalone tests, detector drives resource usage * ALICE issues/observations - file sizes, repeated mounts - xrootd progress OK - data access CPU/wall-clock ratio - dCache: advanced features for xrootd? - GSI security for xrootd? May. - main concern: * storage efficiency - ~3 FTEs needed for operations * LHCb issues/observations - DIRAC 3 expected in May - failover spaces - TURL format problem? - reco issues at T1: * long queue limits too low * stalled jobs - restarting gsidcap services - awaiting generic pilots - CPU time-left tool - data access alternative: download to WN, 7 GB/job - 6-9 FTEs needed for operations * GGUS MoU field usage - who can set it when? - try it out and measure response times - critical services list - scaling factors - incident followup: not in GGUS * where to look? - ELOG <--> GGUS - ELOG is extra, not main channel to sites - GridView for targets - try measuring, else change MoU - ELOG for quick experience, feedback - GGUS changes timescale ~2 months - hook Nagios into GGUS? * emergency tickets - signed e-mail triggers actions - ROC must always be copied - ATLAS need direct routes to T2 sites - all (experienced) users should have that option - use CC field with menu listing sites - may need more interfacing with site helpdesk systems, contrary to ROC/NGI support model ISSUES TO BE IMPROVED --------------------- * getting information to sites (T2 in particular) * more systematic followup on problems * where to get help * file sizes * tape usage * data removal * space token definitions * storage vs. networks * storage in general * monitoring * stability * user analysis * requirements mismatch * middleware versions * experiments to inform sites about activities and problems - problem can be with a site or with the experiment * daily and (bi)monthly face-to-face meetings are useful - faster minutes - daily updates, weekly summaries - RSS feed useful? no. - need better participation in daily calls - no new solutions to be invented during data taking * T2 coordinators * middleware updates - also needed during realistic operating conditions - staged rollout through volunteer sites --> "training" - need good communication to avoid site mismatches - sites need to know if patch successfully deployed elsewhere - just-in-time deployment can be seen as success - WLCG only needs certain components --> special treatment - not to be confused with EGEE/gLite - middleware table for T1/T2 - T2 needs to be up to date w.r.t. gLite - collect statistics on update times