StoRM ----- * blocksize mismatch GridFTP (hardcoded) vs. GPFS * SLC4 brings twice the performance of SLC3 * tuning number of streams * CNAF FTS SRM Get timeout 3600 --> 3000 * CMS farm activity saturated number of slots on disk servers * GPFS: high random access latency for software area * GPFS problems due to limited hardware and misconfiguration * better logs needed to distinguish StoRM problems from other problems * better configuration/admin tools needed RAL --- * LHCb RFIO core dumps, not yet understood * CMS tape mounts for skimming halted production * tape servers flaky, probably due to older CASTOR version CASTOR SRM v2.2 --------------- * DB deadlocks * too many DB connections, need more machines and better configuration * CGSI errors unclear * SRM stuck in recv(), cured by timeouts in latest version * timeout on stager calls needed * pinning/GC problem fixed * logging trail being improved * Put/Get processing typically 1-5 s after authentication * moving to SL4, 2.1.7 and new MoU all urgent * more tests to avoid problems in production * test tool to come with release SARA ---- * GSIDCAP server only on SRM node, due to bug that will be fixed * read/write/cache pools separated * queues for GridFTP and GSIDCAP * full pools due to orphaned files removed from PNFS --> FTS timeouts increased, cron job to clean up * slow ATLASDATATAPE --> increased number of movers, extra node * slow staging for LHCb --> more hardware needed * DIRAC staging small (150 MB) files, bad for tape system * SRM reports NEARLINE also for T0D1 when file is only on a write pool --> T0D1 should be made read-write * space token VOMS checking problem fixed * GSIDCAP no longer listening on port 22128, not understood * LFC crashes, fix coming * ATLAS DDM bugs: failures seen as successes and vice versa * LHCb: bringOnline not enough to make status ONLINE, should be fixed * D1 <--> D0 transition function or pinning? --> changeSpaceForFiles not on roadmap, PNFS admin command available * dCache release should highlight configuration changes * should not mix patches with new features (was an accident) * stage tests: - 500 + 50 (different tape) 2 GB files - bringOnline crashes with 500 files --> use "dccp -P" for now - 100 MB/s with pre-stager, else ~60 MB/s DPM at GRIF ----------- * 1.6.10, 64-bit, 100 TB * 250 MB/s transfers without tuning * ATLASGRPDISK needs multiple FQANs --> feature expected in September * XROOTD plugin rpm coming * advanced monitoring tools by Greig Cowan Databases --------- * most critical service * RAC + DataGuard, down << 1% * old hardware on standby during transition period * Streams replication: online --> offline, T0 --> T1, OpenLab collaboration * CMS: Frontier, Squid * 3D project for sharing policies and procedures * 24x7, but still best effort * with more memory fewer physical reads * DB usage increase should be at T1, not T0 * DB dashboard for easy monitoring, technology also available to T1 * most applications guided to 1 preferred node each, better cache utilization, less intracluster traffic * locked owner accounts to avoid accidents (e.g. drop table) * SAM/GridView biggest consumers * T1: no problems, hardware upgrades foreseen * power cut: - 1 Ethernet switch not on critical power - faulty OEM agents scripts prevented automatic startup * completing migration to 64-bit and 10.2.0.4, first T0, then T1 (3D) * Streams setup improvements * reliable, manageable service * close collaboration between application developers and DBAs ATLAS DB -------- * reprocessing launched at the end of May * need ~1k concurrent Oracle sessions * some sites not yet OK/tested * 3D streaming to T1 OK * DCS (slow control) has largest volumes * replication to calibration sites OK * reprocessing: DB average load OK, bursts limited by capacity * more tests foreseen * T1 firewall issues --> use proxy CASTOR DB --------- * background bulk query for sync. between stagers, disk servers, name space: - slowed down name server during backup - sync. suspended during backup, DB disks defragmented * stager_rm slow in certain cases --> fixed by forcing index use via hint * deadlocks between concurrent requests --> fix coming * too many concurrent connections: - lowered number of connections - lowered number of SRM threads - split DB into several RACs * increase during CCRC'08-2 not large compared to continuous activity Middleware ---------- * software process operated as usual: updates, priorities * longstanding fix for job priorities released * dCache in gLite not at cutting edge, slightly behind * lcg-CE Globus marshal daemons security fix * gLite 3.1 WMS released * baseline versions for services and clients * GFAL desired pin time bug affected ATLAS reprocessing: - fix entering certification - no EMT request for fast track yet * FTM to be deployed at T1 * VDT bugs: - MyProxy linked against wrong version of Globus - Globus proxy chain length limit too low - fixes expected shortly * CREAM: still functional problems, stress tests started * gLExec security problems affecting CREAM and pilot jobs, fixes expected shortly * more lcg-CE performance improvements in the pipeline * VDT 1.10 for SL5 in September, driven by sites * EMT for short term planning, TMB for medium term, ALICE not present * Application Area repository for early access to new client versions * ATLAS, CMS: SL5 possible in winter