WLCG workshop notes Each fill now same as 2010 data. Refinements and adjustments to models. 2011 big increase in resources. T0 built for peaks – averaged over month utilisation looks low. LHCb has 3Khz trigger rate. T0-T1 event sizes and data volumes larger but fewer redundant copies => less stress. T1 utilization at end 2010 when all data repro. 2011 average utilisation higher than in 2010. CMS 0.4M events at 8TeV before energy plan changed! Distributed analysis on T2 very successful. Well utilised. Expect resource contraints by end of year. 10% failure rate of grid jobs. CMS about 25,000 jobs at a time – 250,000 at a time. Popularity project (ATLAS/CMS) to see what is being accessed most and how. See transition to AOD. A number of interesting optimisations (Xrootd; LHCONE; better disk management; cloud components & virtualistion). Biggest 2011 change will be resource contraints. Optimisation and choices will be needed. Qs: CMS test jobs about 200 on each site per day. Small resource usage. When whole node reservation? Virtual or physical access to all machine cores – improves memory utilisation due to shared data like geometry. Track nodes not cores makes things easier. Also reduces i/o because draw in files less often. From m/w perspective need to sort out the accounting. Transition between single and whole node usage. Draining is inefficient. Need to control who switch if not all experiments ready to use whole node. For T2s (e.g. simulation) not so memory limited. User applications have less obvious advantages so do T1s first. Transatlantic link was not saturating so probably there is a bottleneck somewhere. LHCONE objective not just more bandwidth but have it more predictable. Root optimisation for CMS has seen usage drop by factor of 5. Tier-1 experience Issues with CREAM. Concern about divergence EGI/EMI and WLCG. Tier-0 experience Pile-up becoming a problem already. Preliminary results cloud for ATLAS gives 51.1% mean CPU/walltime vs 59.1% for physical machine. Encouraging. Distribution of services based on concern that network would not cope. Move away from MONARCH Tier approach. Why not use different class distinction such as sites doing analysis/production etc. SL6 aim is to have this in EMI-2. Some packages developed now. Markus argued that EMI-1 on SL5 is not attractive to sites. SL6 would be more attractive and make more sense given downtime required. SL5 backport ends this year. Given critical time for computing and SL6 lifetime vs EMI-2 availability, would it be better to skip SL6!? Partition off WNs? They are more closely tied to experiment testing. New hardware issues. Priority plan for new OS. Batch system from 2001 (direct submission to globus). Need to look at rising needs from areas like virtualisation and clouds. Clouds – easier for VOs to provide images (OS. Libs); clouds improve security… but resource sharing part is still missing. Dealing with 64 cores etc. Difficult if instantiation lasts as long as job needs it. All the experiments are moving towards dynamic data placement. (see Profiling real root jobs to look at actual access and optimise i/o profile on storage. Access patterns for whole nodes currently unknown. CMS will replace jobRobot with HC soon. Sending proxies to commercial clouds may be an issue since they are not bound by our policies. Slurm batch system Perfsonar –dashboard and architecture. What do we need here? ATLAS tracking brokeroff. Users want more grid stability. CMS file popularity information. CMS SAM test target 80% for T2s.