UKI Monthly Operations Meeting (TB-SUPPORT)

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the monthly UKI meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The UK phone bridge is on +44 (0)161 306 6802. The CERN one is: +41 22 76 71400. The UK one tends to produce background noise and may have to be muted! The phone bridge ID is 1010033 with code: 4880. - If the CERN phone connection does not work please try Caltech +1 626 395 2112 or DESY +49 40 8998 1346. - For more information on the UK phone bridge: http://www.ja.net/services/video/agsc/services/evotelephonebridge.html
    • 10:30 10:50
      Experiment problems/issues & STEP09 20m
      - including the latest on preparations for STEP09 CMS: Bad week, CEs and prod role accountign in Castor LHCb: LSF issues over job slots UK 43% in DIRAC3 accounting (pledge is 12-14%) Jobs slot usage expected to increase next month after application cod fix. ATLAS: - User numbers increasing (see http://scotgrid.blogspot.com/2009/05/oh-my-gosh-its-users.html) - Some issues with MCDISK Resetting maui fairshare stats before STEP09? - Other VO issues Camont update - after the presentation last month there were several questions for the VO to answer which they have....
    • 10:50 11:00
      Pilot roles 10m
      - This is basically a reminder! These roles are to be used shortly.... site - Status for ATLAS - Status for LHCb For the ScotGrid notes on doing this: https://www.scotgrid.ac.uk/wiki/index.php/Glasgow_Enabling_the_LHCb/ATLAS_pilot_roles
    • 11:00 11:05
      Publishing DNs 5m
      - What is the status of this across the sites? - Any new (or old unresolved) problems? For a useful overview of the procedure see the blog entry by Alessandra: http://northgrid-tech.blogspot.com/2009/05/howto-publish-users-dns-accounting.html
      Sites Publishing UserDNS 20/05/2009
    • 11:05 11:15
      GDB discussions 10m
      - There was a GDB last Wednesday: http://indico.cern.ch/conferenceDisplay.py?confId=45475 Introduction ********** Subjects for future meetings: Batch systems; T2 storage; Virtualisation; Site management ... - Are there any suggestions for topics that should be visited at the GDB? SL5 Events/Meetings: SRM workshop for developers this week; HEPiX 25th-29th May; STEP09 June; STEP09 review 9-10th July Site to give feedback. Pre-GDB was on NGI progress Installed capacity - automated monitoring about to become a reality so please check what your site is publishing (logical/physical CPUs) GridMap View. Question asked regarding how to publish the number of CPUs for sub-clusters. No real conclusion. SB Possibility of a WLCG technical forum. GGUS structure and workflow changes ****************************** Sites can now all be directly notified Interfaces with other ticketing systems (region/experiment) being improved Storage ****** LHCb tests running analysis jobs on large amounts of data - to understand current limitations for user job data access - Eg. 600 jobs - data in 100 files (200MB) with 500 events each - ROOT based application opens files and reads events from SEs - dCache tuning required but old version used. - Promising results but either performance or file open time problems seen at most T1s. SRM usage - Measured SRM utilization patterns at CERN and RAL T1 - Looked at number of polling requests, failures etc. - Main SRM client at CERN is FTS - ATLAS runs 5 requests/s which is 5x more than CMS and LHCb SL5 *** Experiment requirements - Existing SL4 (gcc-3.4) binaries on SL5 -- Incompatibilities found with SELinux module (impacts ROOT, Oracle client and CERNLIB) -- Identified compatibility libraries (but do the experiments distribute them or LCG/GD via a meta-RPM) -- Working to make gcc-3.4 available on SLC5 systems ATLAS - No production release yet compatible with SL5 but can run with SELinux partially disabled (with compatibility libs installed). 15.2.0 expected in production by September. CMS - Can run on SL5 but do not want sites fiddling with SELinux! LHCb - Can run on SL5 for analysis of existing MC data but older releases will not run - Testing and distribution of libs TBC ALICE - No problem Native builds for SL5 with gcc-4.3 (skipping 4.1) - much effort in porting C++ code - external libraries an issue. Needs unified installation approach or shipping gcc-4.3 compiler or libraries. ATLAS - Native build soon. Deployed around August. Will produce SL4 and SL5 binaries. - Inclined to retain SLC4/gcc-3.4/32-bit) as primary platform till after 09-10 physics run. CMS - Finished native port to 64-bit. Builds. Doing validation - Want to switch binaries in one go LHCb - Port not yet done - Plan to use SL5/gcc4.3/64-bit for real data and corresponding MC ALICE - No problem => Slow migration & calls for virtualization! Experiment data flows ***************** Presented to show rates for various tasks. T1 breakdowns. LHCb - gave rates for T1s CMS - gave detailed rates per T1 ATLAS - detailed T1 and T2 requirements including pilot shares. Request "feedback on how well balance between activities works during STEP09: jobs run/queued, CPU efficiencies each day). Also gives spacetoken summary (slide 25) Memory use with 64-bit - Graeme pointed out that sites should not kill jobs based on vmem as in 64-bit there is a large overhead as each process has a memory footprint of about 50MB vs 5 MB for 32-bit. - Believe that torque kills on the basis of memory consumption of process tree not payload. Pilot jobs ******* - SCAS/glexec pilots running (inc. Lancaster) - but concerns and CREAM issues Lanaster deployemnt ahead of others, running using lcmap-plugins. SCAS plugin not available for 64bit (ETICS issue.) BNL testing this week usage with ATLAS pilot system. - The WLCG management board has asked T1s and larger T2s to deploy SCAS/glexec for testing by experiments: Not ready to dep[loy ; so not asking T1 and large T2s to deploy yet. - Some progress in ATLAS framework issues - VDT asked to provide MyProxy server built with support for VOMS attributes CREAM ***** - Status vs transition criteria: https://twiki.cern.ch/twiki/bin/view/LCG/LCGCEtoCREAMCETransition - CMS just started testing - Request for more sites to provide CREAM CEs for testing - MB plan has 50+ sites providing CREAM by 1st October (currently about 14 sites have it) [Dug recently wrote a useful summary about the Glasgow installation http://scotgrid.blogspot.com/2009/05/cream-in-action-local-users-glexec.html] For more on installation see: https://www.scotgrid.ac.uk/wiki/index.php/Glasgow_GLite_Cream_CE_installation and http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:devel:install-cream31-devel A common file access protocol *********************** - Proposal to have XROOT as a common protocol. Consolidation would come after 09-10 run. - Response was not great - may work for CASTOR and security aspect there but experiments just want anything that does not introduce a large overhead!
    • 11:15 11:20
      ROC/WLCG stuff 5m
      ROC update *************** - Move to regional monitoring and support happens on 15th June - Implications... COD looks at UKI only. setup shifts/1st Line support. Virtual discussion room skype/jabber? -- Concept of a virtual control/discussion room. How many sites still prohibit the use of Skype? T1 news ********** - The T1 has now confirmed its timetable for the new building move (http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/) -- This means a period of about a week without the MyProxy server and WMSes at RAL. We need to get UIs pointing at the IC/Glasgow instances of the WMS and ensure additional VOs are enabled. There is a possibility to use the GridIreland MyProxy but we also need to consider setting up another outside RAL. Pointing at the CERN services is another possibility. We need to review and disseminate decisions on this in the next few weeks. - T1 team update? Network outage on the 26th. top level bdii seeing more than 14k connections per site. 4 conn/s per site. MH will look further.
    • 11:20 11:25
      Site issues 5m
      The EGEE April availability and reliability report was published recently showing that the UKI figures were 94% availability and 96% reliability. Many sites were 100% during this period. The PMB would like to congratulate all site admins for the improvements seen. - Quick look at monitoring and accounting status - Accounting problems noted for -- IC-LeSC -- UCL-HEP -- MAN-HEP
    • 11:25 11:30
      AOB 5m
      - The next HEPSYSMAN 30th June - 1st July - ssh incident ongoing. Please always follow up on IP request checks. (Only some sites reply that they have checked so the assumption is that the others checked and found nothing but it helps to know that the message arrived. Mingchao is away at the moment so Alessandra and I will coordinate UKI responses.