UKI Monthly Operations Meeting (TB-SUPPORT)

Europe/Zurich
EVO

EVO

Description
Monthly review and discussion meeting for those involved with GridPP deployment and operations. To join via EVO go to http://evo.caltech.edu. To join by phone call +41 22 76 71400. The phone bridge ID is 357746 and the code: 4880.
TB-Support minutes 22nd May 2008 Attendees ========= Derek Ross Matthew Doidge Brian Davies Gianfranco Sciacca Stephen Childs Sam Skipsey Peter Love UCL-HEP - Ben Waugh, Clare Gryce Jeremy Coles Elena Korolkova Raja Nandakumar Durham - Phil Roffe, David Ambrose-Griffith Graeme Stewart Duncan Rand Andrew Elwell Mike Kenyon Greig Cowan Rob Fay David COlling Alessandra Forti Pete Gronbech Ewan Mcmahon Chris Brew Yves Coppens Stephen Burke John Bland CCRC Progress and Plans ======================= LHCb ---- CCRC Progress - transferring from Pit to To to T1 - working fine. Also reco and strippping working fine, everything going smoothly. Problems with LHCb code, rfio issue at RAL at the moment. DIRAC 3 becomes baseline for production, not yet clear when production will be restarted but expected around mid June, little going on at T2s until then SAM tests are fine, UK sites are fine. Problems due to UK Ca problem meant that had to change to different user certificate. Glasgow failing software installing tests - not run since 16th - known issue, being followed up by Raja Atlas ----- 1st week - Throughput test between T0-T1, RAL did well 2nd week - T1-T1 test, higher data rates than normal, again no problems with RAL 3rd week - over w/e M7 cosmic run - data from T0-T1 no ESD to T2, srm 2 instance lost but have caught up Reprocessed FDR1 data being distributed to srmv1 T2 endpoints Functional test with random data T0-T1-T2, held up by CA problems, T1 slow to update CA on disk servers. Testing out to 8-10 T2s. Birmingham problems but now transferrring data to Cambridge. Oxford and RHUL srms hanging internally. 4th week - not clear yet, was supposed to be contingency, may be a more aggressive throughput test to T2s, details circulated FDR2 week 1 June- need lateest Atlas software installed Athena Release 14. Physcicsts keen to analyse data - increased use of LAN (dcap,rfio) protocols, need to make sure theses wotrk. GangaRobot running against old Atlas release which have been removed from most sites. Space Tokens are essential, only UK site without is Manchester due to known dCache issues. Trinity now in Atlas in Dutch cloud. Release 14 not installed at 5 or so sites. CMS --- Tranfers had been going well, everything on schedule. Believed to be connected to CA problems. May have just been fixed however. Plans beyond May- full CSA challenge - data from T0,skimming, anaylsis 70% of 2008 figures aiming for 100% Alice ----- Birmigham trying to get xrootd on DPM working SAM tests --------- UK looks good on SAM QMUL now passing, Cambridge having problems on CE possibly related to recent storage upgrade Steve LLoyd will generate plots to allow sites to comment, will be uploaded to wiki Steve LLoyd's tests ------------------- In general looking better but IC, RHUL, Manchester ance Cambridge are having issues Storage Availability tests -------------------------- Lancaster moved to DPM, unstable due to unstable disk server RHUL looking poor, but not understood yet, may be averaging between old and new SEs EGEE Accounting --------------- Publising behind for QMUL -new CE Two sites in Ireland - 1st not in EGEE, Trinity stopped publishing in middle of month RHUL, UCL-HEP Project Updates and News ======================== 3.0U23 for Production SL4 WMS expected in the 29th May TCG becomes Technical Management Board, largely the same roles T2 partners recruiting extra person to work alongside T2 coordinator, More regional monitoring WLCG Deployment Board ---------------------- SSC3 Providing sub-clusters SAM tests Storage Tickets directly to sites EGI for WLCG Discussion ========== Debian SSL problem ------------------ Mingchao will be following up tomorrow that Sites are checking keys. UIs at sites ------------ T2 coordinators to monitor set up of new UIs Use and support of new VOS -------------------------- UK sites to support GridPP vo - for testing, intended to be use in upcoming Security Challenge. camont and supernemo asking for support. AOB === Grid Service Monitoring in Nagios --------------------------------- Setup in Scotgrid Nagios configurator tool, to be incorporated into yaim http://www.physics.gla.ac.uk/~aelwell/tier2-nagios.pdf WN Management ------------- RHUL using rsync Glasgow not using image based, using cfengine, Oxford similar HEPSYSMAN --------- 2 day meeting included site reports + CCRC + Nagios regional monitoring AOB --- Greig : DPM admin toolkit https://www.gridpp.ac.uk/wiki/DPM-admin-tools, rpm available at system managers repo, feedback welcome. Chat window =========== [10:29:57] Jeremy Coles Derek is taking minutes today. [10:30:16] Jeremy Coles Who is present at UCL-HEP and IPPP1 Durham? [10:32:16] UCL HEP UCL-HEP: GIanfranco and Ben; Clare said she would join us for UCL-CENTRAL [10:33:01] Jeremy Coles Thanks [10:33:02] IPPP1 Durham ippp1: Phil Roffe and David Ambrose-Griffith [10:33:29] Jeremy Coles For those just joining we are starting with the second agenda item. [10:37:57] Greig Cowan i can't get any audio [10:38:08] Greig Cowan which panda server are you all on? [10:38:13] UCL HEP Clare Gryce (for UCL-CENTRAL) has joined us at UCL-HEP [10:38:23] Jeremy Coles Ukerna1_uk [10:38:50] Andrew Elwell seems OK for audio on CERNext_CH [10:39:04] Peter Love Jeremy, for AOB, can we get a site-survey about WN management? We're finding consistency problems are a headache, wondering if anyone has a single 'image-based' system. [10:39:37] Greig Cowan nope, not working. will try logging out and back in again... [10:40:26] Jeremy Coles Ok - WN man added to AOB [10:44:55] Jeremy Coles Greig - if you see this you could try the phone bridge. See agenda for details. [10:47:24] Andrew Elwell Greig is rebooting his laptop [11:08:28] Derek Ross apologies, evo grey'd out [11:11:06] David Colling I have to do other things, but will be connected and so if I am needed Duncan can find me quickly [11:11:34] Jeremy Coles Thanks for joining [11:24:19] Mike Kenyon left [11:26:18] Duncan Rand left [11:26:25] Duncan Rand joined [11:27:37] Duncan Rand left [11:29:34] Duncan Rand joined [11:34:31] Andrew Elwell http://www.physics.gla.ac.uk/~aelwell/tier2-nagios.pdf [11:38:49] Andrew Elwell well volteered for southgrid then Pete [11:39:11] Greig Cowan https://www.gridpp.ac.uk/wiki/DPM-admin-tools [11:43:46] Alessandra Forti http://www.sysadmin.hep.ac.uk/rpms [11:43:59] Alessandra Forti http://www.sysadmin.hep.ac.uk/svn [11:44:04] Alessandra Forti http://www.sysadmin.hep.ac.uk/wiki [11:44:24] Alessandra Forti they have all you ask. it is really tiring to reinvent the wheel every year.
There are minutes attached to this event. Show them.
    • 10:30 10:40
      Site issues 10m
      - look back at SAM/SL results for availability and reliability -- For SAM figures refer to http://pprc.qmul.ac.uk/~lloyd/gridpp/samtest.html. -- For historical plots (will be moved to wiki for comment within the week) look at the last column here: http://pprc.qmul.ac.uk/~lloyd/gridpp/samplots.html. -- For Steve's ATLAS tests: http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html -- Views of storage availability: http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring - Accounting. To check whether site publishing is up-to-date visit: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php. Select UKI from tree on left. Select from 2008 month 5. Select "Show data for site" and hit refresh. Scroll down for the publishing charts. -- Publishing appears to be behind for QMUL; UCL-CENTRAL (known); BHAM; giHECie. - general discussion on current site problems
      WLCG April T2 reliability report
    • 10:40 11:00
      CCRC progress and plans 20m
      Review of what has been happening and what happens next. For the high-level view from the daily WLCG ops meetings see https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08DailyMeetingsWeek080519. LHCb: Jobs review: http://lxarda05.cern.ch/dashboard/request.py/jobsummary?sortby=site CMS: Site view for CMS http://dashb-cms-sv.cern.ch/dashboard/request.py/siteview. Production PhEDEx transfer quality: http://cmsdoc.cern.ch/cms/aprom/phedex/prod/Activity::QualityPlots?view=global ATLAS: - Jobs running: http://gridinfo.triumf.ca/panglia/week-summary.php?SIZE=large - DDM: http://dashb-atlas-data.cern.ch/dashboard/request.py/site click on RAL cloud on right hand side. ALICE: - Recently allocated more effort from the T1.
    • 11:05 11:15
      Project updates & news 10m
      EGEE ROC & ops (middleware) ******************************** - See middleware note attached to agenda - EGEE is now in EGEEIII -- For the UKI this means some changes including more ROC activity being done within the T2s. Partners recruiting.... -- UKI still using GGUS directly (any problems to note?) WLCG Deployment Board ***************************** Last meeting was last Wednesday. The agenda can be found here: http://indico.cern.ch/conferenceDisplay.py?confId=20229. Discussions covered: SSC3 and changes in EGEE/OSG/WLCG policies. Developments for WNs (sub-clusters); The SAM test framework and plans for the future with focus on experiment SAM tests (what is being done and what will be done); Storage (especially SRM2.2 status); Routing tickets directly to sites; what does an EGI mean for WLCG and finally how are things going so far in the May CCRC.
      gLite release news
    • 11:15 11:25
      Discussion 10m
      - Further discussion on topics of interest -- Perhaps Debian SSL security problem? -- UIs at sites -- Use of and support for new VOs (are the regional VOs now all approved and can we migrate people from gridpp?)
    • 11:25 11:30
      AOB 5m
      - Please could all GridPP sites enable the gridpp VO. This is needed for an upcoming challenge. - Grid Service monitoring in Nagios. Deployed regionally for ScotGrid. Planning for other regions starting. For site monitoring using the project tools see https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg. - WN management (PL) - HEPSYSMAN