UKI Monthly Operations Meeting (TB-SUPPORT)

Name: UKI Monthly Operations Meeting (TB-SUPPORT)
Start: 2007-11-29T10:30:00+01:00
End: 2007-11-29T11:30:00+01:00
Location: EVO - GridPP Deployment team meeting (under GridPP community)

Thursday 29 Nov 2007, 10:30 → 11:30 Europe/Zurich

EVO - GridPP Deployment team meeting (under GridPP community)

Description

Monthly review and discussion meeting for those involved with UKI EGEE/WCG deployment and operations. - There is a phone bridge available for this meeting. Please call +41 22 76 71400 and use conference ID: 152862. The phone bridge code is 4880.

- 10:30 → 10:50
  
  Site status review and issues 20m
  
  - CPU Accounting (with reference to http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php) -- ECDF is the only site mentioned as not publishing for >3 months (because it is new!) -- Sites that do not appear up-to-date: 1) csTCDie 2) UKI-LT2-QMUL; UKI-LT2-UCL-CENTRAL 3) UKI-NORTHGRID-SHEF-HEP 4) UKI-SCOTGRID-DURHAM 5) UKI-SOUTHGRID-BRIS-HEP; UKI-SOUTHGRID-RALPP - Storage accounting (with reference to http://tinyurl.com/2cj2u4) - Monitoring status (SAM history http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/sam.html) -- QMUL -- UCL-CENTRAL (Maintenance) -- Lancaster -- Cambridge (host certificate on the SE/DPM-head node expired) -- SL summary page now available: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/ukgrid.html -- SL ATLAS tests have been impacted by planned power outages at QMUL. Most recent failures seem linked to a UI certificate problem at QMUL. -- Each site now has a daily updated chart of availability AND reliability (click on your site row for the current quarter: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/samplots.html). -- FCR failures are now being logged for a metric on exclusions - Site concerns
- 10:50 → 10:55
  
  Current experiment/VO activities and issues 5m
  
  - ATLAS -- Moving to Panda. You can take a look at the status for the UK here: http://tinyurl.com/2arv3s. -- M5 cosmics finished early November. There were some data backlogs because of T0 and DDM problems. -- Had a few problems shipping data last week at RAL so Panda stopped. Things are now picking up again. Current status: http://tinyurl.com/24jspn - CMS -- CSA07 has finished. CMS is still assessing the exercise. -- Reported problems with the cms software framework which hampered both the Tier-0 and Tier-1 -- CMS RAL Tier-1 visit takes place next Friday - LHCb -- Not much T2 activity; focus has been stripping work at T1 -- Moving data from dCache to CASTOR -- Downtime announcemnts for T1 are not getting to the VO properly which raises concerns about other sites/VOs and the notification matrix -- Following up SAM problems (focus on T1s first) -camont -- would like to use/test gLite-WMS [aside on plans for gLite-WMS in UKI] - Other?
- 10:55 → 11:05
  
  Recent GDB/MB/DTEAM discussions 10m
  
  GridPP PMB **************** Met last week (as I think you know!). The agenda covered most areas that have been of recent interest: 0) Tier-1 site review (TD) 1) GridPP3 project planning (SP) 2) EGI/NGI plans (RM) 3) Disaster Planning (SB) + network resilience (PC to circulate update) 4) GridPP3 MOU (TD) 5) CASTOR status/ CSA07 summary (DN +) 6) Middleware issues - R-GMA status and Network area plans (RM) 7) Tier-2 Hardware Allocation result/process (SL) 8) Tier-3 (RJ) GridPP DTEAM ****************** - Recent input from Graeme who has been getting involved with ATLAS production work.. latest on Panda "Would like UK to be moved across to complete Panda production by start of December. Rest of EGEE cloud will follow." - Stephen has kept us informed on ATLAS use of CASTOR. The latest release has gone well. Now a plan to close dCache in 6 months time. - GridPP Tier-2 hardware allocations - The biomed user issue. The activity stopped but a new way is being sought to do things properly. [Question to geant4 supporting sites] - Occasional APEL issues. Are there any at the moment? EGEE/OSG/WLCG ops & ROC manager's ************************************************ - SpecInt2000 "How To" from the HEPiX working group [http://tinyurl.com/2of7vm] -- "The SPEC CPU2000 benchmark suite has been retired by SPEC and replaced by its successor CPU2006. The CPU2000 benchmarks are, however, still widely used within the HEP community... A benchmarking working group, launched at HEPiX Fall 2006 and run by Helge Meinhard (CERN), is currently working on a strategy how to move away from CPU2000 to a more recent benchmark." - useful reference page http://hepix.caspur.it/processors/. - response on our request 14A: "Provide administration tools such as add/remove/suspend a VO, a user, add/remove/close/drain a queue close a site (on site BDII), close a storage area". Claudio Grandi remarked: 1. "Start/stop are available for all services. Misbehaving commands are bugs: submit a bug to Savannah if you think the start or stop of a service is not doing what expected (e.g. processes left behind, etc…)" 2. "Missing features are better identified by clients. We propose to form a group within SA1 with the aim of developing a service management interface for all gLite services." - A ROC-Site Service Level Description (SLD) document is almost final: https://edms.cern.ch/document/860386/0.5. Similar to MoU but less constraints. E.g. Minimum site availability 70%. Maximum time to acknowledge GGUS tickets - 2hrs, and to resolve GGUS incidents 5 working days. - ATLAS VO Views problems (any left?) WLCG GDB/MB ******************** - Attempting to find a way out of the long standing deadlock on pilot jobs/glexec. Take a look at John's summary from the MB: http://tinyurl.com/2wt3t6. "WLCG sites must allow job submission by the LHC VOs using pilot jobs that submit work on behalf of other users." - The short-term work to get job priorities working is "ongoing" - The status of the middleware is best summarised in Markus's talk to the LHCC Comprehensive Review (the general meeting may be of interest: http://indico.cern.ch/conferenceDisplay.py?confId=22243) last week http://tinyurl.com/2qytlm. The talk also offers a good summary of the current build, configuration and test process. 32-bit: -- LCG-CE now ported (with torque) SL4+VDT1.6 (released?) -- CREAM-CE - expect certification to start January 2008 -- WMS/LB gLite 3.1 SL4 - in testing (IC) -- BDII on SL4 PPS this week -- DPM & LFC - internally tested but ongiong configuration -- gLite-SE dCache - ready for certification. Is the 32-bit version needed? 64-bit: -- Priorities - WN (in runtime testing), Torque_client, DPM_disk & UI - Migration to SL4 will be complete in early 2008. In parallel porting to SL5 will start.
- 11:05 → 11:10
  
  SRM2.2 5m
  
  - Brief reminder/summary of workshop (see also Greig's blog entries and the storage meeting minutes) - Next steps for sites
- 11:10 → 11:15
  
  GGUS and UKI ticketing systems 5m
  
  - Comments on tickets received recently - Problems/concerns with the interface - Particular tickets that should be followed up
- 11:15 → 11:25
  
  Enabling VOs 10m
  
  - Still looking for sites to support supernemo - Glasgow has now joined NGS (support ngs.ac.uk) and passes the NGS tests (http://tinyurl.com/yuqdvx). Hopefully more sites will join. - EGEE catch-all VOs - what is the UKI position? - The recent biomed problem highlighted again the need for an EGEE wide catch-all VO. What are your views on this matter? - What is the status of the Tier-2 VOs - Are there any significant changes with respect to local users getting more interested in using the EGEE infrastructure?
- 11:25 → 11:30
  
  AOB 5m
  
  - This week was the Service Reliability Workshop at CERN: http://indico.cern.ch/conferenceTimeTable.py?confId=20080. Some parts may be of interest but much is aimed at Tier-1s - Next week there is a GDB: http://indico.cern.ch/conferenceDisplay.py?confId=8508. Let me know if you have things you would like mentioned. There is also a pre-GDB meeting with a focus on data transport: http://indico.cern.ch/conferenceDisplay.py?confId=20248. - ATLAS also have a meeting next week. A jamboree no less! http://indico.cern.ch/conferenceDisplay.py?confId=23620. The UK seems well represented. As always, if you have questions you want answered then let one of us (see attendee list) know. - Marian Klein (RAL PPS) has been working on xen recently and invites feedback on http://www.gridpp.ac.uk/wiki/Xen-strap. - Has anyone tried the EGEE Grid Service Monitoring prototype for Nagios discussed at the HEPSYSMAN meeting? - Next meeting 13th December (unless 20th preferred). Will confirm by email.

Choose timezone

UKI Monthly Operations Meeting (TB-SUPPORT)

EVO - GridPP Deployment team meeting (under GridPP community)