Minutes of storage phone conference 23 Aug 2006 Present: Edinburgh: Greig RAL Tier 1: Derek Glasgow: Jamie DESY: Owen RAL Storage: Jens (chair+mins) Apologies: Durham: Mark Glasgow: Graeme RAL Storage: Jiri 0. Review of actions (see below) 1. Monitoring and Accounting update 1.0. Metrics Metrics should ideally be, er, metered via an accounting system and collected centrally because (a) we can watch the progress on a single visualisation page, and (b) don't have to ask each site for their numbers every time we write a report. We currently only do this via the GLUE schema, and the storage schema does not account for all things that we wish to measure. One option is to use R-GMA to publish the data. Another is to implement an extended schema for the BDII. Probably R-GMA would be easier. 1.1. Accounting At the moment, we are gathering the BDII storage used/available data via Greig's RGMA, and store it in the GOC DB. Dave is back from hols and working on the visualisation and should have something this week. This is a high PHB-visibility thing, so it would be good to sort it out. We discussed whether we could monitor all of LCG; at the moment it will probably overload Greig's RGMA server with ~200 BDII queries every hour, but perhaps they can be distributed evenly with ~3-4 requests per minute. Related to that, CASTOR is being counted but the data is currently, if not duff, then at least not really accurate. CASTOR accounting work is ongoing, needs accurate tape accounting, and then accounting per service class (see below). Would be useful to use some monitoring data (eg uptime) as an accounting metric. 1.2. Monitoring Monitoring work ongoing, see Wiki pages: http://www.gridpp.ac.uk/wiki/MonAMI_DPM_plugin http://www.gridpp.ac.uk/wiki/MonAMI_dCache_plugin Examples of what we could monitor: A Implementation specific things and alerts E.g., daemons dying, disks filling up, certificates expiring, pool nodes going AWOL, networks going down, int'l transfer rates, B Metric type stuff Space used/avail, number of requests processed, number of users, active users, files per VO, transfer rates on ext'l interfaces, overall service availability (ICMP ping/service ping/probe), For the metric type stuff, if we are to make any use of it, we need to (a) Publish it, presumably via R-GMA, to a central location, (b) Then visualise the data so people can see we're making progress (or not), (c) Ensure that data is collected consistently across the implementations. 2. Metrics and milestones update See discussion above. Once we have decided appropriate metrics that we can monitor, we can decide whether to set milestones to achieve. 3. Round up of storage issues from the SC Apart from CASTOR hiccups last week, including SRM's disk failing, faulty network cables, etc, the service should touch wood be available now; CMS and GridPP T2 tests ongoing. These are now coordinated between Simon Metson (CMS Bristol) and Jamie. Some concerns about sites publishing 0 in the canonical storage listing page. And we're still well short of the 200 TB. Greig or Owen will follow up with Liverpool which was one of the sites. 4. CASTOR weather report and service classes Discussions with CMS how to meet their service class requirements. They need Disk1Tape0 (durable), Disk0Tape1 (permament), and Disk1Tape1 (permeable?). Implementation likely via three different SRMs, due to limitations in SRM1 implementation. Reminder to people to (a) File CASTOR bugs and problems via Tier 1 helpdesk lcg-support@gridpp.rl.ac.uk or support@gridpp.rl.ac.uk (b) CASTOR-SUPPORT@jiscmail.ac.uk available for CASTOR discussions, (c) GRIDPP-SC@jiscmail.ac.uk used for service challenge discussions and coordination. Stuff like CASTOR downtime will currently be announced to both SUPPORT and SC. This is the sort of medium term arrangement, to be possibly replaced with something else, e.g. when the SCs are over, or non-LCG people start using CASTOR. 5. AOB Greig will be running the show next week. ------------------------------------------------------------------------ 41 10/08/2005 Agree licence with DESY Jens Open No news. 53 12/10/2005 Find reasoanable % for SE uptime for SC4 Jeremy Open Jeremy sent a response, and it was agreed that the best thing is to get some measurements for the uptime. Which we have discussed before, but we'll close this action and leave it to the monitoring to monitor uptime. 86 08/02/2006 Extend monitoring to do sites per VO and VOs per site Greig Open See discussion above. 105 03/05/2006 Re-poke DESY or FNAL about SRM (now 2.2) 2.1 for dCache Owen Open No news. 116 31/05/2006 Progress of Durham-MAN networking discussions. Mark Open No news - Mark had sent apols. But see summary from last week. 119 07/06/2006 Circulate next version of VO storage to list Jens Open Not done, but progress made with CMS and service classes. 121 05/07/2006 Get report from NGS on GPFS Jens Open No news. 127 19/07/2006 Test out dCache Nagios plugin Greig Open Greig has not been able to get quality time with Edinburgh Nagios admin. RAL could have tested but is short staffed. 129 09/08/2006 Produce GridPP response to dCache licence Jens/ALL Open No news. Nobody has had sufficient time to parse the licence at a sufficiently detailed level. If the licence is OK, we won't need action 41. 130 09/08/2006 Get legal input on dCache licence Jens Open No news. 133 16/08/2006 Summarise MonAMI monitoring for dCache/DPM on wiki Graeme/Greig Open Done; see URL above. 134 16/08/2006 Talk to CASTOR about MonAMI monitoring Jens Open Done. Won't necessarily be done now (for now we do plain Nagios). 136 16/08/2006 Locate/summarise progress metrics in wiki Jens Open Todo. Will put above summary on wiki in appropriate location. ------------------------------------------------------------------------ NEW ACTIONS 137 23/08/2006 Follow up with sites publishing 0 bytes Greig (Owen) Open More bite sized (byte sized?) actions should come out of the monitoring stuff but we need to agree on something first - which ones are doable and worth doing.