Minutes of storage phone conference 23 Aug 2006

Present:
	Edinburgh: Greig
	RAL Tier 1: Derek
	Glasgow: Jamie
	DESY: Owen
	RAL Storage: Jens (chair+mins)

Apologies:
	Durham: Mark
	Glasgow: Graeme
	RAL Storage: Jiri


0. Review of actions (see below)

1. Monitoring and Accounting update

  1.0. Metrics

    Metrics should ideally be, er, metered via an accounting system
    and collected centrally because (a) we can watch the progress on a
    single visualisation page, and (b) don't have to ask each site for
    their numbers every time we write a report.

    We currently only do this via the GLUE schema, and the storage
    schema does not account for all things that we wish to measure.

    One option is to use R-GMA to publish the data.  Another is to
    implement an extended schema for the BDII.  Probably R-GMA would
    be easier.

  1.1. Accounting

    At the moment, we are gathering the BDII storage used/available
    data via Greig's RGMA, and store it in the GOC DB.  Dave is back
    from hols and working on the visualisation and should have
    something this week.

    This is a high PHB-visibility thing, so it would be good to sort
    it out.  We discussed whether we could monitor all of LCG; at the
    moment it will probably overload Greig's RGMA server with ~200
    BDII queries every hour, but perhaps they can be distributed
    evenly with ~3-4 requests per minute.

    Related to that, CASTOR is being counted but the data is
    currently, if not duff, then at least not really accurate.  CASTOR
    accounting work is ongoing, needs accurate tape accounting, and
    then accounting per service class (see below).

    Would be useful to use some monitoring data (eg uptime) as an
    accounting metric.

  1.2. Monitoring

    Monitoring work ongoing, see Wiki pages:
    http://www.gridpp.ac.uk/wiki/MonAMI_DPM_plugin
    http://www.gridpp.ac.uk/wiki/MonAMI_dCache_plugin

    Examples of what we could monitor:

    A Implementation specific things and alerts
      E.g., daemons dying, disks filling up, certificates expiring,
      pool nodes going AWOL, networks going down, int'l transfer
      rates,

    B Metric type stuff
      Space used/avail, number of requests processed, number of users,
      active users, files per VO, transfer rates on ext'l interfaces,
      overall service availability (ICMP ping/service ping/probe),

    For the metric type stuff, if we are to make any use of it, we
    need to

    (a) Publish it, presumably via R-GMA, to a central location,
    (b) Then visualise the data so people can see we're making
	progress (or not),
    (c) Ensure that data is collected consistently across the
	implementations.

2. Metrics and milestones update

  See discussion above.  Once we have decided appropriate metrics that
  we can monitor, we can decide whether to set milestones to achieve.

3. Round up of storage issues from the SC

  Apart from CASTOR hiccups last week, including SRM's disk failing,
  faulty network cables, etc, the service should touch wood be
  available now; CMS and GridPP T2 tests ongoing.  These are now
  coordinated between Simon Metson (CMS Bristol) and Jamie.

  Some concerns about sites publishing 0 in the canonical storage
  listing page.  And we're still well short of the 200 TB.  Greig or
  Owen will follow up with Liverpool which was one of the sites.

4. CASTOR weather report and service classes
 
  Discussions with CMS how to meet their service class requirements.
  They need Disk1Tape0 (durable), Disk0Tape1 (permament), and
  Disk1Tape1 (permeable?).  Implementation likely via three different
  SRMs, due to limitations in SRM1 implementation.

  Reminder to people to

    (a) File CASTOR bugs and problems via Tier 1 helpdesk
	lcg-support@gridpp.rl.ac.uk or support@gridpp.rl.ac.uk

    (b) CASTOR-SUPPORT@jiscmail.ac.uk available for CASTOR
	discussions,

    (c) GRIDPP-SC@jiscmail.ac.uk used for service challenge
	discussions and coordination.

    Stuff like CASTOR downtime will currently be announced to both
    SUPPORT and SC.

    This is the sort of medium term arrangement, to be possibly
    replaced with something else, e.g. when the SCs are over, or
    non-LCG people start using CASTOR.

5. AOB

    Greig will be running the show next week.

------------------------------------------------------------------------

41	10/08/2005	Agree licence with DESY	Jens	Open

No news.

53	12/10/2005	Find reasoanable % for SE uptime for SC4	Jeremy	Open

Jeremy sent a response, and it was agreed that the best thing is to
get some measurements for the uptime.  Which we have discussed before,
but we'll close this action and leave it to the monitoring to monitor
uptime.

86	08/02/2006	Extend monitoring to do sites per VO and VOs per site	Greig	Open

See discussion above.

105	03/05/2006	Re-poke DESY or FNAL about SRM (now 2.2) 2.1 for dCache	Owen	Open

No news.

116	31/05/2006	Progress of Durham-MAN networking discussions.	Mark	Open

No news - Mark had sent apols.  But see summary from last week.

119	07/06/2006	Circulate next version of VO storage to list	Jens	Open

Not done, but progress made with CMS and service classes.

121	05/07/2006	Get report from NGS on GPFS	Jens	Open

No news.

127	19/07/2006	Test out dCache Nagios plugin	Greig	Open

Greig has not been able to get quality time with Edinburgh Nagios
admin.  RAL could have tested but is short staffed.

129	09/08/2006	Produce GridPP response to dCache licence	Jens/ALL	Open

No news.  Nobody has had sufficient time to parse the licence at a
sufficiently detailed level.  If the licence is OK, we won't need
action 41.

130	09/08/2006	Get legal input on dCache licence	Jens	Open

No news.

133	16/08/2006	Summarise MonAMI monitoring for dCache/DPM on wiki	Graeme/Greig	Open

Done; see URL above.

134	16/08/2006	Talk to CASTOR about MonAMI monitoring	Jens	Open

Done.  Won't necessarily be done now (for now we do plain Nagios).

136	16/08/2006	Locate/summarise progress metrics in wiki	Jens	Open

Todo.  Will put above summary on wiki in appropriate location.

------------------------------------------------------------------------
NEW ACTIONS

137	23/08/2006	Follow up with sites publishing 0 bytes	Greig (Owen)	Open

More bite sized (byte sized?) actions should come out of the
monitoring stuff but we need to agree on something first - which ones
are doable and worth doing.