Dramatis personć
  SB   Stephen Burke
  FD   Flavia Donno
  PF   Patrick Fuhrmann
  JG   John Gordon
  GMcC Gavin McCance
  TP   Timur Perelmutov
  JS   Jamie Shiers

Plus others who I didn't recognised from where I was sitting.
(Apologies for any omitted or incorrectly recorded comments.)


 Storage Performance Issues, Monitoring etc.
 -------------------------------------------

Abstract:
 This session will discuss how to improve the performance of  tape-backed MSS 
systems (or what the experiments could/should do to  give hints to the 
system).  (experiences, suggestions for dCache and  Castor, etc.).

 See also: { One and the same monitoring tool is needed for tapes for  all 
T1’s. Action on: Jamie Shiers } We have another use case, but  this is 
primarily for CERN. We still have users reading from tape and  we didn't know. 
Actually there are hundreds of tapemounts per day for  reading by users and we 
didn't know. If we would have had any  monitoring we would have been able to 
spot this much earlier. Now we  learned for the first time in a presentation 
by Tim Bell.

Initial discussion was to select a list of topics.  This discussion drew the 
following list:

Topics:
  File size
  Repack
  Read/write performance
  Monitoring
  Batching
  Ordering of reads
  Clustering of writes
  LAN access
  Pinning
  Deletion policies

  
File sizes
----------

?> The average file size (on tape) at CERN is 170 MiB; The ideal file size 
should be bigger ~0.5 TiB.

The discussion centred on whether this was an average over all files in 
Castor, which would include user analysis files.

JG> should see real data arriving.

?> This small files "nonsense" is causing very slow tape operations.

An issue was mentioned about throwing away old (?? experiment) data.

JG> This is something that WLCG [i.e., this forum] can't come to a
decision about.

JS> Can we expect larger file sizes?  Is this something we can set as
a target?

CMS guy> we have our own in-house solution: before that, we had a site-local 
solution before the experiment came up with a CMS framework solution.

The in-house solution involved zip files.  The zip files have a small 
catalogue of the file contents.  The required files can be extracted from a 
much larger zip file.  This results in an improvement of ~20x in performances, 
but suffers from a 2 GiB (32-bit) limitation.

The CMS framework solution overcomes these problems and can store files up to 
the 64-bit limit.

The status for ATLAS is production filesize should increase soon, but that 
this is not necessarily final solution.

JS> can we set ourselves a target for May that the average filesize
will be greater than 1 GiB (for example).

The data files that go onto Tier-0, yes; for user jobs this is more difficult.

JG, relaying a comment from GDB, said that experiments didn't want to run 10-
day jobs.

(This lead to a discussion about why there is a link between job run-time and 
the filesize produced: short files from multiple jobs could be merged to 
produce larger files that are suitable for storage.


Monitoring:
-----

Tier-0 felt that they were "completely blind" on what Tier-1s are doing.  
There's no indication on whether the file received was stored successfully.  
The Tier-0 see the files go, but doesn't know what happened to them.

Adding some simple monitor that's possible for Tier-1.  This would be a first 
step.

JG> what information do you want to see?

Is there a similar request from the VOs (CMS/ATLAS/...)?

RAL has an (Apache-based) Webpage monitoring solution at RAL, that shows 
transfers bandwidth -- per space-token.

SARA have a monitoring solution.

There was a proposal to look at the CMS monitoring at RAL and SARA monitoring.

CMS also includes PhedEx monitoring of transfers.  This also includes queries 
to the tape back-end.

People felt it would be good if there was a standardisation on (wire?) 
protocol for providing this monitoring information.

JG> There seems to be a tension between sites and experiments.  Sites
want to provide a reliable service and experiments want to know everything 
about all files, which might have performance issues.
There needs to be a balance.

It was felt that this was difficult.  We should use SRM as a standard and that 
PhedEx is by-passing SRM.

?> points out the tie-in with GMcC's talk, which pointed out that it is very 
hard to build in monitoring after developing the software.

JS> can people who have a solution send a pointer?

[The following emails were received by Jamie and are included more-or-less 
verbatim]

[[[  Email from:  Vladimir Bahyl
 [..]I would propose the following:

 1/ Sites should split the monitoring of the tape activity by production
    and non-production users.

 2/ Sites should measure number of files read/write per mount (this should
    be much greater than 1).

 3/ Sites should measure the amount of data transfered during each mount
    (I propose to add this 3rd point since as someone suggested in the
    audience, 1 GB files barely make the drive to stream.)

 I believe this are the basic metric that each site should collect and  then 
compare with the experiments - the reality and expectations.

 CERN will provide the proposed numbers.

 I provide this mail as CERNs contribution to the conclusions.
]]]


[[[ Email from: Miguel Marques Coelho Dos Santos  Example of basic stats:
     https://sls.cern.ch/sls/service.php?id=CASTORATLAS_T0ATLAS
]]]


[[[ Email from: JT
 To be clear (it appeared I did not express myself clearly) : the  problem was 
"too few files read per tape mount".  There were two  possible reasons why 
this could happen :

 1) users not asking for files in 'intelligent' fashion.  Ian
    F. pointed out that it was impossible to ask for files in exactly
    the same order as they were written; however it should be possible
    to ask for a related group of files (TBD: define "related group of
    files") over a short period (TBD: define "short period"), so that
    it would be possible for the MSS underlying the SRM, to organize
    the read requests to optimize the tape-mount and seek count.

 2) underlying MSS (or the SRM layer?) not doing something intelligent
    with the requests, ie there is no attempt to look at the "current
    pending bunch of read requests" and reordering them to group by
    tape and sort by position on the tape.

 My proposal was, in order to detangle these two, to look at  statistics of 
requests to the SRM to read files.  I don't know what  the proper call is ... 
"bringonline" or "get" or "whatever".  However  we can look at statistics -- 
do these calls come in bunches of  "related files" or not?  If not, then we 
can't really blame the MSS  for not doing a good job.

 Now someone more MSS-aware can put the right words in the above  proposal :-) 
]]]


[[[ Email from: Miguel Marques Coelho Dos Santos
 - amount of data read from tape during May
 - number of tape tape mounts
 - number of read files
]]]

JS> We will try to make some time (Tuesday/Wednesday) to discuss this.


Batching / Clustering
----

?> reading back is problematic.  Too much time is spent seeking on tapes.

JG> We're not talking about (re-)ordering of requests for read: Castor
does this?

It was unclear whether castor does this, or under what circumstances.
For example, ?? reported a test conducted using Castor at RAL.  A large batch 
of requests were submitted and Castor reordered these requests to better use 
the tape.  The conclusion seems that Castor needs enough information to work 
on.

With ENTSTOR at FNAL, the requests are queued by volume.  This gives a limited 
options for request reordering.

SB> If we believe SRM is the interface, this isn't supported.

TP> Calling SRM BringOnLine with all files you need would allow
reordering.

JS> is there some way of measuring what happens, to identify if we
have a problem?

JG> Record the number of files read/written per time-mount as a
metric.

Also suggested was the number of tape-seek operations per tape mount.

JG> See this in the performance.

ATLAS> no user analysis should trigger tape access, this only happens
0for reprocessing.  For reprocessing it comes down to how big is the disk 
buffer.  We want enough for several days.  The buffer determins performance.
 
Futher optimisation is in writing: we (ATLAS) could say when a dataset is 
finished, if this would be useful.

JG> Nothing to stop a user from accessing lots of small files.

Map file/tape families to directories?

PF>
  1.  Experiments must have a good idea which files go together,
  2.  This must be recorded in spaces or directory,
  3.  Site must know this ordering,
  4.  SE providers must take this information, pass it to the back-end
      storage.

JS> this shouldn't be a problem.  We can formulate a metric off-line
to measure this and check how close we are to implementing it.

?> Currently, there's an average 1.5 file-read requests satisfied per tape 
mount.

It was asked whether this tape back-end statistics included all requests.  To 
improve this statistic, it was felt necessary to identify which end-users were 
causing the problematic access.

Currently the statistics is for all tape-mounts: cannot (easily) split the 
read access based on which end-user (or end-user group) requested the data.

JG> No point shouting at ATLAS without suggesting which file-requests
are causing the problem.

Production users should have a much larger files-read-per-tape.  If end-user 
requests were filtered out, a much better picture should emerge.


Writes
------

JT> Can we look at the statistics of what is being asked?  This is to
analyse whether it is possible to do reordering.

PF> Information is recorded, but it might not be easy to extract this
or make it available.

One suggestion was to delaying writing to tape to collect multiple requests?  
(c.f. TCP packets).

A question was whether we going to measure these metrics during CCRC?

JG> All sites are publishing these metrics.

Measures the number of files reads per mount.  This should separate
production- from user- requests.  This metric should have values >> 1.
If not, then something is wrong.

There is an upper file size.  Problems start to appear with file sizes greater 
than the tape size.



SRM v1.1 de-commissioning plan and timetable
--------------------------------------------

FD>  There are five experiment frameworks to consider.

Question: are all experiments ready for this?

FD> Yes, except LHCb, which should be ready for July.

The experiments may need to migrate legacy data, as the catalogue entries are 
recorded against the old SRMv1.1 end-points.  It is difficult to say how long 
this will take.

File-name-space in the storage needs migration.  Different operations from 
when under SRM-v1 to SRM-v2.2. 

Do we migrate this information or use explicit SRM operations to move the 
data?

The issue is with Castor, with dCache and DPM the two end-points are hosted on 
the same machine.

FD> once the SAM tests have moved over, SRM-1 entries for the Tier-2s
could be moved quickly.  There's still some Tier-2s that still don't have SRM-
2.2.

ALICE> we have no SRM information in our catalogue(s), so this
transition should be transparent.

SB> we will need to keep the BDII entries whilst anyone is using them.

FTS now uses SRM-v2.2 as default for transfers involving a space token.

During the last CCRC, the experiments tried to use SRM-v2.2 but ended up using 
SRM-v1 by accident.

JG> Can *we* tell, from an LFC, whether any SRM-1.1 endpoints are in use?

A discussion ensued.  The conclusion was that this should be possible and 
would provide a necessray but not sufficient test for decommissioning SRM-v1.1

(coffee break)

Flavia: report from SSWG

Discussion during the presentation:

Flavia:  No standardisation for the ‘default’ space in the document:
the ‘default’ space is implementation specific and also may be site specific.
Sites need agree with the experiments (eg CMS and ALICE ) for calls which 
do not pass a space token.

Jeff: can’t we use one special space token for those request which don’t pass one?
Flavia: not clear, depends on what they need and how they access data
Timur: only need token to specify non-default behaviour of a system. if you are happy
   with the defaults then just don’t use token.
Patrick: but space token insulates against different ways to configure  between eg castor and dcache. if not used experiments will be exposed to those differences.

Flavia: no primary file copy. so which retention policy to attach if file has copies in different retention types

Issue with T0D1 and last copy of a file being removed?

more discussion on: what does implementation dependent mean? 
Flavia: re-states - areas which are not defined by the document will
need experiments to agree with sites / storage developers directly.

Flavia/Jeff: semantic of secondary FQANs - should they be picked automatically? 
possibility of misuse: experiments should clarify what they would like to see.
Clarification from Flavia: pure issue of space use. This should not be confused 
with secondary info during file ACL handling .

Jeff: new space gives all rights to creator: can we transfer those rights to another person?
Flavia: [[?? not sure I understood??]] site admin can do that.


question and clarification by Flavia about NFS ACL definition: -
has nothing to do with any concrete file system at the site. document just refers 
to ACL semantics as in NFS...

Jamie:  Need to understand the time line to move this forward. f2f will 
be June 10th - so some iteration in May. time scale for technical sign off 
is short to allow for discussion on implementation effort/resources after 
the technical sign off.

Flavia/David S: some issues require storm developers and dpm developers to 
come back. This will need another iteration...


Next presentation: Flavia on storage security issues

Alessandra: where is storage access control  setup? in VOMS or in SRM system?
Flavia: in SRM.

Alessandra: how can this be controlled by a site if people from physics
community directly request changes at sites?

Flavia points out - little input on document from sites  and requests comments

Jeff: if permissions on file and space level disagree, which one wins? Request
to document the expected behaviour at least for the main cases.

Flavia: should in principle be orthogonal..

Ian Fisk: you need both permission to write file and permission to add to space,
similar to quota and directory access in file systems.

Jeff: The answer [to his initial question] is then ‘and’ of both permissions. 

Flavia: may still have a problem as the ACL for ‘space’ are now defined by the 
document and the ACL for files is not and system dependent. This means the 
result is not well defined.

Flavia: may also get problem if some systems enforce the use of local uid/gui
for name space ACLs. 

Alessandra: large number of undefined cases looks scary. 

Jamie: should not enter in a re-discussion of the approach to
administration now. We need to establish timeline for moving 
this forward

Jeff: proposed an action all T1 sites to have their storage experts comment 
on the draft document from the service point of view.

John: thanks to all people who contributed - now need to make sure outcome 
is not diverging during implementation phase.

Flavia: need validation and testing of the agreed behaviour  upfront and 
not only afterwards as for the SRM 2 validation.
John: maybe already have design exchanged between teams to spot differences

Timur: do experiments really need / like the new functionality described? 
[someone]: is the document complete, so that divergence because of guessing 
undefined areas during implementation can be avoided?

Flavia: do we need discussion on ACL on namespace as well?

Jamie: [[ did not get answer on this]]

From: "Vlado Bahyl" <Vladimir.Bahyl@cern.ch>
Date: 21 April 2008 15:48:23 GMT+02:00
To: "Tim Bell" <Tim.Bell@cern.ch>
Cc: "Jamie Shiers" <Jamie.Shiers@cern.ch>
Subject: Tape efficiency conclusions
Reply-To: "Vlado Bahyl" <Vladimir.Bahyl@cern.ch>

Jamie,
(as discussed anyway) for the conclusions on the tape efficiency, I would 
propose the following:
1/ Sites should split the monitoring of the tape activity by production 
   and non-production users.
2/ Sites should measure number of files read/write per mount (this should 
   be much greater than 1).
3/ Sites should measure the amount of data transfered during each mount 
   (I propose to add this 3rd point since as someone suggested in the 
   audience, 1 GB files barely make the drive to stream.)
I believe this are the basic metric that each site should collect and 
then compare with the experiments - the reality and expectations.
CERN will provide the proposed numbers.
I provide this mail as CERNs contribution to the conclusions.
Let me know if you have some comments. Cheers,
                        Vlado

From: "Jeff Templon" <templon@nikhef.nl>
Date: 21 April 2008 14:48:49 GMT+02:00
To: <g.stewart@physics.gla.ac.uk>, "Jamie Shiers" <Jamie.Shiers@cern.ch>, <patrick.fuhrmann@desy.de>, "K Bos" <bosk@nikhef.nl>
Subject: statistics on tapes

Hi *,
To be clear (it appeared I did not express myself clearly) : the problem 
was "too few files read per tape mount".  There were two possible reasons 
why this could happen :
1) users not asking for files in 'intelligent' fashion.  Ian F. pointed out 
that it was impossible to ask for files in exactly the same order as they 
were written; however it should be possible to ask for a related group of 
files (TBD: define "related group of files") over a short period (TBD: 
define "short period"), so that it would be possible for the MSS underlying 
the SRM, to organize the read requests to optimize the tape-mount and seek 
count.
2) underlying MSS (or the SRM layer?) not doing something intelligent with 
the requests, ie there is no attempt to look at the "current pending bunch 
of read requests" and reordering them to group by tape and sort by position 
on the tape.
My proposal was, in order to detangle these two, to look at statistics of 
requests to the SRM to read files.  I don't know what the proper call is 
... "bringonline" or "get" or "whatever".  However we can look at 
statistics -- do these calls come in bunches of "related files" or not?  If 
not, then we can't really blame the MSS for not doing a good job.
Now someone more MSS-aware can put the right words in the above proposal :-)
                                                JT

From: "Miguel Marques Coelho Dos Santos" <miguel.coelho.santos@cern.ch>
Date: 21 April 2008 14:26:55 GMT+02:00
To: "Jamie Shiers" <Jamie.Shiers@cern.ch>
Subject: metric

- amount of data read from tape during may 
- number of tape tape mounts 
- number of read files