Dramatis personæ SB Stephen Burke FD Flavia Donno PF Patrick Fuhrmann JG John Gordon GMcC Gavin McCance TP Timur Perelmutov JS Jamie Shiers Plus others who I didn't recognised from where I was sitting. (Apologies for any omitted or incorrectly recorded comments.) Storage Performance Issues, Monitoring etc. ------------------------------------------- Abstract: This session will discuss how to improve the performance of tape-backed MSS systems (or what the experiments could/should do to give hints to the system). (experiences, suggestions for dCache and Castor, etc.). See also: { One and the same monitoring tool is needed for tapes for all T1’s. Action on: Jamie Shiers } We have another use case, but this is primarily for CERN. We still have users reading from tape and we didn't know. Actually there are hundreds of tapemounts per day for reading by users and we didn't know. If we would have had any monitoring we would have been able to spot this much earlier. Now we learned for the first time in a presentation by Tim Bell. Initial discussion was to select a list of topics. This discussion drew the following list: Topics: File size Repack Read/write performance Monitoring Batching Ordering of reads Clustering of writes LAN access Pinning Deletion policies File sizes ---------- ?> The average file size (on tape) at CERN is 170 MiB; The ideal file size should be bigger ~0.5 TiB. The discussion centred on whether this was an average over all files in Castor, which would include user analysis files. JG> should see real data arriving. ?> This small files "nonsense" is causing very slow tape operations. An issue was mentioned about throwing away old (?? experiment) data. JG> This is something that WLCG [i.e., this forum] can't come to a decision about. JS> Can we expect larger file sizes? Is this something we can set as a target? CMS guy> we have our own in-house solution: before that, we had a site-local solution before the experiment came up with a CMS framework solution. The in-house solution involved zip files. The zip files have a small catalogue of the file contents. The required files can be extracted from a much larger zip file. This results in an improvement of ~20x in performances, but suffers from a 2 GiB (32-bit) limitation. The CMS framework solution overcomes these problems and can store files up to the 64-bit limit. The status for ATLAS is production filesize should increase soon, but that this is not necessarily final solution. JS> can we set ourselves a target for May that the average filesize will be greater than 1 GiB (for example). The data files that go onto Tier-0, yes; for user jobs this is more difficult. JG, relaying a comment from GDB, said that experiments didn't want to run 10- day jobs. (This lead to a discussion about why there is a link between job run-time and the filesize produced: short files from multiple jobs could be merged to produce larger files that are suitable for storage. Monitoring: ----- Tier-0 felt that they were "completely blind" on what Tier-1s are doing. There's no indication on whether the file received was stored successfully. The Tier-0 see the files go, but doesn't know what happened to them. Adding some simple monitor that's possible for Tier-1. This would be a first step. JG> what information do you want to see? Is there a similar request from the VOs (CMS/ATLAS/...)? RAL has an (Apache-based) Webpage monitoring solution at RAL, that shows transfers bandwidth -- per space-token. SARA have a monitoring solution. There was a proposal to look at the CMS monitoring at RAL and SARA monitoring. CMS also includes PhedEx monitoring of transfers. This also includes queries to the tape back-end. People felt it would be good if there was a standardisation on (wire?) protocol for providing this monitoring information. JG> There seems to be a tension between sites and experiments. Sites want to provide a reliable service and experiments want to know everything about all files, which might have performance issues. There needs to be a balance. It was felt that this was difficult. We should use SRM as a standard and that PhedEx is by-passing SRM. ?> points out the tie-in with GMcC's talk, which pointed out that it is very hard to build in monitoring after developing the software. JS> can people who have a solution send a pointer? [The following emails were received by Jamie and are included more-or-less verbatim] [[[ Email from: Vladimir Bahyl [..]I would propose the following: 1/ Sites should split the monitoring of the tape activity by production and non-production users. 2/ Sites should measure number of files read/write per mount (this should be much greater than 1). 3/ Sites should measure the amount of data transfered during each mount (I propose to add this 3rd point since as someone suggested in the audience, 1 GB files barely make the drive to stream.) I believe this are the basic metric that each site should collect and then compare with the experiments - the reality and expectations. CERN will provide the proposed numbers. I provide this mail as CERNs contribution to the conclusions. ]]] [[[ Email from: Miguel Marques Coelho Dos Santos Example of basic stats: https://sls.cern.ch/sls/service.php?id=CASTORATLAS_T0ATLAS ]]] [[[ Email from: JT To be clear (it appeared I did not express myself clearly) : the problem was "too few files read per tape mount". There were two possible reasons why this could happen : 1) users not asking for files in 'intelligent' fashion. Ian F. pointed out that it was impossible to ask for files in exactly the same order as they were written; however it should be possible to ask for a related group of files (TBD: define "related group of files") over a short period (TBD: define "short period"), so that it would be possible for the MSS underlying the SRM, to organize the read requests to optimize the tape-mount and seek count. 2) underlying MSS (or the SRM layer?) not doing something intelligent with the requests, ie there is no attempt to look at the "current pending bunch of read requests" and reordering them to group by tape and sort by position on the tape. My proposal was, in order to detangle these two, to look at statistics of requests to the SRM to read files. I don't know what the proper call is ... "bringonline" or "get" or "whatever". However we can look at statistics -- do these calls come in bunches of "related files" or not? If not, then we can't really blame the MSS for not doing a good job. Now someone more MSS-aware can put the right words in the above proposal :-) ]]] [[[ Email from: Miguel Marques Coelho Dos Santos - amount of data read from tape during May - number of tape tape mounts - number of read files ]]] JS> We will try to make some time (Tuesday/Wednesday) to discuss this. Batching / Clustering ---- ?> reading back is problematic. Too much time is spent seeking on tapes. JG> We're not talking about (re-)ordering of requests for read: Castor does this? It was unclear whether castor does this, or under what circumstances. For example, ?? reported a test conducted using Castor at RAL. A large batch of requests were submitted and Castor reordered these requests to better use the tape. The conclusion seems that Castor needs enough information to work on. With ENTSTOR at FNAL, the requests are queued by volume. This gives a limited options for request reordering. SB> If we believe SRM is the interface, this isn't supported. TP> Calling SRM BringOnLine with all files you need would allow reordering. JS> is there some way of measuring what happens, to identify if we have a problem? JG> Record the number of files read/written per time-mount as a metric. Also suggested was the number of tape-seek operations per tape mount. JG> See this in the performance. ATLAS> no user analysis should trigger tape access, this only happens 0for reprocessing. For reprocessing it comes down to how big is the disk buffer. We want enough for several days. The buffer determins performance. Futher optimisation is in writing: we (ATLAS) could say when a dataset is finished, if this would be useful. JG> Nothing to stop a user from accessing lots of small files. Map file/tape families to directories? PF> 1. Experiments must have a good idea which files go together, 2. This must be recorded in spaces or directory, 3. Site must know this ordering, 4. SE providers must take this information, pass it to the back-end storage. JS> this shouldn't be a problem. We can formulate a metric off-line to measure this and check how close we are to implementing it. ?> Currently, there's an average 1.5 file-read requests satisfied per tape mount. It was asked whether this tape back-end statistics included all requests. To improve this statistic, it was felt necessary to identify which end-users were causing the problematic access. Currently the statistics is for all tape-mounts: cannot (easily) split the read access based on which end-user (or end-user group) requested the data. JG> No point shouting at ATLAS without suggesting which file-requests are causing the problem. Production users should have a much larger files-read-per-tape. If end-user requests were filtered out, a much better picture should emerge. Writes ------ JT> Can we look at the statistics of what is being asked? This is to analyse whether it is possible to do reordering. PF> Information is recorded, but it might not be easy to extract this or make it available. One suggestion was to delaying writing to tape to collect multiple requests? (c.f. TCP packets). A question was whether we going to measure these metrics during CCRC? JG> All sites are publishing these metrics. Measures the number of files reads per mount. This should separate production- from user- requests. This metric should have values >> 1. If not, then something is wrong. There is an upper file size. Problems start to appear with file sizes greater than the tape size. SRM v1.1 de-commissioning plan and timetable -------------------------------------------- FD> There are five experiment frameworks to consider. Question: are all experiments ready for this? FD> Yes, except LHCb, which should be ready for July. The experiments may need to migrate legacy data, as the catalogue entries are recorded against the old SRMv1.1 end-points. It is difficult to say how long this will take. File-name-space in the storage needs migration. Different operations from when under SRM-v1 to SRM-v2.2. Do we migrate this information or use explicit SRM operations to move the data? The issue is with Castor, with dCache and DPM the two end-points are hosted on the same machine. FD> once the SAM tests have moved over, SRM-1 entries for the Tier-2s could be moved quickly. There's still some Tier-2s that still don't have SRM- 2.2. ALICE> we have no SRM information in our catalogue(s), so this transition should be transparent. SB> we will need to keep the BDII entries whilst anyone is using them. FTS now uses SRM-v2.2 as default for transfers involving a space token. During the last CCRC, the experiments tried to use SRM-v2.2 but ended up using SRM-v1 by accident. JG> Can *we* tell, from an LFC, whether any SRM-1.1 endpoints are in use? A discussion ensued. The conclusion was that this should be possible and would provide a necessray but not sufficient test for decommissioning SRM-v1.1 (coffee break) Flavia: report from SSWG Discussion during the presentation: Flavia: No standardisation for the ‘default’ space in the document: the ‘default’ space is implementation specific and also may be site specific. Sites need agree with the experiments (eg CMS and ALICE ) for calls which do not pass a space token. Jeff: can’t we use one special space token for those request which don’t pass one? Flavia: not clear, depends on what they need and how they access data Timur: only need token to specify non-default behaviour of a system. if you are happy with the defaults then just don’t use token. Patrick: but space token insulates against different ways to configure between eg castor and dcache. if not used experiments will be exposed to those differences. Flavia: no primary file copy. so which retention policy to attach if file has copies in different retention types Issue with T0D1 and last copy of a file being removed? more discussion on: what does implementation dependent mean? Flavia: re-states - areas which are not defined by the document will need experiments to agree with sites / storage developers directly. Flavia/Jeff: semantic of secondary FQANs - should they be picked automatically? possibility of misuse: experiments should clarify what they would like to see. Clarification from Flavia: pure issue of space use. This should not be confused with secondary info during file ACL handling . Jeff: new space gives all rights to creator: can we transfer those rights to another person? Flavia: [[?? not sure I understood??]] site admin can do that. question and clarification by Flavia about NFS ACL definition: - has nothing to do with any concrete file system at the site. document just refers to ACL semantics as in NFS... Jamie: Need to understand the time line to move this forward. f2f will be June 10th - so some iteration in May. time scale for technical sign off is short to allow for discussion on implementation effort/resources after the technical sign off. Flavia/David S: some issues require storm developers and dpm developers to come back. This will need another iteration... Next presentation: Flavia on storage security issues Alessandra: where is storage access control setup? in VOMS or in SRM system? Flavia: in SRM. Alessandra: how can this be controlled by a site if people from physics community directly request changes at sites? Flavia points out - little input on document from sites and requests comments Jeff: if permissions on file and space level disagree, which one wins? Request to document the expected behaviour at least for the main cases. Flavia: should in principle be orthogonal.. Ian Fisk: you need both permission to write file and permission to add to space, similar to quota and directory access in file systems. Jeff: The answer [to his initial question] is then ‘and’ of both permissions. Flavia: may still have a problem as the ACL for ‘space’ are now defined by the document and the ACL for files is not and system dependent. This means the result is not well defined. Flavia: may also get problem if some systems enforce the use of local uid/gui for name space ACLs. Alessandra: large number of undefined cases looks scary. Jamie: should not enter in a re-discussion of the approach to administration now. We need to establish timeline for moving this forward Jeff: proposed an action all T1 sites to have their storage experts comment on the draft document from the service point of view. John: thanks to all people who contributed - now need to make sure outcome is not diverging during implementation phase. Flavia: need validation and testing of the agreed behaviour upfront and not only afterwards as for the SRM 2 validation. John: maybe already have design exchanged between teams to spot differences Timur: do experiments really need / like the new functionality described? [someone]: is the document complete, so that divergence because of guessing undefined areas during implementation can be avoided? Flavia: do we need discussion on ACL on namespace as well? Jamie: [[ did not get answer on this]] From: "Vlado Bahyl" Date: 21 April 2008 15:48:23 GMT+02:00 To: "Tim Bell" Cc: "Jamie Shiers" Subject: Tape efficiency conclusions Reply-To: "Vlado Bahyl" Jamie, (as discussed anyway) for the conclusions on the tape efficiency, I would propose the following: 1/ Sites should split the monitoring of the tape activity by production and non-production users. 2/ Sites should measure number of files read/write per mount (this should be much greater than 1). 3/ Sites should measure the amount of data transfered during each mount (I propose to add this 3rd point since as someone suggested in the audience, 1 GB files barely make the drive to stream.) I believe this are the basic metric that each site should collect and then compare with the experiments - the reality and expectations. CERN will provide the proposed numbers. I provide this mail as CERNs contribution to the conclusions. Let me know if you have some comments. Cheers, Vlado From: "Jeff Templon" Date: 21 April 2008 14:48:49 GMT+02:00 To: , "Jamie Shiers" , , "K Bos" Subject: statistics on tapes Hi *, To be clear (it appeared I did not express myself clearly) : the problem was "too few files read per tape mount". There were two possible reasons why this could happen : 1) users not asking for files in 'intelligent' fashion. Ian F. pointed out that it was impossible to ask for files in exactly the same order as they were written; however it should be possible to ask for a related group of files (TBD: define "related group of files") over a short period (TBD: define "short period"), so that it would be possible for the MSS underlying the SRM, to organize the read requests to optimize the tape-mount and seek count. 2) underlying MSS (or the SRM layer?) not doing something intelligent with the requests, ie there is no attempt to look at the "current pending bunch of read requests" and reordering them to group by tape and sort by position on the tape. My proposal was, in order to detangle these two, to look at statistics of requests to the SRM to read files. I don't know what the proper call is ... "bringonline" or "get" or "whatever". However we can look at statistics -- do these calls come in bunches of "related files" or not? If not, then we can't really blame the MSS for not doing a good job. Now someone more MSS-aware can put the right words in the above proposal :-) JT From: "Miguel Marques Coelho Dos Santos" Date: 21 April 2008 14:26:55 GMT+02:00 To: "Jamie Shiers" Subject: metric - amount of data read from tape during may - number of tape tape mounts - number of read files