Small investigation of EOSUSER namespace

Out of curiosity (using a dump from Yolanda) I checked the effect of deduplication (file level). Please note deduplication is *not* a prioritiy IMO.

Input: 396 M files (Early October)

Consider only files >10MB

Use AD32 (as recorded in the catalogue). With "large  files", AD32 collisions are not too many collisions.

Dedup saving ~15%  (188 TB out of 1184). I am shocked but I do not find any loophole

Anyway: top files being "repeated":

- cernbox/smashbox testing (e.g. file "c1857a3c" has 20k replicas for a total of 1.1 TB). They are characterised by a flat time distribution (test executed every x hours).

- File 05f0347e is an output of a job (2.9 GB x 374 copies). Suboutputs of single job (all equals...).    Time distr concentrated well within an hour.

- Similar cases exists where 1 file is in the user dir and all the others are in the trash (again, rather peaked time distribution.

 

Q: de-duplication effect on number of files? Not looked.


Q; is SWAN now using the new unified principals (since will play with instances)? will check (Enrico: looks OK)