Small investigation of EOSUSER namespace
Out of curiosity (using a dump from Yolanda) I checked the effect of deduplication (file level). Please note deduplication is *not* a prioritiy IMO.
Input: 396 M files (Early October)
Consider only files >10MB
Use AD32 (as recorded in the catalogue). With "large files", AD32 collisions are not too many collisions.
Dedup saving ~15% (188 TB out of 1184). I am shocked but I do not find any loophole
Anyway: top files being "repeated":
- cernbox/smashbox testing (e.g. file "c1857a3c" has 20k replicas for a total of 1.1 TB). They are characterised by a flat time distribution (test executed every x hours).
- File 05f0347e is an output of a job (2.9 GB x 374 copies). Suboutputs of single job (all equals...). Time distr concentrated well within an hour.
- Similar cases exists where 1 file is in the user dir and all the others are in the trash (again, rather peaked time distribution.
Q: de-duplication effect on number of files? Not looked.
Q; is SWAN now using the new unified principals (since will play with instances)? will check (Enrico: looks OK)