(Georgios)
- Request: We need more space for PPS, we're at 97% capacity on eospps-ns* (already with compression).
- Could we add a second SSD, so that we can test rocksdb with multiple data directories?
- Luca: won't happen - complicated workflow to add HW.
- Dan: servers would already have 2 SSDs, perhaps put OS on spinning disk.
- Elvin: would like to verify that RocksDB can actually split over several devices
- Massimo: take offline, will try to get test machine
- Could we move away from VMs for the NS machines? Virtualization severely limits amount of IOPS we can get from SSDs. (Only a couple of thousand, real SSD can do ~100k, real disk ~300 - all seen both with QuarkDB and "fio" synthetic test)
- Luca: please benchmark the machine he prepared
- Dan: (->offline) please talk to Arne, he might be investigating such limits.
- Two weeks ago, PPS stopped working upon reaching 2^32 files. There was a type confusion (id_t) in the code, making file and container IDs which should normally be 64 bits, get truncated to 32 bits.
- Safety checks detected there was something wrong, and the NS was refusing to boot.
- Elvin: also would affect aquamarine. IDs get increased, never re-used - will this affect EOSUSER? Only if somebody creates+deletes files in a loop.
- Jan: could this affect FUSE clients? yes in principle, but Georgios did check, nothing found
- Experimenting with optimizing how metadata is laid out on disk, to reduce the high amount of IOPS incurred when listing a directory.
- So that, file metadata within the same directory are physically colocated on disk.
- Listings would thus benefit greatly from kernel page cache, and rocksdb block cache.
- Jan: worried about intrusive changes close to production roll-out? Would probably be last major change (and would need a conversion campaign on EOSPPS, prod instances will have this from the start).
(Andreas)
- Changed the behaviour of atomic uploads to avoid any file loss scenario by overlapping/replaid open/uploads EOS-2571
- Fixed looping bug in FST (MgmSyncer.cc) using a bogus mtime making a
thread running in a tight loop
( filling up /var parition in EOSUSER/GENOME )
fix in CITRINE & AQUAMARINE branches
(Massimo)
- Memleaks in (old) namespace, linked to file creation - seems to also affect new NS (QuarkDB starts paging. Other issues: every hour, on the hour, see mem increase - log truncation? Last issue is runaway mem consumptiom
(Jan): new NS roll-out - status (EOSHOME has 1 filesystem)?
- have Foreman hostgroup, puppet config (MGM and QuarkDB on same box), have redirector, have NS for first instance (but will wait for new QuarkDB on-disk layout)