● EOS production instances (LHC, PUBLIC, USER)
LHC instances
Atlas to be updated to 4.2.2x version early next week - need new version -24 (only locking, EOSATLAS incident), not out yet.
● EOS clients, FUSE(X)
(Dan) Integrating puppet eosclient fixes in eosclient_dev environment:
- fuse: include only legacy instances in EOS_FUSE_MOUNTS
- eos-cleanup: also clean dead eosxd mounts
Go ahead or wait for new eos-fuse release?
(Jan) top "eosd" segfaults on PLUS (several/day) - could these perhaps get fixed in next release?
- EOS-2081 - sscanf() / filesystem::stat() / EosFuse::lookup - easy to fix?
- EOS-2080 - AuthIdManager::CleanupThread
(Andreas)
- important progress with eoshome-i00
- now eosxd seems stable under default conditions
stable benchmarks on 4 core VM towards EOS-MGM CEPH-MDS @0.3ms RTT
(default mount) on idle instances (repeated many times on several days/instances)
producer tasks eosxd(home) ceph(ssd) ceph(hdd) ceph(k)+
---------------------------------------------------------------------------
untar 9-12 8-14 9-14 -
untar (overwrite) 14 20 21 -
fusex-benchmark 40s 60s 60s
cmake .. 17s 45s 46s
---------------------------------------------------------------------------
compilation task -j4 120s 155++ s 155++ s -
CPU consumption -j4 57s 233s
context switches -j4 720k 3.5M
---------------------------------------------------------------------------
rpm build eos/git 380s* 990s 1035s -
rpm build kernel locks** locks** locks** -
---------------------------------------------------------------------------
* comparison on /tmp/ 200s
** locks process of L.Torvald massaging executable symbols in kernel object file
---------------------------------------------------------------------------
FuseServer scalability test (home00 / 4 core VM)
---------------------------------------------------------------------------
ls 100cli 1dir=1kfiles 150k entries/s *
ls 100cli max listing/s 6k ls/s
* move getMD function from open/read/close to fsctl call (3 times less TCP messages)
Streaming performance (dd if= of= bs=1M count=1000)
eoshome ceph-hdd ceph-ssd
---------------------------------------------------------------------------
WR 1M 285 MB/s 250 MB/s 320 MB/s
---------------------------------------------------------------------------
RD 4M
uncached 200 MB/s 200 MB/s
cached(sever)480 MB/s 310 MB/s
(Massimo)
EOS-HOME tests (NB: in edit mode you can see attached graphics)
Stable and fast. The plot shows the input on a FST. The dip is me hitting the quota and restarting. Sounds good. Suggestions for more tests welcome.
What I did (for the moment) was 2 nodes, one writing 100MB files (two streams) and the second having another writing stream + 4 reading streams (reading + checksumming). No exceptions found during a 5-hour run.
I prepared the batch test but I need the new version (23 I guess) of FUSEX
The highest concurrency test (8 nodes) largely exceeds the 600 MB/s to EOSHOME in writing (to do more I need batch...). Given some internals of eoshome this is almost saturated (1 node is hit with about 10Gb/s).
The reading tests (8 clients x 4 streams) reads and checksum at 1,000MB/s (no bottleneck yet due to double copy (better source spreading).
Some monitoring: https://monit-grafana.cern.ch/dashboard/snapshot/Z38qw1LpO2JLZ8iROPwi721Boi8kNnwZ
Work priorities
Test/tune scalability n x 1k clients (Massimo)
Test typical failure scenarios (MGM,FST dead, disks full)
Harden stability (biggest asset of ceph-fuse is long-term stability!)
Samba native VFS plug-in (with Rainer)
UAT
I would like to stop doing anything with UAT (aquamarine branch). The server is not anymore uptodate and would save a lot of work to do backports (for nothing).
Decisions:
- need new release with EOSATLAS lock fix + FUSE + FUSEX (but not full merge). Expected Wed, then deploy into "qa" on client and to EOSPPSLEGACY, later to EOSATLAS.
- EOSUAT will go to citrine - no longer need FUSEX backports on aquamarine.
- Unclear whether EOSPPSLEGACY is still needed (but anyway is empty right now).
● AOB
Luca has very tight timelines for the EOSHOME rollout (as promised at IT PoW) and will ask for help.
There are minutes attached to this event.
Show them.