EOS DevOps Meeting
● production instances
EOSALICE
Updated to 4.2.0 this morning
- EOS-2017 is still here (FSCK crashes the MGM - fixed but need release)
EOSCMS
Preparation for EOY closure:
- CMS plans to add 80k CPU from the HLT, this will likely require us either to whitelist their (additional) gateway nodes (which use GPN to talk to MGM, but directly to FST..), or the make the Auth Proxies more robust
- one issue with auth proxies has been fixed.
Batch on EOS
Preparing a bunch of nodes in QA/EOSPPS to serve as a testbed (managed by us, to see the operational implications)
● CERNBOX and EOSUSER
running out of memory, expect crash in the next days. Trying to have new (big-mem but slower CPU, less disk and only SSD - and apparently cannot add more spinning disks) box in production, but Luca leaves for Friday..
- "slower CPU": booting NS took 2200sec (on idle machine). Previous NS boot (70min) suffered from THP (transparent huge pages) and was being hammered by client.
- 4x 1TB SSD means RAID1 (no feeling/data for reliability), reduced space for logs
Hugo now has website is readonly mode (instead of being unavailable), have separate alias so that FUSE clients can stay connected.
● FUSE and client versions
eos-fuse-4.2.0-3 in "production" since Nov 1st, but still frequent crashes
- jemalloc: EOS-2054 has unreleased fix - ETA for new release? No, does not fix the crashes.
- AuthIdManager::CleanupThread : EOS-1920 / EOS-2080
- EOS-2081 "somewhere in libc" - waiting for backtraces
- (tried to create JIRA tickets for most other crashes observed, based on library+base address - these will be filled out once we have actual backtraces.)
Small eos-cleanup.sh change in qa. Prevents an lsof deadlock. CRM-2470
WIP puppet-eosclient support for eosxd can be seen here: https://gitlab.cern.ch/ai/it-puppet-module-eosclient/merge_requests/53
Deadlock on autofs triggered eosxd mount: EOS-2095
● Citrine rollout
EOSATLAS
Citrine migration planned for 9th January, will send announcement shortly. Foreseen downtime < ~4h
EOSCMS: no date yet.
● nextgen FUSE
- Dan starts working on puppet module.
- Suggest "test plan" ("who tests which area") and guidelines for getting as many automatic tests as possible from occasional testers.
- Massimo would like "a couple of machines" to play around
- Suggest to have Rainer look at the client cache.
- Dev Status
Fixed code in Aquamarine, started merge now into CITRINE, need to run certification script, then green light ...
====================================================================
--- ... working-dir = /eos/dev/fuse/certify/certify.10474
====================================================================
001 ... fusex-benchmark
real 0m20.818s
user 0m0.117s
sys 0m1.113s
====================================================================
002 ... rename-test
====================================================================
003 ... git-clone-test
real 0m37.038s
user 0m7.819s
sys 0m2.359s
====================================================================
004 ... xrootd-compilation
real 0m44.122s
user 1m59.250s
sys 0m16.458s
real 1m0.853s
user 2m0.487s
sys 0m17.337s
====================================================================
005 ... client-tests
005a... micro-tests
eos.clients.fuse.dev.microtests.touch_ms 7.185 1510060804
eos.clients.fuse.dev.microtests.rm_ms 5.350 1510060804
eos.clients.fuse.dev.microtests.sqlite_100_inserts_ms 856.418 1510060804
eos.clients.fuse.dev.microtests.touch100files_parallel_ms 5.619 1510060805
eos.clients.fuse.dev.microtests.rm_100_files_ms 5.339 1510060805
eos.clients.fuse.dev.microtests.untar_ms 799.089 1510060805
eos.clients.fuse.dev.microtests.dd_4m_ms 32.118 1510060806
eos.clients.fuse.dev.microtests.dd_4m_dsync_ms 875.729 1510060806
eos.clients.fuse.dev.microtests.dd_4m_read_ms 8.863 1510060807
eos.clients.fuse.dev.microtests.dd_4m_read_direct_ms 612.384 1510060807
eos.clients.fuse.dev.microtests.dd_4k_ms 8.887 1510060807
eos.clients.fuse.dev.microtests.dd_4k_dsync_ms 9.875 1510060807
eos.clients.fuse.dev.microtests.dd_4k_read_ms 5.852 1510060807
eos.clients.fuse.dev.microtests.dd_4k_read_direct_ms 9.086 1510060807
eos.clients.fuse.dev.microtests.rndmseekwrite_ms 138.519 1510060807
eos.clients.fuse.dev.microtests.fwseekwrite_ms 545.669 1510060807
eos.clients.fuse.dev.microtests.untar_940_files_ms 2388.796 1510060808
eos.clients.fuse.dev.microtests.f77uf_ms 3.585 1510060810
eos.clients.fuse.dev.microtests.multiopen_fortran_gf_ms 2588.663 1510060810
eos.clients.fuse.dev.microtests.multiopen_fortran_i_ms 151120.621 1510060813
eos.clients.fuse.dev.microtests.git_clone_ms 2353.397 1510060964
005b... zlib-compile
005c... git-clone
005d... rsync
005d... sqlite
Note: (Some of the tests run FSYNC but should not.)
- will merge to citrine, re-run these tests, then release (and that then is OK to run other tests on). ETA tomorrow.
EOSUAT runs a version with messed up quota support need to be updated to latest Aquamarine build.
● new Namespace
Last week meeting to decide the best strategy for MGM rollout (AP, CC, HR, ML, LM)
Working decision: 1 (unsplit, non-HA) MGM for EOSUSER with QDB backend (3 or 5 nodes?)
Task-force style effort between ops (LM, HR, CC, ...) and dev (ES, GB, ...) to coordinate rollout.
Elvin is looking at the EOSBACKUP conversion tool (1h30), runs out of some resource waiting for ack (tuneable). Will do conversion for all production namespaces, as they are all different..
● BATCH integration
Usual test... Task 26135 starts at Tue Nov 7 11:27:13 2017 and ends at Tue Nov 7 12:19:21 2017 (52.1 minutes) Analysed jobs: 100 Correct jobs: 100 Maximum concurrency: 1 Execution hosts (top 5): b6c0fb38a7 [#28] b69586e854 [#23] b60c691f69 [#22] b678940021 [#17] b626536183 [#10] Execution environments (top 5): eos-client-4.2.0-3.el6.x86_64, eos-fuse-core-4.2.0-3.el6.x86_64, xrootd-client-libs-4.7.0-1.el6.i686, xrootd-client-libs-4.7.0-1.el6.x86_64 [#100]
● AOB
Small investigation of EOSUSER namespace
Out of curiosity (using a dump from Yolanda) I checked the effect of deduplication (file level). Please note deduplication is *not* a prioritiy IMO.
Input: 396 M files (Early October)
Consider only files >10MB
Use AD32 (as recorded in the catalogue). With "large files", AD32 collisions are not too many collisions.
Dedup saving ~15% (188 TB out of 1184). I am shocked but I do not find any loophole
Anyway: top files being "repeated":
- cernbox/smashbox testing (e.g. file "c1857a3c" has 20k replicas for a total of 1.1 TB). They are characterised by a flat time distribution (test executed every x hours).
- File 05f0347e is an output of a job (2.9 GB x 374 copies). Suboutputs of single job (all equals...). Time distr concentrated well within an hour.
- Similar cases exists where 1 file is in the user dir and all the others are in the trash (again, rather peaked time distribution.
Q: de-duplication effect on number of files? Not looked.
Q; is SWAN now using the new unified principals (since will play with instances)? will check (Enrico: looks OK)