● EOS production instances (LHC, PUBLIC, USER)
LHC Instances
- ATLAS
- Updated to 4.2.24 yesterday morning
- ✓ Boot time divided by 2
- ✗ Crashed this morning in the TableFormatter - EOS-2662
- (Elvin: probably now fixed, also seen when listing nodes/quotas/filesystem). Not linked to the update, just loong FST listing
- Data taking starts next Sat? - can we ask ATLAS for another update slot?
- Could this behind CMS instabilities (one crash/week, 04:00 in the morning..?). No, seen similar thing in LHCb
- ALICE
- Last week, we found that the AliceToken auth plugin wasn't loaded, effectively opening EOSALICE in Read/Write for everyone.
- Version 4.2.25 fixed this issue
- Slave updated and validated the fix
- Will failover later today.
- also needs to go to EOSALICEDAQ, Cristi will deploy there.
● EOS clients, FUSE(X)
(Jan)
- still waiting for new FUSEX release, neither 4.2.24 (ATLAS/locking) nor 4.2.25 (ALICE/tokenauth) seem to have anything relevant for the "batch-scale" test.
(Andreas)
- debugged chunk upload on EOSHOME (combination of 3 different small differences to AQ) - working now
- example file had 2.3GB, 2xx chunks, aka "very big".
- refactored eosxd read-ahead
- tagged versions perform very bad with high latency (such as CERN-Wigner, ~5MB/s)
- reach 850 MB/s if backend is fast enough and LAN latency with default values
- new version WIGNER node reading 1G file from CERN eosxd faster than xrdcp with dynamic readahead [125 MB/s]
- static readahead: prefetch n-blocks ahead
- dynamic readahead (new default): prefetch n-blocks increase pre-fetch block size with every hit from 256k to 2M by default
- disable on miss (such as triggered by "md5sum"..), reenable after 3 consecutive reads (might change)
- adapted read-ahead test suite
- e.g. f.e. does not strictly behaves like an aligned forward reader ( keep now read-ahead blocks [x-1,x+n] to deal with that
- not useful to tag 4.2.26 without these two issues resolved. Massimo to do test with a snapshot and report back. Will tag once OK, then roll-out widely.
Q:(Kuba): directory mtime is apparently not preserved, affects sync client (believes that this was the result of a backup+restore?). Perhaps reports ctime()? -> ticket.
● Development issues
(Georgios)
- Now possible to secure a QDB cluster based on a shared secret. (version 0.2.6) Clients are authenticated by solving an HMAC signing challenge, thus proving they know the secret.
- In the interests of ease of use (eg through redis-cli), it's also possible to directly provide the secret in plaintext to authenticate (AUTH command), but this is discouraged.
- There's no encryption. If needed, we could use TLS, basic support in QClient and QDB is already implemented.
- How to enable:
- On QuarkDB side: Add "redis.password_file /etc/path/to/secret" on all nodes. The file should contain one line containing the secret, and have 0400 permissions.
- This will still allow unauthenticated connections from localhost, it simplifies using redis-cli. Is that an issue?
- On EOS side: Add "mgmofs.qdbpassword_file /etc/path/to/secret" to xrd.cf.mgm, and "fstofs.qdbpassword_file /etc/path/to/secret" to xrd.cf.fst.
- Discussion
- - should keep simple, now we only need auth (not encryption). TLS might be overkill (need to generate certs at startup)?
- localhost is not authenticated
- "eos new-find" command - does this need keys? no, goes first to MGM for auth
- needs puppet magic.
Q (Georgios): EOSPPS has consistently faster boot time and higher rates (3x) - why? (but in line with Hervé/ EOSATLAS fast boot time).
● AOB
EOSHOME status :
- have EOSHOME redirector
- have first MGM (eoshome-1)
- issues found
- (chunked uploads)
- sharing in CERNbox needs some work (Kuba - see mtime())
- need to set up a dummy LXPLUS node (get config right)
- need released 4.2.26
- need puppet module support
- need backup mechanism for EOSHOME
- need EOSBACKUP on citrine+QDB (upgrade in place)
- saw "too many files" storm in dumpmd - is fixed in 4.2.25
- Elvin also wants still to stop EOSBACKUP to do some conversion tests - Decision: do Monday. Elvin to tell Georgios+Kuba to stop backup scripts.
- need more hardware + puppet for actual QDB conversion: Wed/Thu.
- -2 and -3 are being set up (install-time problems: Mellanox driver)
- EOSHOME-1 will update to QDB-with-password
- data migration script is being prepared (based on Elvin's "eosbackup" script, to set attributes). For now can use "rsync".
- Andreas: there is a truly "parallel" rsync -use that?
- Luca: would like to use TPC, but does not handle full tree. Could run once with "eos cp", then sync up via rsync.
- strategy for move: ourselves, new users, and copy existing old data in parallel.
- Timeline: "new users on ESHOME before Sept" is still realistic.
There are minutes attached to this event.
Show them.