EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-06-19T16:00:00+02:00
End: 2018-06-19T17:50:00+02:00
Location: CERN

Tuesday 19 Jun 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, USER)

LHC Instances

ATLAS
- Updated to 4.2.24 yesterday morning
  - ✓ Boot time divided by 2
  - ✗ Crashed this morning in the TableFormatter - EOS-2662
    - (Elvin: probably now fixed, also seen when listing nodes/quotas/filesystem). Not linked to the update, just loong FST listing
    - Data taking starts next Sat? - can we ask ATLAS for another update slot?
    - Could this behind CMS instabilities (one crash/week, 04:00 in the morning..?). No, seen similar thing in LHCb
ALICE
- Last week, we found that the AliceToken auth plugin wasn't loaded, effectively opening EOSALICE in Read/Write for everyone.
- Version 4.2.25 fixed this issue
- Slave updated and validated the fix
- Will failover later today.
- also needs to go to EOSALICEDAQ, Cristi will deploy there.

● EOS clients, FUSE(X)

(Jan)

still waiting for new FUSEX release, neither 4.2.24 (ATLAS/locking) nor 4.2.25 (ALICE/tokenauth) seem to have anything relevant for the "batch-scale" test.

(Andreas)

debugged chunk upload on EOSHOME (combination of 3 different small differences to AQ) - working now
- example file had 2.3GB, 2xx chunks, aka "very big".
refactored eosxd read-ahead
- tagged versions perform very bad with high latency (such as CERN-Wigner, ~5MB/s)
- reach 850 MB/s if backend is fast enough and LAN latency with default values
- new version WIGNER node reading 1G file from CERN eosxd faster than xrdcp with dynamic readahead [125 MB/s]
  - static readahead: prefetch n-blocks ahead
  - dynamic readahead (new default): prefetch n-blocks increase pre-fetch block size with every hit from 256k to 2M by default
  - disable on miss (such as triggered by "md5sum"..), reenable after 3 consecutive reads (might change)
- adapted read-ahead test suite
- e.g. f.e. does not strictly behaves like an aligned forward reader ( keep now read-ahead blocks [x-1,x+n] to deal with that
not useful to tag 4.2.26 without these two issues resolved. Massimo to do test with a snapshot and report back. Will tag once OK, then roll-out widely.

Q:(Kuba): directory mtime is apparently not preserved, affects sync client (believes that this was the result of a backup+restore?). Perhaps reports ctime()? -> ticket.

● Development issues

(Georgios)

Now possible to secure a QDB cluster based on a shared secret. (version 0.2.6) Clients are authenticated by solving an HMAC signing challenge, thus proving they know the secret.
In the interests of ease of use (eg through redis-cli), it's also possible to directly provide the secret in plaintext to authenticate (AUTH command), but this is discouraged.
There's no encryption. If needed, we could use TLS, basic support in QClient and QDB is already implemented.
How to enable:
- On QuarkDB side: Add "redis.password_file /etc/path/to/secret" on all nodes. The file should contain one line containing the secret, and have 0400 permissions.
- This will still allow unauthenticated connections from localhost, it simplifies using redis-cli. Is that an issue?
- On EOS side: Add "mgmofs.qdbpassword_file /etc/path/to/secret" to xrd.cf.mgm, and "fstofs.qdbpassword_file /etc/path/to/secret" to xrd.cf.fst.
Discussion
- - should keep simple, now we only need auth (not encryption). TLS might be overkill (need to generate certs at startup)?
- localhost is not authenticated
- "eos new-find" command - does this need keys? no, goes first to MGM for auth
- needs puppet magic.

Q (Georgios): EOSPPS has consistently faster boot time and higher rates (3x) - why? (but in line with Hervé/ EOSATLAS fast boot time).

● AOB

EOSHOME status :

have EOSHOME redirector
have first MGM (eoshome-1)
issues found
- (chunked uploads)
- sharing in CERNbox needs some work (Kuba - see mtime())
need to set up a dummy LXPLUS node (get config right)
- need released 4.2.26
- need puppet module support
need backup mechanism for EOSHOME
- need EOSBACKUP on citrine+QDB (upgrade in place)
  - saw "too many files" storm in dumpmd - is fixed in 4.2.25
  - Elvin also wants still to stop EOSBACKUP to do some conversion tests - Decision: do Monday. Elvin to tell Georgios+Kuba to stop backup scripts.
  - need more hardware + puppet for actual QDB conversion: Wed/Thu.
-2 and -3 are being set up (install-time problems: Mellanox driver)
EOSHOME-1 will update to QDB-with-password
data migration script is being prepared (based on Elvin's "eosbackup" script, to set attributes). For now can use "rsync".
- Andreas: there is a truly "parallel" rsync -use that?
- Luca: would like to use TPC, but does not handle full tree. Could run once with "eos cp", then sync up via rsync.
- strategy for move: ourselves, new users, and copy existing old data in parallel.
Timeline: "new users on ESHOME before Sept" is still realistic.

There are minutes attached to this event. Show them.

- 16:00 → 16:20
  EOS production instances (LHC, PUBLIC, USER) 20m
  
  Minutes
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)
  LHC Instances
  
  ATLAS
  
  Updated to 4.2.24 yesterday morning
  
  ✓ Boot time divided by 2
  
  ✗ Crashed this morning in the TableFormatter - EOS-2662
  
  (Elvin: probably now fixed, also seen when listing nodes/quotas/filesystem). Not linked to the update, just loong FST listing
  
  Data taking starts next Sat? - can we ask ATLAS for another update slot?
  
  Could this behind CMS instabilities (one crash/week, 04:00 in the morning..?). No, seen similar thing in LHCb
  
  ALICE
  
  Last week, we found that the AliceToken auth plugin wasn't loaded, effectively opening EOSALICE in Read/Write for everyone.
  
  Version 4.2.25 fixed this issue
  
  Slave updated and validated the fix
  
  Will failover later today.
  
  also needs to go to EOSALICEDAQ, Cristi will deploy there.
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  
  Minutes
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  (Jan)
  
  still waiting for new FUSEX release, neither 4.2.24 (ATLAS/locking) nor 4.2.25 (ALICE/tokenauth) seem to have anything relevant for the "batch-scale" test.
  
  (Andreas)
  
  debugged chunk upload on EOSHOME (combination of 3 different small differences to AQ) - working now
  
  example file had 2.3GB, 2xx chunks, aka "very big".
  
  refactored eosxd read-ahead
  
  tagged versions perform very bad with high latency (such as CERN-Wigner, ~5MB/s)
  
  reach 850 MB/s if backend is fast enough and LAN latency with default values
  
  new version WIGNER node reading 1G file from CERN eosxd faster than xrdcp with dynamic readahead [125 MB/s]
  
  static readahead: prefetch n-blocks ahead
  
  dynamic readahead (new default): prefetch n-blocks increase pre-fetch block size with every hit from 256k to 2M by default
  
  disable on miss (such as triggered by "md5sum"..), reenable after 3 consecutive reads (might change)
  
  adapted read-ahead test suite
  
  e.g. f.e. does not strictly behaves like an aligned forward reader ( keep now read-ahead blocks [x-1,x+n] to deal with that
  
  not useful to tag 4.2.26 without these two issues resolved. Massimo to do test with a snapshot and report back. Will tag once OK, then roll-out widely.
  
  Q:(Kuba): directory mtime is apparently not preserved, affects sync client (believes that this was the result of a backup+restore?). Perhaps reports ctime()? -> ticket.
- 16:25 → 16:35
  Development issues 10m
  
  Minutes
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
  (Georgios)
  
  Now possible to secure a QDB cluster based on a shared secret. (version 0.2.6) Clients are authenticated by solving an HMAC signing challenge, thus proving they know the secret.
  
  In the interests of ease of use (eg through redis-cli), it's also possible to directly provide the secret in plaintext to authenticate (AUTH command), but this is discouraged.
  
  There's no encryption. If needed, we could use TLS, basic support in QClient and QDB is already implemented.
  
  How to enable:
  
  On QuarkDB side: Add "redis.password_file /etc/path/to/secret" on all nodes. The file should contain one line containing the secret, and have 0400 permissions.
  
  This will still allow unauthenticated connections from localhost, it simplifies using redis-cli. Is that an issue?
  
  On EOS side: Add "mgmofs.qdbpassword_file /etc/path/to/secret" to xrd.cf.mgm, and "fstofs.qdbpassword_file /etc/path/to/secret" to xrd.cf.fst.
  
  Discussion
  
  - should keep simple, now we only need auth (not encryption). TLS might be overkill (need to generate certs at startup)?
  
  localhost is not authenticated
  
  "eos new-find" command - does this need keys? no, goes first to MGM for auth
  
  needs puppet magic.
  
  Q (Georgios): EOSPPS has consistently faster boot time and higher rates (3x) - why? (but in line with Hervé/ EOSATLAS fast boot time).
- 16:35 → 16:50
  AOB 15m
  
  Minutes
  EOSHOME status :
  
  have EOSHOME redirector
  
  have first MGM (eoshome-1)
  
  have storage for 2nd ad 3rd instance
  
  issues found
  
  (chunked uploads)
  
  sharing in CERNbox needs some work (Kuba - see mtime())
  
  need to set up a dummy LXPLUS node (get config right)
  
  need released 4.2.26
  
  need puppet module support
  
  need backup mechanism for EOSHOME
  
  need EOSBACKUP on citrine+QDB (upgrade in place)
  
  saw "too many files" storm in dumpmd - is fixed in 4.2.25
  
  Elvin also wants still to stop EOSBACKUP to do some conversion tests - Decision: do Monday. Elvin to tell Georgios+Kuba to stop backup scripts.
  
  need more hardware + puppet for actual QDB conversion: Wed/Thu.
  
  -2 and -3 are being set up (install-time problems: Mellanox driver)
  
  EOSHOME-1 will update to QDB-with-password
  
  data migration script is being prepared (based on Elvin's "eosbackup" script, to set attributes). For now can use "rsync".
  
  Andreas: there is a truly "parallel" rsync -use that?
  
  Luca: would like to use TPC, but does not handle full tree. Could run once with "eos cp", then sync up via rsync.
  
  strategy for move: ourselves, new users, and copy existing old data in parallel.
  
  Timeline: "new users on ESHOME before Sept" is still realistic.

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● EOS production instances (LHC, PUBLIC, USER)

LHC Instances

● EOS clients, FUSE(X)

● Development issues

● AOB

LHC Instances

Share this page

Direct link

Social networks

Calendaring