EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-07-31T16:00:00+02:00
End: 2018-07-31T17:50:00+02:00
Location: CERN

Tuesday 31 Jul 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, USER)

EOSBACKUP:

namespace converted to QuarkDB
daily backups restarted
issues observed due to a high number of files under one single directory (eosarchi's recycle containing ~65M files) causing cache thrashing; a mechanism put in place to log threads that are taking too long to run was segfault-ing adding to the problem (instance seemed locked)
increasing the file cache limit from 30M to 70M seemed to have worked around the problem and the instance ran smoothly during the night

● EOS clients, FUSE(X)

(jan)

DNS aliases "eosuser-fuse.cern.ch" and "eosproject-fuse.cern.ch" in production since yesterday, slowly being picked up (e.g 22/77 LXPLUS run with this)
eos-client-4.3.5 is now in "qa" (until next Monday - Dan or Luca to "koji tag-build" to production iff OK)
- saw spurious I/O errors on functional_tests/test_sqlite.py against EOSUSER (load-related?)

(andreas)

(has taken over the "batch scale" test from Massimo). Trying batch jobs creating private mounts ... core dumping all the time ... will send tests to UAT for the time being

● Development issues

(Georgios)

Implemented a solution to the 256M directory ID limitation, using a new inode encoding scheme. Now both file IDs and directory IDs will be capped at 2^63. (up from 2^35 files, 2^28 directories using previous scheme)
The compatibility situation is subtle, as older eosd versions will not work once new scheme is activated.
The plan:
- eosd 4.3.6 will support both encoding schemes, and query the MGM at startup on which to use.
- MGM 4.3.6 has dormant support, but still uses old scheme.
- Months from now, we flip the switch in a new release, and MGM 4.x.y uses new scheme: versions prior to eosd 4.3.6 stop working.
- This gives a long "window of compatibility" to phase out older eosd versions.
  - Q: what will happen to non-updated old "eosd" - can we make sure these stop working, or at least identify from the logs (and try to contact the owner)?
    - Guess: "will just crash". Hard to identify from logs. Worst-case: access some random other files/directories??
  - Q: do we want to support "eosd" and "eosxd" in parallel for long, or fully deprecate "eosd" once "eosxd" is stable?
    - To be seen, "eosd" is stateless, might be useful for some workloads.

(Andreas)

refactored 'Commit' method
refactoring recycle bin hash policies
introducing gRPC server into MGM

● AOB

(Luca)

2 out of 4 diskservers on the EOSHOME-01 instance got disk-wiped, >4 weeks after entering production. Was apparently due to a lingering action (explicit wipe specified) left over from install-time - these machines needed manual action (Mellanox, no network link after installation?). Was triggered by the operator resetting these machines after a NO_CONTACT.
- install script is being used successfully on LHC instances (incl with Mellanox network, no manual action needed), but EOSHOME has a different config (one FST process per disk)?
- could ask procurement team for workarounds/ hardware parameters for the Mellanox issue?
- will add some safety checks to the script - EOS-2750 (Roberto)
- ~25% of the data for this instance lost, will need to be re-imported.
as a consequence, massive draining was triggered, this exposed bugs in both the "autodrain" and the new "centralized drain". Also QuarkDB has "critical" errors, does not boot
- considered "critical" -> Luca, Georgios, Elvin looking into this

(Luca)

Went through the "Massimo" planning Excel sheet. Overall still mostly OK, but
- EOSBACKUP migration to QuarkDB: done but one week late.
- need to contact LHCb to get OK to switch to FUSEX by end of August. Herve will (try to, holidays..) contact them.
- need to migrate ST users from EOSUSER to EOSHOME this week (data copy AND change over to new location). Might slip to Monday next week
- migration procedure ("written" this week): still split over several scripts and not fully documented
- old "cernbox" clients: being contacted (Remy).
Should find a better way to track than via Excel.

There are minutes attached to this event. Show them.

- 16:00 → 16:20
  EOS production instances (LHC, PUBLIC, USER) 20m
  
  Minutes
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)
  EOSBACKUP:
  
  namespace converted to QuarkDB
  
  daily backups restarted
  
  issues observed due to a high number of files under one single directory (eosarchi's recycle containing ~65M files) causing cache thrashing; a mechanism put in place to log threads that are taking too long to run was segfault-ing adding to the problem (instance seemed locked)
  
  increasing the file cache limit from 30M to 70M seemed to have worked around the problem and the instance ran smoothly during the night
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  
  Minutes
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  (jan)
  
  DNS aliases "eosuser-fuse.cern.ch" and "eosproject-fuse.cern.ch" in production since yesterday, slowly being picked up (e.g 22/77 LXPLUS run with this)
  
  eos-client-4.3.5 is now in "qa" (until next Monday - Dan or Luca to "koji tag-build" to production iff OK)
  
  saw spurious I/O errors on functional_tests/test_sqlite.py against EOSUSER (load-related?)
  
  (andreas)
  
  (has taken over the "batch scale" test from Massimo). Trying batch jobs creating private mounts ... core dumping all the time ... will send tests to UAT for the time being
- 16:25 → 16:35
  Development issues 10m
  
  Minutes
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
  (Georgios)
  
  Implemented a solution to the 256M directory ID limitation, using a new inode encoding scheme. Now both file IDs and directory IDs will be capped at 2^63. (up from 2^35 files, 2^28 directories using previous scheme)
  
  The compatibility situation is subtle, as older eosd versions will not work once new scheme is activated.
  
  The plan:
  
  eosd 4.3.6 will support both encoding schemes, and query the MGM at startup on which to use.
  
  MGM 4.3.6 has dormant support, but still uses old scheme.
  
  Months from now, we flip the switch in a new release, and MGM 4.x.y uses new scheme: versions prior to eosd 4.3.6 stop working.
  
  This gives a long "window of compatibility" to phase out older eosd versions.
  
  Q: what will happen to non-updated old "eosd" - can we make sure these stop working, or at least identify from the logs (and try to contact the owner)?
  
  Guess: "will just crash". Hard to identify from logs. Worst-case: access some random other files/directories??
  
  Q: do we want to support "eosd" and "eosxd" in parallel for long, or fully deprecate "eosd" once "eosxd" is stable?
  
  To be seen, "eosd" is stateless, might be useful for some workloads.
  
  (Andreas)
  
  refactored 'Commit' method
  
  refactoring recycle bin hash policies
  
  introducing gRPC server into MGM
- 16:35 → 16:50
  AOB 15m
  
  Minutes
  (Luca)
  
  2 out of 4 diskservers on the EOSHOME-01 instance got disk-wiped, >4 weeks after entering production. Was apparently due to a lingering action (explicit wipe specified) left over from install-time - these machines needed manual action (Mellanox, no network link after installation?). Was triggered by the operator resetting these machines after a NO_CONTACT.
  
  install script is being used successfully on LHC instances (incl with Mellanox network, no manual action needed), but EOSHOME has a different config (one FST process per disk)?
  
  could ask procurement team for workarounds/ hardware parameters for the Mellanox issue?
  
  will add some safety checks to the script - EOS-2750 (Roberto)
  
  ~25% of the data for this instance lost, will need to be re-imported.
  
  as a consequence, massive draining was triggered, this exposed bugs in both the "autodrain" and the new "centralized drain". Also QuarkDB has "critical" errors, does not boot
  
  considered "critical" -> Luca, Georgios, Elvin looking into this
  
  (Luca)
  
  Went through the "Massimo" planning Excel sheet. Overall still mostly OK, but
  
  EOSBACKUP migration to QuarkDB: done but one week late.
  
  need to contact LHCb to get OK to switch to FUSEX by end of August. Herve will (try to, holidays..) contact them.
  
  need to migrate ST users from EOSUSER to EOSHOME this week (data copy AND change over to new location). Might slip to Monday next week
  
  migration procedure ("written" this week): still split over several scripts and not fully documented
  
  old "cernbox" clients: being contacted (Remy).
  
  Should find a better way to track than via Excel.

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● EOS production instances (LHC, PUBLIC, USER)

● EOS clients, FUSE(X)

● Development issues

● AOB

Share this page

Direct link

Social networks

Calendaring