EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSBACKUP:

  • namespace converted to QuarkDB
  • daily backups restarted
  • issues observed due to a high number of files under one single directory (eosarchi's recycle containing ~65M files) causing cache thrashing; a mechanism put in place to log threads that are taking too long to run was segfault-ing adding to the problem (instance seemed locked)
  • increasing the file cache limit from 30M to 70M seemed to have worked around the problem and the instance ran smoothly during the night

● EOS clients, FUSE(X)

(jan)

  • DNS aliases "eosuser-fuse.cern.ch" and "eosproject-fuse.cern.ch" in production since yesterday, slowly being picked up (e.g 22/77 LXPLUS run with this)
  • eos-client-4.3.5 is now in "qa" (until next Monday - Dan or Luca to "koji tag-build" to production iff OK)
    • saw spurious I/O errors on functional_tests/test_sqlite.py against EOSUSER (load-related?)

(andreas)

  • (has taken over the "batch scale" test from Massimo). Trying batch jobs creating private mounts ... core dumping all the time ... will send tests to UAT for the time being

● Development issues

(Georgios)

  • Implemented a solution to the 256M directory ID limitation, using a new inode encoding scheme. Now both file IDs and directory IDs will be capped at 2^63. (up from 2^35 files, 2^28 directories using previous scheme)
  • The compatibility situation is subtle, as older eosd versions will not work once new scheme is activated.
  • The plan:
    • eosd 4.3.6 will support both encoding schemes, and query the MGM at startup on which to use.
    • MGM 4.3.6 has dormant support, but still uses old scheme.
    • Months from now, we flip the switch in a new release, and MGM 4.x.y uses new scheme: versions prior to eosd 4.3.6 stop working.
    • This gives a long "window of compatibility" to phase out older eosd versions.
      • Q: what will happen to non-updated old "eosd" - can we make sure these stop working, or at least identify from the logs (and try to contact the owner)?
        • Guess: "will just crash". Hard to identify from logs. Worst-case: access some random other files/directories??
      • Q: do we want to support "eosd" and "eosxd" in parallel for long, or fully deprecate "eosd" once "eosxd" is stable?
        • To be seen, "eosd" is stateless, might be useful for some workloads.

           

(Andreas)

  • refactored 'Commit' method
  • refactoring recycle bin hash policies
  • introducing gRPC server into MGM

● AOB

(Luca)

  • 2 out of 4 diskservers on the EOSHOME-01 instance got disk-wiped, >4 weeks after entering production. Was apparently due to a lingering action (explicit wipe specified) left over from install-time - these machines needed manual action (Mellanox, no network link after installation?). Was triggered by the operator resetting these machines after a NO_CONTACT.
    • install script is being used successfully on LHC instances (incl with Mellanox network, no manual action needed), but EOSHOME has a different config (one FST process per disk)?
    • could ask procurement team for workarounds/ hardware parameters for the Mellanox issue?
    • will add some safety checks to the script - EOS-2750 (Roberto)
    • ~25% of the data for this instance lost, will need to be re-imported.
  • as a consequence, massive draining was triggered, this exposed bugs in both the "autodrain" and the new "centralized drain". Also QuarkDB has "critical" errors, does not boot
    • considered "critical" -> Luca, Georgios, Elvin looking into this

(Luca)

  • Went through the "Massimo" planning Excel sheet. Overall still mostly OK, but
    • EOSBACKUP migration to QuarkDB: done but one week late.
    • need to contact LHCb to get OK to switch to FUSEX by end of August. Herve will (try to, holidays..) contact them.
    • need to migrate ST users from EOSUSER to EOSHOME this week (data copy AND change over to new location). Might slip to Monday next week
    • migration procedure ("written" this week): still split over several scripts and not fully documented
    • old "cernbox" clients: being contacted (Remy).
  • Should find a better way to track than via Excel.
There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSBACKUP:

      • namespace converted to QuarkDB
      • daily backups restarted
      • issues observed due to a high number of files under one single directory (eosarchi's recycle containing ~65M files) causing cache thrashing; a mechanism put in place to log threads that are taking too long to run was segfault-ing adding to the problem (instance seemed locked)
      • increasing the file cache limit from 30M to 70M seemed to have worked around the problem and the instance ran smoothly during the night
    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (jan)

      • DNS aliases "eosuser-fuse.cern.ch" and "eosproject-fuse.cern.ch" in production since yesterday, slowly being picked up (e.g 22/77 LXPLUS run with this)
      • eos-client-4.3.5 is now in "qa" (until next Monday - Dan or Luca to "koji tag-build" to production iff OK)
        • saw spurious I/O errors on functional_tests/test_sqlite.py against EOSUSER (load-related?)

      (andreas)

      • (has taken over the "batch scale" test from Massimo). Trying batch jobs creating private mounts ... core dumping all the time ... will send tests to UAT for the time being
    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • Implemented a solution to the 256M directory ID limitation, using a new inode encoding scheme. Now both file IDs and directory IDs will be capped at 2^63. (up from 2^35 files, 2^28 directories using previous scheme)
      • The compatibility situation is subtle, as older eosd versions will not work once new scheme is activated.
      • The plan:
        • eosd 4.3.6 will support both encoding schemes, and query the MGM at startup on which to use.
        • MGM 4.3.6 has dormant support, but still uses old scheme.
        • Months from now, we flip the switch in a new release, and MGM 4.x.y uses new scheme: versions prior to eosd 4.3.6 stop working.
        • This gives a long "window of compatibility" to phase out older eosd versions.
          • Q: what will happen to non-updated old "eosd" - can we make sure these stop working, or at least identify from the logs (and try to contact the owner)?
            • Guess: "will just crash". Hard to identify from logs. Worst-case: access some random other files/directories??
          • Q: do we want to support "eosd" and "eosxd" in parallel for long, or fully deprecate "eosd" once "eosxd" is stable?
            • To be seen, "eosd" is stateless, might be useful for some workloads.

               

      (Andreas)

      • refactored 'Commit' method
      • refactoring recycle bin hash policies
      • introducing gRPC server into MGM
    • 16:35 16:50
      AOB 15m

      (Luca)

      • 2 out of 4 diskservers on the EOSHOME-01 instance got disk-wiped, >4 weeks after entering production. Was apparently due to a lingering action (explicit wipe specified) left over from install-time - these machines needed manual action (Mellanox, no network link after installation?). Was triggered by the operator resetting these machines after a NO_CONTACT.
        • install script is being used successfully on LHC instances (incl with Mellanox network, no manual action needed), but EOSHOME has a different config (one FST process per disk)?
        • could ask procurement team for workarounds/ hardware parameters for the Mellanox issue?
        • will add some safety checks to the script - EOS-2750 (Roberto)
        • ~25% of the data for this instance lost, will need to be re-imported.
      • as a consequence, massive draining was triggered, this exposed bugs in both the "autodrain" and the new "centralized drain". Also QuarkDB has "critical" errors, does not boot
        • considered "critical" -> Luca, Georgios, Elvin looking into this

      (Luca)

      • Went through the "Massimo" planning Excel sheet. Overall still mostly OK, but
        • EOSBACKUP migration to QuarkDB: done but one week late.
        • need to contact LHCb to get OK to switch to FUSEX by end of August. Herve will (try to, holidays..) contact them.
        • need to migrate ST users from EOSUSER to EOSHOME this week (data copy AND change over to new location). Might slip to Monday next week
        • migration procedure ("written" this week): still split over several scripts and not fully documented
        • old "cernbox" clients: being contacted (Remy).
      • Should find a better way to track than via Excel.