EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, MEDIA)

(jan):


● CERNBox / EOSUSER / EOSHOME


● EOS clients, FUSE(X)

(jan):

  • Status: 4.4.0 in "production", same as "qa"
  • FUSEX on /eos/lhcb, /eos/home-X, /eos/pps (all in "production")
  • FUSEX rollout blocked since last week:
  • Bugs - see Open FUSEX issues
    • several LXPLUS "eosxd" abort() with corrupted stacktrace (waves - due to "eos fusex evict"?). No JIRA.
    • several LXBATCH stuck "df"  for FUSEX - mostly leftover /eos/ams ? No JIRA.
    • several FUSEX-as-homedir stuck - EOS-2988 (Massimo, random LXPLUS), EOS-2983 (single-file), EOS-2894 (xauth lockfile dance)

(andreas):

  • eosxd problems in production version
    • hanging df
    • writing large files (where t_write>300s)
    • 0-size of new file after listing
    • many old mounts still around (evicted ~2000 mounts today - v 4.3.9 4.3.11 4.3.13)
    • listing cache not properly activated
    • auto mount extremely slow due to O(100) users per letter = mount directory
       
  • fixes ready for new release
    • all 'df' calls run now with a 2s timeout (never block)
    • large file writing in regular intervals publishes a new file size to other clients during write process (default 5s). Listing after expiration of the cache subscription does not wipe the file size to 0 for an open file
    • dentry cache is now properly populated and used
    • autofs mount does not trigger to retrieve subscriptions for all directory children (e.g. for hundreds of users in the mount directory) - autofs initial mount took ~5s for letter a ... now few ms
       
  • remaining issues
    • login lockups
    • non-scalable listing (listing for large directories) - new prototype done
    • to avoid any mount locks, should we reduce timeouts from 1d to some reasonable shorter time until we are confident we don't block anymore?
       

-> Need a new tag. Test, go for "qa" -> Dan.

Suggest to keep EOSWEB on "old" FUSE for now (seems to work for them).. Dan to suggest a configuration.

(Giuseppe: also have issue on "eosd" running since June, despite RPM updates - how to force restart? Open issue, deliberate decision to not force a restart. Might "evict" from server side, after a while, for  when no inodes are in use, and old versions.. do manual for now , later cronjob) 

 


● AOB

(jan)

EOSPROJECT status: EOS-2965

  • 1st instance server-side setup OK.
  • no FST yet - waiting for EOSPPS drain.. (see below)

EOSPPS:

  • generally: many errors (non-existing replicas), clearly not treated as pre-PRODUCTION instance
  • consequence: very slow drain - EOS-2976
    • general impression - drain in general is a problem area, needs manual effort (retries, file-per-file actions)
  • re-enabled "fsck" (in order to run auto-repair) - seems to now cause the Grafana "Namespace latency alert" - bug?

$HOME:

  • Looking for volunteers to have an EOS home directory (Massimo mail)
  • self-service will allow to change home directory entry to point to EOS - see (dev version of CERN resources). ETA next Monday

AFS phaseout

  • ITUM communicated timeline - "restarting in 1Q2019, done before RUN3"

Kuba: EOS_MGM_URL for 'c3' points to root://eospps - do we want to change that (e.g to point to EOSUSER) ? Jan: rather would like this to be removed, clear error message if unset (and required), and auto-guess based on current directory/pathname (EOS-1397 - Andreas: could do, have the info when a path is given..).

AP: "eos quota" now has some built-in assumptions that it should go to the correct EOSHOME instance (?)

 

There are minutes attached to this event. Show them.
    • 16:00 16:10
      EOS production instances (LHC, PUBLIC, MEDIA) 10m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Luca Mascetti (CERN)

      (jan):

    • 16:10 16:20
      CERNBox / EOSUSER / EOSHOME 10m

      major issues for EOSUSER (non-sync-side of CERNBox)

      Speaker: Hugo Gonzalez Labrador (CERN)
    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (jan):

      • Status: 4.4.0 in "production", same as "qa"
      • FUSEX on /eos/lhcb, /eos/home-X, /eos/pps (all in "production")
      • FUSEX rollout blocked since last week:
      • Bugs - see Open FUSEX issues
        • several LXPLUS "eosxd" abort() with corrupted stacktrace (waves - due to "eos fusex evict"?). No JIRA.
        • several LXBATCH stuck "df"  for FUSEX - mostly leftover /eos/ams ? No JIRA.
        • several FUSEX-as-homedir stuck - EOS-2988 (Massimo, random LXPLUS), EOS-2983 (single-file), EOS-2894 (xauth lockfile dance)

      (andreas):

      • eosxd problems in production version
        • hanging df
        • writing large files (where t_write>300s)
        • 0-size of new file after listing
        • many old mounts still around (evicted ~2000 mounts today - v 4.3.9 4.3.11 4.3.13)
        • listing cache not properly activated
        • auto mount extremely slow due to O(100) users per letter = mount directory
           
      • fixes ready for new release
        • all 'df' calls run now with a 2s timeout (never block)
        • large file writing in regular intervals publishes a new file size to other clients during write process (default 5s). Listing after expiration of the cache subscription does not wipe the file size to 0 for an open file
        • dentry cache is now properly populated and used
        • autofs mount does not trigger to retrieve subscriptions for all directory children (e.g. for hundreds of users in the mount directory) - autofs initial mount took ~5s for letter a ... now few ms
           
      • remaining issues
        • login lockups
        • non-scalable listing (listing for large directories) - new prototype done
        • to avoid any mount locks, should we reduce timeouts from 1d to some reasonable shorter time until we are confident we don't block anymore?
           

      -> Need a new tag. Test, go for "qa" -> Dan.

      Suggest to keep EOSWEB on "old" FUSE for now (seems to work for them).. Dan to suggest a configuration.

      (Giuseppe: also have issue on "eosd" running since June, despite RPM updates - how to force restart? Open issue, deliberate decision to not force a restart. Might "evict" from server side, after a while, for  when no inodes are in use, and old versions.. do manual for now , later cronjob) 

       

    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
    • 16:35 16:50
      AOB 15m

      (jan)

      EOSPROJECT status: EOS-2965

      • 1st instance server-side setup OK.
      • no FST yet - waiting for EOSPPS drain.. (see below)

      EOSPPS:

      • generally: many errors (non-existing replicas), clearly not treated as pre-PRODUCTION instance
      • consequence: very slow drain - EOS-2976
        • general impression - drain in general is a problem area, needs manual effort (retries, file-per-file actions)
      • re-enabled "fsck" (in order to run auto-repair) - seems to now cause the Grafana "Namespace latency alert" - bug?

      $HOME:

      • Looking for volunteers to have an EOS home directory (Massimo mail)
      • self-service will allow to change home directory entry to point to EOS - see (dev version of CERN resources). ETA next Monday

      AFS phaseout

      • ITUM communicated timeline - "restarting in 1Q2019, done before RUN3"

      Kuba: EOS_MGM_URL for 'c3' points to root://eospps - do we want to change that (e.g to point to EOSUSER) ? Jan: rather would like this to be removed, clear error message if unset (and required), and auto-guess based on current directory/pathname (EOS-1397 - Andreas: could do, have the info when a path is given..).

      AP: "eos quota" now has some built-in assumptions that it should go to the correct EOSHOME instance (?)

      Hugo: is up to Luca as product manager to understand and evaluate the current situation and find the best possible solution that does not break user compatibility.