● EOS production instances (LHC, PUBLIC, MEDIA)
(jan):
● CERNBox / EOSUSER / EOSHOME
● EOS clients, FUSE(X)
(jan):
- Status: 4.4.0 in "production", same as "qa"
- FUSEX on /eos/lhcb, /eos/home-X, /eos/pps (all in "production")
- FUSEX rollout blocked since last week:
- neither mount the non-AMS EOSPUBLIC areas via FUSEX in "production",
- nor to start mounting EOSATLAS, EOSCMS, EOSMEDIA in "qa" for now
- EOSPUBLIC also seems a bit unstable (https://its.cern.ch/jira/browse/EOS-2950), and other instances perhaps should wait before going to 4.4.X (req for FUSEX) until things stabilize.
- next steps (if any)?
- Bugs - see Open FUSEX issues
- several LXPLUS "eosxd" abort() with corrupted stacktrace (waves - due to "eos fusex evict"?). No JIRA.
- several LXBATCH stuck "df" for FUSEX - mostly leftover /eos/ams ? No JIRA.
- several FUSEX-as-homedir stuck - EOS-2988 (Massimo, random LXPLUS), EOS-2983 (single-file), EOS-2894 (xauth lockfile dance)
(andreas):
- eosxd problems in production version
- hanging df
- writing large files (where t_write>300s)
- 0-size of new file after listing
- many old mounts still around (evicted ~2000 mounts today - v 4.3.9 4.3.11 4.3.13)
- listing cache not properly activated
- auto mount extremely slow due to O(100) users per letter = mount directory
- fixes ready for new release
- all 'df' calls run now with a 2s timeout (never block)
- large file writing in regular intervals publishes a new file size to other clients during write process (default 5s). Listing after expiration of the cache subscription does not wipe the file size to 0 for an open file
- dentry cache is now properly populated and used
- autofs mount does not trigger to retrieve subscriptions for all directory children (e.g. for hundreds of users in the mount directory) - autofs initial mount took ~5s for letter a ... now few ms
- remaining issues
- login lockups
- non-scalable listing (listing for large directories) - new prototype done
- to avoid any mount locks, should we reduce timeouts from 1d to some reasonable shorter time until we are confident we don't block anymore?
-> Need a new tag. Test, go for "qa" -> Dan.
Suggest to keep EOSWEB on "old" FUSE for now (seems to work for them).. Dan to suggest a configuration.
(Giuseppe: also have issue on "eosd" running since June, despite RPM updates - how to force restart? Open issue, deliberate decision to not force a restart. Might "evict" from server side, after a while, for when no inodes are in use, and old versions.. do manual for now , later cronjob)
● AOB
(jan)
EOSPROJECT status: EOS-2965
- 1st instance server-side setup OK.
- no FST yet - waiting for EOSPPS drain.. (see below)
EOSPPS:
- generally: many errors (non-existing replicas), clearly not treated as pre-PRODUCTION instance
- consequence: very slow drain - EOS-2976
- general impression - drain in general is a problem area, needs manual effort (retries, file-per-file actions)
- re-enabled "fsck" (in order to run auto-repair) - seems to now cause the Grafana "Namespace latency alert" - bug?
$HOME:
- Looking for volunteers to have an EOS home directory (Massimo mail)
- self-service will allow to change home directory entry to point to EOS - see (dev version of CERN resources). ETA next Monday
AFS phaseout
- ITUM communicated timeline - "restarting in 1Q2019, done before RUN3"
Kuba: EOS_MGM_URL for 'c3' points to root://eospps - do we want to change that (e.g to point to EOSUSER) ? Jan: rather would like this to be removed, clear error message if unset (and required), and auto-guess based on current directory/pathname (EOS-1397 - Andreas: could do, have the info when a path is given..).
AP: "eos quota" now has some built-in assumptions that it should go to the correct EOSHOME instance (?)
There are minutes attached to this event.
Show them.