EOS DevOps Meeting
Weekly meeting to discuss progress on EOS rollout.
- please keep content relevant to (most of) the audience, explain context
- Last week: major issues, preferably with ticket
- This week/planning: who, until when, needs what?
Add your input to the "contribution minutes" before the meeting. Else will be "AOB".
● EOS production instances (LHC, PUBLIC, USER)
EOSATLAS
Incident of last thursday led to the creation of these two tickets:
- EOS-2600: A clean FST shutdown wrongly marks local LevelDB as dirty
- EOS-2601: GeoScheduler misbehaviour when cluster is degraded
● EOS clients, FUSE(X)
(Jan):
- still-pending "eosd" config change (EOS_FUSE_LAZYOPENRW=1 etc) - CRM-2669 - ETA tomorrow.
- "eosxd" also may need "cleanup" script to prevent stuck mountpoints - EOS-2614
- decision (input from Elvin): not deploying 4.2.23 on clients
- ETA for "eosxd" slow GIT checkout (W.Lampl, EOS-2589 has a commit but is "in progress")? [Andreas: I closed it => 4.2.24]
(Andreas):
- everything besides exos branch [rados] has been merged into 'dev' branch
- file inlining has been fixed in 'dev' branch (it is off by default anyway)
- at least with the 'dev' branch I see an issue that the current working directory gets inaccessible (has somebody observed this on lx*) and a 'cd . ' is required - looking into it
- Luca will deploy the "dev" branch
- fixed a FUSEX SEGV issue related to inline repair (too small buffer) found by Rainer - needs to be ported to 4.2.24
(Enrico):
- SWAN is running 4.2.22 since ~1week, series of crashes but no coredumps (abrtd setup?), will try to reproduce/provide more info.
● Development issues
(Georgios)
- Fixing the last few remaining places where doing synchronous requests to QDB could lock up the namespace, and cause the MGM to become unresponsive for several seconds.
- Mostly related to FilesystemView, used by MGM services like Balancer, etc.
- (EOS-2610) PPS MGM hangs every day around 2pm for a couple of minutes, but recovers quickly. (No crash, just the NS remains unavailable) Almost sure it's related to the above, fixing that should also resolve this.
- seems not to be cronjob (internal?). Happens every 24h+30min.
- (have deleted some files on EOSPPS, will go below 3.5B to have some room for operations)
EOSBACKUP - want access to MGM machine for 1 day (to boot namespace), then tag release, then deploy new NS on that machine (is already on "citrine"). Will stop backup traffic tomorrow during the day, needs compacting (Georgios/Elvin/Kuba/Luca to coordinate).
● AOB
Massimo - need new FUSEX soon (4.2.24 should be release in days), stuck with "massive parallel FUSEX" testing.
- multi-mountpoint - might be EOS-2603 (but marked as low-prio)
- need "eso-cleanup" script soon (EOS-2614) since seeing a lot of EOS-2603 and nodes are unusable afterwards
- saw some "corruption"? (unclear, might have been SEGV as found by Rainer - see above, no ticket?)
Cristi - sec team request to scan for world-writeable directories. Operations need to look into this (with high priority). Cristi will do this for EOSPUBLIC, but somebody should reply for all EOSes.
Kuba - have drafted policy for EOSUSER service (i.e what to answer for "big files", >1TB quota, FTS access, Grid integration etc - all should try go to experiment instance).
- Jan: OK from our side (clarify the 1TB/2TB limit), but suggest to clarify message with ATLAS, LHCb storage experts before sending to users.
- Massimo: in particular ATLAS has group areas for heavy analysis.