EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-05-29T16:00:00+02:00
End: 2018-05-29T17:50:00+02:00
Location: CERN

Tuesday 29 May 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, USER)

EOSATLAS

Incident of last thursday led to the creation of these two tickets:

EOS-2600: A clean FST shutdown wrongly marks local LevelDB as dirty
EOS-2601: GeoScheduler misbehaviour when cluster is degraded

● EOS clients, FUSE(X)

(Jan):

still-pending "eosd" config change (EOS_FUSE_LAZYOPENRW=1 etc) - CRM-2669 - ETA tomorrow.
"eosxd" also may need "cleanup" script to prevent stuck mountpoints - EOS-2614
decision (input from Elvin): not deploying 4.2.23 on clients
ETA for "eosxd" slow GIT checkout (W.Lampl, EOS-2589 has a commit but is "in progress")? [Andreas: I closed it => 4.2.24]

(Andreas):

everything besides exos branch [rados] has been merged into 'dev' branch
file inlining has been fixed in 'dev' branch (it is off by default anyway)
at least with the 'dev' branch I see an issue that the current working directory gets inaccessible (has somebody observed this on lx*) and a 'cd . ' is required - looking into it
- Luca will deploy the "dev" branch
fixed a FUSEX SEGV issue related to inline repair (too small buffer) found by Rainer - needs to be ported to 4.2.24

(Enrico):

SWAN is running 4.2.22 since ~1week, series of crashes but no coredumps (abrtd setup?), will try to reproduce/provide more info.

● Development issues

(Georgios)

Fixing the last few remaining places where doing synchronous requests to QDB could lock up the namespace, and cause the MGM to become unresponsive for several seconds.
- Mostly related to FilesystemView, used by MGM services like Balancer, etc.
(EOS-2610) PPS MGM hangs every day around 2pm for a couple of minutes, but recovers quickly. (No crash, just the NS remains unavailable) Almost sure it's related to the above, fixing that should also resolve this.
- seems not to be cronjob (internal?). Happens every 24h+30min.
(have deleted some files on EOSPPS, will go below 3.5B to have some room for operations)

EOSBACKUP - want access to MGM machine for 1 day (to boot namespace), then tag release, then deploy new NS on that machine (is already on "citrine"). Will stop backup traffic tomorrow during the day, needs compacting (Georgios/Elvin/Kuba/Luca to coordinate).

● AOB

Massimo - need new FUSEX soon (4.2.24 should be release in days), stuck with "massive parallel FUSEX" testing.

multi-mountpoint - might be EOS-2603 (but marked as low-prio)
need "eso-cleanup" script soon (EOS-2614) since seeing a lot of EOS-2603 and nodes are unusable afterwards
saw some "corruption"? (unclear, might have been SEGV as found by Rainer - see above, no ticket?)

Cristi - sec team request to scan for world-writeable directories. Operations need to look into this (with high priority). Cristi will do this for EOSPUBLIC, but somebody should reply for all EOSes.

Kuba - have drafted policy for EOSUSER service (i.e what to answer for "big files", >1TB quota, FTS access, Grid integration etc - all should try go to experiment instance).

Jan: OK from our side (clarify the 1TB/2TB limit), but suggest to clarify message with ATLAS, LHCb storage experts before sending to users.
Massimo: in particular ATLAS has group areas for heavy analysis.

There are minutes attached to this event. Show them.

- 16:00 → 16:20
  EOS production instances (LHC, PUBLIC, USER) 20m
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)
  EOSATLAS
  
  Incident of last thursday led to the creation of these two tickets:
  
  EOS-2600: A clean FST shutdown wrongly marks local LevelDB as dirty
  
  EOS-2601: GeoScheduler misbehaviour when cluster is degraded
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  (Jan):
  
  still-pending "eosd" config change (EOS_FUSE_LAZYOPENRW=1 etc) - CRM-2669 - ETA tomorrow.
  
  "eosxd" also may need "cleanup" script to prevent stuck mountpoints - EOS-2614
  
  decision (input from Elvin): not deploying 4.2.23 on clients
  
  ETA for "eosxd" slow GIT checkout (W.Lampl, EOS-2589 has a commit but is "in progress")? [Andreas: I closed it => 4.2.24]
  
  (Andreas):
  
  everything besides exos branch [rados] has been merged into 'dev' branch
  
  file inlining has been fixed in 'dev' branch (it is off by default anyway)
  
  at least with the 'dev' branch I see an issue that the current working directory gets inaccessible (has somebody observed this on lx*) and a 'cd . ' is required - looking into it
  
  Luca will deploy the "dev" branch
  
  fixed a FUSEX SEGV issue related to inline repair (too small buffer) found by Rainer - needs to be ported to 4.2.24
  
  (Enrico):
  
  SWAN is running 4.2.22 since ~1week, series of crashes but no coredumps (abrtd setup?), will try to reproduce/provide more info.
- 16:25 → 16:35
  Development issues 10m
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
  (Georgios)
  
  Fixing the last few remaining places where doing synchronous requests to QDB could lock up the namespace, and cause the MGM to become unresponsive for several seconds.
  
  Mostly related to FilesystemView, used by MGM services like Balancer, etc.
  
  (EOS-2610) PPS MGM hangs every day around 2pm for a couple of minutes, but recovers quickly. (No crash, just the NS remains unavailable) Almost sure it's related to the above, fixing that should also resolve this.
  
  seems not to be cronjob (internal?). Happens every 24h+30min.
  
  (have deleted some files on EOSPPS, will go below 3.5B to have some room for operations)
  
  EOSBACKUP - want access to MGM machine for 1 day (to boot namespace), then tag release, then deploy new NS on that machine (is already on "citrine"). Will stop backup traffic tomorrow during the day, needs compacting (Georgios/Elvin/Kuba/Luca to coordinate).
- 16:35 → 16:50
  AOB 15m
  Massimo - need new FUSEX soon (4.2.24 should be release in days), stuck with "massive parallel FUSEX" testing.
  
  multi-mountpoint - might be EOS-2603 (but marked as low-prio)
  
  need "eso-cleanup" script soon (EOS-2614) since seeing a lot of EOS-2603 and nodes are unusable afterwards
  
  saw some "corruption"? (unclear, might have been SEGV as found by Rainer - see above, no ticket?)
  
  Cristi - sec team request to scan for world-writeable directories. Operations need to look into this (with high priority). Cristi will do this for EOSPUBLIC, but somebody should reply for all EOSes.
  
  Kuba - have drafted policy for EOSUSER service (i.e what to answer for "big files", >1TB quota, FTS access, Grid integration etc - all should try go to experiment instance).
  
  Jan: OK from our side (clarify the 1TB/2TB limit), but suggest to clarify message with ATLAS, LHCb storage experts before sending to users.
  
  Massimo: in particular ATLAS has group areas for heavy analysis.