EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-10-16T16:00:00+02:00
End: 2018-10-16T17:50:00+02:00
Location: CERN

Tuesday 16 Oct 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, MEDIA)

(jan):

SLS availability dashboard
Wed: EOSPUBLIC crash - OTG0046321
Mon: EOSCMS crash - OTG0046403 = EOS-2947

● CERNBox / EOSUSER / EOSHOME

Inconsistent symlink behaviour: https://its.cern.ch/jira/browse/CERNBOX-627
decisions: skip symlink-using users , AP to implement something that allows them.

● EOS clients, FUSE(X)

(jan):

Status: 4.4.0 in "production", same as "qa"
FUSEX on /eos/lhcb, /eos/home-X, /eos/pps (all in "production")
- one req to rollback on /eos/lhcb - RQF1137273
FUSEX rollout blocked since last week:
- neither mount the non-AMS EOSPUBLIC areas via FUSEX in "production",
- nor to start mounting EOSATLAS, EOSCMS, EOSMEDIA in "qa" for now
  - Both need https://its.cern.ch/jira/browse/EOS-2954
- EOSPUBLIC also seems a bit unstable (https://its.cern.ch/jira/browse/EOS-2950), and other instances perhaps should wait before going to 4.4.X (req for FUSEX) until things stabilize.
- next steps (if any)?
Bugs - see Open FUSEX issues
- several LXPLUS "eosxd" abort() with corrupted stacktrace (waves - due to "eos fusex evict"?). No JIRA.
- several LXBATCH stuck "df" for FUSEX - mostly leftover /eos/ams ? No JIRA.
- several FUSEX-as-homedir stuck - EOS-2988 (Massimo, random LXPLUS), EOS-2983 (single-file), EOS-2894 (xauth lockfile dance)

(andreas):

eosxd problems in production version
- hanging df
- writing large files (where t_write>300s)
- 0-size of new file after listing
- many old mounts still around (evicted ~2000 mounts today - v 4.3.9 4.3.11 4.3.13)
- listing cache not properly activated
- auto mount extremely slow due to O(100) users per letter = mount directory
fixes ready for new release
- all 'df' calls run now with a 2s timeout (never block)
- large file writing in regular intervals publishes a new file size to other clients during write process (default 5s). Listing after expiration of the cache subscription does not wipe the file size to 0 for an open file
- dentry cache is now properly populated and used
- autofs mount does not trigger to retrieve subscriptions for all directory children (e.g. for hundreds of users in the mount directory) - autofs initial mount took ~5s for letter a ... now few ms
remaining issues
- login lockups
- non-scalable listing (listing for large directories) - new prototype done
- to avoid any mount locks, should we reduce timeouts from 1d to some reasonable shorter time until we are confident we don't block anymore?

-> Need a new tag. Test, go for "qa" -> Dan.

Suggest to keep EOSWEB on "old" FUSE for now (seems to work for them).. Dan to suggest a configuration.

(Giuseppe: also have issue on "eosd" running since June, despite RPM updates - how to force restart? Open issue, deliberate decision to not force a restart. Might "evict" from server side, after a while, for when no inodes are in use, and old versions.. do manual for now , later cronjob)

● AOB

(jan)

EOSPROJECT status: EOS-2965

1st instance server-side setup OK.
no FST yet - waiting for EOSPPS drain.. (see below)

EOSPPS:

generally: many errors (non-existing replicas), clearly not treated as pre-PRODUCTION instance
consequence: very slow drain - EOS-2976
- general impression - drain in general is a problem area, needs manual effort (retries, file-per-file actions)
re-enabled "fsck" (in order to run auto-repair) - seems to now cause the Grafana "Namespace latency alert" - bug?

$HOME:

Looking for volunteers to have an EOS home directory (Massimo mail)
self-service will allow to change home directory entry to point to EOS - see (dev version of CERN resources). ETA next Monday

AFS phaseout

ITUM communicated timeline - "restarting in 1Q2019, done before RUN3"

Kuba: EOS_MGM_URL for 'c3' points to root://eospps - do we want to change that (e.g to point to EOSUSER) ? Jan: rather would like this to be removed, clear error message if unset (and required), and auto-guess based on current directory/pathname (EOS-1397 - Andreas: could do, have the info when a path is given..).

AP: "eos quota" now has some built-in assumptions that it should go to the correct EOSHOME instance (?)

There are minutes attached to this event. Show them.

- 16:00 → 16:10
  EOS production instances (LHC, PUBLIC, MEDIA) 10m
  
  Minutes
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Luca Mascetti (CERN)
  (jan):
  
  SLS availability dashboard
  
  Wed: EOSPUBLIC crash - OTG0046321
  
  Mon: EOSCMS crash - OTG0046403 = EOS-2947
- 16:10 → 16:20
  CERNBox / EOSUSER / EOSHOME 10m
  
  Minutes
  
  major issues for EOSUSER (non-sync-side of CERNBox)
  
  Speaker: Hugo Gonzalez Labrador (CERN)
  Inconsistent symlink behaviour: https://its.cern.ch/jira/browse/CERNBOX-627
  
  decisions: skip symlink-using users , AP to implement something that allows them.
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  
  Minutes
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  (jan):
  
  Status: 4.4.0 in "production", same as "qa"
  
  FUSEX on /eos/lhcb, /eos/home-X, /eos/pps (all in "production")
  
  one req to rollback on /eos/lhcb - RQF1137273
  
  FUSEX rollout blocked since last week:
  
  neither mount the non-AMS EOSPUBLIC areas via FUSEX in "production",
  
  nor to start mounting EOSATLAS, EOSCMS, EOSMEDIA in "qa" for now
  
  Both need https://its.cern.ch/jira/browse/EOS-2954
  
  EOSPUBLIC also seems a bit unstable (https://its.cern.ch/jira/browse/EOS-2950), and other instances perhaps should wait before going to 4.4.X (req for FUSEX) until things stabilize.
  
  next steps (if any)?
  
  Bugs - see Open FUSEX issues
  
  several LXPLUS "eosxd" abort() with corrupted stacktrace (waves - due to "eos fusex evict"?). No JIRA.
  
  several LXBATCH stuck "df" for FUSEX - mostly leftover /eos/ams ? No JIRA.
  
  several FUSEX-as-homedir stuck - EOS-2988 (Massimo, random LXPLUS), EOS-2983 (single-file), EOS-2894 (xauth lockfile dance)
  
  (andreas):
  
  eosxd problems in production version
  
  hanging df
  
  writing large files (where t_write>300s)
  
  0-size of new file after listing
  
  many old mounts still around (evicted ~2000 mounts today - v 4.3.9 4.3.11 4.3.13)
  
  listing cache not properly activated
  
  auto mount extremely slow due to O(100) users per letter = mount directory
  
  fixes ready for new release
  
  all 'df' calls run now with a 2s timeout (never block)
  
  large file writing in regular intervals publishes a new file size to other clients during write process (default 5s). Listing after expiration of the cache subscription does not wipe the file size to 0 for an open file
  
  dentry cache is now properly populated and used
  
  autofs mount does not trigger to retrieve subscriptions for all directory children (e.g. for hundreds of users in the mount directory) - autofs initial mount took ~5s for letter a ... now few ms
  
  remaining issues
  
  login lockups
  
  non-scalable listing (listing for large directories) - new prototype done
  
  to avoid any mount locks, should we reduce timeouts from 1d to some reasonable shorter time until we are confident we don't block anymore?
  
  -> Need a new tag. Test, go for "qa" -> Dan.
  
  Suggest to keep EOSWEB on "old" FUSE for now (seems to work for them).. Dan to suggest a configuration.
  
  (Giuseppe: also have issue on "eosd" running since June, despite RPM updates - how to force restart? Open issue, deliberate decision to not force a restart. Might "evict" from server side, after a while, for when no inodes are in use, and old versions.. do manual for now , later cronjob)
- 16:25 → 16:35
  Development issues 10m
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
- 16:35 → 16:50
  AOB 15m
  
  Minutes
  (jan)
  
  EOSPROJECT status: EOS-2965
  
  1st instance server-side setup OK.
  
  no FST yet - waiting for EOSPPS drain.. (see below)
  
  EOSPPS:
  
  generally: many errors (non-existing replicas), clearly not treated as pre-PRODUCTION instance
  
  consequence: very slow drain - EOS-2976
  
  general impression - drain in general is a problem area, needs manual effort (retries, file-per-file actions)
  
  re-enabled "fsck" (in order to run auto-repair) - seems to now cause the Grafana "Namespace latency alert" - bug?
  
  $HOME:
  
  Looking for volunteers to have an EOS home directory (Massimo mail)
  
  self-service will allow to change home directory entry to point to EOS - see (dev version of CERN resources). ETA next Monday
  
  AFS phaseout
  
  ITUM communicated timeline - "restarting in 1Q2019, done before RUN3"
  
  Kuba: EOS_MGM_URL for 'c3' points to root://eospps - do we want to change that (e.g to point to EOSUSER) ? Jan: rather would like this to be removed, clear error message if unset (and required), and auto-guess based on current directory/pathname (EOS-1397 - Andreas: could do, have the info when a path is given..).
  
  AP: "eos quota" now has some built-in assumptions that it should go to the correct EOSHOME instance (?)
  
  Hugo: is up to Luca as product manager to understand and evaluate the current situation and find the best possible solution that does not break user compatibility.

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● EOS production instances (LHC, PUBLIC, MEDIA)

● CERNBox / EOSUSER / EOSHOME

● EOS clients, FUSE(X)

● AOB

Share this page

Direct link

Social networks

Calendaring