EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2017-12-05T16:00:00+01:00
End: 2017-12-05T17:45:00+01:00
Location: CERN

Tuesday 5 Dec 2017, 16:00 → 17:45 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout

Hide

● overall 2017^H8 planning

EOSFUSEx

(at ITUM): "stabilize until end of 2017" - is "January 2018" still realistic for release? Can we now invite IT testers?
can we already open up port 1100/tcp on all "citrine" instances (so far no server-side instability seen)?

New Namespace:

now on EOSPPS.
EOSBACKUP foreseen for this week
who is next? - or rather, when will EOSUSER (or EOSALICE) run out of memory?
full deployment in summer 2018 - still realistic?

● production instances

(from Jan):

SLS probe is unreliable: still overall "green" with two instances "red". Also should define target availability (EOSATLAS: 97.8% year-to-day, EOSUSER 98.7%, EOSALICE 99.0%)?

EOSPUBLIC

Few crashes in the last days:

out of memory triggered by auxiliary services running on the headnodes
known XRootD bugs: waiting for release with fix (workaround enabled)

EOSCMS

Agreed on Citrine upgrade date: 23rd January

● CERNBOX and EOSUSER

EOSUSER MGM deadlocked on Friday. Some aliases (e.g for webservice) switched to slave, but the "slave" machine has less RAM, and is loaded by the backup process -> ran out of file descriptors. MGM restarted, aliases moved back, restarted slave.. (have big-memory slave machine, not in production yet).

● FUSE and client versions

(From Dan)

eos-fuse 4.2.4 is in qa since Thursday. Going prod this Thurday: CRM-2489
hg_eostest/bagplus (e.g. eosplus604/704) updated to override $HOME to /eos/user. Puppet for this:

sssd::override_homedir: '/eos/user/%l/%u'

● Citrine rollout

CentOS7 migration

Ops tools fixed by Roberto
New EOSServer module configures
- NS Frontend
- NS Backend
- Auth Proxies
- FST
EOSPPS headnodes migrated to CC7 and to the Puppet environment eos_next, preparing for BEER test(s)

● SWAN

Tests for the new fuse implementation on swan-qa003

Installation: manual installation of eos-fusex rpm (and deps), puppet disable, puppet manifest left untouched
Verison: eos-fusex-4.2.4-1 + xrootd-client-4.7.0-1
Host OS: CC 7.3.1611

Setup with config file in `/etc/eos/fuse.conf` as if https://gitlab.cern.ch/dss/eos/tree/master/fusex
When defining a local cache folder, eosxd results in error:

[root@swan-12c-01 entf]# eosxd
# fsname=''
# -o allow_other enabled on shared mount
# -o big_writes enabled
# JSON parsing successfull
# File descriptor limit: 65536 soft, 65536 hard
error: failed to make path= /var/eos/fusex/journal/eosuat RWX for root - errno=107

The error code can be either 107 or 2, with no apparent reason.

Tests as fuse gateway for multiple users (as of today's eos-fuse in SWAN). Local cache disabled:

Jupyter (rarely) complains about "Notebook changed on disk". This is usually representative for glitches with mtimes or inodes. See screenshot attached.
Python shutil.copytree(src, dst) results in Errorno 1 -- Operation not permitted. Worked flawlessly with the old fuse.
Will an equivalent of `eosfusebind` be provided for fusex? Used in production so that users gain permission to read/write on their $HOME folder via fuse once they got the Kerberos ticket.

Discussed - please open JIRA tickets for 1+2, 3: no "eosfusebind" needed (suggest to use explicit KRB5CCNAME env variable to prevent accidental sharing).

● nextgen FUSE

(From Kuba)

eosfusex user mounts -> can run out of fds -> document/discourage?
quota update propagation -> 5 minutes (?) -> document?
rm propagation between two clients -> after 180 seconds stale entries (ENOENT)
smashbox nplusone: empty files read (under investigation)

(From Dan)

/eos/scratch & /eos/pps mounted in hg_eostest/bagplus (e.g. eosplus604/704) with eosxd.
New hg_eostest/bagplus/microtests to run microtests on PPS. (FYI, acron requires include ::afs on target hosts).
- See grafana for PPS results
Reminder: anybody who needs a personal/different HW "eosxd" test machine can create a new VM in "eostest/bagplus"

(From Andreas)

4.2.4 quite broken once remounted or after first 5 minutes ...

EOSxd runs now on MAC
- EOS-2147
- EOS-2148
Multi-client append bug
- EOS-2143
Server reply client truncate bug
Owner trapping on Chmod (Massimo via rsync)
- EOS-2159
symlink size / tar problem(rainer)
- EOS-XXXX
rm -rf mess (rainer)
- EOS-2161
- EOS-2168

New config options:

"rename-is-sync" : 1 // until server has atomic open function by inode

"free-md-asap" : 1 // keeps the memory foot print low e.g. after find /eos/

There is big problem when eosxd is mounted from a user because of the low number of file descriptors. I have to add a barrier that it stops after 250 open files until files get closed to avoid weird problems.

OSX-Mounter prototype https://cernbox.cern.ch/index.php/s/MXS8mJXddGZLBH0

- will transform this into simple Menu App for single mounts

From Jozsef:

POSIX test is run once/day from pipeline (takes too long for CI, also still too many (~100) known-to-be-broken things. Assumed things like "chmod" by root, which might be disabled on production instances)

Actions:

New release (4.2.5+) will be tagged, deployed to test machines.
EOSUAT MGM needs updating (to what release?) to be able to create empty files via FUSEX
multi-client case needs more testing - probably easiest with Docker
all: please script & contribute your testcases

● new Namespace

Deployed on EOSPPS. Many things tested and OK, but (of course) discovering issues:

need to cache FS view
"fs boot * syncmgm" may kill MGM (slow, FST nodes retry after timeout. Looking at FST-level workaround.) - not sure whether this is a new bug, or already in current MGM (which is faster, less likely to hit the timeout+retry)
Elvin will keep an eye on SLS status, no need to report this
QuarkDB currently on 2 extra nodes, and on MGM. Will slit out the last to own node.
Massimo will stress-test the namespace. ("eos find /" to flush cache etc)

EOSBACKUP: foreseen for this week but will come only in January

Later deployment on production instances: too early to tell, full rollout in summer 2018 is considered "un peu juste". Massimo's "namespace hard wall" (entries > mem) was July 2018 (Luca: "April") but now have 1TB RAM.

Deployment on EOSUSER - two possibilities:

"big bang" (downtime, migrate)
create new shadow instance, copy all data, point alias there (but fall through to old for missing entries) - no downtime but more complex (sync client etc).

● Xrootd

From Michal:

there's not much to report on XRootD side, only that we are in the process of releasing 4.8.0 and Gerri is working on the issue reported by Herve (#631).

Discussion:

Kuba: is there some "recommended" XRootd version (different ones available from EPEL, EOS repo, "upstream" XRootd repos)? Not really, no hard dependency from EOS. Xroot-4.7.1 is latest but has some known bugs, 4.7.0 is still on LXPLUS (and has other known bugs). Suggest to use whatever is easiest (i.e does not need additional repos).

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  overall 2017^H8 planning 5m
  
  Minutes
  
  Speaker: Jan Iven (CERN)
  EOSFUSEx
  
  (at ITUM): "stabilize until end of 2017" - is "January 2018" still realistic for release? Can we now invite IT testers?
  
  can we already open up port 1100/tcp on all "citrine" instances (so far no server-side instability seen)?
  
  New Namespace:
  
  now on EOSPPS.
  
  EOSBACKUP foreseen for this week
  
  who is next? - or rather, when will EOSUSER (or EOSALICE) run out of memory?
  
  full deployment in summer 2018 - still realistic?
- 16:05 → 16:30
  operations: production
  - 16:05
    production instances 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    (from Jan):
    
    SLS probe is unreliable: still overall "green" with two instances "red". Also should define target availability (EOSATLAS: 97.8% year-to-day, EOSUSER 98.7%, EOSALICE 99.0%)?
    
    EOSPUBLIC
    
    Few crashes in the last days:
    
    out of memory triggered by auxiliary services running on the headnodes
    
    known XRootD bugs: waiting for release with fix (workaround enabled)
    
    EOSCMS
    
    Agreed on Citrine upgrade date: 23rd January
  - 16:10
    
    CERNBOX and EOSUSER 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    EOSUSER MGM deadlocked on Friday. Some aliases (e.g for webservice) switched to slave, but the "slave" machine has less RAM, and is loaded by the backup process -> ran out of file descriptors. MGM restarted, aliases moved back, restarted slave.. (have big-memory slave machine, not in production yet).
  - 16:15
    FUSE and client versions 5m
    
    Minutes
    
    Speaker: Dan van der Ster (CERN)
    
    (From Dan)
    
    eos-fuse 4.2.4 is in qa since Thursday. Going prod this Thurday: CRM-2489
    
    hg_eostest/bagplus (e.g. eosplus604/704) updated to override $HOME to /eos/user. Puppet for this:
    
    sssd::override_homedir: '/eos/user/%l/%u'
  - 16:20
    Citrine rollout 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    CentOS7 migration
    
    Ops tools fixed by Roberto
    
    New EOSServer module configures
    
    NS Frontend
    
    NS Backend
    
    Auth Proxies
    
    FST
    
    EOSPPS headnodes migrated to CC7 and to the Puppet environment eos_next, preparing for BEER test(s)
  - 16:25
    SWAN 5m
    
    Minutes
    
    Speaker: Jakub Moscicki (CERN)
    
    Tests for the new fuse implementation on swan-qa003
    
    Installation: manual installation of eos-fusex rpm (and deps), puppet disable, puppet manifest left untouched
    Verison: eos-fusex-4.2.4-1 + xrootd-client-4.7.0-1
    Host OS: CC 7.3.1611
    
    Setup with config file in `/etc/eos/fuse.conf` as if https://gitlab.cern.ch/dss/eos/tree/master/fusex
    When defining a local cache folder, eosxd results in error:
    
    [root@swan-12c-01 entf]# eosxd
    # fsname=''
    # -o allow_other enabled on shared mount
    # -o big_writes enabled
    # JSON parsing successfull
    # File descriptor limit: 65536 soft, 65536 hard
    error: failed to make path= /var/eos/fusex/journal/eosuat RWX for root - errno=107
    
    The error code can be either 107 or 2, with no apparent reason.
    
    Tests as fuse gateway for multiple users (as of today's eos-fuse in SWAN). Local cache disabled:
    
    Jupyter (rarely) complains about "Notebook changed on disk". This is usually representative for glitches with mtimes or inodes. See screenshot attached.
    
    Python shutil.copytree(src, dst) results in Errorno 1 -- Operation not permitted. Worked flawlessly with the old fuse.
    
    Will an equivalent of `eosfusebind` be provided for fusex? Used in production so that users gain permission to read/write on their $HOME folder via fuse once they got the Kerberos ticket.
    
    Discussed - please open JIRA tickets for 1+2, 3: no "eosfusebind" needed (suggest to use explicit KRB5CCNAME env variable to prevent accidental sharing).
- 16:30 → 16:50
  development: near-term
  - 16:30
    nextgen FUSE 5m
    
    Minutes
    
    Speaker: Andreas Joachim Peters (CERN)
    
    (From Kuba)
    
    eosfusex user mounts -> can run out of fds -> document/discourage?
    
    quota update propagation -> 5 minutes (?) -> document?
    
    rm propagation between two clients -> after 180 seconds stale entries (ENOENT)
    
    smashbox nplusone: empty files read (under investigation)
    
    (From Dan)
    
    /eos/scratch & /eos/pps mounted in hg_eostest/bagplus (e.g. eosplus604/704) with eosxd.
    
    New hg_eostest/bagplus/microtests to run microtests on PPS. (FYI, acron requires include ::afs on target hosts).
    
    See grafana for PPS results
    
    Reminder: anybody who needs a personal/different HW "eosxd" test machine can create a new VM in "eostest/bagplus"
    
    (From Andreas)
    
    4.2.4 quite broken once remounted or after first 5 minutes ...
    
    EOSxd runs now on MAC
    
    EOS-2147
    
    EOS-2148
    
    Multi-client append bug
    
    EOS-2143
    
    Server reply client truncate bug
    
    Owner trapping on Chmod (Massimo via rsync)
    
    EOS-2159
    
    symlink size / tar problem(rainer)
    
    EOS-XXXX
    
    rm -rf mess (rainer)
    
    EOS-2161
    
    EOS-2168
    
    New config options:
    
    "rename-is-sync" : 1 // until server has atomic open function by inode
    
    "free-md-asap" : 1 // keeps the memory foot print low e.g. after find /eos/
    
    There is big problem when eosxd is mounted from a user because of the low number of file descriptors. I have to add a barrier that it stops after 250 open files until files get closed to avoid weird problems.
    
    OSX-Mounter prototype https://cernbox.cern.ch/index.php/s/MXS8mJXddGZLBH0
    
    - will transform this into simple Menu App for single mounts
    
    From Jozsef:
    
    POSIX test is run once/day from pipeline (takes too long for CI, also still too many (~100) known-to-be-broken things. Assumed things like "chmod" by root, which might be disabled on production instances)
    
    Actions:
    
    New release (4.2.5+) will be tagged, deployed to test machines.
    
    EOSUAT MGM needs updating (to what release?) to be able to create empty files via FUSEX
    
    multi-client case needs more testing - probably easiest with Docker
    
    all: please script & contribute your testcases
  - 16:35
    new Namespace 5m
    
    Minutes
    
    Speaker: Elvin Alin Sindrilaru (CERN)
    
    Deployed on EOSPPS. Many things tested and OK, but (of course) discovering issues:
    
    need to cache FS view
    
    "fs boot * syncmgm" may kill MGM (slow, FST nodes retry after timeout. Looking at FST-level workaround.) - not sure whether this is a new bug, or already in current MGM (which is faster, less likely to hit the timeout+retry)
    
    Elvin will keep an eye on SLS status, no need to report this
    
    QuarkDB currently on 2 extra nodes, and on MGM. Will slit out the last to own node.
    
    Massimo will stress-test the namespace. ("eos find /" to flush cache etc)
    
    EOSBACKUP: foreseen for this week but will come only in January
    
    Later deployment on production instances: too early to tell, full rollout in summer 2018 is considered "un peu juste". Massimo's "namespace hard wall" (entries > mem) was July 2018 (Luca: "April") but now have 1TB RAM.
    
    Deployment on EOSUSER - two possibilities:
    
    "big bang" (downtime, migrate)
    
    create new shadow instance, copy all data, point alias there (but fall through to old for missing entries) - no downtime but more complex (sync client etc).
- 16:50 → 17:45
  other: pilot services, long-term dev, external
  - 16:50
    
    Webservice 5m
    
    Speaker: Luca Mascetti (CERN)
  - 16:55
    
    Backup 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:00
    
    Samba 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:05
    
    $HOME structure 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:10
    
    BATCH integration 5m
    
    Speaker: Massimo Lamanna (CERN)
  - 17:15
    
    Xrootd 5m
    
    Minutes
    
    Speaker: Michal Kamil Simon (CERN)
    
    From Michal:
    
    there's not much to report on XRootD side, only that we are in the process of releasing 4.8.0 and Gerri is working on the issue reported by Herve (#631).
    
    Discussion:
    
    Kuba: is there some "recommended" XRootd version (different ones available from EPEL, EOS repo, "upstream" XRootd repos)? Not really, no hard dependency from EOS. Xroot-4.7.1 is latest but has some known bugs, 4.7.0 is still on LXPLUS (and has other known bugs). Suggest to use whatever is easiest (i.e does not need additional repos).
  - 17:20
    
    AOB 5m

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● overall 2017^H8 planning

● production instances

EOSPUBLIC

EOSCMS

● CERNBOX and EOSUSER

● FUSE and client versions

● Citrine rollout

CentOS7 migration

● SWAN

● nextgen FUSE

● new Namespace

● Xrootd

EOSPUBLIC

EOSCMS

CentOS7 migration

Share this page

Direct link

Social networks

Calendaring