● overall 2017^H8 planning
EOSFUSEx
- (at ITUM): "stabilize until end of 2017" - is "January 2018" still realistic for release? Can we now invite IT testers?
- can we already open up port 1100/tcp on all "citrine" instances (so far no server-side instability seen)?
New Namespace:
- now on EOSPPS.
- EOSBACKUP foreseen for this week
- who is next? - or rather, when will EOSUSER (or EOSALICE) run out of memory?
- full deployment in summer 2018 - still realistic?
● production instances
(from Jan):
SLS probe is unreliable: still overall "green" with two instances "red". Also should define target availability (EOSATLAS: 97.8% year-to-day, EOSUSER 98.7%, EOSALICE 99.0%)?
EOSPUBLIC
Few crashes in the last days:
- out of memory triggered by auxiliary services running on the headnodes
- known XRootD bugs: waiting for release with fix (workaround enabled)
EOSCMS
Agreed on Citrine upgrade date: 23rd January
● CERNBOX and EOSUSER
EOSUSER MGM deadlocked on Friday. Some aliases (e.g for webservice) switched to slave, but the "slave" machine has less RAM, and is loaded by the backup process -> ran out of file descriptors. MGM restarted, aliases moved back, restarted slave.. (have big-memory slave machine, not in production yet).
● FUSE and client versions
(From Dan)
- eos-fuse 4.2.4 is in qa since Thursday. Going prod this Thurday: CRM-2489
- hg_eostest/bagplus (e.g. eosplus604/704) updated to override $HOME to /eos/user. Puppet for this:
sssd::override_homedir: '/eos/user/%l/%u'
● Citrine rollout
CentOS7 migration
- Ops tools fixed by Roberto
- New EOSServer module configures
- NS Frontend
- NS Backend
- Auth Proxies
- FST
- EOSPPS headnodes migrated to CC7 and to the Puppet environment eos_next, preparing for BEER test(s)
● SWAN
Tests for the new fuse implementation on swan-qa003
Installation: manual installation of eos-fusex rpm (and deps), puppet disable, puppet manifest left untouched
Verison: eos-fusex-4.2.4-1 + xrootd-client-4.7.0-1
Host OS: CC 7.3.1611
Setup with config file in `/etc/eos/fuse.conf` as if https://gitlab.cern.ch/dss/eos/tree/master/fusex
When defining a local cache folder, eosxd results in error:
[root@swan-12c-01 entf]# eosxd
# fsname=''
# -o allow_other enabled on shared mount
# -o big_writes enabled
# JSON parsing successfull
# File descriptor limit: 65536 soft, 65536 hard
error: failed to make path= /var/eos/fusex/journal/eosuat RWX for root - errno=107
The error code can be either 107 or 2, with no apparent reason.
Tests as fuse gateway for multiple users (as of today's eos-fuse in SWAN). Local cache disabled:
- Jupyter (rarely) complains about "Notebook changed on disk". This is usually representative for glitches with mtimes or inodes. See screenshot attached.
- Python shutil.copytree(src, dst) results in Errorno 1 -- Operation not permitted. Worked flawlessly with the old fuse.
- Will an equivalent of `eosfusebind` be provided for fusex? Used in production so that users gain permission to read/write on their $HOME folder via fuse once they got the Kerberos ticket.
Discussed - please open JIRA tickets for 1+2, 3: no "eosfusebind" needed (suggest to use explicit KRB5CCNAME env variable to prevent accidental sharing).
● nextgen FUSE
(From Kuba)
- eosfusex user mounts -> can run out of fds -> document/discourage?
- quota update propagation -> 5 minutes (?) -> document?
- rm propagation between two clients -> after 180 seconds stale entries (ENOENT)
- smashbox nplusone: empty files read (under investigation)
(From Dan)
- /eos/scratch & /eos/pps mounted in hg_eostest/bagplus (e.g. eosplus604/704) with eosxd.
- New hg_eostest/bagplus/microtests to run microtests on PPS. (FYI, acron requires include ::afs on target hosts).
- Reminder: anybody who needs a personal/different HW "eosxd" test machine can create a new VM in "eostest/bagplus"
(From Andreas)
4.2.4 quite broken once remounted or after first 5 minutes ...
- EOSxd runs now on MAC
- Multi-client append bug
- Server reply client truncate bug
- Owner trapping on Chmod (Massimo via rsync)
- symlink size / tar problem(rainer)
- rm -rf mess (rainer)
New config options:
"rename-is-sync" : 1 // until server has atomic open function by inode
"free-md-asap" : 1 // keeps the memory foot print low e.g. after find /eos/
There is big problem when eosxd is mounted from a user because of the low number of file descriptors. I have to add a barrier that it stops after 250 open files until files get closed to avoid weird problems.
OSX-Mounter prototype https://cernbox.cern.ch/index.php/s/MXS8mJXddGZLBH0
- will transform this into simple Menu App for single mounts
From Jozsef:
- POSIX test is run once/day from pipeline (takes too long for CI, also still too many (~100) known-to-be-broken things. Assumed things like "chmod" by root, which might be disabled on production instances)
Actions:
- New release (4.2.5+) will be tagged, deployed to test machines.
- EOSUAT MGM needs updating (to what release?) to be able to create empty files via FUSEX
- multi-client case needs more testing - probably easiest with Docker
- all: please script & contribute your testcases
● new Namespace
Deployed on EOSPPS. Many things tested and OK, but (of course) discovering issues:
- need to cache FS view
- "fs boot * syncmgm" may kill MGM (slow, FST nodes retry after timeout. Looking at FST-level workaround.) - not sure whether this is a new bug, or already in current MGM (which is faster, less likely to hit the timeout+retry)
- Elvin will keep an eye on SLS status, no need to report this
- QuarkDB currently on 2 extra nodes, and on MGM. Will slit out the last to own node.
- Massimo will stress-test the namespace. ("eos find /" to flush cache etc)
EOSBACKUP: foreseen for this week but will come only in January
Later deployment on production instances: too early to tell, full rollout in summer 2018 is considered "un peu juste". Massimo's "namespace hard wall" (entries > mem) was July 2018 (Luca: "April") but now have 1TB RAM.
Deployment on EOSUSER - two possibilities:
- "big bang" (downtime, migrate)
- create new shadow instance, copy all data, point alias there (but fall through to old for missing entries) - no downtime but more complex (sync client etc).
● Xrootd
From Michal:
there's not much to report on XRootD side, only that we are in the process of releasing 4.8.0 and Gerri is working on the issue reported by Herve (#631).
Discussion:
Kuba: is there some "recommended" XRootd version (different ones available from EPEL, EOS repo, "upstream" XRootd repos)? Not really, no hard dependency from EOS. Xroot-4.7.1 is latest but has some known bugs, 4.7.0 is still on LXPLUS (and has other known bugs). Suggest to use whatever is easiest (i.e does not need additional repos).
There are minutes attached to this event.
Show them.