● EOS production instances (LHC, PUBLIC, USER)
EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:
4.2.21 has been released. Need to test against FUSE? not, already in production on some MGM, and 4.2.21 was pretty specific.
Q (Massimo): time to get 4.3 out? depend on testing - if happy, can release tomorrow? Plan to release this week (since already running). How to test "truncation" part (perhaps in CASTOR repack; was simple to reproduce - just freeze for 60sec)?
EOSPUBLIC:
LHC instances
- Upgrading to 4.2.21 and XRootD 4.8.3-rc1 (except ALICE = no FUSE)
- Minor fixes to be commited on CentOS7 / systemd
- EOSCMS not joining AAA fed.
- Coredump enabled (to be fixed via a SystemD override)
- Batch on EOS
- Draining SSDs on ~half of the eligible ATLAS FSTs, can use as /pool for jobs
- Fixing minor issues with Docker
EOASLICE is on 4.2.17 - should perhaps get the fixed xrootd (might affect conditions files)?
● EOS clients, FUSE(X)
(Massimo)
- Inconsistent dates (close-but-not-quite UNIX "0") for files generated by HTCondor (stdout and stderr). Discussions + INC1650792
- BEER testing (Condor jobs on FST)
- Monitoring stuff to see the effect (https://monit-grafana.cern.ch/dashboard/db/_user-laman-eos-beer?orgId=6 + beer.sh reading data from one FST via xrdc to /dev/null (batch))
- Problem: CONDOR job draining not working (now in Koji: see BEER-12 ), now also will get a way to hard kill jobs.
- PPS sailing towards 2^32+1 (almost 4.3B, 4.0 now, suggested by and agreed with GeorgiosB)
- Still confused about draining interference on PPS (sorry). EOS-2517 (Thank you to A. Manzi)
- might have been in "draindead", not visible on some tool.
- No time to check newfind yet (sorry again)
- Prepare for multiple-mountpoint tests (e.g 300 jobs x 100 mountpoints). OK on UAT
- FUSEX issues are at "residual state" now. Can we have FUSEX on, say, LHCb? I understand USER is "special".
- no recovery for deleted files yet -> add to list as showstopper. Luca to give list.
- enable on "qa" (aka on 10% of batch) afterwards.
(Jan)
- (FUSE / FST data loss: need postmortem at ASDF this week - due to "emergency" change of writeback cache setting).
- FUSE client status: 4.2.18 in production, 4.2.19 in qa (but known to be broken - EOS-2486, needs 4.2.22)
- FUSEX - created accounts for AFS phaseout people BUT not announced yet
- A.Lossent would like to use FUSEX for openshift (some oddities on credential access, containerized)
(Dan)
- Thomas is testing latest eosclient puppet module and rpm v4.2.18 with locmap for the CC7.5 desktop release.
- Q: can we push new xrootd-4.3 on desktops? yes, is compatible (SSI is incompatible but is only server-side). We could "force" this by tagging in the "eos" repo.
- Thomas proposal is to have desktops pointing to the KOJI repo, we get control of what
(Andreas)
eosxd
- increased very short default timeouts from 15s to 30/60s ... add an internal eosxd logic to get rid of never returned outstanding XrdCl asynchronous messages (hard timeout, gives EIO)
- limit the number of files in the filestart cache to 64k by default e.g. added inode limit to local cache
- added 'xoff' functionality to throttle IOPS per file and not to overrun XrdCl which failes once more than 64k requests are in flight (can hit this via one-byte writes, now a single requester can only have 1k in-flight)
eosd
- allow user private mounts to root users be setting EOS_FUSE_PRIVATE_ROOT_MOUNT (not released yet)
- stop writing (triggering re-opening) when the FST open failed
- remove deprecated authentication recovery code for XrootD 3.3.X
- avoid to create a new connection per request (Georgios) in certain circumstances (ProtoDUNE did this, exhausted FDs and increases MGM mem use)
- fix segfault in shutdown (Georgios)
Andreas/eosxd: one PPS machine now has 100 mounts, run sync test. Could we do this on all CC7 macines to get "many clients"-test. Yes. Q: can we do this in containers? yes.
Luca: have a config change slipped in puppet that forces KRB5 (seen by Jesus, contacted Dan)? Comes from renamed config tree, was supposed to work (affected CERNBox, discovered yesterday). Note: change was pushed a month ago, one issue seen immediately - might have been still-running mount. Should we "grep" across Puppet configs? No, first cross-check with Jesus, then (perhaps) fix in module.
● Development issues
(Georgios)
- New NS: implementing asynchronous reads on all metadata operations, to prevent locking the namespace while waiting on network requests towards QDB. Expected to fix many issues.
- Asynchronous writes are already there (MetadataFlusher), no way we'd be getting 1kHz of file creations otherwise.
(Michal):
xrootd-4.8.3 expected this week.
(Jan):
want to restart "qa" cycle with 4.2.22. Need to wait for 4.8.3. Can be done this week. Careful: eos-4.2.22(server) has a new shared dependency.
There are minutes attached to this event.
Show them.