● EOS production instances (LHC, PUBLIC, USER)
EOSATLAS heavy slowdown
The instance had all threads busy starting this morning at ~06h30
After an unsuccessful attempt to restart the MGM (eg: didn't fix the issue), some more log grepping lead to this:
180320 12:05:43 1822 XrdProtocol: ?:10501@lxplus100 terminated handshake not received
180320 12:05:43 1173 XrdProtocol: ?:2223@lxplus100 terminated handshake not received
180320 12:05:43 23651 XrdProtocol: ?:8358@lxplus100 terminated handshake not received
180320 12:05:43 1173 XrdProtocol: ?:2223@lxplus079 terminated handshake not received
180320 12:05:43 31849 XrdProtocol: ?:6226@lxplus100 terminated handshake not received
180320 12:05:43 31849 XrdProtocol: ?:8358@lxplus079 terminated handshake not received
180320 12:05:43 22304 XrdProtocol: ?:11205@lxplus100 terminated handshake not received
This correlates nicely with the thread exhaustion, these messages were reported in the logs at ~415Hz.
- Added DROP rule in iptables for the 2 nodes until they're drained.
- As soon as the rule was added, number of threads went back to normal (~1k)
-
CentOS 7 migration of FSTs
Roberto has been doing a great job with EOSATLAS and wrote a documentation with the most frequent issues when updating FSTs to CentOS7.
There's a Rundeck job that updates the machine's profile and reinstalls them:
CMS and ALICE will be assigned to people on rota: works quitely in the background, sends an email if a machine takes more than 2 hours to come back.
New dashboard
EOS Versions
EOSMEDIA:
- updated to 4.2.17 in the shadow of the Sorenson and Micala patching campaign
- added more space, but as JBOD (instead of RAID1-optimized for video)
EOSALICEDAQ (WIP):
- more puppet-ization
- add SLS monitoring (to old SLS script, new is not ready yet)
EOSUSER
- issue with /var/eos getting full >85%, we have no alarm for that (we do have something for /var, but not the SSD under /var/eos)
- scheduled a morning compact for dirs at 5:45
- possible issue if new md files do not fit -> we have a space issue coming soon, need additional ssds
- plan schedule a file compact tonight
- roll out of new version (tested on EOSUAT) before Easter (tbd)
EOSHOME
- still some installation issue for FSTs only 2 out of 3 worked (after many try and retry). Roberto will have a look (using his instructions)
● EOS clients, FUSE(X)
[ Andreas ]
Development
1) change to non-POSIX behaviour to enable public directories in non-public home directories
- stat on directory evaluates ACLs on itself not the parent
- stat on file evalutaes ACLs of parent as before
2) adding by default chmod,setxattr,delete when 'w' is specified in ACL can be revoked with '!'d' and '!m'
(background from Luca: FUSEX does not coexist with "sharing", too POSIXy. Would need to share also the parent dir, at least for browsing)
1) := client side patch (4.2.18)
2) := server side patch (4.2.18 + AQ commit branch)
New issues:
EOS-2425
EOS-2444
[Jan]
- 4.2.16 is now the "production" client version
- /eos/scratch roll-out BI-1862 - puppet module issues (32GB FUSEX cache default is far too large)
- ATLAS T0 batch file creation via (old) FUSE, feedback - see https://indico.cern.ch/event/698870/
- 0.5% error rate (empty files on reading, via "xrdcp")
- might be retry-on-open. Eric (CTA) has come up with scenario that causes overwritten/truncated files. Luca: most "lost files" last year due to this?
- 0.5% errors on delete (non-empty directory?)
- .sys "junk", invisible via FUSE?
- see errors in FUSE log but not propagated to python/exception
● Development issues
Georgios
- Enabled compression in QuarkDB: LZ4 for the top compaction layers (very fast), and ZSTD for the bottom one. (slightly slower, but offers higher compression ratio)
- would having no compression at top layer make a difference? No idea.
- PPS namespace went from ~800GB to ~350GB, not bad.
- Might be improved further using ZSTD dictionary compression, but something weird is going inside RocksDB: I saw no benefit at all by enabling dictionary compression - disabled for now.
- We're still missing an SSD in Wigner for PPS... we'll probably move the MGM to Meyrin.
- Benoit: currently in "burn-in" - is this being slowed down by "ironic" workflow? (auto-created a new "EOS" instance..)
Elvin:
- 4.2.18 released: addresses "corrupted config" on MGM restart, and 0-size TPC brokenness. Also lots of docs updates (FUSEX, BOX setup, QuarkDB setup etc..). Compiled against 4.8.1.
- should go into client "qa"
- Request from Aarnet: put "last modified" timestamp on docs?
● AOB
AOB:
Massimo: eosarchive scripts (CASTOR to EOS) needs to be resurrected
Massimo: FUSEX write test, see PDF.
Should really have rate limitations everywhere (got applied on EOSPUBLIC), might have side effects from HTTP gateways.
There are minutes attached to this event.
Show them.