- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
outside communication: ITUM-22 (AFS+) EOSFUSEx status - missed "end of September for beta test" deadline, but promise to "stabilize until end of 2017".
Crashed on sunday: apparently a memory allocation issue that is pretty hard to track down.
Re-enabling intra-group balancing, will re-enable inter-group balancing in the coming days (nicer load distribution across nodes/scheduling groups)
namespace compacted.
Sat: overload (self-inflicted? grafana cron job did "wc -l" on 66GB log file.. every 5 min). MGM unresponsive, MQ lost connection for some FSTs. Occurred again. Need to do something about this "wc" (only run once/lockfile) .. (and also drive down the error rate - one line per write attempt after failure?)
Log is rotated daily (copytruncate.. need to look at xrootd internal logrotate again (DST bug fixed in 4.x?)).
Also heavy user activity (10k auth req in 60sec) - cause/effect?
Also Kuba's account ACLs got reset due to "eos stat" returning unexpected error code (time out == not found..)?
Owncloud (web) update on Oct 10th
(ticket from EOSATLAS user) - 0.3.267 MGM reports "full" filesystem differently - EOSFUSE reacts to this, and needs to be updated -> 4.1.32
desktop 4.1.30 is still pending (needs packaged protobuf3). Q: should we wait for 4.1.32 instead (no, they need to get some newer version beyond 4.1.14, then can update later).
EOSPPS is being updated to CentOS 7, also one diskserver is using the new Puppet module and hostgroup layout (still WIP while adapting/rewriting manifests)
Very interesting in having the storage nodes migrated to CentOS 7 (thus Citrine), so that they can experiment with BEER (Batch on Eos Evaluation of Resources)
Discussed during last meeting, will most likely upgrade to Citrine after the run (or early 2018) depending on the experiment's schedule.
Sat incident: EOSUSER had transient impact - stalled, SWAN resumed without further action
Upcoming MGM renames - please check whether SWAN is impacted. EOSPUBLIC is using the "new" eos principal, EOSUSER hasn't been restarted yet (would pick up new principals on restart). Massimo will still discuss with procurement.
(mail from Andreas, 2017-09-29) - assign these items to people, add tentative timestamp?
STATUS
- server-side code merged into production version and first (beta) version of new client RPM in gitlab (last week)
- deployed on EOSUAT (last week)
WORKPLAN
CLIENTS: from beta to production version
remaining development items
- finalizing integration of new strong security module (deprecating eosfusebind)
- client support for HA deployments (master-slave failover)
- fix reported issues from qa tests
quality assurance
- CI intergration for all supported platforms doing core functional tests on the way
- internal validation (IT_ST)
- validate many-client deployment at larger scale
- validate use cases known not to work on current FUSE implementation
- validate in SAMBA and NFS gateways
- external valdiation (others0
- invite beta tester on QA platforms
- iterate on feedback
performance tuning
- towards AFS performance
SERVER
quality assurance
- validate usual standard production use cases
- scale test with many new FUSE clients
- verify no interference between old/new clients
migration
- migrate instances to new production release
- migrate instances to new CITRINE namespace backend (2018)
mid-term
- evolve MD & DATA HA & scalability model (2018)
TIMESCALE
achieve stable production version until end of the year
Current Status Update
"auth" : {
"shared-mount" : 1,
"krb5" : 1
}
"options" : {
"md-kernelcache.enoent.timeout" : 5,
},
autoconf, cmake, git clone works (on EL7) (still tracking an MD invalidation bug)
RPM: eos-fusex-core (can co-exist with old eos-fuse-core). Available from build system. Can start to rebuild on KOJI (Dan's special area). Will be released as next tagged release.
EOSPPS - update to citrine+new API
want bagplus VMs that mounts the updated instances (starting with EOSUAT, also EOSBACKUP (which has scratch spaces). Also allows to validate the new puppet modules. Can mount same instance twice, using old and new FUSE.
refactero
Can set ACLs recursively+atomically server-side.
Some "cosmetic" stuff pending, also still need to erge Giorgios "queue".
Deployment: go to EOSBACKUP (on citrine in 2 weeks). New namespace before end of year (puppet "eosserver")?
Usual routine test. Note that (for several weeks) we see systematic job failures/retry. === Task 184433 starts at Mon Oct 2 16:58:18 2017 and ends at Mon Oct 2 17:43:35 2017 (45.3 minutes) Analysed jobs: 100 Correct jobs: 100 Maximum concurrency: 7 Execution hosts (top 5): b6c5080bfb [#24] b64972dff9 [#22] b681081309 [#14] b6de650151 [#10] b66d5427ce [#9] Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] Error analysis: ### Jobs analysis comparing status and error messages Checking task 181800 (100 jobs) Duplicated out file from outputs/out_0_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_1_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_2_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_3_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_4_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_5_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_6_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_7_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_8_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_9_10_*_181800.0.txt: (2 copies) Problems in matching data file for task 181800 (job 0) with template outputs/out_9_10_*_181800.0.txt outputfile: outputs/out_0_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_0_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_1_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_1_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_2_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_2_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_3_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_3_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_4_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_4_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_5_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_5_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_6_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_6_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_7_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_7_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_8_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_8_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_9_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_9_10_b6f23a4dd2.cern.ch_181800.0.txt Wrong filesize for outputs/out_0_10_b68e5a1754.cern.ch_181800.3.txt (0!=1024000)
(linked to NA62 complaint about mixed failures of EOS + CONDOR. Indeed see many jobs being run twice..)
Need to make sure EOS does not become the default scapegoat for all computer trouble)..
Massimo to take this up with CONDOR people (look at job logs? "started".."finished"). "job resubmission" is switched off for Grid (behind CE), but not for normal users..