EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2017-10-03T16:00:00+02:00
End: 2017-10-03T17:45:00+02:00
Location: CERN

Tuesday 3 Oct 2017, 16:00 → 17:45 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout

Hide

● overall 2017 planning

outside communication: ITUM-22 (AFS+) EOSFUSEx status - missed "end of September for beta test" deadline, but promise to "stabilize until end of 2017".

● production instances

EOSPUBLIC

Crashed on sunday: apparently a memory allocation issue that is pretty hard to track down.

LHC instances

Re-enabling intra-group balancing, will re-enable inter-group balancing in the coming days (nicer load distribution across nodes/scheduling groups)

● CERNBOX and EOSUSER

namespace compacted.

Sat: overload (self-inflicted? grafana cron job did "wc -l" on 66GB log file.. every 5 min). MGM unresponsive, MQ lost connection for some FSTs. Occurred again. Need to do something about this "wc" (only run once/lockfile) .. (and also drive down the error rate - one line per write attempt after failure?)

Log is rotated daily (copytruncate.. need to look at xrootd internal logrotate again (DST bug fixed in 4.x?)).

Also heavy user activity (10k auth req in 60sec) - cause/effect?

Also Kuba's account ACLs got reset due to "eos stat" returning unexpected error code (time out == not found..)?

Owncloud (web) update on Oct 10th

● FUSE and client versions

(ticket from EOSATLAS user) - 0.3.267 MGM reports "full" filesystem differently - EOSFUSE reacts to this, and needs to be updated -> 4.1.32

desktop 4.1.30 is still pending (needs packaged protobuf3). Q: should we wait for 4.1.32 instead (no, they need to get some newer version beyond 4.1.14, then can update later).

● Citrine rollout

OS Upgrade and new eosserver Puppet module

EOSPPS is being updated to CentOS 7, also one diskserver is using the new Puppet module and hostgroup layout (still WIP while adapting/rewriting manifests)

EOSATLAS

Very interesting in having the storage nodes migrated to CentOS 7 (thus Citrine), so that they can experiment with BEER (Batch on Eos Evaluation of Resources)

Discussed during last meeting, will most likely upgrade to Citrine after the run (or early 2018) depending on the experiment's schedule.

● SWAN

Sat incident: EOSUSER had transient impact - stalled, SWAN resumed without further action

Upcoming MGM renames - please check whether SWAN is impacted. EOSPUBLIC is using the "new" eos principal, EOSUSER hasn't been restarted yet (would pick up new principals on restart). Massimo will still discuss with procurement.

● nextgen FUSE

(mail from Andreas, 2017-09-29) - assign these items to people, add tentative timestamp?

STATUS

- server-side code merged into production version and first (beta) version of new client RPM in gitlab (last week)

- deployed on EOSUAT (last week)

WORKPLAN

CLIENTS: from beta to production version

remaining development items
- finalizing integration of new strong security module (deprecating eosfusebind)

- client support for HA deployments (master-slave failover)

- fix reported issues from qa tests

quality assurance
- CI intergration for all supported platforms doing core functional tests on the way

- internal validation (IT_ST)

- validate many-client deployment at larger scale

- validate use cases known not to work on current FUSE implementation

- validate in SAMBA and NFS gateways

- external valdiation (others0

- invite beta tester on QA platforms

- iterate on feedback

performance tuning

- towards AFS performance

SERVER
quality assurance

- validate usual standard production use cases

- scale test with many new FUSE clients

- verify no interference between old/new clients

migration

- migrate instances to new production release

- migrate instances to new CITRINE namespace backend (2018)

mid-term

- evolve MD & DATA HA & scalability model (2018)

TIMESCALE

achieve stable production version until end of the year

Current Status Update

Joszef/Elvin several fixes to compile on Ubuntu & OSX
Giorgios imported auth security code for strong security from legacy fuse:

"auth" : { "shared-mount" : 1, "krb5" : 1 }

Elvin merged server part into CITRINE (master)
server side fix allow clients to work without quota when quota is disabled in a space
added cache-leveler thread keeping the local disk cache at the configured size
added negative cache hooks

"options" : { "md-kernelcache.enoent.timeout" : 5, },

we are tracking bugs/tasks now in the /eosxd epic in JIRA
Giorgeos found already several issues with Clang/GCC and code instrumentation
still missing '.' and '..' directory in 'ls' output

autoconf, cmake, git clone works (on EL7) (still tracking an MD invalidation bug)

RPM: eos-fusex-core (can co-exist with old eos-fuse-core). Available from build system. Can start to rebuild on KOJI (Dan's special area). Will be released as next tagged release.

EOSPPS - update to citrine+new API

want bagplus VMs that mounts the updated instances (starting with EOSUAT, also EOSBACKUP (which has scratch spaces). Also allows to validate the new puppet modules. Can mount same instance twice, using old and new FUSE.

● new Namespace

refactero

Can set ACLs recursively+atomically server-side.

Some "cosmetic" stuff pending, also still need to erge Giorgios "queue".

Deployment: go to EOSBACKUP (on citrine in 2 weeks). New namespace before end of year (puppet "eosserver")?

● BATCH integration

Usual routine test.    Note that (for several weeks) we see systematic job failures/retry.

===

Task 184433 starts at Mon Oct  2 16:58:18 2017 and ends at Mon Oct  2 17:43:35 2017 (45.3 minutes)
Analysed jobs: 100
Correct jobs: 100
Maximum concurrency: 7
Execution hosts (top 5):  b6c5080bfb [#24]  b64972dff9 [#22]  b681081309 [#14]  b6de650151 [#10]  b66d5427ce [#9] 
Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, 
xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] 


Error analysis:
### Jobs analysis comparing status and error messages

Checking task 181800 (100 jobs)
Duplicated out file from outputs/out_0_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_1_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_2_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_3_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_4_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_5_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_6_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_7_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_8_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_9_10_*_181800.0.txt: (2 copies)
Problems in matching data file for task 181800 (job 0) with template outputs/out_9_10_*_181800.0.txt
	outputfile: outputs/out_0_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_0_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_1_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_1_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_2_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_2_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_3_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_3_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_4_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_4_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_5_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_5_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_6_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_6_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_7_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_7_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_8_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_8_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_9_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_9_10_b6f23a4dd2.cern.ch_181800.0.txt
Wrong filesize for outputs/out_0_10_b68e5a1754.cern.ch_181800.3.txt (0!=1024000)

(linked to NA62 complaint about mixed failures of EOS + CONDOR. Indeed see many jobs being run twice..)

Need to make sure EOS does not become the default scapegoat for all computer trouble)..

Massimo to take this up with CONDOR people (look at job logs? "started".."finished"). "job resubmission" is switched off for Grid (behind CE), but not for normal users..

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  overall 2017 planning 5m
  
  Minutes
  
  Speaker: Jan Iven (CERN)
  
  outside communication: ITUM-22 (AFS+) EOSFUSEx status - missed "end of September for beta test" deadline, but promise to "stabilize until end of 2017".
- 16:05 → 16:30
  operations: production
  - 16:05
    
    production instances 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    EOSPUBLIC
    
    Crashed on sunday: apparently a memory allocation issue that is pretty hard to track down.
    
    LHC instances
    
    Re-enabling intra-group balancing, will re-enable inter-group balancing in the coming days (nicer load distribution across nodes/scheduling groups)
  - 16:10
    
    CERNBOX and EOSUSER 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    namespace compacted.
    
    Sat: overload (self-inflicted? grafana cron job did "wc -l" on 66GB log file.. every 5 min). MGM unresponsive, MQ lost connection for some FSTs. Occurred again. Need to do something about this "wc" (only run once/lockfile) .. (and also drive down the error rate - one line per write attempt after failure?)
    
    Log is rotated daily (copytruncate.. need to look at xrootd internal logrotate again (DST bug fixed in 4.x?)).
    
    Also heavy user activity (10k auth req in 60sec) - cause/effect?
    
    Also Kuba's account ACLs got reset due to "eos stat" returning unexpected error code (time out == not found..)?
    
    Owncloud (web) update on Oct 10th
  - 16:15
    
    FUSE and client versions 5m
    
    Minutes
    
    Speaker: Dan van der Ster (CERN)
    
    (ticket from EOSATLAS user) - 0.3.267 MGM reports "full" filesystem differently - EOSFUSE reacts to this, and needs to be updated -> 4.1.32
    
    desktop 4.1.30 is still pending (needs packaged protobuf3). Q: should we wait for 4.1.32 instead (no, they need to get some newer version beyond 4.1.14, then can update later).
  - 16:20
    
    Citrine rollout 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    OS Upgrade and new eosserver Puppet module
    
    EOSPPS is being updated to CentOS 7, also one diskserver is using the new Puppet module and hostgroup layout (still WIP while adapting/rewriting manifests)
    
    EOSATLAS
    
    Very interesting in having the storage nodes migrated to CentOS 7 (thus Citrine), so that they can experiment with BEER (Batch on Eos Evaluation of Resources)
    
    Discussed during last meeting, will most likely upgrade to Citrine after the run (or early 2018) depending on the experiment's schedule.
  - 16:25
    
    SWAN 5m
    
    Minutes
    
    Speaker: Jakub Moscicki (CERN)
    
    Sat incident: EOSUSER had transient impact - stalled, SWAN resumed without further action
    
    Upcoming MGM renames - please check whether SWAN is impacted. EOSPUBLIC is using the "new" eos principal, EOSUSER hasn't been restarted yet (would pick up new principals on restart). Massimo will still discuss with procurement.
- 16:30 → 16:50
  development: near-term
  - 16:30
    nextgen FUSE 5m
    
    Minutes
    
    Speaker: Andreas Joachim Peters (CERN)
    
    (mail from Andreas, 2017-09-29) - assign these items to people, add tentative timestamp?
    
    STATUS
    
    - server-side code merged into production version and first (beta) version of new client RPM in gitlab (last week)
    
    - deployed on EOSUAT (last week)
    
    WORKPLAN
    
    CLIENTS: from beta to production version
    
    remaining development items
    - finalizing integration of new strong security module (deprecating eosfusebind)
    
    - client support for HA deployments (master-slave failover)
    
    - fix reported issues from qa tests
    
    quality assurance
    - CI intergration for all supported platforms doing core functional tests on the way
    
    - internal validation (IT_ST)
    
    - validate many-client deployment at larger scale
    
    - validate use cases known not to work on current FUSE implementation
    
    - validate in SAMBA and NFS gateways
    
    - external valdiation (others0
    
    - invite beta tester on QA platforms
    
    - iterate on feedback
    
    performance tuning
    
    - towards AFS performance
    
    SERVER
    quality assurance
    
    - validate usual standard production use cases
    
    - scale test with many new FUSE clients
    
    - verify no interference between old/new clients
    
    migration
    
    - migrate instances to new production release
    
    - migrate instances to new CITRINE namespace backend (2018)
    
    mid-term
    
    - evolve MD & DATA HA & scalability model (2018)
    
    TIMESCALE
    
    achieve stable production version until end of the year
    
    Current Status Update
    
    Joszef/Elvin several fixes to compile on Ubuntu & OSX
    
    Giorgios imported auth security code for strong security from legacy fuse:
    
    "auth" : { "shared-mount" : 1, "krb5" : 1 }
    
    Elvin merged server part into CITRINE (master)
    
    server side fix allow clients to work without quota when quota is disabled in a space
    
    added cache-leveler thread keeping the local disk cache at the configured size
    
    added negative cache hooks
    
    "options" : { "md-kernelcache.enoent.timeout" : 5, },
    
    we are tracking bugs/tasks now in the /eosxd epic in JIRA
    
    Giorgeos found already several issues with Clang/GCC and code instrumentation
    
    still missing '.' and '..' directory in 'ls' output
    
    autoconf, cmake, git clone works (on EL7) (still tracking an MD invalidation bug)
    
    RPM: eos-fusex-core (can co-exist with old eos-fuse-core). Available from build system. Can start to rebuild on KOJI (Dan's special area). Will be released as next tagged release.
    
    EOSPPS - update to citrine+new API
    
    want bagplus VMs that mounts the updated instances (starting with EOSUAT, also EOSBACKUP (which has scratch spaces). Also allows to validate the new puppet modules. Can mount same instance twice, using old and new FUSE.
  - 16:35
    
    new Namespace 5m
    
    Minutes
    
    Speaker: Elvin Alin Sindrilaru (CERN)
    
    refactero
    
    Can set ACLs recursively+atomically server-side.
    
    Some "cosmetic" stuff pending, also still need to erge Giorgios "queue".
    
    Deployment: go to EOSBACKUP (on citrine in 2 weeks). New namespace before end of year (puppet "eosserver")?
- 16:50 → 17:45
  other: pilot services, long-term dev, external
  - 16:50
    
    Webservice 5m
    
    Speaker: Luca Mascetti (CERN)
  - 16:55
    
    Backup 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:00
    
    Samba 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:05
    
    $HOME structure 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:10
    BATCH integration 5m
    
    Minutes
    
    Speaker: Massimo Lamanna (CERN)
    
    Usual routine test. Note that (for several weeks) we see systematic job failures/retry. === Task 184433 starts at Mon Oct 2 16:58:18 2017 and ends at Mon Oct 2 17:43:35 2017 (45.3 minutes) Analysed jobs: 100 Correct jobs: 100 Maximum concurrency: 7 Execution hosts (top 5): b6c5080bfb [#24] b64972dff9 [#22] b681081309 [#14] b6de650151 [#10] b66d5427ce [#9] Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] Error analysis: ### Jobs analysis comparing status and error messages Checking task 181800 (100 jobs) Duplicated out file from outputs/out_0_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_1_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_2_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_3_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_4_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_5_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_6_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_7_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_8_10_*_181800.0.txt: (2 copies) Duplicated out file from outputs/out_9_10_*_181800.0.txt: (2 copies) Problems in matching data file for task 181800 (job 0) with template outputs/out_9_10_*_181800.0.txt outputfile: outputs/out_0_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_0_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_1_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_1_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_2_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_2_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_3_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_3_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_4_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_4_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_5_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_5_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_6_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_6_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_7_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_7_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_8_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_8_10_b6f23a4dd2.cern.ch_181800.0.txt outputfile: outputs/out_9_10_b68e5a1754.cern.ch_181800.0.txt outputfile: outputs/out_9_10_b6f23a4dd2.cern.ch_181800.0.txt Wrong filesize for outputs/out_0_10_b68e5a1754.cern.ch_181800.3.txt (0!=1024000)
    
    (linked to NA62 complaint about mixed failures of EOS + CONDOR. Indeed see many jobs being run twice..)
    
    Need to make sure EOS does not become the default scapegoat for all computer trouble)..
    
    Massimo to take this up with CONDOR people (look at job logs? "started".."finished"). "job resubmission" is switched off for Grid (behind CE), but not for normal users..
  - 17:15
    
    Xrootd 5m
    
    Speaker: Michal Kamil Simon (CERN)
  - 17:20
    
    AOB 5m

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● overall 2017 planning

● production instances

EOSPUBLIC

LHC instances

● CERNBOX and EOSUSER

● FUSE and client versions

● Citrine rollout

OS Upgrade and new eosserver Puppet module

EOSATLAS

● SWAN

● nextgen FUSE

● new Namespace

● BATCH integration

EOSPUBLIC

LHC instances

OS Upgrade and new eosserver Puppet module

EOSATLAS

Share this page

Direct link

Social networks

Calendaring