EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description
Weekly meeting to discuss progress on EOS rollout

● overall 2017 planning

 outside communication: ITUM-22 (AFS+) EOSFUSEx status - missed "end of September for beta test" deadline, but promise to "stabilize until end of 2017".


● production instances

EOSPUBLIC

Crashed on sunday: apparently a memory allocation issue that is pretty hard to track down.

LHC instances

Re-enabling intra-group balancing, will re-enable inter-group balancing in the coming days (nicer load distribution across nodes/scheduling groups)


● CERNBOX and EOSUSER

namespace compacted.

Sat: overload (self-inflicted? grafana cron job did "wc -l" on 66GB log file.. every 5 min). MGM unresponsive, MQ lost connection for some FSTs. Occurred again. Need to do something about this "wc" (only run once/lockfile) .. (and also drive down the error rate - one line per write attempt after failure?)

Log is rotated daily (copytruncate.. need to look at xrootd internal logrotate again (DST bug fixed in 4.x?)).

Also heavy user activity (10k auth req in 60sec) - cause/effect?

Also Kuba's account ACLs got reset due to "eos stat" returning unexpected error code (time out == not found..)?

Owncloud (web) update on Oct 10th

 


● FUSE and client versions

(ticket from EOSATLAS user) - 0.3.267 MGM reports "full" filesystem differently - EOSFUSE reacts to this, and needs to be updated -> 4.1.32

desktop 4.1.30 is still pending (needs packaged protobuf3). Q: should we wait for 4.1.32 instead (no, they need to get some newer version beyond 4.1.14, then can update later).

 

 


● Citrine rollout

OS Upgrade and new eosserver Puppet module

EOSPPS is being updated to CentOS 7, also one diskserver is using the new Puppet module and hostgroup layout (still WIP while adapting/rewriting manifests)

EOSATLAS

Very interesting in having the storage nodes migrated to CentOS 7 (thus Citrine), so that they can experiment with BEER (Batch on Eos Evaluation of Resources)

Discussed during last meeting, will most likely upgrade to Citrine after the run (or early 2018) depending on the experiment's schedule.


● SWAN

Sat incident: EOSUSER had transient impact  - stalled, SWAN resumed without further action

Upcoming MGM renames - please check whether SWAN is impacted. EOSPUBLIC is using the "new" eos principal, EOSUSER hasn't been restarted yet (would pick up new principals on restart). Massimo will still discuss with procurement.

 


● nextgen FUSE

(mail from Andreas, 2017-09-29) - assign these items to people, add tentative timestamp?


STATUS

- server-side code merged into production version and first (beta) version of new client RPM in gitlab (last week)

- deployed on EOSUAT (last week)

 

WORKPLAN

CLIENTS: from beta to production version

remaining development items
- finalizing integration of new strong security module (deprecating eosfusebind)

- client support for HA deployments (master-slave failover)

- fix reported issues from qa tests

quality assurance
- CI intergration for all supported platforms doing core functional tests on the way

- internal validation (IT_ST)

 - validate many-client deployment at larger scale

 - validate use cases known not to work on current FUSE implementation

 - validate in SAMBA and NFS gateways

- external valdiation (others0

 - invite beta tester on QA platforms

 - iterate on feedback

performance tuning

- towards AFS  performance


SERVER
quality assurance

- validate usual standard production use cases

- scale test with many new FUSE clients

- verify no interference between old/new clients

migration

- migrate instances to new production release

- migrate instances to new CITRINE namespace backend (2018)

mid-term

- evolve MD & DATA HA & scalability model (2018)

TIMESCALE

achieve stable production version until end of the year


 

Current Status Update

  • Joszef/Elvin several fixes to compile on Ubuntu & OSX
     
  • Giorgios imported auth security code for strong security from legacy fuse:

  "auth" : {
    "shared-mount" : 1,
    "krb5" : 1
  }

  • Elvin merged server part into CITRINE (master)
     
  • server side fix allow clients to work without quota when quota is disabled in a space
     
  • added cache-leveler thread keeping the local disk cache at the configured size
     
  • added negative cache hooks


"options" : {
    "md-kernelcache.enoent.timeout" : 5,
  },

 

  • we are tracking bugs/tasks now in the /eosxd epic in JIRA
     
  • Giorgeos found already several issues with Clang/GCC and code instrumentation
     
  • still missing '.' and '..' directory in 'ls' output

autoconf, cmake, git clone works (on EL7) (still tracking an MD invalidation bug)

 

RPM: eos-fusex-core (can co-exist with old eos-fuse-core). Available from build system. Can start to rebuild on KOJI (Dan's special area). Will be released as next tagged release.

EOSPPS - update to citrine+new API

want bagplus VMs that mounts the updated instances (starting with EOSUAT, also EOSBACKUP (which has scratch spaces). Also allows to validate the new puppet modules. Can mount same instance twice, using old and new FUSE.

 

 

 

 

 


● new Namespace

refactero

Can set ACLs recursively+atomically server-side.

Some "cosmetic" stuff pending, also still need to erge Giorgios "queue".

Deployment: go to EOSBACKUP  (on citrine in 2 weeks). New namespace before end of year (puppet "eosserver")?


● BATCH integration

Usual routine test.    Note that (for several weeks) we see systematic job failures/retry.

===

Task 184433 starts at Mon Oct  2 16:58:18 2017 and ends at Mon Oct  2 17:43:35 2017 (45.3 minutes)
Analysed jobs: 100
Correct jobs: 100
Maximum concurrency: 7
Execution hosts (top 5):  b6c5080bfb [#24]  b64972dff9 [#22]  b681081309 [#14]  b6de650151 [#10]  b66d5427ce [#9] 
Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, 
xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] 


Error analysis:
### Jobs analysis comparing status and error messages

Checking task 181800 (100 jobs)
Duplicated out file from outputs/out_0_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_1_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_2_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_3_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_4_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_5_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_6_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_7_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_8_10_*_181800.0.txt: (2 copies)
Duplicated out file from outputs/out_9_10_*_181800.0.txt: (2 copies)
Problems in matching data file for task 181800 (job 0) with template outputs/out_9_10_*_181800.0.txt
	outputfile: outputs/out_0_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_0_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_1_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_1_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_2_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_2_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_3_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_3_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_4_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_4_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_5_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_5_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_6_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_6_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_7_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_7_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_8_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_8_10_b6f23a4dd2.cern.ch_181800.0.txt
	outputfile: outputs/out_9_10_b68e5a1754.cern.ch_181800.0.txt
	outputfile: outputs/out_9_10_b6f23a4dd2.cern.ch_181800.0.txt
Wrong filesize for outputs/out_0_10_b68e5a1754.cern.ch_181800.3.txt (0!=1024000)

(linked to NA62 complaint about mixed failures of EOS + CONDOR. Indeed see many jobs being run twice..)

Need to make sure EOS does not become the default scapegoat for all computer trouble)..

Massimo to take this up with CONDOR people (look at job logs? "started".."finished"). "job resubmission" is switched off for Grid (behind CE), but not for normal users..

 

There are minutes attached to this event. Show them.
    • 16:00 16:05
      overall 2017 planning 5m
      Speaker: Jan Iven (CERN)

       outside communication: ITUM-22 (AFS+) EOSFUSEx status - missed "end of September for beta test" deadline, but promise to "stabilize until end of 2017".

    • 16:05 16:30
      operations: production
      • 16:05
        production instances 5m
        Speaker: Herve Rousseau (CERN)

        EOSPUBLIC

        Crashed on sunday: apparently a memory allocation issue that is pretty hard to track down.

        LHC instances

        Re-enabling intra-group balancing, will re-enable inter-group balancing in the coming days (nicer load distribution across nodes/scheduling groups)

      • 16:10
        CERNBOX and EOSUSER 5m
        Speaker: Luca Mascetti (CERN)

        namespace compacted.

        Sat: overload (self-inflicted? grafana cron job did "wc -l" on 66GB log file.. every 5 min). MGM unresponsive, MQ lost connection for some FSTs. Occurred again. Need to do something about this "wc" (only run once/lockfile) .. (and also drive down the error rate - one line per write attempt after failure?)

        Log is rotated daily (copytruncate.. need to look at xrootd internal logrotate again (DST bug fixed in 4.x?)).

        Also heavy user activity (10k auth req in 60sec) - cause/effect?

        Also Kuba's account ACLs got reset due to "eos stat" returning unexpected error code (time out == not found..)?

        Owncloud (web) update on Oct 10th

         

      • 16:15
        FUSE and client versions 5m
        Speaker: Dan van der Ster (CERN)

        (ticket from EOSATLAS user) - 0.3.267 MGM reports "full" filesystem differently - EOSFUSE reacts to this, and needs to be updated -> 4.1.32

        desktop 4.1.30 is still pending (needs packaged protobuf3). Q: should we wait for 4.1.32 instead (no, they need to get some newer version beyond 4.1.14, then can update later).

         

         

      • 16:20
        Citrine rollout 5m
        Speaker: Herve Rousseau (CERN)

        OS Upgrade and new eosserver Puppet module

        EOSPPS is being updated to CentOS 7, also one diskserver is using the new Puppet module and hostgroup layout (still WIP while adapting/rewriting manifests)

        EOSATLAS

        Very interesting in having the storage nodes migrated to CentOS 7 (thus Citrine), so that they can experiment with BEER (Batch on Eos Evaluation of Resources)

        Discussed during last meeting, will most likely upgrade to Citrine after the run (or early 2018) depending on the experiment's schedule.

      • 16:25
        SWAN 5m
        Speaker: Jakub Moscicki (CERN)

        Sat incident: EOSUSER had transient impact  - stalled, SWAN resumed without further action

        Upcoming MGM renames - please check whether SWAN is impacted. EOSPUBLIC is using the "new" eos principal, EOSUSER hasn't been restarted yet (would pick up new principals on restart). Massimo will still discuss with procurement.

         

    • 16:30 16:50
      development: near-term
      • 16:30
        nextgen FUSE 5m
        Speaker: Andreas Joachim Peters (CERN)

        (mail from Andreas, 2017-09-29) - assign these items to people, add tentative timestamp?


        STATUS

        - server-side code merged into production version and first (beta) version of new client RPM in gitlab (last week)

        - deployed on EOSUAT (last week)

         

        WORKPLAN

        CLIENTS: from beta to production version

        remaining development items
        - finalizing integration of new strong security module (deprecating eosfusebind)

        - client support for HA deployments (master-slave failover)

        - fix reported issues from qa tests

        quality assurance
        - CI intergration for all supported platforms doing core functional tests on the way

        - internal validation (IT_ST)

         - validate many-client deployment at larger scale

         - validate use cases known not to work on current FUSE implementation

         - validate in SAMBA and NFS gateways

        - external valdiation (others0

         - invite beta tester on QA platforms

         - iterate on feedback

        performance tuning

        - towards AFS  performance


        SERVER
        quality assurance

        - validate usual standard production use cases

        - scale test with many new FUSE clients

        - verify no interference between old/new clients

        migration

        - migrate instances to new production release

        - migrate instances to new CITRINE namespace backend (2018)

        mid-term

        - evolve MD & DATA HA & scalability model (2018)

        TIMESCALE

        achieve stable production version until end of the year


         

        Current Status Update

        • Joszef/Elvin several fixes to compile on Ubuntu & OSX
           
        • Giorgios imported auth security code for strong security from legacy fuse:

          "auth" : {
            "shared-mount" : 1,
            "krb5" : 1
          }

        • Elvin merged server part into CITRINE (master)
           
        • server side fix allow clients to work without quota when quota is disabled in a space
           
        • added cache-leveler thread keeping the local disk cache at the configured size
           
        • added negative cache hooks


        "options" : {
            "md-kernelcache.enoent.timeout" : 5,
          },

         

        • we are tracking bugs/tasks now in the /eosxd epic in JIRA
           
        • Giorgeos found already several issues with Clang/GCC and code instrumentation
           
        • still missing '.' and '..' directory in 'ls' output

        autoconf, cmake, git clone works (on EL7) (still tracking an MD invalidation bug)

         

        RPM: eos-fusex-core (can co-exist with old eos-fuse-core). Available from build system. Can start to rebuild on KOJI (Dan's special area). Will be released as next tagged release.

        EOSPPS - update to citrine+new API

        want bagplus VMs that mounts the updated instances (starting with EOSUAT, also EOSBACKUP (which has scratch spaces). Also allows to validate the new puppet modules. Can mount same instance twice, using old and new FUSE.

         

         

         

         

         

      • 16:35
        new Namespace 5m
        Speaker: Elvin Alin Sindrilaru (CERN)

        refactero

        Can set ACLs recursively+atomically server-side.

        Some "cosmetic" stuff pending, also still need to erge Giorgios "queue".

        Deployment: go to EOSBACKUP  (on citrine in 2 weeks). New namespace before end of year (puppet "eosserver")?

    • 16:50 17:45
      other: pilot services, long-term dev, external
      • 16:50
        Webservice 5m
        Speaker: Luca Mascetti (CERN)
      • 16:55
        Backup 5m
        Speaker: Luca Mascetti (CERN)
      • 17:00
        Samba 5m
        Speaker: Luca Mascetti (CERN)
      • 17:05
        $HOME structure 5m
        Speaker: Luca Mascetti (CERN)
      • 17:10
        BATCH integration 5m
        Speaker: Massimo Lamanna (CERN)
        Usual routine test.    Note that (for several weeks) we see systematic job failures/retry.
        
        ===
        
        Task 184433 starts at Mon Oct  2 16:58:18 2017 and ends at Mon Oct  2 17:43:35 2017 (45.3 minutes)
        Analysed jobs: 100
        Correct jobs: 100
        Maximum concurrency: 7
        Execution hosts (top 5):  b6c5080bfb [#24]  b64972dff9 [#22]  b681081309 [#14]  b6de650151 [#10]  b66d5427ce [#9] 
        Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, 
        xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] 
        
        
        Error analysis:
        ### Jobs analysis comparing status and error messages
        
        Checking task 181800 (100 jobs)
        Duplicated out file from outputs/out_0_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_1_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_2_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_3_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_4_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_5_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_6_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_7_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_8_10_*_181800.0.txt: (2 copies)
        Duplicated out file from outputs/out_9_10_*_181800.0.txt: (2 copies)
        Problems in matching data file for task 181800 (job 0) with template outputs/out_9_10_*_181800.0.txt
        	outputfile: outputs/out_0_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_0_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_1_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_1_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_2_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_2_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_3_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_3_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_4_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_4_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_5_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_5_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_6_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_6_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_7_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_7_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_8_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_8_10_b6f23a4dd2.cern.ch_181800.0.txt
        	outputfile: outputs/out_9_10_b68e5a1754.cern.ch_181800.0.txt
        	outputfile: outputs/out_9_10_b6f23a4dd2.cern.ch_181800.0.txt
        Wrong filesize for outputs/out_0_10_b68e5a1754.cern.ch_181800.3.txt (0!=1024000)
        

        (linked to NA62 complaint about mixed failures of EOS + CONDOR. Indeed see many jobs being run twice..)

        Need to make sure EOS does not become the default scapegoat for all computer trouble)..

        Massimo to take this up with CONDOR people (look at job logs? "started".."finished"). "job resubmission" is switched off for Grid (behind CE), but not for normal users..

         

      • 17:15
        Xrootd 5m
        Speaker: Michal Kamil Simon (CERN)
      • 17:20
        AOB 5m