EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description
Weekly meeting to discuss progress on EOS rollout

● production instances

EOSALICE

Multiple crashes on Sat caused by (per Andreas)

  • Incredible high connection rate -fixed
  • Malformed authentication information - upstream
  • EOS Stacktrace
  • (and something with draining - memory corruption, elsewhere)

Working on putting the authentication proxies in front of the MGM: very exotic issue with iptables... (was working for EOSCMS, but somehow not for EOSALICE).

Better data distribution

Roberto is working on EOSALICE scheduling group unbalance (got random FST crashes, now going slowly = 25 filesystems/day, but need to do O(3000) filesystems in 30 groups-> 160 days ETA at current rate.. scripted but launched manually). Andreas might have suggestion on how to speed this (or take 1 FS/node in parallel?).


Started to work on EOSCMS (which reached critical levels, up to 99% full) and EOSPUBLIC.

Note: have filesystems of different sizes (2TB..6TB), should take into account for groups.

EOSATLAS (Cristi): similar - drain filesystems, add them to groups based on fullness (add until <90% full). 

 


● CERNBOX and EOSUSER

Investigating "strange" 1min-delays (also seen by probe, also by WOPI - stat() takes minute(s)). Could have been Backup launching in a "storm", but unlikely.

Andreas suggests a better probe: mkdir() on established connection (vs "mkdir" on a new/separate connection)? Might be different, seems to come from "xrdcp -f" waiting for redirection.

Not reaching the max number of threads (4k).

Might capture the latency in MGM - have this but would need to reset every hour.


● FUSE and client versions

Compiled 4.2.0-3 on el7/el6 for koji. el6 repo has a new dependency, hiredis. 

el7 testing: http://linuxsoft.cern.ch/internal/repos/eos7-testing/x86_64/os/Packages/

el6 testing: http://linuxsoft.cern.ch/internal/repos/eos6-testing/x86_64/os/Packages/

Dan's basic tests are passing, but these have *not* been pushed to qa.

Also, eos-fusex 4.2.0-3 can be found in the above repos, but puppet eosclient integration incomplete.

Q: what needs to be done - should not be blocked for 3 weeks.

Q: who can push this to "qa" since fixes 4.1.30 session binding crash? see brand-new EOSops procedure.


● Citrine rollout

EOSCMS

They confirmed the preferred slot for migrating to Citrine would be after the Christmas shutdown

EOSATLAS

Meeting on friday about Batch on EOS, hence also about CentOS 7 and Citrine migration


● SWAN

SWAN had "spontaneous" update to EOSFUSE 4.1.30 (which crashes on LXPLUS, when used with per-session bindings.. might not affect).


● nextgen FUSE

new FUSE

  • discovered that XrdCL does not disable the nagle alrogithm (write(1b)-sync-write(1b)-sync ... take 25ms for the write and 25ms for the disk sync = 50ms/b)
    • Michal added XRD_NODELAY to XrdCl to disable nagle
  • file start cache and journal directories can now be overlayed in the same directory
  • Georgios ported RocksDB as KV backend as REDIS replacement (used for SMB/NFS gateways, where stable inodes are needed)
  • FUSEX client creates now all (missing) local cache directories according to configuration
  • Georgios fixed few more race conditions with thread sanitizer
  • few fixes for NFS4 gateway (. .. dir, special FUSE flags)
  • strong auth now works, you can change your credentials and permissions change as expected
  • Georgios fixed wrong standard deviation computation of rate counters
  • FUSEX client sends statistic to server (memory usage, inodes cached ...) - would need to extract into logs if required, can also trigger on demand.
  • kernel cache invalidation now works
  • Georgios provides source RPM for hiredis, was used for compiling 4.2.0-3

todo

  •  identified update bug when RocksDB is enabled, which also affects compilation via NFS4 gateway (0 size file seen)
    • on the way of fixing
  • refine recovery behaviour of client when it was unresponsive and didn't receive MGM callbacks (test: SIGSTOP/SIGCONT)

 


● new Namespace

numeric UIDs: done, clients resolve, converter handles

protobuf 

Have 2 old ALICE headnodes, now doing EOSBACKUP namespace conversion tests - found issues with orphans and name conflicts (done on-the fly during boot) . To be fixed today, will then convert+validate.

Rollout: EOSBACKUP. Does it need CC7? yes, only on MGM and QuarkDB". Luca: "mhmmmh.."

 


● BATCH integration

Task 263925 starts at Mon Oct 23 16:39:20 2017 and ends at Mon Oct 23 17:07:32 2017 (28.2 minutes)
Analysed jobs: 100
Correct jobs: 100
Maximum concurrency: 3
Execution hosts (top 5):  b69586e854 [#43]  b64972dff9 [#28]  b674d8742c [#19]  b6163cf2d6 [#10]
Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] 

 

Q: why still running xrootd-4.6 (has "empty buffer retry" issue) - should be 4.7 - where is this version coming from?


● AOB

  •  SAMBA - need "expert"? Could do some automatic behaviour change when re-exporting as NFS or SMB. Luca has series of steps)
  • Fermi reports a crash on LRU (auto-cleanup of scratch directories); have trace on JIRA..
There are minutes attached to this event. Show them.
    • 16:00 16:05
      overall 2017 planning 5m
      Speaker: Jan Iven (CERN)
    • 16:05 16:30
      operations: production
      • 16:05
        production instances 5m
        Speaker: Herve Rousseau (CERN)

        EOSALICE

        Multiple crashes on Sat caused by (per Andreas)

        • Incredible high connection rate -fixed
        • Malformed authentication information - upstream
        • EOS Stacktrace
        • (and something with draining - memory corruption, elsewhere)

        Working on putting the authentication proxies in front of the MGM: very exotic issue with iptables... (was working for EOSCMS, but somehow not for EOSALICE).

        Better data distribution

        Roberto is working on EOSALICE scheduling group unbalance (got random FST crashes, now going slowly = 25 filesystems/day, but need to do O(3000) filesystems in 30 groups-> 160 days ETA at current rate.. scripted but launched manually). Andreas might have suggestion on how to speed this (or take 1 FS/node in parallel?).


        Started to work on EOSCMS (which reached critical levels, up to 99% full) and EOSPUBLIC.

        Note: have filesystems of different sizes (2TB..6TB), should take into account for groups.

        EOSATLAS (Cristi): similar - drain filesystems, add them to groups based on fullness (add until <90% full). 

         

      • 16:10
        CERNBOX and EOSUSER 5m
        Speaker: Luca Mascetti (CERN)

        Investigating "strange" 1min-delays (also seen by probe, also by WOPI - stat() takes minute(s)). Could have been Backup launching in a "storm", but unlikely.

        Andreas suggests a better probe: mkdir() on established connection (vs "mkdir" on a new/separate connection)? Might be different, seems to come from "xrdcp -f" waiting for redirection.

        Not reaching the max number of threads (4k).

        Might capture the latency in MGM - have this but would need to reset every hour.

      • 16:15
        FUSE and client versions 5m
        Speaker: Dan van der Ster (CERN)

        Compiled 4.2.0-3 on el7/el6 for koji. el6 repo has a new dependency, hiredis. 

        el7 testing: http://linuxsoft.cern.ch/internal/repos/eos7-testing/x86_64/os/Packages/

        el6 testing: http://linuxsoft.cern.ch/internal/repos/eos6-testing/x86_64/os/Packages/

        Dan's basic tests are passing, but these have *not* been pushed to qa.

        Also, eos-fusex 4.2.0-3 can be found in the above repos, but puppet eosclient integration incomplete.

        Q: what needs to be done - should not be blocked for 3 weeks.

        Q: who can push this to "qa" since fixes 4.1.30 session binding crash? see brand-new EOSops procedure.

      • 16:20
        Citrine rollout 5m
        Speaker: Herve Rousseau (CERN)

        EOSCMS

        They confirmed the preferred slot for migrating to Citrine would be after the Christmas shutdown

        EOSATLAS

        Meeting on friday about Batch on EOS, hence also about CentOS 7 and Citrine migration

      • 16:25
        SWAN 5m
        Speaker: Jakub Moscicki (CERN)

        SWAN had "spontaneous" update to EOSFUSE 4.1.30 (which crashes on LXPLUS, when used with per-session bindings.. might not affect).

    • 16:30 16:50
      development: near-term
      • 16:30
        nextgen FUSE 5m
        Speaker: Andreas Joachim Peters (CERN)

        new FUSE

        • discovered that XrdCL does not disable the nagle alrogithm (write(1b)-sync-write(1b)-sync ... take 25ms for the write and 25ms for the disk sync = 50ms/b)
          • Michal added XRD_NODELAY to XrdCl to disable nagle
        • file start cache and journal directories can now be overlayed in the same directory
        • Georgios ported RocksDB as KV backend as REDIS replacement (used for SMB/NFS gateways, where stable inodes are needed)
        • FUSEX client creates now all (missing) local cache directories according to configuration
        • Georgios fixed few more race conditions with thread sanitizer
        • few fixes for NFS4 gateway (. .. dir, special FUSE flags)
        • strong auth now works, you can change your credentials and permissions change as expected
        • Georgios fixed wrong standard deviation computation of rate counters
        • FUSEX client sends statistic to server (memory usage, inodes cached ...) - would need to extract into logs if required, can also trigger on demand.
        • kernel cache invalidation now works
        • Georgios provides source RPM for hiredis, was used for compiling 4.2.0-3

        todo

        •  identified update bug when RocksDB is enabled, which also affects compilation via NFS4 gateway (0 size file seen)
          • on the way of fixing
        • refine recovery behaviour of client when it was unresponsive and didn't receive MGM callbacks (test: SIGSTOP/SIGCONT)

         

      • 16:35
        new Namespace 5m
        Speaker: Elvin Alin Sindrilaru (CERN)

        numeric UIDs: done, clients resolve, converter handles

        protobuf 

        Have 2 old ALICE headnodes, now doing EOSBACKUP namespace conversion tests - found issues with orphans and name conflicts (done on-the fly during boot) . To be fixed today, will then convert+validate.

        Rollout: EOSBACKUP. Does it need CC7? yes, only on MGM and QuarkDB". Luca: "mhmmmh.."

         

    • 16:50 17:45
      other: pilot services, long-term dev, external
      • 16:50
        Webservice 5m
        Speaker: Luca Mascetti (CERN)
      • 16:55
        Backup 5m
        Speaker: Luca Mascetti (CERN)
      • 17:00
        Samba 5m
        Speaker: Luca Mascetti (CERN)
      • 17:05
        $HOME structure 5m
        Speaker: Luca Mascetti (CERN)
      • 17:10
        BATCH integration 5m
        Speaker: Massimo Lamanna (CERN)

        Task 263925 starts at Mon Oct 23 16:39:20 2017 and ends at Mon Oct 23 17:07:32 2017 (28.2 minutes)
        Analysed jobs: 100
        Correct jobs: 100
        Maximum concurrency: 3
        Execution hosts (top 5):  b69586e854 [#43]  b64972dff9 [#28]  b674d8742c [#19]  b6163cf2d6 [#10]
        Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100] 

         

        Q: why still running xrootd-4.6 (has "empty buffer retry" issue) - should be 4.7 - where is this version coming from?

      • 17:15
        Xrootd 5m
        Speaker: Michal Kamil Simon (CERN)
      • 17:20
        AOB 5m
        •  SAMBA - need "expert"? Could do some automatic behaviour change when re-exporting as NFS or SMB. Luca has series of steps)
        • Fermi reports a crash on LRU (auto-cleanup of scratch directories); have trace on JIRA..