EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2017-10-24T16:00:00+02:00
End: 2017-10-24T17:45:00+02:00
Location: CERN

Tuesday 24 Oct 2017, 16:00 → 17:45 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout

Hide

● production instances

EOSALICE

Multiple crashes on Sat caused by (per Andreas)

Incredible high connection rate -fixed
Malformed authentication information - upstream
EOS Stacktrace
(and something with draining - memory corruption, elsewhere)

Working on putting the authentication proxies in front of the MGM: very exotic issue with iptables... (was working for EOSCMS, but somehow not for EOSALICE).

Better data distribution

Roberto is working on EOSALICE scheduling group unbalance (got random FST crashes, now going slowly = 25 filesystems/day, but need to do O(3000) filesystems in 30 groups-> 160 days ETA at current rate.. scripted but launched manually). Andreas might have suggestion on how to speed this (or take 1 FS/node in parallel?).

Started to work on EOSCMS (which reached critical levels, up to 99% full) and EOSPUBLIC.

Note: have filesystems of different sizes (2TB..6TB), should take into account for groups.

EOSATLAS (Cristi): similar - drain filesystems, add them to groups based on fullness (add until <90% full).

● CERNBOX and EOSUSER

Investigating "strange" 1min-delays (also seen by probe, also by WOPI - stat() takes minute(s)). Could have been Backup launching in a "storm", but unlikely.

Andreas suggests a better probe: mkdir() on established connection (vs "mkdir" on a new/separate connection)? Might be different, seems to come from "xrdcp -f" waiting for redirection.

Not reaching the max number of threads (4k).

Might capture the latency in MGM - have this but would need to reset every hour.

● FUSE and client versions

Compiled 4.2.0-3 on el7/el6 for koji. el6 repo has a new dependency, hiredis.

el7 testing: http://linuxsoft.cern.ch/internal/repos/eos7-testing/x86_64/os/Packages/

el6 testing: http://linuxsoft.cern.ch/internal/repos/eos6-testing/x86_64/os/Packages/

Dan's basic tests are passing, but these have *not* been pushed to qa.

Also, eos-fusex 4.2.0-3 can be found in the above repos, but puppet eosclient integration incomplete.

Q: what needs to be done - should not be blocked for 3 weeks.

Q: who can push this to "qa" since fixes 4.1.30 session binding crash? see brand-new EOSops procedure.

● Citrine rollout

EOSCMS

They confirmed the preferred slot for migrating to Citrine would be after the Christmas shutdown

EOSATLAS

Meeting on friday about Batch on EOS, hence also about CentOS 7 and Citrine migration

● SWAN

SWAN had "spontaneous" update to EOSFUSE 4.1.30 (which crashes on LXPLUS, when used with per-session bindings.. might not affect).

● nextgen FUSE

new FUSE

discovered that XrdCL does not disable the nagle alrogithm (write(1b)-sync-write(1b)-sync ... take 25ms for the write and 25ms for the disk sync = 50ms/b)
- Michal added XRD_NODELAY to XrdCl to disable nagle
file start cache and journal directories can now be overlayed in the same directory
Georgios ported RocksDB as KV backend as REDIS replacement (used for SMB/NFS gateways, where stable inodes are needed)
FUSEX client creates now all (missing) local cache directories according to configuration
Georgios fixed few more race conditions with thread sanitizer
few fixes for NFS4 gateway (. .. dir, special FUSE flags)
strong auth now works, you can change your credentials and permissions change as expected
Georgios fixed wrong standard deviation computation of rate counters
FUSEX client sends statistic to server (memory usage, inodes cached ...) - would need to extract into logs if required, can also trigger on demand.
kernel cache invalidation now works
Georgios provides source RPM for hiredis, was used for compiling 4.2.0-3

todo

identified update bug when RocksDB is enabled, which also affects compilation via NFS4 gateway (0 size file seen)
- on the way of fixing
refine recovery behaviour of client when it was unresponsive and didn't receive MGM callbacks (test: SIGSTOP/SIGCONT)

● new Namespace

numeric UIDs: done, clients resolve, converter handles

protobuf

Have 2 old ALICE headnodes, now doing EOSBACKUP namespace conversion tests - found issues with orphans and name conflicts (done on-the fly during boot) . To be fixed today, will then convert+validate.

Rollout: EOSBACKUP. Does it need CC7? yes, only on MGM and QuarkDB". Luca: "mhmmmh.."

● BATCH integration

Task 263925 starts at Mon Oct 23 16:39:20 2017 and ends at Mon Oct 23 17:07:32 2017 (28.2 minutes)
Analysed jobs: 100
Correct jobs: 100
Maximum concurrency: 3
Execution hosts (top 5): b69586e854 [#43] b64972dff9 [#28] b674d8742c [#19] b6163cf2d6 [#10]
Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100]

Q: why still running xrootd-4.6 (has "empty buffer retry" issue) - should be 4.7 - where is this version coming from?

● AOB

SAMBA - need "expert"? Could do some automatic behaviour change when re-exporting as NFS or SMB. Luca has series of steps)
Fermi reports a crash on LRU (auto-cleanup of scratch directories); have trace on JIRA..

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  overall 2017 planning 5m
  
  Speaker: Jan Iven (CERN)
- 16:05 → 16:30
  operations: production
  - 16:05
    production instances 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    EOSALICE
    
    Multiple crashes on Sat caused by (per Andreas)
    
    Incredible high connection rate -fixed
    
    Malformed authentication information - upstream
    
    EOS Stacktrace
    
    (and something with draining - memory corruption, elsewhere)
    
    Working on putting the authentication proxies in front of the MGM: very exotic issue with iptables... (was working for EOSCMS, but somehow not for EOSALICE).
    
    Better data distribution
    
    Roberto is working on EOSALICE scheduling group unbalance (got random FST crashes, now going slowly = 25 filesystems/day, but need to do O(3000) filesystems in 30 groups-> 160 days ETA at current rate.. scripted but launched manually). Andreas might have suggestion on how to speed this (or take 1 FS/node in parallel?).
    
    Started to work on EOSCMS (which reached critical levels, up to 99% full) and EOSPUBLIC.
    
    Note: have filesystems of different sizes (2TB..6TB), should take into account for groups.
    
    EOSATLAS (Cristi): similar - drain filesystems, add them to groups based on fullness (add until <90% full).
  - 16:10
    
    CERNBOX and EOSUSER 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    Investigating "strange" 1min-delays (also seen by probe, also by WOPI - stat() takes minute(s)). Could have been Backup launching in a "storm", but unlikely.
    
    Andreas suggests a better probe: mkdir() on established connection (vs "mkdir" on a new/separate connection)? Might be different, seems to come from "xrdcp -f" waiting for redirection.
    
    Not reaching the max number of threads (4k).
    
    Might capture the latency in MGM - have this but would need to reset every hour.
  - 16:15
    
    FUSE and client versions 5m
    
    Minutes
    
    Speaker: Dan van der Ster (CERN)
    
    Compiled 4.2.0-3 on el7/el6 for koji. el6 repo has a new dependency, hiredis.
    
    el7 testing: http://linuxsoft.cern.ch/internal/repos/eos7-testing/x86_64/os/Packages/
    
    el6 testing: http://linuxsoft.cern.ch/internal/repos/eos6-testing/x86_64/os/Packages/
    
    Dan's basic tests are passing, but these have *not* been pushed to qa.
    
    Also, eos-fusex 4.2.0-3 can be found in the above repos, but puppet eosclient integration incomplete.
    
    Q: what needs to be done - should not be blocked for 3 weeks.
    
    Q: who can push this to "qa" since fixes 4.1.30 session binding crash? see brand-new EOSops procedure.
  - 16:20
    
    Citrine rollout 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    EOSCMS
    
    They confirmed the preferred slot for migrating to Citrine would be after the Christmas shutdown
    
    EOSATLAS
    
    Meeting on friday about Batch on EOS, hence also about CentOS 7 and Citrine migration
  - 16:25
    
    SWAN 5m
    
    Minutes
    
    Speaker: Jakub Moscicki (CERN)
    
    SWAN had "spontaneous" update to EOSFUSE 4.1.30 (which crashes on LXPLUS, when used with per-session bindings.. might not affect).
- 16:30 → 16:50
  development: near-term
  - 16:30
    nextgen FUSE 5m
    
    Minutes
    
    Speaker: Andreas Joachim Peters (CERN)
    
    new FUSE
    
    discovered that XrdCL does not disable the nagle alrogithm (write(1b)-sync-write(1b)-sync ... take 25ms for the write and 25ms for the disk sync = 50ms/b)
    
    Michal added XRD_NODELAY to XrdCl to disable nagle
    
    file start cache and journal directories can now be overlayed in the same directory
    
    Georgios ported RocksDB as KV backend as REDIS replacement (used for SMB/NFS gateways, where stable inodes are needed)
    
    FUSEX client creates now all (missing) local cache directories according to configuration
    
    Georgios fixed few more race conditions with thread sanitizer
    
    few fixes for NFS4 gateway (. .. dir, special FUSE flags)
    
    strong auth now works, you can change your credentials and permissions change as expected
    
    Georgios fixed wrong standard deviation computation of rate counters
    
    FUSEX client sends statistic to server (memory usage, inodes cached ...) - would need to extract into logs if required, can also trigger on demand.
    
    kernel cache invalidation now works
    
    Georgios provides source RPM for hiredis, was used for compiling 4.2.0-3
    
    todo
    
    identified update bug when RocksDB is enabled, which also affects compilation via NFS4 gateway (0 size file seen)
    
    on the way of fixing
    
    refine recovery behaviour of client when it was unresponsive and didn't receive MGM callbacks (test: SIGSTOP/SIGCONT)
  - 16:35
    
    new Namespace 5m
    
    Minutes
    
    Speaker: Elvin Alin Sindrilaru (CERN)
    
    numeric UIDs: done, clients resolve, converter handles
    
    protobuf
    
    Have 2 old ALICE headnodes, now doing EOSBACKUP namespace conversion tests - found issues with orphans and name conflicts (done on-the fly during boot) . To be fixed today, will then convert+validate.
    
    Rollout: EOSBACKUP. Does it need CC7? yes, only on MGM and QuarkDB". Luca: "mhmmmh.."
- 16:50 → 17:45
  other: pilot services, long-term dev, external
  - 16:50
    
    Webservice 5m
    
    Speaker: Luca Mascetti (CERN)
  - 16:55
    
    Backup 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:00
    
    Samba 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:05
    
    $HOME structure 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:10
    
    BATCH integration 5m
    
    Minutes
    
    Speaker: Massimo Lamanna (CERN)
    
    Task 263925 starts at Mon Oct 23 16:39:20 2017 and ends at Mon Oct 23 17:07:32 2017 (28.2 minutes)
    Analysed jobs: 100
    Correct jobs: 100
    Maximum concurrency: 3
    Execution hosts (top 5): b69586e854 [#43] b64972dff9 [#28] b674d8742c [#19] b6163cf2d6 [#10]
    Execution environments (top 5): eos-client-4.1.30-1.el6.x86_64, eos-fuse-core-4.1.30-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100]
    
    Q: why still running xrootd-4.6 (has "empty buffer retry" issue) - should be 4.7 - where is this version coming from?
  - 17:15
    
    Xrootd 5m
    
    Speaker: Michal Kamil Simon (CERN)
  - 17:20
    AOB 5m
    
    Minutes
    
    SAMBA - need "expert"? Could do some automatic behaviour change when re-exporting as NFS or SMB. Luca has series of steps)
    
    Fermi reports a crash on LRU (auto-cleanup of scratch directories); have trace on JIRA..

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● production instances

EOSALICE

Better data distribution

● CERNBOX and EOSUSER

● FUSE and client versions

● Citrine rollout

EOSCMS

EOSATLAS

● SWAN

● nextgen FUSE

● new Namespace

● BATCH integration

● AOB

EOSALICE

Better data distribution

EOSCMS

EOSATLAS

Share this page

Direct link

Social networks

Calendaring