EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSATLAS heavy slowdown

The instance had all threads busy starting this morning at ~06h30

After an unsuccessful attempt to restart the MGM (eg: didn't fix the issue), some more log grepping lead to this:

180320 12:05:43 1822 XrdProtocol: ?:10501@lxplus100 terminated handshake not received
180320 12:05:43 1173 XrdProtocol: ?:2223@lxplus100 terminated handshake not received
180320 12:05:43 23651 XrdProtocol: ?:8358@lxplus100 terminated handshake not received
180320 12:05:43 1173 XrdProtocol: ?:2223@lxplus079 terminated handshake not received
180320 12:05:43 31849 XrdProtocol: ?:6226@lxplus100 terminated handshake not received
180320 12:05:43 31849 XrdProtocol: ?:8358@lxplus079 terminated handshake not received
180320 12:05:43 22304 XrdProtocol: ?:11205@lxplus100 terminated handshake not received

 

This correlates nicely with the thread exhaustion, these messages were reported in the logs at ~415Hz.

  • Added DROP rule in iptables for the 2 nodes until they're drained.
    • As soon as the rule was added, number of threads went back to normal (~1k)
    •  

CentOS 7 migration of FSTs

Roberto has been doing a great job with EOSATLAS and wrote a documentation with the most frequent issues when updating FSTs to CentOS7.

There's a Rundeck job that updates the machine's profile and reinstalls them:

CMS and ALICE will be assigned to people on rota: works quitely in the background, sends an email if a machine takes more than 2 hours to come back.

New dashboard

EOS Versions

EOSMEDIA:

  • updated to 4.2.17 in the shadow of the Sorenson and Micala patching campaign
  • added more space, but as JBOD (instead of RAID1-optimized for video)

 

EOSALICEDAQ (WIP):

  • more puppet-ization
  • add SLS monitoring (to old SLS script, new is not ready yet)

 

EOSUSER

  • issue with /var/eos getting full >85%, we have no alarm for that (we do have something for /var, but not the SSD under /var/eos)
    • scheduled a morning compact for dirs at 5:45
    • possible issue if new md files do not fit -> we have a space issue coming soon, need additional ssds
  • plan schedule a file compact tonight
  • roll out of new version (tested on EOSUAT) before Easter (tbd)

EOSHOME

  • still some installation issue for FSTs only 2 out of 3 worked (after many try and retry). Roberto will have a look (using his instructions)

 

 


● EOS clients, FUSE(X)

[ Andreas ]

Development

1) change to non-POSIX behaviour to enable public directories in non-public home directories

- stat on directory evaluates ACLs on itself not the parent

- stat on file evalutaes ACLs of parent as before

2)  adding by default chmod,setxattr,delete when 'w' is specified in ACL can be revoked with '!'d' and '!m'

(background from Luca: FUSEX does not coexist with "sharing", too POSIXy. Would need to share also the parent dir, at least for browsing)

 

1) := client side patch (4.2.18)

2) := server side patch (4.2.18 + AQ commit branch)

New issues:

EOS-2425

EOS-2444

 

[Jan]

  • 4.2.16 is now the "production" client version
  • /eos/scratch roll-out BI-1862 - puppet module issues (32GB FUSEX cache default is far too large)
  • ATLAS T0 batch file creation via (old) FUSE, feedback - see https://indico.cern.ch/event/698870/
    • 0.5% error rate (empty files on reading, via "xrdcp")
      • might be retry-on-open. Eric (CTA) has come up with scenario that causes overwritten/truncated files. Luca: most "lost files" last year due to this?
    • 0.5% errors on delete (non-empty directory?)
      • .sys "junk", invisible via FUSE?
    • see errors in FUSE log but not propagated to python/exception

● Development issues

Georgios

  • Enabled compression in QuarkDB: LZ4 for the top compaction layers (very fast), and ZSTD for the bottom one. (slightly slower, but offers higher compression ratio)
    • would having no compression at top layer make a difference? No idea.
  • PPS namespace went from ~800GB to ~350GB, not bad.
  • Might be improved further using ZSTD dictionary compression, but something weird is going inside RocksDB: I saw no benefit at all by enabling dictionary compression - disabled for now.
  • We're still missing an SSD in Wigner for PPS... we'll probably move the MGM to Meyrin.
    • Benoit: currently in "burn-in" - is this being slowed down by "ironic" workflow? (auto-created a new "EOS" instance..)

Elvin:

  • 4.2.18 released: addresses "corrupted config" on MGM restart, and 0-size TPC brokenness. Also lots of docs updates (FUSEX, BOX setup, QuarkDB setup etc..). Compiled against 4.8.1.
    • should go into client "qa"
  • Request from Aarnet: put "last modified" timestamp on docs?

● AOB

AOB:

Massimo: eosarchive scripts (CASTOR to EOS) needs to be resurrected

Massimo: FUSEX write test, see PDF.

Should really have rate limitations everywhere (got applied on EOSPUBLIC), might have side effects from HTTP gateways.

There are minutes attached to this event. Show them.
    • 16:00 16:02
      (new meeting agenda) 2m
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSATLAS heavy slowdown

      The instance had all threads busy starting this morning at ~06h30

      After an unsuccessful attempt to restart the MGM (eg: didn't fix the issue), some more log grepping lead to this:

      180320 12:05:43 1822 XrdProtocol: ?:10501@lxplus100 terminated handshake not received
      180320 12:05:43 1173 XrdProtocol: ?:2223@lxplus100 terminated handshake not received
      180320 12:05:43 23651 XrdProtocol: ?:8358@lxplus100 terminated handshake not received
      180320 12:05:43 1173 XrdProtocol: ?:2223@lxplus079 terminated handshake not received
      180320 12:05:43 31849 XrdProtocol: ?:6226@lxplus100 terminated handshake not received
      180320 12:05:43 31849 XrdProtocol: ?:8358@lxplus079 terminated handshake not received
      180320 12:05:43 22304 XrdProtocol: ?:11205@lxplus100 terminated handshake not received

       

      This correlates nicely with the thread exhaustion, these messages were reported in the logs at ~415Hz.

      • Added DROP rule in iptables for the 2 nodes until they're drained.
        • As soon as the rule was added, number of threads went back to normal (~1k)
        •  

      CentOS 7 migration of FSTs

      Roberto has been doing a great job with EOSATLAS and wrote a documentation with the most frequent issues when updating FSTs to CentOS7.

      There's a Rundeck job that updates the machine's profile and reinstalls them:

      CMS and ALICE will be assigned to people on rota: works quitely in the background, sends an email if a machine takes more than 2 hours to come back.

      New dashboard

      EOS Versions

      EOSMEDIA:

      • updated to 4.2.17 in the shadow of the Sorenson and Micala patching campaign
      • added more space, but as JBOD (instead of RAID1-optimized for video)

       

      EOSALICEDAQ (WIP):

      • more puppet-ization
      • add SLS monitoring (to old SLS script, new is not ready yet)

       

      EOSUSER

      • issue with /var/eos getting full >85%, we have no alarm for that (we do have something for /var, but not the SSD under /var/eos)
        • scheduled a morning compact for dirs at 5:45
        • possible issue if new md files do not fit -> we have a space issue coming soon, need additional ssds
      • plan schedule a file compact tonight
      • roll out of new version (tested on EOSUAT) before Easter (tbd)

      EOSHOME

      • still some installation issue for FSTs only 2 out of 3 worked (after many try and retry). Roberto will have a look (using his instructions)

       

       

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      [ Andreas ]

      Development

      1) change to non-POSIX behaviour to enable public directories in non-public home directories

      - stat on directory evaluates ACLs on itself not the parent

      - stat on file evalutaes ACLs of parent as before

      2)  adding by default chmod,setxattr,delete when 'w' is specified in ACL can be revoked with '!'d' and '!m'

      (background from Luca: FUSEX does not coexist with "sharing", too POSIXy. Would need to share also the parent dir, at least for browsing)

       

      1) := client side patch (4.2.18)

      2) := server side patch (4.2.18 + AQ commit branch)

      New issues:

      EOS-2425

      EOS-2444

       

      [Jan]

      • 4.2.16 is now the "production" client version
      • /eos/scratch roll-out BI-1862 - puppet module issues (32GB FUSEX cache default is far too large)
      • ATLAS T0 batch file creation via (old) FUSE, feedback - see https://indico.cern.ch/event/698870/
        • 0.5% error rate (empty files on reading, via "xrdcp")
          • might be retry-on-open. Eric (CTA) has come up with scenario that causes overwritten/truncated files. Luca: most "lost files" last year due to this?
        • 0.5% errors on delete (non-empty directory?)
          • .sys "junk", invisible via FUSE?
        • see errors in FUSE log but not propagated to python/exception
    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      Georgios

      • Enabled compression in QuarkDB: LZ4 for the top compaction layers (very fast), and ZSTD for the bottom one. (slightly slower, but offers higher compression ratio)
        • would having no compression at top layer make a difference? No idea.
      • PPS namespace went from ~800GB to ~350GB, not bad.
      • Might be improved further using ZSTD dictionary compression, but something weird is going inside RocksDB: I saw no benefit at all by enabling dictionary compression - disabled for now.
      • We're still missing an SSD in Wigner for PPS... we'll probably move the MGM to Meyrin.
        • Benoit: currently in "burn-in" - is this being slowed down by "ironic" workflow? (auto-created a new "EOS" instance..)

      Elvin:

      • 4.2.18 released: addresses "corrupted config" on MGM restart, and 0-size TPC brokenness. Also lots of docs updates (FUSEX, BOX setup, QuarkDB setup etc..). Compiled against 4.8.1.
        • should go into client "qa"
      • Request from Aarnet: put "last modified" timestamp on docs?
    • 16:35 16:50
      AOB 15m

      AOB:

      Massimo: eosarchive scripts (CASTOR to EOS) needs to be resurrected

      Massimo: FUSEX write test, see PDF.

      Should really have rate limitations everywhere (got applied on EOSPUBLIC), might have side effects from HTTP gateways.