EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

 

(Massimo)

  • Background BEER tests
    • Fixed issues preventing /etc/nologin to trigger node draining. Fast emptying OK. Asked Ben to send a few real jobs (couple of nodes on PPS0.
    • I can saturate with/without background jobs the IO capability of the FST node (reacting its network capacity in serving files to 100-200 of clients
  • Ticket rota discussion
    • Need to review some procedures
    • Need to understand the "fsck repair" part (goal: being automatic. Reduce drastically single-replica files which are responsible of aggravating procedure for us and data loss for the users.

(Cristi)

  • EOSMEDIA: MGM upgrade to 4.2.22 this morning: OTG0044014
  • EOSALICEDAQ: MGM update to 4.2.22: OTG0044041

● EOS clients, FUSE(X)

(Andreas)

Test Walter Lampl:
Environment | Git operations | cmake | make |
 
lxplus + eos/scratch | 11 min | 1 min  | 9 min*  | *CPU bound
lxplus + AFS         | 3 min  | 1 min  | 4.5 min |
desktop + SSD        | 2.5min | 15 sec | 1.5 min |
 

=> found 'unlucky' sleep(25ms) implementation when selecting a branch with GIT

=> GIT unlinks file from master branch and creates the branch version. The create has to wait the 'unlink' operation to be executed server side and if it wasn't done already it waited 25ms (more or less for each file it waited 25ms extra).

See EOS-2589
 

For effective GIT usage we need to add/configure other optimizations.

 

Updated eosclient tests (Jozsef and Kuba):

https://gitlab.cern.ch/dss/eosclient-tests/blob/master/run.py

 

 

 


● Development issues

FUSEX:

  • (Jan) invited AFS phaseout coordinators (~50 people) to test /eos/scratch
    • B.Jones / CONDOR - tried "git clone https://github.com/torvalds/linux.git" (6m objects), got spurious errors: "fatal: Out of memory, calloc failed"  (GIT-internal?), "fatal: Could not get current working directory: No such file or directory"

      (comment Andreas: I can checkout the kernel source. It is close to 16Gb. I can also checkout a given branch afterwards (with latest patch
      in a finite time. To use effectively GIT one needs an optimization for
      a file recreate sequence and one should pin the object files in the local cache.)
       

(Georgios)

  • Now have a non-virtualized SSD for PPS, thanks to Luca & Massimo for the quick response.
  • Latest EOS commits use new-layout for metadata by default - this will reduce by a factor the number of random IO operations when listing directories.
    • New instances use the new layout exclusively, by default.
    • Old instances will create new files and directories with the new layout, but can fallback to the old one for reads. Live migration is thus possible (which is what we're doing on PPS), run "eos-ns-convert-to-locality-hashes" tool to do the conversion, if you have instances you care about.
    • I'll take care of the PPS migration, please don't run the tool there
  • Starting from commit 9339be0ca30a093b8a467eeaf4fbe194d47aaeca
  • have added docs (step-by-step) on how to backup a running QuarkDB

(Massimo)

  • Restarted working on PPS
    • Cycle of files (non-empty) creation and checking (counting files, dirs and checksumming). This is to re-test general behaviour due to the new disk layout
      • It looks sluggish (but Georgios is trying to push deletions and changing the disk layout)
    • Later this week: check the max #clients using multiple mounts per batch

 

 

 

 

 

 


● AOB

next steps for FUSEX rollout?

  • Massimo: massive user-mount test against EOSUAT?
  • wait for feedback ~ 3 weeks.
  • more testers? - who? (e.g W.Bialas)
  • enable on prod instances?
  • more tests with writing to shares, ownership.

EOSHOME timeline?

  • have redirector, have MGM
  • special FST config (Georrgios: will be in next release, 4.2.23+AP patches merged, in test branch)
  • 2..3 weeks time until functional testing. Will start with one MGM (for letter "a").

FUSE client needs config change (Luca to send magic parameters. Dan Or Jan to deploy).

  • recovery for FST unavailble, fsync 
  • also turn back on LAZY_OPEN.
  •  
There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

       

      (Massimo)

      • Background BEER tests
        • Fixed issues preventing /etc/nologin to trigger node draining. Fast emptying OK. Asked Ben to send a few real jobs (couple of nodes on PPS0.
        • I can saturate with/without background jobs the IO capability of the FST node (reacting its network capacity in serving files to 100-200 of clients
      • Ticket rota discussion
        • Need to review some procedures
        • Need to understand the "fsck repair" part (goal: being automatic. Reduce drastically single-replica files which are responsible of aggravating procedure for us and data loss for the users.

      (Cristi)

      • EOSMEDIA: MGM upgrade to 4.2.22 this morning: OTG0044014
      • EOSALICEDAQ: MGM update to 4.2.22: OTG0044041
    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (Andreas)

      Test Walter Lampl:
      Environment | Git operations | cmake | make |
       
      lxplus + eos/scratch | 11 min | 1 min  | 9 min*  | *CPU bound
      lxplus + AFS         | 3 min  | 1 min  | 4.5 min |
      desktop + SSD        | 2.5min | 15 sec | 1.5 min |
       

      => found 'unlucky' sleep(25ms) implementation when selecting a branch with GIT

      => GIT unlinks file from master branch and creates the branch version. The create has to wait the 'unlink' operation to be executed server side and if it wasn't done already it waited 25ms (more or less for each file it waited 25ms extra).

      See EOS-2589
       

      For effective GIT usage we need to add/configure other optimizations.

       

      Updated eosclient tests (Jozsef and Kuba):

      https://gitlab.cern.ch/dss/eosclient-tests/blob/master/run.py

       

       

       

    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      FUSEX:

      • (Jan) invited AFS phaseout coordinators (~50 people) to test /eos/scratch
        • B.Jones / CONDOR - tried "git clone https://github.com/torvalds/linux.git" (6m objects), got spurious errors: "fatal: Out of memory, calloc failed"  (GIT-internal?), "fatal: Could not get current working directory: No such file or directory"

          (comment Andreas: I can checkout the kernel source. It is close to 16Gb. I can also checkout a given branch afterwards (with latest patch
          in a finite time. To use effectively GIT one needs an optimization for
          a file recreate sequence and one should pin the object files in the local cache.)
           

      (Georgios)

      • Now have a non-virtualized SSD for PPS, thanks to Luca & Massimo for the quick response.
      • Latest EOS commits use new-layout for metadata by default - this will reduce by a factor the number of random IO operations when listing directories.
        • New instances use the new layout exclusively, by default.
        • Old instances will create new files and directories with the new layout, but can fallback to the old one for reads. Live migration is thus possible (which is what we're doing on PPS), run "eos-ns-convert-to-locality-hashes" tool to do the conversion, if you have instances you care about.
        • I'll take care of the PPS migration, please don't run the tool there
      • Starting from commit 9339be0ca30a093b8a467eeaf4fbe194d47aaeca
      • have added docs (step-by-step) on how to backup a running QuarkDB

      (Massimo)

      • Restarted working on PPS
        • Cycle of files (non-empty) creation and checking (counting files, dirs and checksumming). This is to re-test general behaviour due to the new disk layout
          • It looks sluggish (but Georgios is trying to push deletions and changing the disk layout)
        • Later this week: check the max #clients using multiple mounts per batch

       

       

       

       

       

       

    • 16:35 16:50
      AOB 15m

      next steps for FUSEX rollout?

      • Massimo: massive user-mount test against EOSUAT?
      • wait for feedback ~ 3 weeks.
      • more testers? - who? (e.g W.Bialas)
      • enable on prod instances?
      • more tests with writing to shares, ownership.

      EOSHOME timeline?

      • have redirector, have MGM
      • special FST config (Georrgios: will be in next release, 4.2.23+AP patches merged, in test branch)
      • 2..3 weeks time until functional testing. Will start with one MGM (for letter "a").

      FUSE client needs config change (Luca to send magic parameters. Dan Or Jan to deploy).

      • recovery for FST unavailble, fsync 
      • also turn back on LAZY_OPEN.
      •