EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-05-22T16:00:00+02:00
End: 2018-05-22T17:50:00+02:00
Location: CERN

Tuesday 22 May 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, USER)

(Massimo)

Background BEER tests
- Fixed issues preventing /etc/nologin to trigger node draining. Fast emptying OK. Asked Ben to send a few real jobs (couple of nodes on PPS0.
- I can saturate with/without background jobs the IO capability of the FST node (reacting its network capacity in serving files to 100-200 of clients
Ticket rota discussion
- Need to review some procedures
- Need to understand the "fsck repair" part (goal: being automatic. Reduce drastically single-replica files which are responsible of aggravating procedure for us and data loss for the users.

(Cristi)

EOSMEDIA: MGM upgrade to 4.2.22 this morning: OTG0044014
EOSALICEDAQ: MGM update to 4.2.22: OTG0044041

● EOS clients, FUSE(X)

(Andreas)

Test Walter Lampl:

Environment | Git operations | cmake | make |

lxplus + eos/scratch | 11 min | 1 min | 9 min* | *CPU bound

lxplus + AFS | 3 min | 1 min | 4.5 min |

desktop + SSD | 2.5min | 15 sec | 1.5 min |

=> found 'unlucky' sleep(25ms) implementation when selecting a branch with GIT

=> GIT unlinks file from master branch and creates the branch version. The create has to wait the 'unlink' operation to be executed server side and if it wasn't done already it waited 25ms (more or less for each file it waited 25ms extra).

See EOS-2589

For effective GIT usage we need to add/configure other optimizations.

Updated eosclient tests (Jozsef and Kuba):

https://gitlab.cern.ch/dss/eosclient-tests/blob/master/run.py

● Development issues

FUSEX:

(Jan) invited AFS phaseout coordinators (~50 people) to test /eos/scratch
- B.Jones / CONDOR - tried "git clone https://github.com/torvalds/linux.git" (6m objects), got spurious errors: "fatal: Out of memory, calloc failed" (GIT-internal?), "fatal: Could not get current working directory: No such file or directory"
  
  (comment Andreas: I can checkout the kernel source. It is close to 16Gb. I can also checkout a given branch afterwards (with latest patch
  in a finite time. To use effectively GIT one needs an optimization for
  a file recreate sequence and one should pin the object files in the local cache.)

(Georgios)

Now have a non-virtualized SSD for PPS, thanks to Luca & Massimo for the quick response.
Latest EOS commits use new-layout for metadata by default - this will reduce by a factor the number of random IO operations when listing directories.
- New instances use the new layout exclusively, by default.
- Old instances will create new files and directories with the new layout, but can fallback to the old one for reads. Live migration is thus possible (which is what we're doing on PPS), run "eos-ns-convert-to-locality-hashes" tool to do the conversion, if you have instances you care about.
- I'll take care of the PPS migration, please don't run the tool there
Starting from commit 9339be0ca30a093b8a467eeaf4fbe194d47aaeca
have added docs (step-by-step) on how to backup a running QuarkDB

(Massimo)

Restarted working on PPS
- Cycle of files (non-empty) creation and checking (counting files, dirs and checksumming). This is to re-test general behaviour due to the new disk layout
  - It looks sluggish (but Georgios is trying to push deletions and changing the disk layout)
- Later this week: check the max #clients using multiple mounts per batch

● AOB

next steps for FUSEX rollout?

Massimo: massive user-mount test against EOSUAT?
wait for feedback ~ 3 weeks.
more testers? - who? (e.g W.Bialas)
enable on prod instances?
more tests with writing to shares, ownership.

EOSHOME timeline?

have redirector, have MGM
special FST config (Georrgios: will be in next release, 4.2.23+AP patches merged, in test branch)
2..3 weeks time until functional testing. Will start with one MGM (for letter "a").

FUSE client needs config change (Luca to send magic parameters. Dan Or Jan to deploy).

recovery for FST unavailble, fsync
also turn back on LAZY_OPEN.

There are minutes attached to this event. Show them.

- 16:00 → 16:20
  EOS production instances (LHC, PUBLIC, USER) 20m
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)
  (Massimo)
  
  Background BEER tests
  
  Fixed issues preventing /etc/nologin to trigger node draining. Fast emptying OK. Asked Ben to send a few real jobs (couple of nodes on PPS0.
  
  I can saturate with/without background jobs the IO capability of the FST node (reacting its network capacity in serving files to 100-200 of clients
  
  Ticket rota discussion
  
  Need to review some procedures
  
  Need to understand the "fsck repair" part (goal: being automatic. Reduce drastically single-replica files which are responsible of aggravating procedure for us and data loss for the users.
  
  (Cristi)
  
  EOSMEDIA: MGM upgrade to 4.2.22 this morning: OTG0044014
  
  EOSALICEDAQ: MGM update to 4.2.22: OTG0044041
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  
  (Andreas)
  
  Test Walter Lampl:
  
  Environment | Git operations | cmake | make |
  
  lxplus + eos/scratch | 11 min | 1 min | 9 min* | *CPU bound
  
  lxplus + AFS | 3 min | 1 min | 4.5 min |
  
  desktop + SSD | 2.5min | 15 sec | 1.5 min |
  
  => found 'unlucky' sleep(25ms) implementation when selecting a branch with GIT
  
  => GIT unlinks file from master branch and creates the branch version. The create has to wait the 'unlink' operation to be executed server side and if it wasn't done already it waited 25ms (more or less for each file it waited 25ms extra).
  
  See EOS-2589
  
  For effective GIT usage we need to add/configure other optimizations.
  
  Updated eosclient tests (Jozsef and Kuba):
  
  https://gitlab.cern.ch/dss/eosclient-tests/blob/master/run.py
- 16:25 → 16:35
  Development issues 10m
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
  FUSEX:
  
  (Jan) invited AFS phaseout coordinators (~50 people) to test /eos/scratch
  
  B.Jones / CONDOR - tried "git clone https://github.com/torvalds/linux.git" (6m objects), got spurious errors: "fatal: Out of memory, calloc failed" (GIT-internal?), "fatal: Could not get current working directory: No such file or directory"
  
  (comment Andreas: I can checkout the kernel source. It is close to 16Gb. I can also checkout a given branch afterwards (with latest patch
  in a finite time. To use effectively GIT one needs an optimization for
  a file recreate sequence and one should pin the object files in the local cache.)
  
  (Georgios)
  
  Now have a non-virtualized SSD for PPS, thanks to Luca & Massimo for the quick response.
  
  Latest EOS commits use new-layout for metadata by default - this will reduce by a factor the number of random IO operations when listing directories.
  
  New instances use the new layout exclusively, by default.
  
  Old instances will create new files and directories with the new layout, but can fallback to the old one for reads. Live migration is thus possible (which is what we're doing on PPS), run "eos-ns-convert-to-locality-hashes" tool to do the conversion, if you have instances you care about.
  
  I'll take care of the PPS migration, please don't run the tool there
  
  Starting from commit 9339be0ca30a093b8a467eeaf4fbe194d47aaeca
  
  have added docs (step-by-step) on how to backup a running QuarkDB
  
  (Massimo)
  
  Restarted working on PPS
  
  Cycle of files (non-empty) creation and checking (counting files, dirs and checksumming). This is to re-test general behaviour due to the new disk layout
  
  It looks sluggish (but Georgios is trying to push deletions and changing the disk layout)
  
  Later this week: check the max #clients using multiple mounts per batch
- 16:35 → 16:50
  AOB 15m
  next steps for FUSEX rollout?
  
  Massimo: massive user-mount test against EOSUAT?
  
  wait for feedback ~ 3 weeks.
  
  more testers? - who? (e.g W.Bialas)
  
  enable on prod instances?
  
  more tests with writing to shares, ownership.
  
  EOSHOME timeline?
  
  have redirector, have MGM
  
  special FST config (Georrgios: will be in next release, 4.2.23+AP patches merged, in test branch)
  
  2..3 weeks time until functional testing. Will start with one MGM (for letter "a").
  
  FUSE client needs config change (Luca to send magic parameters. Dan Or Jan to deploy).
  
  recovery for FST unavailble, fsync
  
  also turn back on LAZY_OPEN.

Choose timezone