EOS DevOps Meeting
Weekly meeting to discuss progress on EOS rollout.
- please keep content relevant to (most of) the audience, explain context
- Last week: major issues, preferably with ticket
- This week/planning: who, until when, needs what?
Add your input to the "contribution minutes" before the meeting. Else will be "AOB".
● EOS production instances (LHC, PUBLIC, USER)
(Massimo)
- Background BEER tests
- Fixed issues preventing /etc/nologin to trigger node draining. Fast emptying OK. Asked Ben to send a few real jobs (couple of nodes on PPS0.
- I can saturate with/without background jobs the IO capability of the FST node (reacting its network capacity in serving files to 100-200 of clients
- Ticket rota discussion
- Need to review some procedures
- Need to understand the "fsck repair" part (goal: being automatic. Reduce drastically single-replica files which are responsible of aggravating procedure for us and data loss for the users.
(Cristi)
- EOSMEDIA: MGM upgrade to 4.2.22 this morning: OTG0044014
- EOSALICEDAQ: MGM update to 4.2.22: OTG0044041
● EOS clients, FUSE(X)
(Andreas)
=> found 'unlucky' sleep(25ms) implementation when selecting a branch with GIT
=> GIT unlinks file from master branch and creates the branch version. The create has to wait the 'unlink' operation to be executed server side and if it wasn't done already it waited 25ms (more or less for each file it waited 25ms extra).
See EOS-2589
For effective GIT usage we need to add/configure other optimizations.
Updated eosclient tests (Jozsef and Kuba):
https://gitlab.cern.ch/dss/eosclient-tests/blob/master/run.py
● Development issues
FUSEX:
- (Jan) invited AFS phaseout coordinators (~50 people) to test /eos/scratch
- B.Jones / CONDOR - tried "git clone https://github.com/torvalds/linux.git" (6m objects), got spurious errors: "fatal: Out of memory, calloc failed" (GIT-internal?), "fatal: Could not get current working directory: No such file or directory"
(comment Andreas: I can checkout the kernel source. It is close to 16Gb. I can also checkout a given branch afterwards (with latest patch
in a finite time. To use effectively GIT one needs an optimization for
a file recreate sequence and one should pin the object files in the local cache.)
- B.Jones / CONDOR - tried "git clone https://github.com/torvalds/linux.git" (6m objects), got spurious errors: "fatal: Out of memory, calloc failed" (GIT-internal?), "fatal: Could not get current working directory: No such file or directory"
(Georgios)
- Now have a non-virtualized SSD for PPS, thanks to Luca & Massimo for the quick response.
- Latest EOS commits use new-layout for metadata by default - this will reduce by a factor the number of random IO operations when listing directories.
- New instances use the new layout exclusively, by default.
- Old instances will create new files and directories with the new layout, but can fallback to the old one for reads. Live migration is thus possible (which is what we're doing on PPS), run "eos-ns-convert-to-locality-hashes" tool to do the conversion, if you have instances you care about.
- I'll take care of the PPS migration, please don't run the tool there
- Starting from commit 9339be0ca30a093b8a467eeaf4fbe194d47aaeca
- have added docs (step-by-step) on how to backup a running QuarkDB
(Massimo)
- Restarted working on PPS
- Cycle of files (non-empty) creation and checking (counting files, dirs and checksumming). This is to re-test general behaviour due to the new disk layout
- It looks sluggish (but Georgios is trying to push deletions and changing the disk layout)
- Later this week: check the max #clients using multiple mounts per batch
- Cycle of files (non-empty) creation and checking (counting files, dirs and checksumming). This is to re-test general behaviour due to the new disk layout
● AOB
next steps for FUSEX rollout?
- Massimo: massive user-mount test against EOSUAT?
- wait for feedback ~ 3 weeks.
- more testers? - who? (e.g W.Bialas)
- enable on prod instances?
- more tests with writing to shares, ownership.
EOSHOME timeline?
- have redirector, have MGM
- special FST config (Georrgios: will be in next release, 4.2.23+AP patches merged, in test branch)
- 2..3 weeks time until functional testing. Will start with one MGM (for letter "a").
FUSE client needs config change (Luca to send magic parameters. Dan Or Jan to deploy).
- recovery for FST unavailble, fsync
- also turn back on LAZY_OPEN.