EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description
Weekly meeting to discuss progress on EOS rollout

● production instances

Meltdown/Spectre

SRM and HTTP gateways are being rebooted as part of the cloud restart campaign, generating false SMS.

 

False alarms

Identified yesterday an authentication issue when replicating between FSTs, looks like the previous "Georgios bug", under investigation (probably fixed in Xrootd, but some PLUS/BATCH machines still run some old version.. bad opaque tag goes to FSTs, affects connections between diskservers - this might allow for a DoS attack).

Q: Could everybody (clients) go to 4.8 (which no longer sends the wrong tag)?

yes - but... 4.8 on the client fixes (probably) 2 bugs - but already 1 known issue (ALICE: "xrdcp" hangs on checksumming). 4.8.1 is to be released "soon" (1..2 weeks).

Elvin: will investigate FST-side fixes (i.e filter the tag). Can then do FST upgrade campaign (again, just did this for EOSPUBLIC.. but no need for real downtime).

EOSUSER: not (yet) affected - bug is in xrootd-4.7+ server side.


● FUSE and client versions

4.2.8 has just been released, is being built on KOJI, will deploy. No changes in old FUSE but fixes many of the FUSEx bugs.

Note: one old EOS client bug ("file is not visible") is apparently back - no more details?

puppet "eosclient" module - one minor config change in the pipeline ("eos.select" script, for SHIP users).


● Citrine rollout

EOSCMS

Next Tuesday (23rd Jan), the instance will be updated to Citrine. May need a workaround/new release (newer than 4.2.8), EOSATLAS update to "citrine" had the MGM slave crashing repeatedly when reading the configuration.

Discussion: any impact/overlap with Hypervisor "spectre" reboot campaign for "cern-geneva-b"? Might slow down puppet. Nothing major expected.

Instance robustness (GeoTreeScheduler)

This is an issue that causes the MGM to crash at boot time because FSTs broadcast wrong/incomplete informations (or corrupted configuration file)

The faulty forceRefresh() method has been moved someplace else, to prevent interferences from FSTs broadcasting invalid data.

EOSBACKUP

Being updated to citrine (started before meeting: will finish either tonight or tomorrow).


● nextgen FUSE

- Giorgios did another round of fixing lock order violations introduced since end of december

- microtest 020 has to be disabled (2 MIO IOPS write(1b, offset+=2) - slow because old kernel does not batch them)

- beryl_aquamarine & master should be now in sync for tagging

Tickets from last week:

EOS-2231: ioflush thread serializes file closeing and leads to memory aggregation ( Massimo mpop3.py tests)

EOS-2232: track and recycle all IO buffers

ALL        threads             := 30
ALL        visze               := 1.18 Gb
All        rss                 := 691.04 Mb
All        wr-buf-inflight     := 0 b
All        wr-buf-queued       := 16.78 Mb
All        ra-buf-inflight     := 0 b
All        ra-buf-queued       := 0 b
All        rd-buf-inflight     := 0 b
All        rd-buf-queued       := 524.29 kb

EOS-2233: Implement all possible FST error recovery scenarios

          - core work done - still WIP

EOS-2241: FUSEX has to point ZMQ connection to active master

- the memory behavior is still not 100% understood, when profiling with JEMALLOC no leak is visible, neither with valgrind, however one of the used thread pools must be responsible for the delayed release of memory (XrdCl?) because when a test is repeated the excess memory disappears completely.

 

Discussion: is "4.2.8" the version to deploy into production to allow /eos/scratch testing? Probably not.

Dan: also found new issue, "eosxd" accumulate whenever FUSEX mountpoint is unmounted. Drives up loadavg.

 

From Kuba

Implemented reproducer test for EOS-2229:

https://gitlab.cern.ch/cernbox/eosfusex_tests

(Joszef is aware, will integrate with GITLAB-CI)

 

 

 

 


● new Namespace

 

 

 

PPS tests (Massimo)

  • New catalogue at ~1.7 B entries
  • Verified last week changes (new version deployed)
  • Define 4 levels of 'temperature' for catalogue data:
    • Cold: quarkDB cluster rebooted
    • Cool: restart all quarkdb servers (kernel cache potentially warm)
    • Warm: restart only MGM
    • Hot: issue again a query (eos ls).    In the production MGM we have only the HOT situation
    • As it stands results are a bit erratic, but COLD=COOL. The "eos ls -l" of a dir containing about 2.1kdirectories takes about 100s in COLD and COOL mode. The WARM case takes about 6 s and HOT 0.2s.
  • Similar philosophy for FUSEx, but the results (ls -l of the same directories) do not make (for me) any sense: in the present system I tried to study the case HOT with/without ClientCache and data are erratic (between 4minutes to 1 s for the case without the cache and more or less consistent for the ClientCache full: 1s)
  • Miscellanea
    • It looks restarting (reboot) the quarkDB cluster without an MGM reboot (or with the MGM reboot finishing before the QuarkDB reboot) gets you a non functional MGM (eos ls takes forever)
    • It looks (to be confirmed) the MGM returns its cached values even if the quarkDB is still booting
    • The eospps-ns1/2 machines have HDD (not SSD). In COLD mode I see ~3% of WA (Waiting IO)

 

 

 


● Xrootd

(see above): need a xrootd-4.8.1 release relatively soon for ALICE (xrdcp checksum hanging). Will also get some protocol enhancements (some more vectorized functions).


● AOB

Dan: odd thing for one microtest (sqlite 100x flush): /eos/scratch used to take ~3sec, now 10msec. Any EOSUAT changes to explain this? No. Understood, /eos/ mount on testbox was broken, test did not return an error .. EOSUAT will get updated tomorrow, will look again.


● overall planning

ITUM-23: see slides (mail), comments welcome, will get own slot so can (slightly) expand.

There are minutes attached to this event. Show them.