EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-01-16T16:00:00+01:00
End: 2018-01-16T17:50:00+01:00
Location: CERN

Tuesday 16 Jan 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout

Hide

● production instances

Meltdown/Spectre

SRM and HTTP gateways are being rebooted as part of the cloud restart campaign, generating false SMS.

False alarms

Identified yesterday an authentication issue when replicating between FSTs, looks like the previous "Georgios bug", under investigation (probably fixed in Xrootd, but some PLUS/BATCH machines still run some old version.. bad opaque tag goes to FSTs, affects connections between diskservers - this might allow for a DoS attack).

Q: Could everybody (clients) go to 4.8 (which no longer sends the wrong tag)?

yes - but... 4.8 on the client fixes (probably) 2 bugs - but already 1 known issue (ALICE: "xrdcp" hangs on checksumming). 4.8.1 is to be released "soon" (1..2 weeks).

Elvin: will investigate FST-side fixes (i.e filter the tag). Can then do FST upgrade campaign (again, just did this for EOSPUBLIC.. but no need for real downtime).

EOSUSER: not (yet) affected - bug is in xrootd-4.7+ server side.

● FUSE and client versions

4.2.8 has just been released, is being built on KOJI, will deploy. No changes in old FUSE but fixes many of the FUSEx bugs.

Note: one old EOS client bug ("file is not visible") is apparently back - no more details?

puppet "eosclient" module - one minor config change in the pipeline ("eos.select" script, for SHIP users).

● Citrine rollout

EOSCMS

Next Tuesday (23rd Jan), the instance will be updated to Citrine. May need a workaround/new release (newer than 4.2.8), EOSATLAS update to "citrine" had the MGM slave crashing repeatedly when reading the configuration.

Discussion: any impact/overlap with Hypervisor "spectre" reboot campaign for "cern-geneva-b"? Might slow down puppet. Nothing major expected.

Instance robustness (GeoTreeScheduler)

This is an issue that causes the MGM to crash at boot time because FSTs broadcast wrong/incomplete informations (or corrupted configuration file)

The faulty forceRefresh() method has been moved someplace else, to prevent interferences from FSTs broadcasting invalid data.

EOSBACKUP

Being updated to citrine (started before meeting: will finish either tonight or tomorrow).

● nextgen FUSE

- Giorgios did another round of fixing lock order violations introduced since end of december

- microtest 020 has to be disabled (2 MIO IOPS write(1b, offset+=2) - slow because old kernel does not batch them)

- beryl_aquamarine & master should be now in sync for tagging

Tickets from last week:

EOS-2231: ioflush thread serializes file closeing and leads to memory aggregation ( Massimo mpop3.py tests)

EOS-2232: track and recycle all IO buffers

ALL threads := 30 ALL visze := 1.18 Gb All rss := 691.04 Mb All wr-buf-inflight := 0 b All wr-buf-queued := 16.78 Mb All ra-buf-inflight := 0 b All ra-buf-queued := 0 b All rd-buf-inflight := 0 b All rd-buf-queued := 524.29 kb

EOS-2233: Implement all possible FST error recovery scenarios

- core work done - still WIP

EOS-2241: FUSEX has to point ZMQ connection to active master

- the memory behavior is still not 100% understood, when profiling with JEMALLOC no leak is visible, neither with valgrind, however one of the used thread pools must be responsible for the delayed release of memory (XrdCl?) because when a test is repeated the excess memory disappears completely.

Discussion: is "4.2.8" the version to deploy into production to allow /eos/scratch testing? Probably not.

Dan: also found new issue, "eosxd" accumulate whenever FUSEX mountpoint is unmounted. Drives up loadavg.

From Kuba

Implemented reproducer test for EOS-2229:

https://gitlab.cern.ch/cernbox/eosfusex_tests

(Joszef is aware, will integrate with GITLAB-CI)

● new Namespace

PPS tests (Massimo)

New catalogue at ~1.7 B entries
Verified last week changes (new version deployed)
Define 4 levels of 'temperature' for catalogue data:
- Cold: quarkDB cluster rebooted
- Cool: restart all quarkdb servers (kernel cache potentially warm)
- Warm: restart only MGM
- Hot: issue again a query (eos ls). In the production MGM we have only the HOT situation
- As it stands results are a bit erratic, but COLD=COOL. The "eos ls -l" of a dir containing about 2.1kdirectories takes about 100s in COLD and COOL mode. The WARM case takes about 6 s and HOT 0.2s.
Similar philosophy for FUSEx, but the results (ls -l of the same directories) do not make (for me) any sense: in the present system I tried to study the case HOT with/without ClientCache and data are erratic (between 4minutes to 1 s for the case without the cache and more or less consistent for the ClientCache full: 1s)
Miscellanea
- It looks restarting (reboot) the quarkDB cluster without an MGM reboot (or with the MGM reboot finishing before the QuarkDB reboot) gets you a non functional MGM (eos ls takes forever)
- It looks (to be confirmed) the MGM returns its cached values even if the quarkDB is still booting
- The eospps-ns1/2 machines have HDD (not SSD). In COLD mode I see ~3% of WA (Waiting IO)

● Xrootd

(see above): need a xrootd-4.8.1 release relatively soon for ALICE (xrdcp checksum hanging). Will also get some protocol enhancements (some more vectorized functions).

● AOB

Dan: odd thing for one microtest (sqlite 100x flush): /eos/scratch used to take ~3sec, now 10msec. Any EOSUAT changes to explain this? No. Understood, /eos/ mount on testbox was broken, test did not return an error .. EOSUAT will get updated tomorrow, will look again.

● overall planning

ITUM-23: see slides (mail), comments welcome, will get own slot so can (slightly) expand.

There are minutes attached to this event. Show them.

- operations: production
  - 1
    
    production instances
    
    Speaker: Herve Rousseau (CERN)
    
    Meltdown/Spectre
    
    SRM and HTTP gateways are being rebooted as part of the cloud restart campaign, generating false SMS.
    
    False alarms
    
    Identified yesterday an authentication issue when replicating between FSTs, looks like the previous "Georgios bug", under investigation (probably fixed in Xrootd, but some PLUS/BATCH machines still run some old version.. bad opaque tag goes to FSTs, affects connections between diskservers - this might allow for a DoS attack).
    
    Q: Could everybody (clients) go to 4.8 (which no longer sends the wrong tag)?
    
    yes - but... 4.8 on the client fixes (probably) 2 bugs - but already 1 known issue (ALICE: "xrdcp" hangs on checksumming). 4.8.1 is to be released "soon" (1..2 weeks).
    
    Elvin: will investigate FST-side fixes (i.e filter the tag). Can then do FST upgrade campaign (again, just did this for EOSPUBLIC.. but no need for real downtime).
    
    EOSUSER: not (yet) affected - bug is in xrootd-4.7+ server side.
  - 2
    
    CERNBOX and EOSUSER
    
    Speaker: Luca Mascetti (CERN)
  - 3
    
    FUSE and client versions
    
    Speaker: Dan van der Ster (CERN)
    
    4.2.8 has just been released, is being built on KOJI, will deploy. No changes in old FUSE but fixes many of the FUSEx bugs.
    
    Note: one old EOS client bug ("file is not visible") is apparently back - no more details?
    
    puppet "eosclient" module - one minor config change in the pipeline ("eos.select" script, for SHIP users).
  - 4
    
    Citrine rollout
    
    Speaker: Herve Rousseau (CERN)
    
    EOSCMS
    
    Next Tuesday (23rd Jan), the instance will be updated to Citrine. May need a workaround/new release (newer than 4.2.8), EOSATLAS update to "citrine" had the MGM slave crashing repeatedly when reading the configuration.
    
    Discussion: any impact/overlap with Hypervisor "spectre" reboot campaign for "cern-geneva-b"? Might slow down puppet. Nothing major expected.
    
    Instance robustness (GeoTreeScheduler)
    
    This is an issue that causes the MGM to crash at boot time because FSTs broadcast wrong/incomplete informations (or corrupted configuration file)
    
    The faulty forceRefresh() method has been moved someplace else, to prevent interferences from FSTs broadcasting invalid data.
    
    EOSBACKUP
    
    Being updated to citrine (started before meeting: will finish either tonight or tomorrow).
  - 5
    
    SWAN
    
    Speaker: Jakub Moscicki (CERN)
- development: near-term
  - 6
    
    nextgen FUSE
    
    Speaker: Andreas Joachim Peters (CERN)
    
    - Giorgios did another round of fixing lock order violations introduced since end of december
    
    - microtest 020 has to be disabled (2 MIO IOPS write(1b, offset+=2) - slow because old kernel does not batch them)
    
    - beryl_aquamarine & master should be now in sync for tagging
    
    Tickets from last week:
    
    EOS-2231: ioflush thread serializes file closeing and leads to memory aggregation ( Massimo mpop3.py tests)
    
    EOS-2232: track and recycle all IO buffers
    
    ALL threads := 30 ALL visze := 1.18 Gb All rss := 691.04 Mb All wr-buf-inflight := 0 b All wr-buf-queued := 16.78 Mb All ra-buf-inflight := 0 b All ra-buf-queued := 0 b All rd-buf-inflight := 0 b All rd-buf-queued := 524.29 kb
    
    EOS-2233: Implement all possible FST error recovery scenarios
    
    - core work done - still WIP
    
    EOS-2241: FUSEX has to point ZMQ connection to active master
    
    - the memory behavior is still not 100% understood, when profiling with JEMALLOC no leak is visible, neither with valgrind, however one of the used thread pools must be responsible for the delayed release of memory (XrdCl?) because when a test is repeated the excess memory disappears completely.
    
    Discussion: is "4.2.8" the version to deploy into production to allow /eos/scratch testing? Probably not.
    
    Dan: also found new issue, "eosxd" accumulate whenever FUSEX mountpoint is unmounted. Drives up loadavg.
    
    From Kuba
    
    Implemented reproducer test for EOS-2229:
    
    https://gitlab.cern.ch/cernbox/eosfusex_tests
    
    (Joszef is aware, will integrate with GITLAB-CI)
  - 7
    new Namespace
    
    Speaker: Elvin Alin Sindrilaru (CERN)
    
    PPS tests (Massimo)
    
    New catalogue at ~1.7 B entries
    
    Verified last week changes (new version deployed)
    
    Define 4 levels of 'temperature' for catalogue data:
    
    Cold: quarkDB cluster rebooted
    
    Cool: restart all quarkdb servers (kernel cache potentially warm)
    
    Warm: restart only MGM
    
    Hot: issue again a query (eos ls). In the production MGM we have only the HOT situation
    
    As it stands results are a bit erratic, but COLD=COOL. The "eos ls -l" of a dir containing about 2.1kdirectories takes about 100s in COLD and COOL mode. The WARM case takes about 6 s and HOT 0.2s.
    
    Similar philosophy for FUSEx, but the results (ls -l of the same directories) do not make (for me) any sense: in the present system I tried to study the case HOT with/without ClientCache and data are erratic (between 4minutes to 1 s for the case without the cache and more or less consistent for the ClientCache full: 1s)
    
    Miscellanea
    
    It looks restarting (reboot) the quarkDB cluster without an MGM reboot (or with the MGM reboot finishing before the QuarkDB reboot) gets you a non functional MGM (eos ls takes forever)
    
    It looks (to be confirmed) the MGM returns its cached values even if the quarkDB is still booting
    
    The eospps-ns1/2 machines have HDD (not SSD). In COLD mode I see ~3% of WA (Waiting IO)
- other: pilot services, long-term dev, external
  - 8
    
    Webservice
    
    Speaker: Luca Mascetti (CERN)
  - 9
    
    Backup
    
    Speaker: Luca Mascetti (CERN)
  - 10
    
    Samba
    
    Speaker: Luca Mascetti (CERN)
  - 11
    
    $HOME structure
    
    Speaker: Luca Mascetti (CERN)
  - 12
    
    BATCH integration
    
    Speaker: Massimo Lamanna (CERN)
  - 13
    
    Xrootd
    
    Speaker: Michal Kamil Simon (CERN)
    
    (see above): need a xrootd-4.8.1 release relatively soon for ALICE (xrdcp checksum hanging). Will also get some protocol enhancements (some more vectorized functions).
  - 14
    
    AOB
    
    Dan: odd thing for one microtest (sqlite 100x flush): /eos/scratch used to take ~3sec, now 10msec. Any EOSUAT changes to explain this? No. Understood, /eos/ mount on testbox was broken, test did not return an error .. EOSUAT will get updated tomorrow, will look again.
- 15
  
  overall planning
  
  Speaker: Jan Iven (CERN)
  
  ITUM-23: see slides (mail), comments welcome, will get own slot so can (slightly) expand.