EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSATLAS

Incident of last thursday led to the creation of these two tickets:

  • EOS-2600: A clean FST shutdown wrongly marks local LevelDB as dirty
  • EOS-2601: GeoScheduler misbehaviour when cluster is degraded

● EOS clients, FUSE(X)

(Jan):

  • still-pending "eosd" config change (EOS_FUSE_LAZYOPENRW=1 etc) - CRM-2669 - ETA tomorrow.
  • "eosxd" also may need "cleanup" script to prevent stuck mountpoints - EOS-2614
  • decision (input from Elvin): not deploying 4.2.23 on clients
  • ETA for "eosxd" slow GIT checkout (W.Lampl, EOS-2589 has a commit but is "in progress")? [Andreas: I closed it => 4.2.24]

 

(Andreas):

  • everything besides exos branch [rados] has been merged into 'dev' branch
  • file inlining has been fixed in 'dev' branch (it is off by default anyway)
  • at least with the 'dev' branch I see an issue that the current working directory gets inaccessible (has somebody observed this on lx*) and a 'cd . ' is required - looking into it
    • Luca will deploy the "dev" branch
  • fixed a FUSEX SEGV issue related to inline repair (too small buffer) found by Rainer - needs to be ported to 4.2.24

(Enrico):

  • SWAN is running 4.2.22 since ~1week, series of crashes but no coredumps (abrtd setup?), will try to reproduce/provide more info.

● Development issues

(Georgios)

  • Fixing the last few remaining places where doing synchronous requests to QDB could lock up the namespace, and cause the MGM to become unresponsive for several seconds.
    • Mostly related to FilesystemView, used by MGM services like Balancer, etc.
  • (EOS-2610) PPS MGM hangs every day around 2pm for a couple of minutes, but recovers quickly. (No crash, just the NS remains unavailable) Almost sure it's related to the above, fixing that should also resolve this.
    • seems not to be cronjob (internal?). Happens every 24h+30min.
  • (have deleted some files on EOSPPS, will go below 3.5B to have some room for operations)

 

EOSBACKUP - want access to MGM machine for 1 day (to boot namespace), then tag release, then deploy new NS on that machine (is already on "citrine"). Will stop backup traffic tomorrow during the day, needs compacting (Georgios/Elvin/Kuba/Luca to coordinate).


● AOB

Massimo - need new FUSEX soon (4.2.24 should be release in days), stuck with "massive parallel FUSEX" testing.

  • multi-mountpoint - might be EOS-2603 (but marked as low-prio)
  • need "eso-cleanup" script soon (EOS-2614) since seeing a lot of  EOS-2603 and nodes are unusable afterwards
  • saw some "corruption"? (unclear, might have been SEGV as found by Rainer - see above, no ticket?)

Cristi - sec team request to scan for world-writeable directories. Operations need to look into this (with high priority). Cristi will do this for EOSPUBLIC, but somebody should reply for all EOSes.

Kuba - have drafted policy for EOSUSER service (i.e what to answer for "big files", >1TB quota, FTS access, Grid integration etc - all should try go to experiment instance).

  • Jan: OK from our side (clarify the 1TB/2TB limit), but suggest to clarify message with ATLAS, LHCb storage experts before sending to users.
  • Massimo: in particular ATLAS has group areas for heavy analysis.
There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSATLAS

      Incident of last thursday led to the creation of these two tickets:

      • EOS-2600: A clean FST shutdown wrongly marks local LevelDB as dirty
      • EOS-2601: GeoScheduler misbehaviour when cluster is degraded
    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (Jan):

      • still-pending "eosd" config change (EOS_FUSE_LAZYOPENRW=1 etc) - CRM-2669 - ETA tomorrow.
      • "eosxd" also may need "cleanup" script to prevent stuck mountpoints - EOS-2614
      • decision (input from Elvin): not deploying 4.2.23 on clients
      • ETA for "eosxd" slow GIT checkout (W.Lampl, EOS-2589 has a commit but is "in progress")? [Andreas: I closed it => 4.2.24]

       

      (Andreas):

      • everything besides exos branch [rados] has been merged into 'dev' branch
      • file inlining has been fixed in 'dev' branch (it is off by default anyway)
      • at least with the 'dev' branch I see an issue that the current working directory gets inaccessible (has somebody observed this on lx*) and a 'cd . ' is required - looking into it
        • Luca will deploy the "dev" branch
      • fixed a FUSEX SEGV issue related to inline repair (too small buffer) found by Rainer - needs to be ported to 4.2.24

      (Enrico):

      • SWAN is running 4.2.22 since ~1week, series of crashes but no coredumps (abrtd setup?), will try to reproduce/provide more info.
    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • Fixing the last few remaining places where doing synchronous requests to QDB could lock up the namespace, and cause the MGM to become unresponsive for several seconds.
        • Mostly related to FilesystemView, used by MGM services like Balancer, etc.
      • (EOS-2610) PPS MGM hangs every day around 2pm for a couple of minutes, but recovers quickly. (No crash, just the NS remains unavailable) Almost sure it's related to the above, fixing that should also resolve this.
        • seems not to be cronjob (internal?). Happens every 24h+30min.
      • (have deleted some files on EOSPPS, will go below 3.5B to have some room for operations)

       

      EOSBACKUP - want access to MGM machine for 1 day (to boot namespace), then tag release, then deploy new NS on that machine (is already on "citrine"). Will stop backup traffic tomorrow during the day, needs compacting (Georgios/Elvin/Kuba/Luca to coordinate).

    • 16:35 16:50
      AOB 15m

      Massimo - need new FUSEX soon (4.2.24 should be release in days), stuck with "massive parallel FUSEX" testing.

      • multi-mountpoint - might be EOS-2603 (but marked as low-prio)
      • need "eso-cleanup" script soon (EOS-2614) since seeing a lot of  EOS-2603 and nodes are unusable afterwards
      • saw some "corruption"? (unclear, might have been SEGV as found by Rainer - see above, no ticket?)

      Cristi - sec team request to scan for world-writeable directories. Operations need to look into this (with high priority). Cristi will do this for EOSPUBLIC, but somebody should reply for all EOSes.

      Kuba - have drafted policy for EOSUSER service (i.e what to answer for "big files", >1TB quota, FTS access, Grid integration etc - all should try go to experiment instance).

      • Jan: OK from our side (clarify the 1TB/2TB limit), but suggest to clarify message with ATLAS, LHCb storage experts before sending to users.
      • Massimo: in particular ATLAS has group areas for heavy analysis.