EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSPUBLIC

  • Crashed this morning, deadlock in FuseServer::Client::BraodcastConfig (EOS-2954)
  • Failed over to the slave that was luckily in sync
  • FSTs almost all crashed at the same time (possibly EOS-1934)
    • crashed in 4.4.0..4.4.3, fixed in 4.4.4. Herve will update FSTs.

 


● EOS clients, FUSE(X)

(jan):

  • eos-fuse(x)-4.4.0 in "production" since today. Nothing (newer) in "qa"
  • Notable bugs:
    • EOS-2894 : "ssh -X" (actually, "xauth") hangs FUSEX. Req for "$HOME" usecase. Have test
  • AMS is waiting for green light to repeat their MGM-crashing test (was on hold until EOSPUBLIC would be on 4.4.X; unfortunately EOSPUBLIC crashed this morning - unrelated?) OK to go ahead?
    • Luca: please test against EOSPPS, if possible for AMS
  • Next instance(s) to go FUSEX, at least in "qa"?
    • tentative planning from was to be on FUSEX everywhere this week. FUSEX roll-out timetable proposal (2018-09-25)
    • MGM should already be on 4.4.x. CMS, ATLAS: 4.3.12, MEDIA: 4.3.14 - can these get updated this week?
      • Massim to decide.
  • EOSPROJECT status?
    • have HW, have no time, Jan to set up, Luca to give name+hostgroup.

● Development issues

(Georgios)

  • Just a few hours ago: eoshome-i03 was DOS'ed by a user doing "recycle ls". Looks like the command was taking too long to complete, timing out, and automatically retried, again and again. EOS-2955
    • is due to core xrootd-client timeout+command resend (same as dumpmd, find). Would need to be rewritten "using new mechanism"
    • "eos console" could turn off retries but also would turn off read recovery, for all commands..
    • Hugo will set env variable for web interface to turn off retries (to be provided by Elvin)
  • Fix for namespace bug, where an mkdir was able to "shadow overwrite" a broken symlink (caused to have both file+dir entry in namespace, caused "find" to crash MGM).
    • in case of clash, would need to rename directory (which causes the file/link entry to become visible). Georgios will write tool to check existing namespace(s) 
  • QuarkDB 0.3.4 has been released:
    • Fix for a bug which does not affect EOS.
    • Updated rocksdb to latest release.
      • (is linked statically, no new RPMs)
    • Full release notes here

● AOB

Reminder:

  • Andy Hanushevski at CERN this week - discuss high-prio Xrootd bugs/features with him.
There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSPUBLIC

      • Crashed this morning, deadlock in FuseServer::Client::BraodcastConfig (EOS-2954)
      • Failed over to the slave that was luckily in sync
      • FSTs almost all crashed at the same time (possibly EOS-1934)
        • crashed in 4.4.0..4.4.3, fixed in 4.4.4. Herve will update FSTs.

       

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (jan):

      • eos-fuse(x)-4.4.0 in "production" since today. Nothing (newer) in "qa"
      • Notable bugs:
        • EOS-2894 : "ssh -X" (actually, "xauth") hangs FUSEX. Req for "$HOME" usecase. Have test
      • AMS is waiting for green light to repeat their MGM-crashing test (was on hold until EOSPUBLIC would be on 4.4.X; unfortunately EOSPUBLIC crashed this morning - unrelated?) OK to go ahead?
        • Luca: please test against EOSPPS, if possible for AMS
      • Next instance(s) to go FUSEX, at least in "qa"?
        • tentative planning from was to be on FUSEX everywhere this week. FUSEX roll-out timetable proposal (2018-09-25)
        • MGM should already be on 4.4.x. CMS, ATLAS: 4.3.12, MEDIA: 4.3.14 - can these get updated this week?
          • Massim to decide.
      • EOSPROJECT status?
        • have HW, have no time, Jan to set up, Luca to give name+hostgroup.
    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • Just a few hours ago: eoshome-i03 was DOS'ed by a user doing "recycle ls". Looks like the command was taking too long to complete, timing out, and automatically retried, again and again. EOS-2955
        • is due to core xrootd-client timeout+command resend (same as dumpmd, find). Would need to be rewritten "using new mechanism"
        • "eos console" could turn off retries but also would turn off read recovery, for all commands..
        • Hugo will set env variable for web interface to turn off retries (to be provided by Elvin)
      • Fix for namespace bug, where an mkdir was able to "shadow overwrite" a broken symlink (caused to have both file+dir entry in namespace, caused "find" to crash MGM).
        • in case of clash, would need to rename directory (which causes the file/link entry to become visible). Georgios will write tool to check existing namespace(s) 
      • QuarkDB 0.3.4 has been released:
        • Fix for a bug which does not affect EOS.
        • Updated rocksdb to latest release.
          • (is linked statically, no new RPMs)
        • Full release notes here
    • 16:35 16:50
      AOB 15m

      Reminder:

      • Andy Hanushevski at CERN this week - discuss high-prio Xrootd bugs/features with him.