EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

    • 16:00 16:02
      (new meeting agenda) 2m
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      Network issue (March, 31st)

      A misbehaving card in one of the core router caused some troubles (interrupting TCP connections, timeouts, etc...) from 03h23 to 11h07

      See OTG0043178 for more informations

      • Generating a list of files written during the incident window and contacting experiments to warn them that files may be corrupted.

      EOSPUBLIC

      • slave MGM crashed during a quota operation on the master: EOS-2465
      • slave MGM cannot be restarted (even after online compaction and moving away the files from lost+found followed by another online compaction):
        • 180403 09:22:56 time=1522740176.764038 func=InitializeFileView       level=CRIT  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eospublic-srv-m1.cern.ch:1094 tid=00007fde2abff700 source=XrdMgmOfsConfigure:305         tident=<single-exec> sec=      uid=0 gid=0 name= geo="" namespace file loading initialization failed after 495 seconds
          180403 09:22:56 time=1522740176.764110 func=InitializeFileView       level=CRIT  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eospublic-srv-m1.cern.ch:1094 tid=00007fde2abff700 source=XrdMgmOfsConfigure:307         tident=<single-exec> sec=      uid=0 gid=0 name= geo="" initialization returnd ec=2 Container #42208784 not found

        • more investigation needed

      • master MGM crashed Thu, Mar 29th @23:00 while investigating the slave MGM not booting (due to an attempt to remove of a file with a missing container):
        • SSB entry: OTG0043168
        • ticket for adding exception handling for the above event: EOS-2466
    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      it-puppet-module-eosclient updated to better support FUSEX, merged to production this morning: CRM-2607


      v4.2.18 is in qa and will go into production tomorrow: CRM-2624

      v4.2.18 has a known issue with the FUSEX cache cleanup, fixed in next release.

    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
    • 16:35 16:50
      AOB 15m