EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

(Cristi): EOSPUBLIC just got stuck (during the meeting..) after compaction, is now being updated to 4.3.12.

(Luca): EOSHOME instances are on 4.3.12 (except for -4 , waiting for a break in EOSHOME migration/OK from Eddy).

 


● EOS clients, FUSE(X)

(jan) FUSEX deployment:

  • FUSEX still (only) in "qa" for /eos/ams, /eos/lhcb - since Aug 24 (CRM-2790)
    • general - what are the criteria to go forward?
    • here: push to "production"? add other instances to "qa"?
    • AP: Need 4.3.12 client+server for FUSEX, 4.3.11 on FST for FUSEX+FUSE. I.e ATLAS MGM would need to be updated.
  • versions:
    • 4.3.11 + xrootd-4.8.4 -> "production" (and will go to SLC6/CC7 desktops)
    • 4.3.12 -> "qa"
      • both might be stuck in KOJI, jan will check.

 


● Development issues

(Georgios)

  • QDB 0.3.3 released: Addition of required commands for running EOS config engine on QDB.
    • can run on single node while still in "raft" mode, can then add more nodes (unlike "standalone" mode)
    • Q: is backup working - please check with Paul, and try out restore. (Georgios is actually testing this? How can we "test" that the backup is valid? write a new dummy key in REDIS and check for that?)
  • eosd private mounts would lose authentication after 20 minutes of inactivity (EOS-2892 - fixed in 4.3.12), caused by two bugs (almost) cancelling each other out.

● AOB

(jan): collecting criteria for restarting the AFS migration, will summarize+send by mail ($HOME probably is last)

There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      (Cristi): EOSPUBLIC just got stuck (during the meeting..) after compaction, is now being updated to 4.3.12.

      (Luca): EOSHOME instances are on 4.3.12 (except for -4 , waiting for a break in EOSHOME migration/OK from Eddy).

       

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (jan) FUSEX deployment:

      • FUSEX still (only) in "qa" for /eos/ams, /eos/lhcb - since Aug 24 (CRM-2790)
        • general - what are the criteria to go forward?
        • here: push to "production"? add other instances to "qa"?
        • AP: Need 4.3.12 client+server for FUSEX, 4.3.11 on FST for FUSEX+FUSE. I.e ATLAS MGM would need to be updated.
      • versions:
        • 4.3.11 + xrootd-4.8.4 -> "production" (and will go to SLC6/CC7 desktops)
        • 4.3.12 -> "qa"
          • both might be stuck in KOJI, jan will check.

       

    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • QDB 0.3.3 released: Addition of required commands for running EOS config engine on QDB.
        • can run on single node while still in "raft" mode, can then add more nodes (unlike "standalone" mode)
        • Q: is backup working - please check with Paul, and try out restore. (Georgios is actually testing this? How can we "test" that the backup is valid? write a new dummy key in REDIS and check for that?)
      • eosd private mounts would lose authentication after 20 minutes of inactivity (EOS-2892 - fixed in 4.3.12), caused by two bugs (almost) cancelling each other out.
    • 16:35 16:50
      AOB 15m

      (jan): collecting criteria for restarting the AFS migration, will summarize+send by mail ($HOME probably is last)