EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

LHC instances

  • All instances updated to EOS 4.3.12 and XRootD 4.8.4

EOSPUBLIC

  • deadlocked last Tuesday after a compaction -> upgraded to 4.3.12 - EOS-2895
  • crashed Saturday night - EOS-2908

(EOSHOME: all on 4.3.13)


● EOS clients, FUSE(X)

Andreas

  • coming fixes in version 4.3.14
     
    • eosxd not behaving according to out-of-inode-quota
      • FUSEX: enable has_quota to check only for volume quota if size==0, translate properly XRootD over quota to EDQUOT
         
    • chmod after fresh mount hanging foreever
      • FUSEX: before waiting for flush, verify that there is actually something to be flushed
         
    • allow binary contents as values for extended attributes
      • FUSEX: make xattr values binary blobs instead of strings in the protobuf definition
         
  • still investigating single write operation staying in XOFF state ( EOS-2896 )

(jan)

  • 4.3.13 is in production; also FUSEX auto-clean
  • /eos/lhcb and /eos/ams are mounted via FUSEX for production machines

● Development issues

(Georgios)

  • Luca found a file on EOSHOME, created by fusex microtests, which has a non-existent directory as parent. I fonud a bug in FuseServer class which might have caused this, but maybe not.
  • I'm making an external tool which scans the entire namespace, and detects such problems.

● AFS migration restart - criteria

AFS migration restart -what is required?

(sent by mail on 2018-09-20)

Are these criteria ok for everybody, sufficient?

  1. FUSEX must be in use against all instances (i.e puppet "production" machines use eosxd).
    1. do not want again to tell somebody that their particular use case will only be solved on FUSEX
    2. NOT required for all instance - EOSATLAS FUSEX
  2. FUSEX must be "sufficiently stable", in particular at protocol level
    1. sync'ed software updates of client+server are nearly impossible, once we have desktop machines using FUSEX
  3. FUSEX must be "sufficiently fast"
    1. performance (=latency) was a major reason for the rewrite
    2. microtests need to "nearly all" at AFS speed (we can probably make an argument for one or two being acceptably = <2x slower)
  4. EOSHOME migration completed
    1. cannot move significant amounts of files into old EOSUSER
    2. Q - also need to migrate EOSPROJECT?
  5. explicit "OK to restart" from within IT-ST:
    1. EOS developers
      1. migration had been put on hold at their request.
    2. EOS operations
      1. server-side stability and durability need to be “good enough” (would prefer to have explicit criteria - #crashes/week, uptime..).
    3. user support
      1. happy with existing docs/KBs/procedures?
  6. explicit OK from at least some AFS phaseout coordinators
    1. to validate that their already-known use cases have been addressed

Discussion:

4.2: Luca "yes" - need to migrate EOSPROJECT - need hardware. Do we need multiple instances - (~ 100m files, what is the growth rate? higher )

Other instance will go to QuarkDB: after run , early 2019

Functional equivalence: is needed for $HOME, but assume OK for project spaces.

* unauthenticated access (.ssh/; .forward) is required for $HOME ; hardlinks) - make list of such use cases. Is EOS policy, not code. 

Dan suggests full portfolio table: when should user what service? big table: I/O ops , sharing -which bucket for which use case (AFS/Manila/CEPHFS/EOS). But only for users that either already suffer on AFS, or cannot fit into EOS.

EOSFUSEX is catch-almost-all. With exception of homedirectory!

AFS also "cheap free backup".

Migration: copy over all, then project-by-project switchover. May need tools (e.g ACLs).

 


FUSEX roll-out timetable proposal:

2018-09-25 (this week):

  • AMS, LHCB: production
  • rest of EOSPUBLIC: qa

2018-10-03 (next week):

  • rest of EOSPUBLIC: production
  • EOSCMS, EOSATLAS, EOSMEDIA: qa (both are on 4.3.12 - Q: is that good enough?)

2018-10-10 (2 weeks):

  • EOSCMS, EOSATLAS: production
  • Q: leftover stuff: EOSUSER., EOSPROJECT ,  all on aquamarine? EOSCTA, EOSMEDIA, EOSGENOME, EOSUAT (gone?)

Discussion:

Is dropping old EOSFUSE a goal? No, wide-area access is better on (old) EOSFUSE

Server-side: want latest (4.3.14) MEDIA needs to update (4.2.X)

 


Microtests - which of these need to become "faster"?

There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      LHC instances

      • All instances updated to EOS 4.3.12 and XRootD 4.8.4

      EOSPUBLIC

      • deadlocked last Tuesday after a compaction -> upgraded to 4.3.12 - EOS-2895
      • crashed Saturday night - EOS-2908

      (EOSHOME: all on 4.3.13)

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      Andreas

      • coming fixes in version 4.3.14
         
        • eosxd not behaving according to out-of-inode-quota
          • FUSEX: enable has_quota to check only for volume quota if size==0, translate properly XRootD over quota to EDQUOT
             
        • chmod after fresh mount hanging foreever
          • FUSEX: before waiting for flush, verify that there is actually something to be flushed
             
        • allow binary contents as values for extended attributes
          • FUSEX: make xattr values binary blobs instead of strings in the protobuf definition
             
      • still investigating single write operation staying in XOFF state ( EOS-2896 )

      (jan)

      • 4.3.13 is in production; also FUSEX auto-clean
      • /eos/lhcb and /eos/ams are mounted via FUSEX for production machines
    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • Luca found a file on EOSHOME, created by fusex microtests, which has a non-existent directory as parent. I fonud a bug in FuseServer class which might have caused this, but maybe not.
      • I'm making an external tool which scans the entire namespace, and detects such problems.
    • 16:35 16:55
      AFS migration restart - criteria 20m
      Speaker: Jan Iven (CERN)

      AFS migration restart -what is required?

      (sent by mail on 2018-09-20)

      Are these criteria ok for everybody, sufficient?

      1. FUSEX must be in use against all instances (i.e puppet "production" machines use eosxd).
        1. do not want again to tell somebody that their particular use case will only be solved on FUSEX
        2. NOT required for all instance - EOSATLAS FUSEX
      2. FUSEX must be "sufficiently stable", in particular at protocol level
        1. sync'ed software updates of client+server are nearly impossible, once we have desktop machines using FUSEX
      3. FUSEX must be "sufficiently fast"
        1. performance (=latency) was a major reason for the rewrite
        2. microtests need to "nearly all" at AFS speed (we can probably make an argument for one or two being acceptably = <2x slower)
      4. EOSHOME migration completed
        1. cannot move significant amounts of files into old EOSUSER
        2. Q - also need to migrate EOSPROJECT?
      5. explicit "OK to restart" from within IT-ST:
        1. EOS developers
          1. migration had been put on hold at their request.
        2. EOS operations
          1. server-side stability and durability need to be “good enough” (would prefer to have explicit criteria - #crashes/week, uptime..).
        3. user support
          1. happy with existing docs/KBs/procedures?
      6. explicit OK from at least some AFS phaseout coordinators
        1. to validate that their already-known use cases have been addressed

      Discussion:

      4.2: Luca "yes" - need to migrate EOSPROJECT - need hardware. Do we need multiple instances - (~ 100m files, what is the growth rate? higher )

      Other instance will go to QuarkDB: after run , early 2019

      Functional equivalence: is needed for $HOME, but assume OK for project spaces.

      * unauthenticated access (.ssh/; .forward) is required for $HOME ; hardlinks) - make list of such use cases. Is EOS policy, not code. 

      Dan suggests full portfolio table: when should user what service? big table: I/O ops , sharing -which bucket for which use case (AFS/Manila/CEPHFS/EOS). But only for users that either already suffer on AFS, or cannot fit into EOS.

      EOSFUSEX is catch-almost-all. With exception of homedirectory!

      AFS also "cheap free backup".

      Migration: copy over all, then project-by-project switchover. May need tools (e.g ACLs).

       


      FUSEX roll-out timetable proposal:

      2018-09-25 (this week):

      • AMS, LHCB: production
      • rest of EOSPUBLIC: qa

      2018-10-03 (next week):

      • rest of EOSPUBLIC: production
      • EOSCMS, EOSATLAS, EOSMEDIA: qa (both are on 4.3.12 - Q: is that good enough?)

      2018-10-10 (2 weeks):

      • EOSCMS, EOSATLAS: production
      • Q: leftover stuff: EOSUSER., EOSPROJECT ,  all on aquamarine? EOSCTA, EOSMEDIA, EOSGENOME, EOSUAT (gone?)

      Discussion:

      Is dropping old EOSFUSE a goal? No, wide-area access is better on (old) EOSFUSE

      Server-side: want latest (4.3.14) MEDIA needs to update (4.2.X)

       


      Microtests - which of these need to become "faster"?

    • 16:55 17:10
      AOB 15m