EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description
Weekly meeting to discuss progress on EOS rollout

● production instances

E-Groups restructuration

Goal: reduce the number of people that have root access on the boxes, and separate roles from eos-admins

The proposal is to have 3 e-groups:

  • eos-(servicemanagers|operations|root): root access to nodes, owner of machines, LanDB
  • eos-tickets: Base SNOW team feeding eos-snow-3rdline
  • eos-admins: Communications, discussions (also external people involved)

GIT workflow evolution proposal

The idea is to prevent the following spaghetti-like git history that is very hard to follow, difficult to "merge", error prone and time consuming on top of that.

Desired:

Why

Having a clean history is important in order to understand what happened and what changed in a repo. Fixing merge problems and conflicts is very time consuming (it took me ~2 hours to fix the first spaghetti plate)

How: 

=> Mimicking the AI workflow

Stop pushing on master (except for emergency changes)

  • Use feature branches for new features, then open a merge request for QA
  • Merge request from feature branch to QA
  • Merge request from QA to Master (with X approvals).
    • Allows for code review, advices and improvements

Additional resources:

http://kensheedlo.com/essays/why-you-should-use-a-rebase-workflow/

https://randyfay.com/content/rebase-workflow-git

 

Retirement: Several EOSPPS (and some "gateway") machines need to go until mid-November. Need to get some  (12?) non-diskserver 10Gb physical machines as gateways (FTP, SMB), under discussion with Bernd, with dedicated uplink.


● CERNBOX and EOSUSER

"Marek" incident : incrementally copied logfile, needs update, compact = schedule for in 2 weeks? (good, get several bugfixes).


● FUSE and client versions

v4.1.30 is in qa, production on Monday: CRM-2426

If OK, this version should go to desktops (tickets or "magic" KOJI tag?)


● Citrine rollout

EOSALICE

Upgraded yesterday to 4.1.30 (almost) in time.

Uncovered some issues, fixed in master:

EOSLHCB is already running 4.1.30, EOSPUBLIC will likely be updated to the next 4.1.31 once released.

Further planning: 4.1.31 is getting serious "testing", timeframe for updating EOSATLAS (affected by imbalance between Meyrin/Wigner) and EOSCMS is January. EOSUSER still unclear, needs more testing (EOSBACKUP will go first, then get the new namespace; EOSUAT being set up to test for EOSUSER).

 

One "big memory" headnode has gone missing - have 2 (with spinning disk) that can get recycled,but would like bigmem+SSD for EOSALICE.

 


● SWAN

EOSUSER is still using the per-machine Kerberos principal (until next restart), SWAN dropped them (now back)


● nextgen FUSE

- incorporate local cache cleaner

- enable global byte-range locks

- enable global fsync coordination (open delayed until any client finished his fsync)

- add max file-size support returning EFBIG

- adding rudimentary quota honoring on client side

Progressing as planned.


Server-side is assumed to be ready on Monday (Luca would like pre-/post tags ..)

CDo we set up some "experimental" area on some instance (on EOSUAT?)

 

"eosfusebind": need to keep compat, even if no longer needed.

 

 


● Samba

Samba update this morning (tons of RPMs from 7.4 distrosync) left this in inconsistent state, OK after restart (also affected SWAN). 

Webcast uses this, and had "VIP" HR webcast shortly afterwards = ticket. SMB service status needs to be sorted out (Massimo?), also need to add (more) monitoring. 

There are minutes attached to this event. Show them.
    • 16:00 16:05
      overall 2017 planning 5m
      Speaker: Jan Iven (CERN)
    • 16:05 16:30
      operations: production
      • 16:05
        production instances 5m
        Speaker: Herve Rousseau (CERN)

        E-Groups restructuration

        Goal: reduce the number of people that have root access on the boxes, and separate roles from eos-admins

        The proposal is to have 3 e-groups:

        • eos-(servicemanagers|operations|root): root access to nodes, owner of machines, LanDB
        • eos-tickets: Base SNOW team feeding eos-snow-3rdline
        • eos-admins: Communications, discussions (also external people involved)

        GIT workflow evolution proposal

        The idea is to prevent the following spaghetti-like git history that is very hard to follow, difficult to "merge", error prone and time consuming on top of that.

        Desired:

        Why

        Having a clean history is important in order to understand what happened and what changed in a repo. Fixing merge problems and conflicts is very time consuming (it took me ~2 hours to fix the first spaghetti plate)

        How: 

        => Mimicking the AI workflow

        Stop pushing on master (except for emergency changes)

        • Use feature branches for new features, then open a merge request for QA
        • Merge request from feature branch to QA
        • Merge request from QA to Master (with X approvals).
          • Allows for code review, advices and improvements

        Additional resources:

        http://kensheedlo.com/essays/why-you-should-use-a-rebase-workflow/

        https://randyfay.com/content/rebase-workflow-git

         

        Retirement: Several EOSPPS (and some "gateway") machines need to go until mid-November. Need to get some  (12?) non-diskserver 10Gb physical machines as gateways (FTP, SMB), under discussion with Bernd, with dedicated uplink.

      • 16:10
        CERNBOX and EOSUSER 5m
        Speaker: Luca Mascetti (CERN)

        "Marek" incident : incrementally copied logfile, needs update, compact = schedule for in 2 weeks? (good, get several bugfixes).

      • 16:15
        FUSE and client versions 5m
        Speaker: Dan van der Ster (CERN)

        v4.1.30 is in qa, production on Monday: CRM-2426

        If OK, this version should go to desktops (tickets or "magic" KOJI tag?)

      • 16:20
        Citrine rollout 5m
        Speaker: Herve Rousseau (CERN)

        EOSALICE

        Upgraded yesterday to 4.1.30 (almost) in time.

        Uncovered some issues, fixed in master:

        EOSLHCB is already running 4.1.30, EOSPUBLIC will likely be updated to the next 4.1.31 once released.

        Further planning: 4.1.31 is getting serious "testing", timeframe for updating EOSATLAS (affected by imbalance between Meyrin/Wigner) and EOSCMS is January. EOSUSER still unclear, needs more testing (EOSBACKUP will go first, then get the new namespace; EOSUAT being set up to test for EOSUSER).

         

        One "big memory" headnode has gone missing - have 2 (with spinning disk) that can get recycled,but would like bigmem+SSD for EOSALICE.

         

      • 16:25
        SWAN 5m
        Speaker: Jakub Moscicki (CERN)

        EOSUSER is still using the per-machine Kerberos principal (until next restart), SWAN dropped them (now back)

    • 16:30 16:50
      development: near-term
      • 16:30
        nextgen FUSE 5m
        Speaker: Andreas Joachim Peters (CERN)

        - incorporate local cache cleaner

        - enable global byte-range locks

        - enable global fsync coordination (open delayed until any client finished his fsync)

        - add max file-size support returning EFBIG

        - adding rudimentary quota honoring on client side

        Progressing as planned.


        Server-side is assumed to be ready on Monday (Luca would like pre-/post tags ..)

        CDo we set up some "experimental" area on some instance (on EOSUAT?)

         

        "eosfusebind": need to keep compat, even if no longer needed.

         

         

      • 16:35
        new Namespace 5m
        Speaker: Elvin Alin Sindrilaru (CERN)
    • 16:50 17:45
      other: pilot services, long-term dev, external
      • 16:50
        Webservice 5m
        Speaker: Luca Mascetti (CERN)
      • 16:55
        Backup 5m
        Speaker: Luca Mascetti (CERN)
      • 17:00
        Samba 5m
        Speaker: Luca Mascetti (CERN)

        Samba update this morning (tons of RPMs from 7.4 distrosync) left this in inconsistent state, OK after restart (also affected SWAN). 

        Webcast uses this, and had "VIP" HR webcast shortly afterwards = ticket. SMB service status needs to be sorted out (Massimo?), also need to add (more) monitoring. 

      • 17:05
        $HOME structure 5m
        Speaker: Luca Mascetti (CERN)
      • 17:10
        BATCH integration 5m
        Speaker: Massimo Lamanna (CERN)
      • 17:15
        Xrootd 5m
        Speaker: Michal Kamil Simon (CERN)
      • 17:20
        AOB 5m