EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2017-09-19T16:00:00+02:00
End: 2017-09-19T17:45:00+02:00
Location: CERN

Tuesday 19 Sept 2017, 16:00 → 17:45 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout

Hide

● production instances

E-Groups restructuration

Goal: reduce the number of people that have root access on the boxes, and separate roles from eos-admins

The proposal is to have 3 e-groups:

eos-(servicemanagers|operations|root): root access to nodes, owner of machines, LanDB
eos-tickets: Base SNOW team feeding eos-snow-3rdline
eos-admins: Communications, discussions (also external people involved)

GIT workflow evolution proposal

The idea is to prevent the following spaghetti-like git history that is very hard to follow, difficult to "merge", error prone and time consuming on top of that.

Desired:

Why

Having a clean history is important in order to understand what happened and what changed in a repo. Fixing merge problems and conflicts is very time consuming (it took me ~2 hours to fix the first spaghetti plate)

How:

=> Mimicking the AI workflow

Stop pushing on master (except for emergency changes)

Use feature branches for new features, then open a merge request for QA
Merge request from feature branch to QA
Merge request from QA to Master (with X approvals).
- Allows for code review, advices and improvements

Additional resources:

http://kensheedlo.com/essays/why-you-should-use-a-rebase-workflow/

https://randyfay.com/content/rebase-workflow-git

Retirement: Several EOSPPS (and some "gateway") machines need to go until mid-November. Need to get some (12?) non-diskserver 10Gb physical machines as gateways (FTP, SMB), under discussion with Bernd, with dedicated uplink.

● CERNBOX and EOSUSER

"Marek" incident : incrementally copied logfile, needs update, compact = schedule for in 2 weeks? (good, get several bugfixes).

● FUSE and client versions

v4.1.30 is in qa, production on Monday: CRM-2426

If OK, this version should go to desktops (tickets or "magic" KOJI tag?)

● Citrine rollout

EOSALICE

Upgraded yesterday to 4.1.30 (almost) in time.

Uncovered some issues, fixed in master:

EOS-2016: FST crashes when removing ghost entries
EOS-2017: MGM crash on FSCK

EOSLHCB is already running 4.1.30, EOSPUBLIC will likely be updated to the next 4.1.31 once released.

Further planning: 4.1.31 is getting serious "testing", timeframe for updating EOSATLAS (affected by imbalance between Meyrin/Wigner) and EOSCMS is January. EOSUSER still unclear, needs more testing (EOSBACKUP will go first, then get the new namespace; EOSUAT being set up to test for EOSUSER).

One "big memory" headnode has gone missing - have 2 (with spinning disk) that can get recycled,but would like bigmem+SSD for EOSALICE.

● SWAN

EOSUSER is still using the per-machine Kerberos principal (until next restart), SWAN dropped them (now back)

● nextgen FUSE

- incorporate local cache cleaner

- enable global byte-range locks

- enable global fsync coordination (open delayed until any client finished his fsync)

- add max file-size support returning EFBIG

- adding rudimentary quota honoring on client side

Progressing as planned.

Server-side is assumed to be ready on Monday (Luca would like pre-/post tags ..)

CDo we set up some "experimental" area on some instance (on EOSUAT?)

"eosfusebind": need to keep compat, even if no longer needed.

● Samba

Samba update this morning (tons of RPMs from 7.4 distrosync) left this in inconsistent state, OK after restart (also affected SWAN).

Webcast uses this, and had "VIP" HR webcast shortly afterwards = ticket. SMB service status needs to be sorted out (Massimo?), also need to add (more) monitoring.

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  overall 2017 planning 5m
  
  Speaker: Jan Iven (CERN)
- 16:05 → 16:30
  operations: production
  - 16:05
    production instances 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    E-Groups restructuration
    
    Goal: reduce the number of people that have root access on the boxes, and separate roles from eos-admins
    
    The proposal is to have 3 e-groups:
    
    eos-(servicemanagers|operations|root): root access to nodes, owner of machines, LanDB
    
    eos-tickets: Base SNOW team feeding eos-snow-3rdline
    
    eos-admins: Communications, discussions (also external people involved)
    
    GIT workflow evolution proposal
    
    The idea is to prevent the following spaghetti-like git history that is very hard to follow, difficult to "merge", error prone and time consuming on top of that.
    
    Desired:
    
    Why
    
    Having a clean history is important in order to understand what happened and what changed in a repo. Fixing merge problems and conflicts is very time consuming (it took me ~2 hours to fix the first spaghetti plate)
    
    How:
    
    => Mimicking the AI workflow
    
    Stop pushing on master (except for emergency changes)
    
    Use feature branches for new features, then open a merge request for QA
    
    Merge request from feature branch to QA
    
    Merge request from QA to Master (with X approvals).
    
    Allows for code review, advices and improvements
    
    Additional resources:
    
    http://kensheedlo.com/essays/why-you-should-use-a-rebase-workflow/
    
    https://randyfay.com/content/rebase-workflow-git
    
    Retirement: Several EOSPPS (and some "gateway") machines need to go until mid-November. Need to get some (12?) non-diskserver 10Gb physical machines as gateways (FTP, SMB), under discussion with Bernd, with dedicated uplink.
  - 16:10
    
    CERNBOX and EOSUSER 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    "Marek" incident : incrementally copied logfile, needs update, compact = schedule for in 2 weeks? (good, get several bugfixes).
  - 16:15
    
    FUSE and client versions 5m
    
    Minutes
    
    Speaker: Dan van der Ster (CERN)
    
    v4.1.30 is in qa, production on Monday: CRM-2426
    
    If OK, this version should go to desktops (tickets or "magic" KOJI tag?)
  - 16:20
    Citrine rollout 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    EOSALICE
    
    Upgraded yesterday to 4.1.30 (almost) in time.
    
    Uncovered some issues, fixed in master:
    
    EOS-2016: FST crashes when removing ghost entries
    
    EOS-2017: MGM crash on FSCK
    
    EOSLHCB is already running 4.1.30, EOSPUBLIC will likely be updated to the next 4.1.31 once released.
    
    Further planning: 4.1.31 is getting serious "testing", timeframe for updating EOSATLAS (affected by imbalance between Meyrin/Wigner) and EOSCMS is January. EOSUSER still unclear, needs more testing (EOSBACKUP will go first, then get the new namespace; EOSUAT being set up to test for EOSUSER).
    
    One "big memory" headnode has gone missing - have 2 (with spinning disk) that can get recycled,but would like bigmem+SSD for EOSALICE.
  - 16:25
    
    SWAN 5m
    
    Minutes
    
    Speaker: Jakub Moscicki (CERN)
    
    EOSUSER is still using the per-machine Kerberos principal (until next restart), SWAN dropped them (now back)
- 16:30 → 16:50
  development: near-term
  - 16:30
    
    nextgen FUSE 5m
    
    Minutes
    
    Speaker: Andreas Joachim Peters (CERN)
    
    - incorporate local cache cleaner
    
    - enable global byte-range locks
    
    - enable global fsync coordination (open delayed until any client finished his fsync)
    
    - add max file-size support returning EFBIG
    
    - adding rudimentary quota honoring on client side
    
    Progressing as planned.
    
    Server-side is assumed to be ready on Monday (Luca would like pre-/post tags ..)
    
    CDo we set up some "experimental" area on some instance (on EOSUAT?)
    
    "eosfusebind": need to keep compat, even if no longer needed.
  - 16:35
    
    new Namespace 5m
    
    Speaker: Elvin Alin Sindrilaru (CERN)
- 16:50 → 17:45
  other: pilot services, long-term dev, external
  - 16:50
    
    Webservice 5m
    
    Speaker: Luca Mascetti (CERN)
  - 16:55
    
    Backup 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:00
    
    Samba 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    Samba update this morning (tons of RPMs from 7.4 distrosync) left this in inconsistent state, OK after restart (also affected SWAN).
    
    Webcast uses this, and had "VIP" HR webcast shortly afterwards = ticket. SMB service status needs to be sorted out (Massimo?), also need to add (more) monitoring.
  - 17:05
    
    $HOME structure 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:10
    
    BATCH integration 5m
    
    Speaker: Massimo Lamanna (CERN)
  - 17:15
    
    Xrootd 5m
    
    Speaker: Michal Kamil Simon (CERN)
  - 17:20
    
    AOB 5m

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● production instances

E-Groups restructuration

GIT workflow evolution proposal

● CERNBOX and EOSUSER

● FUSE and client versions

● Citrine rollout

EOSALICE

● SWAN

● nextgen FUSE

● Samba

E-Groups restructuration

GIT workflow evolution proposal

EOSALICE