● production instances

E-Groups restructuration

Goal: reduce the number of people that have root access on the boxes, and separate roles from eos-admins

The proposal is to have 3 e-groups:

eos-(servicemanagers|operations|root): root access to nodes, owner of machines, LanDB
eos-tickets: Base SNOW team feeding eos-snow-3rdline
eos-admins: Communications, discussions (also external people involved)

GIT workflow evolution proposal

The idea is to prevent the following spaghetti-like git history that is very hard to follow, difficult to "merge", error prone and time consuming on top of that.

Desired:

Why

Having a clean history is important in order to understand what happened and what changed in a repo. Fixing merge problems and conflicts is very time consuming (it took me ~2 hours to fix the first spaghetti plate)

How:

=> Mimicking the AI workflow

Stop pushing on master (except for emergency changes)

Use feature branches for new features, then open a merge request for QA
Merge request from feature branch to QA
Merge request from QA to Master (with X approvals).
- Allows for code review, advices and improvements

Additional resources:

http://kensheedlo.com/essays/why-you-should-use-a-rebase-workflow/

https://randyfay.com/content/rebase-workflow-git

Retirement: Several EOSPPS (and some "gateway") machines need to go until mid-November. Need to get some (12?) non-diskserver 10Gb physical machines as gateways (FTP, SMB), under discussion with Bernd, with dedicated uplink.

● CERNBOX and EOSUSER

"Marek" incident : incrementally copied logfile, needs update, compact = schedule for in 2 weeks? (good, get several bugfixes).

● FUSE and client versions

v4.1.30 is in qa, production on Monday: CRM-2426

If OK, this version should go to desktops (tickets or "magic" KOJI tag?)

● Citrine rollout

EOSALICE

Upgraded yesterday to 4.1.30 (almost) in time.

Uncovered some issues, fixed in master:

EOS-2016: FST crashes when removing ghost entries
EOS-2017: MGM crash on FSCK

EOSLHCB is already running 4.1.30, EOSPUBLIC will likely be updated to the next 4.1.31 once released.

Further planning: 4.1.31 is getting serious "testing", timeframe for updating EOSATLAS (affected by imbalance between Meyrin/Wigner) and EOSCMS is January. EOSUSER still unclear, needs more testing (EOSBACKUP will go first, then get the new namespace; EOSUAT being set up to test for EOSUSER).

One "big memory" headnode has gone missing - have 2 (with spinning disk) that can get recycled,but would like bigmem+SSD for EOSALICE.

● SWAN

EOSUSER is still using the per-machine Kerberos principal (until next restart), SWAN dropped them (now back)

● nextgen FUSE

- incorporate local cache cleaner

- enable global byte-range locks

- enable global fsync coordination (open delayed until any client finished his fsync)

- add max file-size support returning EFBIG

- adding rudimentary quota honoring on client side

Progressing as planned.

Server-side is assumed to be ready on Monday (Luca would like pre-/post tags ..)

CDo we set up some "experimental" area on some instance (on EOSUAT?)

"eosfusebind": need to keep compat, even if no longer needed.

● Samba

Samba update this morning (tons of RPMs from 7.4 distrosync) left this in inconsistent state, OK after restart (also affected SWAN).

Webcast uses this, and had "VIP" HR webcast shortly afterwards = ticket. SMB service status needs to be sorted out (Massimo?), also need to add (more) monitoring.