EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-08-28T16:00:00+02:00
End: 2018-08-28T17:50:00+02:00
Location: CERN

Tuesday 28 Aug 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, USER)

EOSCMS

updated to 4.3.8
Slave is now interfering with normal operations when central-draining is enabled (fixed in git)
Boot-syncing (gently) the fss of ~40 nodes to fix an issue while retrieving the metadata information of some file replicas (they are healthy on disk servers). (Roberto)
- Elvin: comes from not doing the full disk scan+sync at every boot. Might be from crashes last week; would be reported by FSCK?
- what can be done "automatically". MGM and disk agree, just the local FMD misses the info and causes errors on open.
- would the FSCK move such files to "orphan"? probably
- Roberto to open JIRA ticket with some examples, to see whether giving access is safe
- mid-term: drop FMD completely, just use extended attribs?

EOSATLAS

Re-enabled IPv6 (green light from E. Martelli)
also saw user-induced overload (but not restarted). Also saw loads of stalling draining filesystems (set to off, restarted).
might also still have slave failover bug where limitations get forgotten
also still see user-level FUSE mount from LXBATCH

QuarkDB repo URL

The default value for the "eos" hostgroup (next branch) has been updated to use the new linuxsoft location instead of storage-ci
- not changed for EOSHOME -> Luca

"next" environment promotion

Would like to "promote" the "next" branch to master, and move machines using the "old" configuration in a "legacy" environment/branch, this is expected to be transparent (affects ~200 machines).

It is running on ~1150 nodes. Breakdown of nodes per hostgroup still reporting SLC6:

eos/atlas/srm found 2 times
eos/backup/servers found 1 times
eos/backup/storage found 33 times
eos/cms/srm found 2 times
eos/cms/storage found 7 times
eos/dev found 4 times
eos/genome/servers found 2 times
eos/genome/storage found 6 times
eos/kinetic found 1 times
eos/lhcb/srm found 3 times
eos/media/servers found 2 times
eos/media/storage found 13 times
eos/pps/srm found 1 times
eos/public/gridftp found 7 times
eos/public/http found 3 times
eos/public/servers found 2 times
eos/public/srm found 2 times
eos/public/storage found 179 times
eos/spare found 2 times
eos/uat/servers found 1 times
eos/uat/storage found 8 times
eos/up2u/servers found 1 times
eos/up2u/storage found 2 times
eos/user/gridftp found 2 times
eos/user/servers found 2 times
eos/user/storage found 30 times

Note: I wasn't able to get SRM to work with CentOS7, all the other components are OK.

● EOS clients, FUSE(X)

(Jan):

eos-4.3.9 got tagged into production (less than usual 7-day "qa" period == "emergency", but so far nobody noticed)
EOSTEST, PLUS (probably BATCH) on SLC6 are stuck on xrootd-4.8.3 (intentional) and cause YUM errors (not intentional - 32bit library issue, see mail from Jan)
- https://its.cern.ch/jira/browse/CRM-2799 - will deprecate 32bit on SLC6 unless really needed.
- Discuss: do we need to push this as "emergency", or tag 4.8.4? not really, seem not to block /eos/home nor 4.3.9 rollout.

● Development issues

(Georgios)

QDB koji repos are now public, available even for external people. (https://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/ )
Please use these instead of storage-ci, for production. Example repofile:

[quarkdb-stable] name=QuarkDB repository [stable] baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/os enabled=1 gpgcheck=False

[quarkdb-stable-debug] name=QuarkDB repository [debug] baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/debug enabled=1 gpgcheck=False

● AOB

(Jan)

hardlinks have been implemented in FUSEX but need non-default config option to work - please make this the default in 4.3.10. Elvin: OK

(Massimo)

4.3.9 indeed fixes the error at login (thanks). Cannot yet submit batch jobs since the "bigbird" machines have not yet been upgraded (can submit when logs go to AFS), will take up with Ben.

Deadlines+Plans:

Massimo: xrootd redirector cannot "redirect" FUSE, need new alias to be deployed before the DNS alias can be changed - announced for Sep 17. Will try to contact all users in advance, but hard to get from logs <0.1%). Also EOSHOME becomes default for new users that date.

(Eddy) Migration status : migrating IT-CM today. Full IT is next.

(Luca): please review the KB0005691 "known issues on EOSHOME".

(Dirk): one "analytics" box is stuck on 4.2.2 - can we see this version from server side? (Not really - too-old will fail, will contact us, will update. FUSEX has the version and can drop too-old clients.)

There are minutes attached to this event. Show them.

- 16:00 → 16:20
  EOS production instances (LHC, PUBLIC, USER) 20m
  
  Minutes
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)
  EOSCMS
  
  updated to 4.3.8
  
  Slave is now interfering with normal operations when central-draining is enabled (fixed in git)
  
  Boot-syncing (gently) the fss of ~40 nodes to fix an issue while retrieving the metadata information of some file replicas (they are healthy on disk servers). (Roberto)
  
  Elvin: comes from not doing the full disk scan+sync at every boot. Might be from crashes last week; would be reported by FSCK?
  
  what can be done "automatically". MGM and disk agree, just the local FMD misses the info and causes errors on open.
  
  would the FSCK move such files to "orphan"? probably
  
  Roberto to open JIRA ticket with some examples, to see whether giving access is safe
  
  mid-term: drop FMD completely, just use extended attribs?
  
  EOSATLAS
  
  Re-enabled IPv6 (green light from E. Martelli)
  
  also saw user-induced overload (but not restarted). Also saw loads of stalling draining filesystems (set to off, restarted).
  
  might also still have slave failover bug where limitations get forgotten
  
  also still see user-level FUSE mount from LXBATCH
  
  QuarkDB repo URL
  
  The default value for the "eos" hostgroup (next branch) has been updated to use the new linuxsoft location instead of storage-ci
  
  not changed for EOSHOME -> Luca
  
  "next" environment promotion
  
  Would like to "promote" the "next" branch to master, and move machines using the "old" configuration in a "legacy" environment/branch, this is expected to be transparent (affects ~200 machines).
  
  It is running on ~1150 nodes. Breakdown of nodes per hostgroup still reporting SLC6:
  
  eos/atlas/srm found 2 times
  eos/backup/servers found 1 times
  eos/backup/storage found 33 times
  eos/cms/srm found 2 times
  eos/cms/storage found 7 times
  eos/dev found 4 times
  eos/genome/servers found 2 times
  eos/genome/storage found 6 times
  eos/kinetic found 1 times
  eos/lhcb/srm found 3 times
  eos/media/servers found 2 times
  eos/media/storage found 13 times
  eos/pps/srm found 1 times
  eos/public/gridftp found 7 times
  eos/public/http found 3 times
  eos/public/servers found 2 times
  eos/public/srm found 2 times
  eos/public/storage found 179 times
  eos/spare found 2 times
  eos/uat/servers found 1 times
  eos/uat/storage found 8 times
  eos/up2u/servers found 1 times
  eos/up2u/storage found 2 times
  eos/user/gridftp found 2 times
  eos/user/servers found 2 times
  eos/user/storage found 30 times
  
  Note: I wasn't able to get SRM to work with CentOS7, all the other components are OK.
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  
  Minutes
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  (Jan):
  
  eos-4.3.9 got tagged into production (less than usual 7-day "qa" period == "emergency", but so far nobody noticed)
  
  EOSTEST, PLUS (probably BATCH) on SLC6 are stuck on xrootd-4.8.3 (intentional) and cause YUM errors (not intentional - 32bit library issue, see mail from Jan)
  
  https://its.cern.ch/jira/browse/CRM-2799 - will deprecate 32bit on SLC6 unless really needed.
  
  Discuss: do we need to push this as "emergency", or tag 4.8.4? not really, seem not to block /eos/home nor 4.3.9 rollout.
- 16:25 → 16:35
  Development issues 10m
  
  Minutes
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
  (Georgios)
  
  QDB koji repos are now public, available even for external people. (https://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/ )
  
  Please use these instead of storage-ci, for production. Example repofile:
  
  [quarkdb-stable] name=QuarkDB repository [stable] baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/os enabled=1 gpgcheck=False
  
  [quarkdb-stable-debug] name=QuarkDB repository [debug] baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/debug enabled=1 gpgcheck=False
- 16:35 → 16:50
  AOB 15m
  
  Minutes
  (Jan)
  
  hardlinks have been implemented in FUSEX but need non-default config option to work - please make this the default in 4.3.10. Elvin: OK
  
  (Massimo)
  
  4.3.9 indeed fixes the error at login (thanks). Cannot yet submit batch jobs since the "bigbird" machines have not yet been upgraded (can submit when logs go to AFS), will take up with Ben.
  
  Deadlines+Plans:
  
  Massimo: xrootd redirector cannot "redirect" FUSE, need new alias to be deployed before the DNS alias can be changed - announced for Sep 17. Will try to contact all users in advance, but hard to get from logs <0.1%). Also EOSHOME becomes default for new users that date.
  
  (Eddy) Migration status : migrating IT-CM today. Full IT is next.
  
  (Luca): please review the KB0005691 "known issues on EOSHOME".
  
  (Dirk): one "analytics" box is stuck on 4.2.2 - can we see this version from server side? (Not really - too-old will fail, will contact us, will update. FUSEX has the version and can drop too-old clients.)

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● EOS production instances (LHC, PUBLIC, USER)

EOSCMS

EOSATLAS

QuarkDB repo URL

"next" environment promotion

● EOS clients, FUSE(X)

● Development issues

● AOB

EOSCMS

EOSATLAS

QuarkDB repo URL

"next" environment promotion

Share this page

Direct link

Social networks

Calendaring