EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2018-04-24T16:00:00+02:00
End: 2018-04-24T17:50:00+02:00
Location: CERN

Tuesday 24 Apr 2018, 16:00 → 17:50 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout.

please keep content relevant to (most of) the audience, explain context
Last week: major issues, preferably with ticket
This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

Hide

● EOS production instances (LHC, PUBLIC, USER)

EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:

XrootD upgrade on the FST to 4.8.1-2.CERN (fixing the no more free SIDs/connection lost between FSTs)
MGM update to 4.2.21 pending + XrootD 4.8.3-rc (ready for deployment?)

4.2.21 has been released. Need to test against FUSE? not, already in production on some MGM, and 4.2.21 was pretty specific.

Q (Massimo): time to get 4.3 out? depend on testing - if happy, can release tomorrow? Plan to release this week (since already running). How to test "truncation" part (perhaps in CASTOR repack; was simple to reproduce - just freeze for 60sec)?

EOSPUBLIC:

controlled restart of the MGM (due to high memory usage) + upgraded to 4.2.19 last Tuesday

LHC instances

Upgrading to 4.2.21 and XRootD 4.8.3-rc1 (except ALICE = no FUSE)
- 2/3 of FSTs done
Minor fixes to be commited on CentOS7 / systemd
- EOSCMS not joining AAA fed.
- Coredump enabled (to be fixed via a SystemD override)
Batch on EOS
- Draining SSDs on ~half of the eligible ATLAS FSTs, can use as /pool for jobs
- Fixing minor issues with Docker

EOASLICE is on 4.2.17 - should perhaps get the fixed xrootd (might affect conditions files)?

● EOS clients, FUSE(X)

(Massimo)

Inconsistent dates (close-but-not-quite UNIX "0") for files generated by HTCondor (stdout and stderr). Discussions + INC1650792
- Not critical?
BEER testing (Condor jobs on FST)
- Monitoring stuff to see the effect (https://monit-grafana.cern.ch/dashboard/db/_user-laman-eos-beer?orgId=6 + beer.sh reading data from one FST via xrdc to /dev/null (batch))
  - Rationale behind...
- Problem: CONDOR job draining not working (now in Koji: see BEER-12 ), now also will get a way to hard kill jobs.
PPS sailing towards 2^32+1 (almost 4.3B, 4.0 now, suggested by and agreed with GeorgiosB)
Still confused about draining interference on PPS (sorry). EOS-2517 (Thank you to A. Manzi)
- might have been in "draindead", not visible on some tool.
No time to check newfind yet (sorry again)
Prepare for multiple-mountpoint tests (e.g 300 jobs x 100 mountpoints). OK on UAT
FUSEX issues are at "residual state" now. Can we have FUSEX on, say, LHCb? I understand USER is "special".
- no recovery for deleted files yet -> add to list as showstopper. Luca to give list.
- enable on "qa" (aka on 10% of batch) afterwards.

(Jan)

(FUSE / FST data loss: need postmortem at ASDF this week - due to "emergency" change of writeback cache setting).
FUSE client status: 4.2.18 in production, 4.2.19 in qa (but known to be broken - EOS-2486, needs 4.2.22)
- 4.2.18 still has EOS-2080
FUSEX - created accounts for AFS phaseout people BUT not announced yet
A.Lossent would like to use FUSEX for openshift (some oddities on credential access, containerized)

(Dan)

Thomas is testing latest eosclient puppet module and rpm v4.2.18 with locmap for the CC7.5 desktop release.
- Q: can we push new xrootd-4.3 on desktops? yes, is compatible (SSI is incompatible but is only server-side). We could "force" this by tagging in the "eos" repo.
- Thomas proposal is to have desktops pointing to the KOJI repo, we get control of what

(Andreas)
eosxd

increased very short default timeouts from 15s to 30/60s ... add an internal eosxd logic to get rid of never returned outstanding XrdCl asynchronous messages (hard timeout, gives EIO)
limit the number of files in the filestart cache to 64k by default e.g. added inode limit to local cache
added 'xoff' functionality to throttle IOPS per file and not to overrun XrdCl which failes once more than 64k requests are in flight (can hit this via one-byte writes, now a single requester can only have 1k in-flight)

eosd

allow user private mounts to root users be setting EOS_FUSE_PRIVATE_ROOT_MOUNT (not released yet)
stop writing (triggering re-opening) when the FST open failed
remove deprecated authentication recovery code for XrootD 3.3.X
avoid to create a new connection per request (Georgios) in certain circumstances (ProtoDUNE did this, exhausted FDs and increases MGM mem use)
fix segfault in shutdown (Georgios)

Andreas/eosxd: one PPS machine now has 100 mounts, run sync test. Could we do this on all CC7 macines to get "many clients"-test. Yes. Q: can we do this in containers? yes.

Luca: have a config change slipped in puppet that forces KRB5 (seen by Jesus, contacted Dan)? Comes from renamed config tree, was supposed to work (affected CERNBox, discovered yesterday). Note: change was pushed a month ago, one issue seen immediately - might have been still-running mount. Should we "grep" across Puppet configs? No, first cross-check with Jesus, then (perhaps) fix in module.

● Development issues

(Georgios)

New NS: implementing asynchronous reads on all metadata operations, to prevent locking the namespace while waiting on network requests towards QDB. Expected to fix many issues.
- Asynchronous writes are already there (MetadataFlusher), no way we'd be getting 1kHz of file creations otherwise.

(Michal):

xrootd-4.8.3 expected this week.

(Jan):

want to restart "qa" cycle with 4.2.22. Need to wait for 4.8.3. Can be done this week. Careful: eos-4.2.22(server) has a new shared dependency.

There are minutes attached to this event. Show them.

- 16:00 → 16:10
  
  (go through last weeks' meeting) 10m
- 16:10 → 16:20
  EOS production instances (LHC, PUBLIC, USER) 10m
  
  Minutes
  - major events last week
  - planned work this week
  Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)
  EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:
  
  XrootD upgrade on the FST to 4.8.1-2.CERN (fixing the no more free SIDs/connection lost between FSTs)
  
  MGM update to 4.2.21 pending + XrootD 4.8.3-rc (ready for deployment?)
  
  4.2.21 has been released. Need to test against FUSE? not, already in production on some MGM, and 4.2.21 was pretty specific.
  
  Q (Massimo): time to get 4.3 out? depend on testing - if happy, can release tomorrow? Plan to release this week (since already running). How to test "truncation" part (perhaps in CASTOR repack; was simple to reproduce - just freeze for 60sec)?
  
  EOSPUBLIC:
  
  controlled restart of the MGM (due to high memory usage) + upgraded to 4.2.19 last Tuesday
  
  LHC instances
  
  Upgrading to 4.2.21 and XRootD 4.8.3-rc1 (except ALICE = no FUSE)
  
  2/3 of FSTs done
  
  Minor fixes to be commited on CentOS7 / systemd
  
  EOSCMS not joining AAA fed.
  
  Coredump enabled (to be fixed via a SystemD override)
  
  Batch on EOS
  
  Draining SSDs on ~half of the eligible ATLAS FSTs, can use as /pool for jobs
  
  Fixing minor issues with Docker
  
  EOASLICE is on 4.2.17 - should perhaps get the fixed xrootd (might affect conditions files)?
- 16:20 → 16:25
  EOS clients, FUSE(X) 5m
  
  Minutes
  - (major) issues seen
  - Rollout of new versions and FUSEX
  Speakers: Dan van der Ster (CERN), Jan Iven (CERN)
  (Massimo)
  
  Inconsistent dates (close-but-not-quite UNIX "0") for files generated by HTCondor (stdout and stderr). Discussions + INC1650792
  
  Not critical?
  
  BEER testing (Condor jobs on FST)
  
  Monitoring stuff to see the effect (https://monit-grafana.cern.ch/dashboard/db/_user-laman-eos-beer?orgId=6 + beer.sh reading data from one FST via xrdc to /dev/null (batch))
  
  Rationale behind...
  
  Problem: CONDOR job draining not working (now in Koji: see BEER-12 ), now also will get a way to hard kill jobs.
  
  PPS sailing towards 2^32+1 (almost 4.3B, 4.0 now, suggested by and agreed with GeorgiosB)
  
  Still confused about draining interference on PPS (sorry). EOS-2517 (Thank you to A. Manzi)
  
  might have been in "draindead", not visible on some tool.
  
  No time to check newfind yet (sorry again)
  
  Prepare for multiple-mountpoint tests (e.g 300 jobs x 100 mountpoints). OK on UAT
  
  FUSEX issues are at "residual state" now. Can we have FUSEX on, say, LHCb? I understand USER is "special".
  
  no recovery for deleted files yet -> add to list as showstopper. Luca to give list.
  
  enable on "qa" (aka on 10% of batch) afterwards.
  
  (Jan)
  
  (FUSE / FST data loss: need postmortem at ASDF this week - due to "emergency" change of writeback cache setting).
  
  FUSE client status: 4.2.18 in production, 4.2.19 in qa (but known to be broken - EOS-2486, needs 4.2.22)
  
  4.2.18 still has EOS-2080
  
  FUSEX - created accounts for AFS phaseout people BUT not announced yet
  
  A.Lossent would like to use FUSEX for openshift (some oddities on credential access, containerized)
  
  (Dan)
  
  Thomas is testing latest eosclient puppet module and rpm v4.2.18 with locmap for the CC7.5 desktop release.
  
  Q: can we push new xrootd-4.3 on desktops? yes, is compatible (SSI is incompatible but is only server-side). We could "force" this by tagging in the "eos" repo.
  
  Thomas proposal is to have desktops pointing to the KOJI repo, we get control of what
  
  (Andreas)
  eosxd
  
  increased very short default timeouts from 15s to 30/60s ... add an internal eosxd logic to get rid of never returned outstanding XrdCl asynchronous messages (hard timeout, gives EIO)
  
  limit the number of files in the filestart cache to 64k by default e.g. added inode limit to local cache
  
  added 'xoff' functionality to throttle IOPS per file and not to overrun XrdCl which failes once more than 64k requests are in flight (can hit this via one-byte writes, now a single requester can only have 1k in-flight)
  
  eosd
  
  allow user private mounts to root users be setting EOS_FUSE_PRIVATE_ROOT_MOUNT (not released yet)
  
  stop writing (triggering re-opening) when the FST open failed
  
  remove deprecated authentication recovery code for XrootD 3.3.X
  
  avoid to create a new connection per request (Georgios) in certain circumstances (ProtoDUNE did this, exhausted FDs and increases MGM mem use)
  
  fix segfault in shutdown (Georgios)
  
  Andreas/eosxd: one PPS machine now has 100 mounts, run sync test. Could we do this on all CC7 macines to get "many clients"-test. Yes. Q: can we do this in containers? yes.
  
  Luca: have a config change slipped in puppet that forces KRB5 (seen by Jesus, contacted Dan)? Comes from renamed config tree, was supposed to work (affected CERNBox, discovered yesterday). Note: change was pushed a month ago, one issue seen immediately - might have been still-running mount. Should we "grep" across Puppet configs? No, first cross-check with Jesus, then (perhaps) fix in module.
- 16:25 → 16:35
  Development issues 10m
  
  Minutes
  - New namespace
  - Testing
  - Xrootd
  Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)
  (Georgios)
  
  New NS: implementing asynchronous reads on all metadata operations, to prevent locking the namespace while waiting on network requests towards QDB. Expected to fix many issues.
  
  Asynchronous writes are already there (MetadataFlusher), no way we'd be getting 1kHz of file creations otherwise.
  
  (Michal):
  
  xrootd-4.8.3 expected this week.
  
  (Jan):
  
  want to restart "qa" cycle with 4.2.22. Need to wait for 4.8.3. Can be done this week. Careful: eos-4.2.22(server) has a new shared dependency.
- 16:35 → 16:50
  
  AOB 15m

Choose timezone

EOS DevOps Meeting

513/R-068

CERN

● EOS production instances (LHC, PUBLIC, USER)

EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:

EOSPUBLIC:

LHC instances

● EOS clients, FUSE(X)

● Development issues

EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:

EOSPUBLIC:

LHC instances

Share this page

Direct link

Social networks

Calendaring