EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:

  • XrootD upgrade on the FST to 4.8.1-2.CERN (fixing the no more free SIDs/connection lost between FSTs)

  • MGM update to 4.2.21 pending + XrootD 4.8.3-rc (ready for deployment?)

4.2.21 has been released. Need to test against FUSE? not, already in production on some MGM, and 4.2.21 was pretty specific.

 

Q (Massimo): time to get 4.3 out? depend on testing - if happy, can release tomorrow? Plan to release this week (since already running). How to test "truncation" part (perhaps in CASTOR repack; was simple to reproduce - just freeze for 60sec)?

EOSPUBLIC:

  • controlled restart of the MGM (due to high memory usage) + upgraded to 4.2.19 last Tuesday

LHC instances

  • Upgrading to 4.2.21 and XRootD 4.8.3-rc1 (except ALICE = no FUSE)
    • 2/3 of FSTs done
  • Minor fixes to be commited on CentOS7 / systemd
    • EOSCMS not joining AAA fed.
    • Coredump enabled (to be fixed via a SystemD override)
  • Batch on EOS
    • Draining SSDs on ~half of the eligible ATLAS FSTs, can use as /pool for jobs
    • Fixing minor issues with Docker

EOASLICE is on 4.2.17 - should perhaps get the fixed xrootd (might affect conditions files)?

 

 


● EOS clients, FUSE(X)

(Massimo)

  • Inconsistent dates (close-but-not-quite UNIX "0") for files generated by HTCondor (stdout and stderr). Discussions + INC1650792
    • Not critical?
  • BEER testing (Condor jobs on FST)
    • Monitoring stuff to see the effect (https://monit-grafana.cern.ch/dashboard/db/_user-laman-eos-beer?orgId=6 + beer.sh reading data from one FST via xrdc to /dev/null (batch))
      • Rationale behind...
    • Problem: CONDOR job draining not working (now in Koji: see BEER-12 ), now also will get a way to hard kill jobs.
  • PPS sailing towards 2^32+1 (almost 4.3B, 4.0 now, suggested by and agreed with GeorgiosB)
  • Still confused about draining interference on PPS (sorry). EOS-2517 (Thank you to A. Manzi)
    • might have been in "draindead", not visible on some tool.
  • No time to check newfind yet (sorry again)
  • Prepare for multiple-mountpoint tests (e.g 300 jobs x 100 mountpoints). OK on UAT
  • FUSEX issues are at "residual state" now.    Can we have FUSEX on, say, LHCb? I understand USER is "special".
    • no recovery for deleted files yet -> add to list as showstopper. Luca to give list. 
    • enable on "qa" (aka on 10% of batch) afterwards.

 

(Jan)

  • (FUSE / FST data loss: need postmortem at ASDF this week - due to "emergency" change of writeback cache setting).
  • FUSE client status: 4.2.18 in production, 4.2.19 in qa (but known to be broken - EOS-2486, needs 4.2.22)
  • FUSEX - created accounts for AFS phaseout people BUT not announced yet
  • A.Lossent would like to use FUSEX for openshift (some oddities on credential access, containerized)

(Dan)

  • Thomas is testing latest eosclient puppet module and rpm v4.2.18 with locmap for the CC7.5 desktop release.
    • Q: can we push new xrootd-4.3 on desktops? yes, is compatible (SSI is incompatible but is only server-side). We could "force" this by tagging in the "eos" repo.
    • Thomas proposal is to have desktops pointing to the KOJI repo, we get control of what 

(Andreas)
eosxd

  • increased very short default timeouts from 15s to 30/60s ... add an internal eosxd logic to get rid of never returned outstanding XrdCl asynchronous messages (hard timeout, gives EIO)
  • limit the number of files in the filestart cache to 64k by default e.g. added inode limit to local cache
  • added 'xoff' functionality to throttle IOPS per file and not to overrun XrdCl which failes once more than 64k requests are in flight (can hit this via one-byte writes, now a single requester can only have 1k in-flight)

eosd

  • allow user private mounts to root users be setting EOS_FUSE_PRIVATE_ROOT_MOUNT (not released yet)
  • stop writing (triggering re-opening) when the FST open failed
  • remove deprecated authentication recovery code for XrootD 3.3.X
  • avoid to create a new connection per request (Georgios) in certain circumstances (ProtoDUNE did this, exhausted FDs and increases MGM mem use)
  • fix segfault in shutdown (Georgios)

Andreas/eosxd: one PPS machine now has 100 mounts, run sync test. Could we do this on all CC7 macines to get "many clients"-test. Yes. Q: can we do this in containers? yes.

Luca: have a config change slipped in puppet that forces KRB5 (seen by Jesus, contacted Dan)? Comes from renamed config tree, was supposed to work (affected CERNBox, discovered yesterday). Note: change was pushed a month ago, one issue seen immediately - might have been still-running mount. Should we "grep" across Puppet configs? No, first cross-check with Jesus, then (perhaps) fix in module.

 


● Development issues

(Georgios)

  • New NS: implementing asynchronous reads on all metadata operations, to prevent locking the namespace while waiting on network requests towards QDB. Expected to fix many issues.
    • Asynchronous writes are already there (MetadataFlusher), no way we'd be getting 1kHz of file creations otherwise.

(Michal):

xrootd-4.8.3 expected this week.

 

(Jan):

want to restart "qa" cycle with 4.2.22. Need to wait for 4.8.3. Can be done this week. Careful: eos-4.2.22(server) has a new shared dependency.

 

There are minutes attached to this event. Show them.
    • 16:00 16:10
      (go through last weeks' meeting) 10m
    • 16:10 16:20
      EOS production instances (LHC, PUBLIC, USER) 10m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSPUBLIC, EOSALICEDAQ, EOSMEDIA:

      • XrootD upgrade on the FST to 4.8.1-2.CERN (fixing the no more free SIDs/connection lost between FSTs)

      • MGM update to 4.2.21 pending + XrootD 4.8.3-rc (ready for deployment?)

      4.2.21 has been released. Need to test against FUSE? not, already in production on some MGM, and 4.2.21 was pretty specific.

       

      Q (Massimo): time to get 4.3 out? depend on testing - if happy, can release tomorrow? Plan to release this week (since already running). How to test "truncation" part (perhaps in CASTOR repack; was simple to reproduce - just freeze for 60sec)?

      EOSPUBLIC:

      • controlled restart of the MGM (due to high memory usage) + upgraded to 4.2.19 last Tuesday

      LHC instances

      • Upgrading to 4.2.21 and XRootD 4.8.3-rc1 (except ALICE = no FUSE)
        • 2/3 of FSTs done
      • Minor fixes to be commited on CentOS7 / systemd
        • EOSCMS not joining AAA fed.
        • Coredump enabled (to be fixed via a SystemD override)
      • Batch on EOS
        • Draining SSDs on ~half of the eligible ATLAS FSTs, can use as /pool for jobs
        • Fixing minor issues with Docker

      EOASLICE is on 4.2.17 - should perhaps get the fixed xrootd (might affect conditions files)?

       

       

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (Massimo)

      • Inconsistent dates (close-but-not-quite UNIX "0") for files generated by HTCondor (stdout and stderr). Discussions + INC1650792
        • Not critical?
      • BEER testing (Condor jobs on FST)
        • Monitoring stuff to see the effect (https://monit-grafana.cern.ch/dashboard/db/_user-laman-eos-beer?orgId=6 + beer.sh reading data from one FST via xrdc to /dev/null (batch))
          • Rationale behind...
        • Problem: CONDOR job draining not working (now in Koji: see BEER-12 ), now also will get a way to hard kill jobs.
      • PPS sailing towards 2^32+1 (almost 4.3B, 4.0 now, suggested by and agreed with GeorgiosB)
      • Still confused about draining interference on PPS (sorry). EOS-2517 (Thank you to A. Manzi)
        • might have been in "draindead", not visible on some tool.
      • No time to check newfind yet (sorry again)
      • Prepare for multiple-mountpoint tests (e.g 300 jobs x 100 mountpoints). OK on UAT
      • FUSEX issues are at "residual state" now.    Can we have FUSEX on, say, LHCb? I understand USER is "special".
        • no recovery for deleted files yet -> add to list as showstopper. Luca to give list. 
        • enable on "qa" (aka on 10% of batch) afterwards.

       

      (Jan)

      • (FUSE / FST data loss: need postmortem at ASDF this week - due to "emergency" change of writeback cache setting).
      • FUSE client status: 4.2.18 in production, 4.2.19 in qa (but known to be broken - EOS-2486, needs 4.2.22)
      • FUSEX - created accounts for AFS phaseout people BUT not announced yet
      • A.Lossent would like to use FUSEX for openshift (some oddities on credential access, containerized)

      (Dan)

      • Thomas is testing latest eosclient puppet module and rpm v4.2.18 with locmap for the CC7.5 desktop release.
        • Q: can we push new xrootd-4.3 on desktops? yes, is compatible (SSI is incompatible but is only server-side). We could "force" this by tagging in the "eos" repo.
        • Thomas proposal is to have desktops pointing to the KOJI repo, we get control of what 

      (Andreas)
      eosxd

      • increased very short default timeouts from 15s to 30/60s ... add an internal eosxd logic to get rid of never returned outstanding XrdCl asynchronous messages (hard timeout, gives EIO)
      • limit the number of files in the filestart cache to 64k by default e.g. added inode limit to local cache
      • added 'xoff' functionality to throttle IOPS per file and not to overrun XrdCl which failes once more than 64k requests are in flight (can hit this via one-byte writes, now a single requester can only have 1k in-flight)

      eosd

      • allow user private mounts to root users be setting EOS_FUSE_PRIVATE_ROOT_MOUNT (not released yet)
      • stop writing (triggering re-opening) when the FST open failed
      • remove deprecated authentication recovery code for XrootD 3.3.X
      • avoid to create a new connection per request (Georgios) in certain circumstances (ProtoDUNE did this, exhausted FDs and increases MGM mem use)
      • fix segfault in shutdown (Georgios)

      Andreas/eosxd: one PPS machine now has 100 mounts, run sync test. Could we do this on all CC7 macines to get "many clients"-test. Yes. Q: can we do this in containers? yes.

      Luca: have a config change slipped in puppet that forces KRB5 (seen by Jesus, contacted Dan)? Comes from renamed config tree, was supposed to work (affected CERNBox, discovered yesterday). Note: change was pushed a month ago, one issue seen immediately - might have been still-running mount. Should we "grep" across Puppet configs? No, first cross-check with Jesus, then (perhaps) fix in module.

       

    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • New NS: implementing asynchronous reads on all metadata operations, to prevent locking the namespace while waiting on network requests towards QDB. Expected to fix many issues.
        • Asynchronous writes are already there (MetadataFlusher), no way we'd be getting 1kHz of file creations otherwise.

      (Michal):

      xrootd-4.8.3 expected this week.

       

      (Jan):

      want to restart "qa" cycle with 4.2.22. Need to wait for 4.8.3. Can be done this week. Careful: eos-4.2.22(server) has a new shared dependency.

       

    • 16:35 16:50
      AOB 15m