EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSHOME

  • all FSTs upgraded to 4.3.11 (since last Thursday evening)
  • -i00 MGM is on 4.3.11, also the eoshome redirector. Plan to update the other MGMs as well.

EOSCMS

  • Another (first was last week) occurrence of 'rm' overload, updated namespace to 4.3.11 (fixes the bug)
    • triggered compaction (CMS complained about "unable to delete")
    • might be linked to /var full alarm (does EOS look at the right place - the MGM have a separate partition for the metadata?)
  • Files "masked/not readable" after compaction (either no replica, or not even in namespace)
    • Full list of files (150) received from CMS, can attempt manual recovery
      • this are the files already seen as missing by CMS, might be incomplete.
    • Would it be easier/less time consuming to "extract" the data from the previous MD files ? 
    • Q: could it be booted from scratch? No - would be gone (are missing from compacted MD).
    • AP: 150 files would be done fastest by "grep" in logs?

Heavy-Ions data-challenge

  • Starting tomorrow with ALICE
    • synthetic test is 2x faster (EOSPUBLIC -> EOSALICEDAQ 12GB/s, iperf P2->EOSALICEDAQ 12GB/s) than P2 storage -> EOS (4.8GB/s). Difference is not understood - will see tomorrow.
    • note: CMS-side client was CPU-bound
  • ATLAS and CMS joining on Thursday
    • LHCB continues with normal production

 

AOB/Cristi: slow FST update script now works on "virtual" servers, is in GIT repo.


● EOS clients, FUSE(X)

(jan)

  • eos-4.3.11 is now in "qa"  (btw - no release announce?) - CRM-2823
  • FUSEX / microtests: reduced coverage from "microbench" to "ci"
    • since reverted since no fresh Grafana data - need to kick out too-slow tests, or see why no data got sent.

 

(andreas) - things below are not yet tagged.

  • tracked LHCB QA machine problems with EOSXD down to 'out of filedescriptor' case
    • prevents reading /proc entries to assign kerberos token (falls back to unix in this cases, gives "permission denied" to client)
    • moving to 512k as default (had use case merging 10-100k files)
       
  • observed memory aggregation on HOME migration machines (4-6 GB RSS)
    • improved memory release functionality, now back to initial RSS after 8min (helps on non-automount, otherwise would not release back memory. Would explain observed OOM of EOS FUSEX daemons)
       
  • disabled a main performance bottleneck config value, which requires create/open(w) to wait for creation on both FSTs
    • this had been enabled because recovery wasn't working reliable (yet)
      • data is still in local journal, but might be lost after a client crash (missing credentials)
    • was triggered by FORTRAN microtest

 

Note: AIADM-EOS test cluster is being set up by CDA ($HOME dir on /eos), discovering known issues (slow GIT commands?, incl "git status") "EOS migration" is being misunderstood?


● Development issues

(Georgios)

  • It was possible to make a directory disappear by moving it into a subdirectory of itself. EOS contained already a protection for this, but could be defeated by using symlinks. (EOS-2850) Affecting both namespaces.
  • Files containing question mark in the path cannot be read by cernbox web preview: (EOS-2869)
  • A few thread-safety fixes for eosxd. (two data races, one double-unlock)

Discussion: to tag, or not to tag? Luca would like the FD limit to be increased (could do via config, but new default is higher..). Please tag

Do we go for "qa" in ATLAS, or production on "PUBLIC"? Massimo would prefer to have more/wider/longer "qa".

PUBLIC needs to be on 4.3.11 server-side ("rm" protection), before this can go to "production".


● AOB

(Jan)

  • xrootd client - suggest to make clients upgrade to xrootd-4.8.4 by removing the (higher-prio) xrootd-4.8.3 from the "eos" YUM repo.  Need CRM?
There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSHOME

      • all FSTs upgraded to 4.3.11 (since last Thursday evening)
      • -i00 MGM is on 4.3.11, also the eoshome redirector. Plan to update the other MGMs as well.

      EOSCMS

      • Another (first was last week) occurrence of 'rm' overload, updated namespace to 4.3.11 (fixes the bug)
        • triggered compaction (CMS complained about "unable to delete")
        • might be linked to /var full alarm (does EOS look at the right place - the MGM have a separate partition for the metadata?)
      • Files "masked/not readable" after compaction (either no replica, or not even in namespace)
        • Full list of files (150) received from CMS, can attempt manual recovery
          • this are the files already seen as missing by CMS, might be incomplete.
        • Would it be easier/less time consuming to "extract" the data from the previous MD files ? 
        • Q: could it be booted from scratch? No - would be gone (are missing from compacted MD).
        • AP: 150 files would be done fastest by "grep" in logs?

      Heavy-Ions data-challenge

      • Starting tomorrow with ALICE
        • synthetic test is 2x faster (EOSPUBLIC -> EOSALICEDAQ 12GB/s, iperf P2->EOSALICEDAQ 12GB/s) than P2 storage -> EOS (4.8GB/s). Difference is not understood - will see tomorrow.
        • note: CMS-side client was CPU-bound
      • ATLAS and CMS joining on Thursday
        • LHCB continues with normal production

       

      AOB/Cristi: slow FST update script now works on "virtual" servers, is in GIT repo.

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (jan)

      • eos-4.3.11 is now in "qa"  (btw - no release announce?) - CRM-2823
      • FUSEX / microtests: reduced coverage from "microbench" to "ci"
        • since reverted since no fresh Grafana data - need to kick out too-slow tests, or see why no data got sent.

       

      (andreas) - things below are not yet tagged.

      • tracked LHCB QA machine problems with EOSXD down to 'out of filedescriptor' case
        • prevents reading /proc entries to assign kerberos token (falls back to unix in this cases, gives "permission denied" to client)
        • moving to 512k as default (had use case merging 10-100k files)
           
      • observed memory aggregation on HOME migration machines (4-6 GB RSS)
        • improved memory release functionality, now back to initial RSS after 8min (helps on non-automount, otherwise would not release back memory. Would explain observed OOM of EOS FUSEX daemons)
           
      • disabled a main performance bottleneck config value, which requires create/open(w) to wait for creation on both FSTs
        • this had been enabled because recovery wasn't working reliable (yet)
          • data is still in local journal, but might be lost after a client crash (missing credentials)
        • was triggered by FORTRAN microtest

       

      Note: AIADM-EOS test cluster is being set up by CDA ($HOME dir on /eos), discovering known issues (slow GIT commands?, incl "git status") "EOS migration" is being misunderstood?

    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • It was possible to make a directory disappear by moving it into a subdirectory of itself. EOS contained already a protection for this, but could be defeated by using symlinks. (EOS-2850) Affecting both namespaces.
      • Files containing question mark in the path cannot be read by cernbox web preview: (EOS-2869)
      • A few thread-safety fixes for eosxd. (two data races, one double-unlock)

      Discussion: to tag, or not to tag? Luca would like the FD limit to be increased (could do via config, but new default is higher..). Please tag

      Do we go for "qa" in ATLAS, or production on "PUBLIC"? Massimo would prefer to have more/wider/longer "qa".

      PUBLIC needs to be on 4.3.11 server-side ("rm" protection), before this can go to "production".

    • 16:35 16:50
      AOB 15m

      (Jan)

      • xrootd client - suggest to make clients upgrade to xrootd-4.8.4 by removing the (higher-prio) xrootd-4.8.3 from the "eos" YUM repo.  Need CRM?