EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description

Weekly meeting to discuss progress on EOS rollout.

  • please keep content relevant to (most of) the audience, explain context
  • Last week: major issues, preferably with ticket
  • This week/planning: who, until when, needs what?

Add your input to the "contribution minutes" before the meeting. Else will be "AOB".

 

● EOS production instances (LHC, PUBLIC, USER)

EOSCMS

  • updated to 4.3.8
  • Slave is now interfering with normal operations when central-draining is enabled (fixed in git)
  • Boot-syncing (gently) the fss of ~40 nodes to fix an issue while retrieving the metadata information of some file replicas (they are healthy on disk servers). (Roberto)
    • Elvin: comes from not doing the full disk scan+sync at every boot. Might be from crashes last week; would be reported by FSCK?
    • what can be done "automatically". MGM and disk agree, just the local FMD misses the info and causes errors on open.
    • would the FSCK move such files to "orphan"? probably
    • Roberto to open JIRA ticket with some examples, to see whether giving access is safe
    • mid-term: drop FMD completely, just use extended attribs?

EOSATLAS

  • Re-enabled IPv6 (green light from E. Martelli)
  • also saw user-induced overload (but not restarted). Also saw loads of stalling draining filesystems (set to off, restarted).
  • might also still have slave failover bug where limitations get forgotten
  • also still see user-level FUSE mount from LXBATCH

QuarkDB repo URL

  • The default value for the "eos" hostgroup (next branch) has been updated to use the new linuxsoft location instead of storage-ci
    • not changed for EOSHOME -> Luca

"next" environment promotion

Would like to "promote" the "next" branch to master, and move machines using the "old" configuration in a "legacy" environment/branch, this is expected to be transparent (affects ~200 machines).

It is running on ~1150 nodes. Breakdown of nodes per hostgroup still reporting SLC6:

        eos/atlas/srm                            found 2 times
        eos/backup/servers                       found 1 times
        eos/backup/storage                       found 33 times
        eos/cms/srm                              found 2 times
        eos/cms/storage                          found 7 times
        eos/dev                                  found 4 times
        eos/genome/servers                       found 2 times
        eos/genome/storage                       found 6 times
        eos/kinetic                              found 1 times
        eos/lhcb/srm                             found 3 times
        eos/media/servers                        found 2 times
        eos/media/storage                        found 13 times
        eos/pps/srm                              found 1 times
        eos/public/gridftp                       found 7 times
        eos/public/http                          found 3 times
        eos/public/servers                       found 2 times
        eos/public/srm                           found 2 times
        eos/public/storage                       found 179 times
        eos/spare                                found 2 times
        eos/uat/servers                          found 1 times
        eos/uat/storage                          found 8 times
        eos/up2u/servers                         found 1 times
        eos/up2u/storage                         found 2 times
        eos/user/gridftp                         found 2 times
        eos/user/servers                         found 2 times
        eos/user/storage                         found 30 times

 

Note: I wasn't able to get SRM to work with CentOS7, all the other components are OK.


● EOS clients, FUSE(X)

(Jan):

  • eos-4.3.9 got tagged into production (less than usual 7-day "qa" period == "emergency", but so far nobody noticed)
  • EOSTEST, PLUS (probably BATCH) on SLC6 are stuck on xrootd-4.8.3 (intentional) and cause YUM errors (not intentional - 32bit library issue, see mail from Jan)
    • https://its.cern.ch/jira/browse/CRM-2799 - will deprecate 32bit on SLC6 unless really needed.
    • Discuss: do we need to push this as "emergency", or tag 4.8.4? not really, seem not to block /eos/home nor 4.3.9 rollout.

● Development issues

(Georgios)

  • QDB koji repos are now public, available even for external people. (https://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/ )
  • Please use these instead of storage-ci, for production. Example repofile:

[quarkdb-stable]
name=QuarkDB repository [stable]
baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/os
enabled=1
gpgcheck=False

[quarkdb-stable-debug]
name=QuarkDB repository [debug]
baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/debug
enabled=1
gpgcheck=False


● AOB

(Jan)

  • hardlinks have been implemented in FUSEX but need non-default config option to work - please make this the default in 4.3.10. Elvin: OK

(Massimo)

  • 4.3.9 indeed fixes the error at login (thanks). Cannot yet submit batch jobs since the "bigbird" machines have not yet been upgraded (can submit when logs go to AFS), will take up with Ben.

Deadlines+Plans:

Massimo: xrootd redirector cannot "redirect" FUSE, need new alias to be deployed before the DNS alias can be changed - announced for Sep 17. Will try to contact all users in advance, but hard to get from logs <0.1%). Also EOSHOME becomes default for new users that date.

(Eddy) Migration status : migrating IT-CM today. Full IT is next.

(Luca): please review the KB0005691 "known issues on EOSHOME".

(Dirk): one "analytics" box is stuck on 4.2.2 - can we see this version from server side? (Not really - too-old will fail, will contact us, will update. FUSEX has the version and can drop too-old clients.)

 

 

There are minutes attached to this event. Show them.
    • 16:00 16:20
      EOS production instances (LHC, PUBLIC, USER) 20m
      • major events last week
      • planned work this week
      Speakers: Cristian Contescu (CERN), Herve Rousseau (CERN), Hugo Gonzalez Labrador (CERN), Luca Mascetti (CERN)

      EOSCMS

      • updated to 4.3.8
      • Slave is now interfering with normal operations when central-draining is enabled (fixed in git)
      • Boot-syncing (gently) the fss of ~40 nodes to fix an issue while retrieving the metadata information of some file replicas (they are healthy on disk servers). (Roberto)
        • Elvin: comes from not doing the full disk scan+sync at every boot. Might be from crashes last week; would be reported by FSCK?
        • what can be done "automatically". MGM and disk agree, just the local FMD misses the info and causes errors on open.
        • would the FSCK move such files to "orphan"? probably
        • Roberto to open JIRA ticket with some examples, to see whether giving access is safe
        • mid-term: drop FMD completely, just use extended attribs?

      EOSATLAS

      • Re-enabled IPv6 (green light from E. Martelli)
      • also saw user-induced overload (but not restarted). Also saw loads of stalling draining filesystems (set to off, restarted).
      • might also still have slave failover bug where limitations get forgotten
      • also still see user-level FUSE mount from LXBATCH

      QuarkDB repo URL

      • The default value for the "eos" hostgroup (next branch) has been updated to use the new linuxsoft location instead of storage-ci
        • not changed for EOSHOME -> Luca

      "next" environment promotion

      Would like to "promote" the "next" branch to master, and move machines using the "old" configuration in a "legacy" environment/branch, this is expected to be transparent (affects ~200 machines).

      It is running on ~1150 nodes. Breakdown of nodes per hostgroup still reporting SLC6:

              eos/atlas/srm                            found 2 times
              eos/backup/servers                       found 1 times
              eos/backup/storage                       found 33 times
              eos/cms/srm                              found 2 times
              eos/cms/storage                          found 7 times
              eos/dev                                  found 4 times
              eos/genome/servers                       found 2 times
              eos/genome/storage                       found 6 times
              eos/kinetic                              found 1 times
              eos/lhcb/srm                             found 3 times
              eos/media/servers                        found 2 times
              eos/media/storage                        found 13 times
              eos/pps/srm                              found 1 times
              eos/public/gridftp                       found 7 times
              eos/public/http                          found 3 times
              eos/public/servers                       found 2 times
              eos/public/srm                           found 2 times
              eos/public/storage                       found 179 times
              eos/spare                                found 2 times
              eos/uat/servers                          found 1 times
              eos/uat/storage                          found 8 times
              eos/up2u/servers                         found 1 times
              eos/up2u/storage                         found 2 times
              eos/user/gridftp                         found 2 times
              eos/user/servers                         found 2 times
              eos/user/storage                         found 30 times

       

      Note: I wasn't able to get SRM to work with CentOS7, all the other components are OK.

    • 16:20 16:25
      EOS clients, FUSE(X) 5m
      • (major) issues seen
      • Rollout of new versions and FUSEX
      Speakers: Dan van der Ster (CERN), Jan Iven (CERN)

      (Jan):

      • eos-4.3.9 got tagged into production (less than usual 7-day "qa" period == "emergency", but so far nobody noticed)
      • EOSTEST, PLUS (probably BATCH) on SLC6 are stuck on xrootd-4.8.3 (intentional) and cause YUM errors (not intentional - 32bit library issue, see mail from Jan)
        • https://its.cern.ch/jira/browse/CRM-2799 - will deprecate 32bit on SLC6 unless really needed.
        • Discuss: do we need to push this as "emergency", or tag 4.8.4? not really, seem not to block /eos/home nor 4.3.9 rollout.
    • 16:25 16:35
      Development issues 10m
      • New namespace
      • Testing
      • Xrootd
      Speakers: Andreas Joachim Peters (CERN), Elvin Alin Sindrilaru (CERN), Georgios Bitzes (CERN), Jozsef Makai (CERN), Michal Kamil Simon (CERN)

      (Georgios)

      • QDB koji repos are now public, available even for external people. (https://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/ )
      • Please use these instead of storage-ci, for production. Example repofile:

      [quarkdb-stable]
      name=QuarkDB repository [stable]
      baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/os
      enabled=1
      gpgcheck=False

      [quarkdb-stable-debug]
      name=QuarkDB repository [debug]
      baseurl=http://linuxsoft.cern.ch/repos/quarkdb7-stable/x86_64/debug
      enabled=1
      gpgcheck=False

    • 16:35 16:50
      AOB 15m

      (Jan)

      • hardlinks have been implemented in FUSEX but need non-default config option to work - please make this the default in 4.3.10. Elvin: OK

      (Massimo)

      • 4.3.9 indeed fixes the error at login (thanks). Cannot yet submit batch jobs since the "bigbird" machines have not yet been upgraded (can submit when logs go to AFS), will take up with Ben.

      Deadlines+Plans:

      Massimo: xrootd redirector cannot "redirect" FUSE, need new alias to be deployed before the DNS alias can be changed - announced for Sep 17. Will try to contact all users in advance, but hard to get from logs <0.1%). Also EOSHOME becomes default for new users that date.

      (Eddy) Migration status : migrating IT-CM today. Full IT is next.

      (Luca): please review the KB0005691 "known issues on EOSHOME".

      (Dirk): one "analytics" box is stuck on 4.2.2 - can we see this version from server side? (Not really - too-old will fail, will contact us, will update. FUSEX has the version and can drop too-old clients.)