EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description
Weekly meeting to discuss progress on EOS rollout

(no meeting, just collected notes)

 

● production instances

EOSATLAS crashes

What (we thought) were crashes caused by sssd and SYN cookies ends up (probably) being from a user doing nasty things like hammering the namespace with illegal filenames (among other things).

Could we put an emphasis on validating user input
(to prevent crashes like stat('/../') crashing the namespace)

FST Upgrade

A new version of EOS will be released soon to fix a bug causing files to disappear (as seen on EOSATLAS)

XRootD DSI

New release from Andrea that makes the GridFTP server report an error on xrootd file handle close

 


● FUSE and client versions

Q: 4.1.26 fixes an issue for both ATLAS and CMS environments (C++ ABI)- can we build+deploy (perhaps even to desktops)?

Mail from Andreas:

> we have some real problem with the FUSE mount. I experienced that myself 
> yesterday, The mount gives arbitrary long delays on mv or file upload. I 
> checked one of the hangs this morning and it does not look like a 
> problem with the server side.

> Elvin could you try to investage that? It is really making the thing 
> more or less unusable and I can really understand why people get frustrated.

> We also should send out (again) the message, that this is not working 
> for any kind of software compilation,dstribution and people should not 
> move their activity out of AFS because we know upfront it does not work. 
> I would still seriously consider to remove the generic mounts or to make 
> them read-only. Better less functionality than a broken one.

● Citrine rollout

Seeing more and more of these (on EOSLHCB for instance)

$ eos file info fid:85870879                                                                                                                                                                                                                                                                                                  
  File: '/eos/lhcb/user/m/mwilkins/Testbeam/June2017/N_9/output_220/channels/channel_426.root'  Flags: 0644
  Size: 10147
Modify: Mon Jun 26 12:40:50 2017 Timestamp: 1498473650.289299785
Change: Mon Jun 26 12:40:50 2017 Timestamp: 1498473650.289299785
  CUid: 66834 CGid: 1470  Fxid: 051e491f Fid: 85870879    Pid: 12223520   Pxid: 00ba8420
XStype: adler    XS: 1e d3 49 28     ETAG: 23050788561485824:1ed34928
replica Stripes: 2 Blocksize: 4k LayoutId: 00600112
  #Rep: 48
 #   fs-id  #...................................................................................................................................
            #                   host  #     schedgroup #           path #     boot # configstatus #      drain # active #                 geotag
            #...................................................................................................................................
  0    3891  p05496589g79192.cern.ch         default.5          /data08     booted             rw      nodrain   online      0513::R::0050::RE41
  1    5024  p05151207152278.cern.ch         default.5          /data14     booted             rw      nodrain   online      9918::R::0002::WH04
  2    6976  p05614920x40969.cern.ch         default.5          /data19     booted             rw      nodrain   online      0513::R::0050::RL53
  3    3663  p05153074053268.cern.ch         default.5          /data23     booted             rw      nodrain   online      0513::R::0050::RD03
  4    4079  p05153074552515.cern.ch         default.5          /data04     booted             rw      nodrain   online      0513::R::0050::RE59
  5    9523  p06253967e31431.cern.ch         default.5          /data39     booted             rw      nodrain   online      9918::R::0001::WF07
  6    8494  p05799459g38700.cern.ch         default.5          /data49     booted             rw      nodrain   online      9918::R::0001::WF02
  7    7054  p05614920s30873.cern.ch         default.5          /data31     booted             rw      nodrain   online      0513::R::0050::RL57
  8    8374  p05799459t78824.cern.ch         default.5          /data31     booted             rw      nodrain   online      9918::R::0001::WD14
  9    3940  p05496644y10496.cern.ch         default.5          /data08     booted             rw      nodrain   online      9918::R::0002::WH07
 10    8862  p06253937q69424.cern.ch         default.5          /data47     booted             rw      nodrain   online      0513::R::0050::RL58
 11    6580      lxfsrd08c02.cern.ch         default.5          /data17     booted             rw      nodrain   online      0513::R::0050::RD08
 12    5913      lxfsrf09c05.cern.ch         default.5          /data28     booted             rw      nodrain   online      0513::R::0050::RF09
 13    4731      lxfsrf16b06.cern.ch         default.5          /data29     booted             rw      nodrain   online      0513::R::0050::RF16
 14    9607  p06253967r10312.cern.ch         default.5          /data47     booted             rw      nodrain   online      9918::R::0001::WF06
 15    5113  p05151207630515.cern.ch         default.5          /data14     booted             rw      nodrain   online      9918::R::0002::WH02
 16    4868      lxfsrf12b07.cern.ch         default.5          /data35     booted             rw      nodrain   online      0513::R::0050::RF12
 17    9521  p06253967q55782.cern.ch         default.5          /data44     booted             rw      nodrain   online      9918::R::0001::WF06
 18    8071  p05614923q22502.cern.ch         default.5          /data31     booted             rw      nodrain   online      9918::R::0002::WB09
 19    7985  p05614923c48223.cern.ch         default.5          /data37     booted             rw      nodrain   online      9918::R::0001::WA10
 20   10541  p06636710p40569.cern.ch         default.5          /data21     booted             rw      nodrain   online      0513::R::0050::CN03
 21    5189      lxfsrf12b04.cern.ch         default.5          /data07     booted             rw      nodrain   online      0513::R::0050::RF12
 22    3788  p05496644y19645.cern.ch         default.5          /data11     booted             rw      nodrain   online      9918::R::0002::WH07
 23    3831  p05496644h55020.cern.ch         default.5          /data04     booted             rw      nodrain   online      9918::R::0002::WH10
 24    9528  p06253967e37850.cern.ch         default.5          /data32     booted             rw      nodrain   online      9918::R::0001::WF07
 25    8613  p05799459s52099.cern.ch         default.5          /data04     booted             rw      nodrain   online      9918::R::0001::WD13
 26    6380      lxfsre16c04.cern.ch         default.5          /data14     booted             rw      nodrain   online      0513::R::0050::RE16
 27    5898      lxfsre10c05.cern.ch         default.5          /data18     booted             rw      nodrain   online      0513::R::0050::RE10
 28    6438      lxfsre12c04.cern.ch         default.5          /data09     booted             rw      nodrain   online      0513::R::0050::RE12
 29    6545      lxfsrd06c02.cern.ch         default.5          /data23     booted             rw      nodrain   online      0513::R::0050::RD06
 30    4011  p05153074580988.cern.ch         default.5          /data14     booted             rw      nodrain   online      0513::R::0050::RE59
 31    5342      lxfsrf14b03.cern.ch         default.5          /data01     booted             rw      nodrain   online      0513::R::0050::RF14
 32    5676      lxfsrf16b04.cern.ch         default.5          /data31     booted             rw      nodrain   online      0513::R::0050::RF16
 33    6359      lxfsre12c03.cern.ch         default.5          /data16     booted             rw      nodrain   online      0513::R::0050::RE12
 34    6417      lxfsre14c03.cern.ch         default.5          /data09     booted             rw      nodrain   online      0513::R::0050::RE14
 35   10049  p06636710d42351.cern.ch         default.5          /data20     booted             rw      nodrain   online      0513::R::0050::CN01
 36    5455      lxfsrf16b01.cern.ch         default.5          /data08     booted             rw      nodrain   online      0513::R::0050::RF16
 37   10757  p06636710f94075.cern.ch         default.5          /data22     booted             rw      nodrain   online      0513::R::0050::CH07
 38    3009  p05153074966793.cern.ch         default.5          /data05     booted             rw      nodrain   online      0513::R::0050::RD21
 39    4630      lxfsrf16b08.cern.ch         default.5          /data03     booted             rw      nodrain   online      0513::R::0050::RF16
 40    5998      lxfsrf11c05.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RF11
 41    6571      lxfsrd04c02.cern.ch         default.5          /data12     booted             rw      nodrain   online      0513::R::0050::RD04
 42    6620      lxfsrd02c04.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RD02
 43    3973  p05153074552024.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RE59
 44    6197      lxfsrf15c03.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RF15
 45   10615  p06636710n28975.cern.ch         default.5          /data37     booted             rw      nodrain   online      0513::R::0050::CH03
 46    7135  p05614920w85109.cern.ch         default.5          /data41     booted             rw      nodrain   online      0513::R::0050::RL45
 47    5251      lxfsrf14b01.cern.ch         default.5          /data01     booted             rw      nodrain   online      0513::R::0050::RF14
*******

Haven't had the time to investigate yet


● BATCH integration

(from Massimo): Batch tests ran OK this week (last week was not, but more a problem on the batch system..). Will expand tests a bit to be closer to ABP/Sixtrack.


● Xrootd

  • Fixed "gsi bug reported by Hervé", backported as 3.3.6-6.CERN (-> roll-out?)
There are minutes attached to this event. Show them.
    • 16:00 16:05
      overall 2017 planning 5m
      Speaker: Jan Iven (CERN)
    • 16:05 16:30
      operations: production
      • 16:05
        production instances 5m
        Speaker: Herve Rousseau (CERN)

        EOSATLAS crashes

        What (we thought) were crashes caused by sssd and SYN cookies ends up (probably) being from a user doing nasty things like hammering the namespace with illegal filenames (among other things).

        Could we put an emphasis on validating user input
        (to prevent crashes like stat('/../') crashing the namespace)

        FST Upgrade

        A new version of EOS will be released soon to fix a bug causing files to disappear (as seen on EOSATLAS)

        XRootD DSI

        New release from Andrea that makes the GridFTP server report an error on xrootd file handle close

         

      • 16:10
        CERNBOX and EOSUSER 5m
        Speaker: Luca Mascetti (CERN)
      • 16:15
        FUSE and client versions 5m
        Speaker: Dan van der Ster (CERN)

        Q: 4.1.26 fixes an issue for both ATLAS and CMS environments (C++ ABI)- can we build+deploy (perhaps even to desktops)?

        Mail from Andreas:

        > we have some real problem with the FUSE mount. I experienced that myself 
        > yesterday, The mount gives arbitrary long delays on mv or file upload. I 
        > checked one of the hangs this morning and it does not look like a 
        > problem with the server side.

        > Elvin could you try to investage that? It is really making the thing 
        > more or less unusable and I can really understand why people get frustrated.

        > We also should send out (again) the message, that this is not working 
        > for any kind of software compilation,dstribution and people should not 
        > move their activity out of AFS because we know upfront it does not work. 
        > I would still seriously consider to remove the generic mounts or to make 
        > them read-only. Better less functionality than a broken one.
      • 16:20
        Citrine rollout 5m
        Speaker: Herve Rousseau (CERN)

        Seeing more and more of these (on EOSLHCB for instance)

        $ eos file info fid:85870879                                                                                                                                                                                                                                                                                                  
          File: '/eos/lhcb/user/m/mwilkins/Testbeam/June2017/N_9/output_220/channels/channel_426.root'  Flags: 0644
          Size: 10147
        Modify: Mon Jun 26 12:40:50 2017 Timestamp: 1498473650.289299785
        Change: Mon Jun 26 12:40:50 2017 Timestamp: 1498473650.289299785
          CUid: 66834 CGid: 1470  Fxid: 051e491f Fid: 85870879    Pid: 12223520   Pxid: 00ba8420
        XStype: adler    XS: 1e d3 49 28     ETAG: 23050788561485824:1ed34928
        replica Stripes: 2 Blocksize: 4k LayoutId: 00600112
          #Rep: 48
         #   fs-id  #...................................................................................................................................
                    #                   host  #     schedgroup #           path #     boot # configstatus #      drain # active #                 geotag
                    #...................................................................................................................................
          0    3891  p05496589g79192.cern.ch         default.5          /data08     booted             rw      nodrain   online      0513::R::0050::RE41
          1    5024  p05151207152278.cern.ch         default.5          /data14     booted             rw      nodrain   online      9918::R::0002::WH04
          2    6976  p05614920x40969.cern.ch         default.5          /data19     booted             rw      nodrain   online      0513::R::0050::RL53
          3    3663  p05153074053268.cern.ch         default.5          /data23     booted             rw      nodrain   online      0513::R::0050::RD03
          4    4079  p05153074552515.cern.ch         default.5          /data04     booted             rw      nodrain   online      0513::R::0050::RE59
          5    9523  p06253967e31431.cern.ch         default.5          /data39     booted             rw      nodrain   online      9918::R::0001::WF07
          6    8494  p05799459g38700.cern.ch         default.5          /data49     booted             rw      nodrain   online      9918::R::0001::WF02
          7    7054  p05614920s30873.cern.ch         default.5          /data31     booted             rw      nodrain   online      0513::R::0050::RL57
          8    8374  p05799459t78824.cern.ch         default.5          /data31     booted             rw      nodrain   online      9918::R::0001::WD14
          9    3940  p05496644y10496.cern.ch         default.5          /data08     booted             rw      nodrain   online      9918::R::0002::WH07
         10    8862  p06253937q69424.cern.ch         default.5          /data47     booted             rw      nodrain   online      0513::R::0050::RL58
         11    6580      lxfsrd08c02.cern.ch         default.5          /data17     booted             rw      nodrain   online      0513::R::0050::RD08
         12    5913      lxfsrf09c05.cern.ch         default.5          /data28     booted             rw      nodrain   online      0513::R::0050::RF09
         13    4731      lxfsrf16b06.cern.ch         default.5          /data29     booted             rw      nodrain   online      0513::R::0050::RF16
         14    9607  p06253967r10312.cern.ch         default.5          /data47     booted             rw      nodrain   online      9918::R::0001::WF06
         15    5113  p05151207630515.cern.ch         default.5          /data14     booted             rw      nodrain   online      9918::R::0002::WH02
         16    4868      lxfsrf12b07.cern.ch         default.5          /data35     booted             rw      nodrain   online      0513::R::0050::RF12
         17    9521  p06253967q55782.cern.ch         default.5          /data44     booted             rw      nodrain   online      9918::R::0001::WF06
         18    8071  p05614923q22502.cern.ch         default.5          /data31     booted             rw      nodrain   online      9918::R::0002::WB09
         19    7985  p05614923c48223.cern.ch         default.5          /data37     booted             rw      nodrain   online      9918::R::0001::WA10
         20   10541  p06636710p40569.cern.ch         default.5          /data21     booted             rw      nodrain   online      0513::R::0050::CN03
         21    5189      lxfsrf12b04.cern.ch         default.5          /data07     booted             rw      nodrain   online      0513::R::0050::RF12
         22    3788  p05496644y19645.cern.ch         default.5          /data11     booted             rw      nodrain   online      9918::R::0002::WH07
         23    3831  p05496644h55020.cern.ch         default.5          /data04     booted             rw      nodrain   online      9918::R::0002::WH10
         24    9528  p06253967e37850.cern.ch         default.5          /data32     booted             rw      nodrain   online      9918::R::0001::WF07
         25    8613  p05799459s52099.cern.ch         default.5          /data04     booted             rw      nodrain   online      9918::R::0001::WD13
         26    6380      lxfsre16c04.cern.ch         default.5          /data14     booted             rw      nodrain   online      0513::R::0050::RE16
         27    5898      lxfsre10c05.cern.ch         default.5          /data18     booted             rw      nodrain   online      0513::R::0050::RE10
         28    6438      lxfsre12c04.cern.ch         default.5          /data09     booted             rw      nodrain   online      0513::R::0050::RE12
         29    6545      lxfsrd06c02.cern.ch         default.5          /data23     booted             rw      nodrain   online      0513::R::0050::RD06
         30    4011  p05153074580988.cern.ch         default.5          /data14     booted             rw      nodrain   online      0513::R::0050::RE59
         31    5342      lxfsrf14b03.cern.ch         default.5          /data01     booted             rw      nodrain   online      0513::R::0050::RF14
         32    5676      lxfsrf16b04.cern.ch         default.5          /data31     booted             rw      nodrain   online      0513::R::0050::RF16
         33    6359      lxfsre12c03.cern.ch         default.5          /data16     booted             rw      nodrain   online      0513::R::0050::RE12
         34    6417      lxfsre14c03.cern.ch         default.5          /data09     booted             rw      nodrain   online      0513::R::0050::RE14
         35   10049  p06636710d42351.cern.ch         default.5          /data20     booted             rw      nodrain   online      0513::R::0050::CN01
         36    5455      lxfsrf16b01.cern.ch         default.5          /data08     booted             rw      nodrain   online      0513::R::0050::RF16
         37   10757  p06636710f94075.cern.ch         default.5          /data22     booted             rw      nodrain   online      0513::R::0050::CH07
         38    3009  p05153074966793.cern.ch         default.5          /data05     booted             rw      nodrain   online      0513::R::0050::RD21
         39    4630      lxfsrf16b08.cern.ch         default.5          /data03     booted             rw      nodrain   online      0513::R::0050::RF16
         40    5998      lxfsrf11c05.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RF11
         41    6571      lxfsrd04c02.cern.ch         default.5          /data12     booted             rw      nodrain   online      0513::R::0050::RD04
         42    6620      lxfsrd02c04.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RD02
         43    3973  p05153074552024.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RE59
         44    6197      lxfsrf15c03.cern.ch         default.5          /data02     booted             rw      nodrain   online      0513::R::0050::RF15
         45   10615  p06636710n28975.cern.ch         default.5          /data37     booted             rw      nodrain   online      0513::R::0050::CH03
         46    7135  p05614920w85109.cern.ch         default.5          /data41     booted             rw      nodrain   online      0513::R::0050::RL45
         47    5251      lxfsrf14b01.cern.ch         default.5          /data01     booted             rw      nodrain   online      0513::R::0050::RF14
        *******
        

        Haven't had the time to investigate yet

      • 16:25
        SWAN 5m
        Speaker: Jakub Moscicki (CERN)
    • 16:30 16:50
      development: near-term
      • 16:30
        nextgen FUSE 5m
        Speaker: Andreas Joachim Peters (CERN)
      • 16:35
        new Namespace 5m
        Speaker: Elvin Alin Sindrilaru (CERN)
    • 16:50 17:45
      other: pilot services, long-term dev, external
      • 16:50
        Webservice 5m
        Speaker: Luca Mascetti (CERN)
      • 16:55
        Backup 5m
        Speaker: Luca Mascetti (CERN)
      • 17:00
        Samba 5m
        Speaker: Luca Mascetti (CERN)
      • 17:05
        $HOME structure 5m
        Speaker: Luca Mascetti (CERN)
      • 17:10
        BATCH integration 5m
        Speaker: Massimo Lamanna (CERN)

        (from Massimo): Batch tests ran OK this week (last week was not, but more a problem on the batch system..). Will expand tests a bit to be closer to ABP/Sixtrack.

      • 17:15
        Xrootd 5m
        Speaker: Michal Kamil Simon (CERN)
        • Fixed "gsi bug reported by Hervé", backported as 3.3.6-6.CERN (-> roll-out?)
      • 17:20
        AOB 5m