EOS DevOps Meeting

Europe/Zurich
513/R-068 (CERN)

513/R-068

CERN

19
Show room on map
Jan Iven (CERN)
Description
Weekly meeting to discuss progress on EOS rollout
    • 16:00 16:05
      overall 2017 planning 5m
      Speaker: Jan Iven (CERN)

      Things to announce/update for ITUM-22 (Mon Oct 2; input before Sept 11)?

      • FUSEng "pilot"
      • AFS phaseout status (i.e EOS new namespace rollout plan)

       

    • 16:05 16:30
      operations: production
      • 16:05
        production instances 5m
        Speaker: Herve Rousseau (CERN)

        Still see ".xsmap" error (in "critical EOS error"-mail) - OK to run cleanup campaigns, or still waiting for FST updates?

        • EOSATLAS - 2017-08-29
        • EOSUSER, EOSCMS - 2017-08-26

        Needs MGM 0.3.267 (which is not anywhere on production)


        "dedicated Kerberos5 principal" (EOS-1234): rolled out on EOSPUBLIC (and EOSPPS) - other instances?

        • add to "qa" for all instances

         

      • 16:10
        CERNBOX and EOSUSER 5m
        Speaker: Luca Mascetti (CERN)

        Waiting fro new 0.3.267+1 tagged release.

        master+slave probe discovered that file changelog can be followed, but not directory (probe-internal error - maybe remove recycle bin).

        Still see various tpc "xrdcp" errors (6 different?). Followup: will go to logs, then create tickets. Also, EOS "tpc" works differently from plain standard Xrootd TPC - apparently has compiled-in "tpc" command, not using the config file directive.

      • 16:15
        FUSE and client versions 5m
        Speaker: Dan van der Ster (CERN)

        4.1.27 got tagged (fixes mem leak), but fails to build on KOJI - SLC6/i386 (in some unrelated area). Builds OK on Jenkins, though.

      • 16:20
        Citrine rollout 5m
        Speaker: Herve Rousseau (CERN)

        EOSPUBLIC & EOSLHCB

        Updated to 4.1.27 to fix various issues (file deletion, compaction crash and built-in HTTP server IPv6 support)

        EOSALICE

        Proposed next TS to upgrade to Citrine, expected downtime ~4h

      • 16:25
        SWAN 5m
        Speaker: Jakub Moscicki (CERN)

        reported three more "eosd" crashes on SWAN, with 4.1.18 (but might be duplicates).

    • 16:30 16:50
      development: near-term
      • 16:30
        nextgen FUSE 5m
        Speaker: Andreas Joachim Peters (CERN)
      • 16:35
        new Namespace 5m
        Speaker: Elvin Alin Sindrilaru (CERN)
    • 16:50 17:45
      other: pilot services, long-term dev, external
      • 16:50
        Webservice 5m
        Speaker: Luca Mascetti (CERN)
      • 16:55
        Backup 5m
        Speaker: Luca Mascetti (CERN)
      • 17:00
        Samba 5m
        Speaker: Luca Mascetti (CERN)
      • 17:05
        $HOME structure 5m
        Speaker: Luca Mascetti (CERN)

        AFS removal (Massimo's test)

        ml001 had ~ and a work dir. Remove AFS as service.

        • As agreed PWD becomes "/"
        • ~ seems to be unreachable, while the workdir seems to be gone.
        • ubackup dir is also gone immediately
        • afs restore seems to work (both ~ and work)

         

        ===

        Log-in after AFS opt out

        -bash-4.1$ ssh -l ml001 lxplus
        Warning: Permanently added the RSA host key for IP address '188.184.93.168' to the list of known hosts.
        Password:
        * ********************************************************************
        * Welcome to lxplus069.cern.ch, SLC, 6.9
        * Archive of news is available in /etc/motd-archive
        * Reminder: You have agreed to comply with the CERN computing rules
        * https://cern.ch/ComputingRules
        * Puppet environment: production, Roger state: production
        * Foreman hostgroup: lxplus/nodes/login
        * LXPLUS Public Login Service
        * ********************************************************************
        -bash-4.1$ df
        Filesystem         1K-blocks          Used     Available Use% Mounted on
        /dev/vda1           82533532      55440472      22903508  71% /
        tmpfs               15041868            64      15041804   1% /dev/shm
        /dev/vdb           154687468     101795652      45027496  70% /tmp
        /dev/vdb           154687468     101795652      45027496  70% /var/tmp
        sssd                  307200         82352        224848  27% /var/lib/sss/db
        AFS                  9000000             0       9000000   0% /afs
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/atlas.cern.ch
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/cms.cern.ch
        eosatlas        128000000000             0  128000000000   0% /eos/atlas
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/lhcb.cern.ch
        eosuser        3475066307484 1954339025428 1520727282056  57% /eos/user
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/sft.cern.ch
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/ganga.cern.ch
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/clicdp.cern.ch
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/atlas-nightlies.cern.ch
        eoscms          128000000000             0  128000000000   0% /eos/cms
        cvmfs2              30720000      28997235       1722766  95% /cvmfs/alice.cern.ch
        -bash-4.1$ pwd
        /
        -bash-4.1$ ls -l /afs/cern.ch/user/m/ml001
        ls: cannot open directory /afs/cern.ch/user/m/ml001: Permission denied
        -bash-4.1$ ls -l /afs/cern.ch/work/m/ml001
        ls: cannot access /afs/cern.ch/work/m/ml001: No such file or directory

         

        ===

        Backup-restore output (incomplete)

        -bash-4.1$  afs_admin recover /afs/cern.ch/user/m/ml001
        RECOVERING VOLUME:Y.user.ml001
        2017-08-29 15:17:01,872 INFO    : Starting restore session... logfile /var/ABS/log/abs-restore-session.2017.08.29-151701/abs-restore.log
           0:    2017-03-06 18:48:10 (f)
           1:    2017-04-20 18:53:34 (f)
           2:    2017-06-04 18:43:50 (f)
           3:    2017-07-19 19:08:17 (f)
           4:    2017-08-07 19:00:03
           5:    2017-08-16 18:30:14
        choose dump (number); or 'm' for more, possibly older dumps (if any); or '^C' to interrupt >5
        2017-08-29 15:17:06,403 INFO    : Restoring volume 1934841534 at 2017-08-16 18:30:14, recalling 3 dumps
        2017-08-29 15:17:06,509 WARNING : stderr:     **** : trace level set to 3
            stager: stage_prepareToGet Usertag=NULL
            stager: Looking up RH host - Using castorpublic
            stager: Looking up RH port - Using 9002
            stager: Looking up service class - Using backup
            stager: stage_prepareToGet file=/castor/cern.ch/afsback/DUMPS/44/24/1934841534/voldump:v.1934841534:n.user.ml001:b.2017.07.19-190817:p.0:e.gzip.aes-v2:i.2017.07.19-190817:.gz.aes-v2/ proto=rfio
            stager: Setting euid: 49089
            stager: Setting egid: 2766
            stager: Localhost is: afs901.cern.ch
            stager: Creating socket for castor callback - Using port 30042
            stager: Aug 29 15:17:06 (1504012626) Sending request
            stager: 4b07de86-4484-45b4-939b-8a734d8e9572 SND 0.01 s to send the request
            stager: Waiting for callback from castor
            stager: 4b07de86-4484-45b4-939b-8a734d8e9572 CBK 0.02 s before callback from 188.184.38.4 was received
        Received 1 responses
        /castor/cern.ch/afsback/DUMPS/44/24/1934841534/voldump:v.1934841534:n.user.ml001:b.2017.07.19-190817:p.0:e.gzip.aes-v2:i.2017.07.19-190817:.gz.aes-v2/ SUBREQUEST_READY

        2017-08-29 15:17:06,613 WARNING : stderr:     **** : trace level set to 3
            stager: stage_prepareToGet Usertag=NULL
            stager: Looking up RH host - Using castorpublic
            stager: Looking up RH port - Using 9002
            stager: Looking up service class - Using backup
            stager: stage_prepareToGet file=/castor/cern.ch/afsback/DUMPS/44/24/1934841534/voldump:v.1934841534:n.user.ml001:b.2017.08.07-190003:p.2017.07.19-190817:e.gzip.aes-v2:i.2017.07.19-190817:.gz.aes-v2/ proto=rfio
            stager: Setting euid: 49089
            stager: Setting egid: 2766
            stager: Localhost is: afs901.cern.ch
            stager: Creating socket for castor callback - Using port 30927
            stager: Aug 29 15:17:06 (1504012626) Sending request
            stager: 35ea97f1-5ca7-4551-9748-0959ea22e00e SND 0.01 s to send the request
            stager: Waiting for callback from castor
            stager: 35ea97f1-5ca7-4551-9748-0959ea22e00e CBK 0.02 s before callback from 188.184.38.4 was received
        Received 1 responses
        /castor/cern.ch/afsback/DUMPS/44/24/1934841534/voldump:v.1934841534:n.user.ml001:b.2017.08.07-190003:p.2017.07.19-190817:e.gzip.aes-v2:i.2017.07.19-190817:.gz.aes-v2/ SUBREQUEST_READY

        2017-08-29 15:17:06,718 WARNING : stderr:     **** : trace level set to 3
            stager: stage_prepareToGet Usertag=NULL
            stager: Looking up RH host - Using castorpublic
            stager: Looking up RH port - Using 9002
            stager: Looking up service class - Using backup
            stager: stage_prepareToGet file=/castor/cern.ch/afsback/DUMPS/44/24/1934841534/voldump:v.1934841534:n.user.ml001:b.2017.08.16-183014:p.2017.08.07-190003:e.gzip.aes-v2:i.2017.07.19-190817:.gz.aes-v2/ proto=rfio
            stager: Setting euid: 49089
            stager: Setting egid: 2766
            stager: Localhost is: afs901.cern.ch
            stager: Creating socket for castor callback - Using port 30890
            stager: Aug 29 15:17:06 (1504012626) Sending request
            stager: c374db3c-e340-4ccf-b9c1-d1547780029d SND 0.02 s to send the request
            stager: Waiting for callback from castor
            stager: c374db3c-e340-4ccf-b9c1-d1547780029d CBK 0.03 s before callback from 128.142.37.40 was received
        Received 1 responses
        /castor/cern.ch/afsback/DUMPS/44/24/1934841534/voldump:v.1934841534:n.user.ml001:b.2017.08.16-183014:p.2017.08.07-190003:e.gzip.aes-v2:i.2017.07.19-190817:.gz.aes-v2/ SUBREQUEST_READY

         

         

      • 17:10
        BATCH integration 5m
        Speaker: Massimo Lamanna (CERN)
        Test with version 4.1.26-1
        
        Analysing dir: outputs
        Expected data size 1024000 (10 datafiles per job)
        Checking task 177588 (100 jobs)
        Analysing task 177588 (100 jobs)
        
        Task 177588 starts at Tue Aug 29 14:56:43 2017 and ends at Tue Aug 29 15:00:42 2017 (4.0 minutes)
        Analysed jobs: 100
        Correct jobs: 100
        Maximum concurrency: 13
        Execution hosts (top 5):  b62a5971fc [#4]  b6e6f35cfa [#4]  b64dda3533 [#4]  b6c077d1d1 [#4]  b6d86ab924 [#4] 
        Execution environments (top 5): eos-client-4.1.26-1.el6.x86_64, eos-fuse-core-4.1.26-1.el6.x86_64, xrootd-client-libs-4.6.1-1.el6.i686, xrootd-client-libs-4.6.1-1.el6.x86_64 [#100]
      • 17:15
        Xrootd 5m
        Speaker: Michal Kamil Simon (CERN)

        xrootd-4.7 has been released. Will come via standard repos.

        Q: does this include the fix on client "maximum number of internal redirect" error reporting (reported to Elvin; might hide "real" error?). Probably not. Perhaps could get "backtrace" with all statuses (upstream ticket exists)?

      • 17:20
        AOB 5m

        Next week - somebody else to chair (or skip)?