Rucio Development Meeting

Europe/Zurich
Martin Barisits (CERN)
Description

Zoomhttps://cern.zoom.us/j/413496641

Meeting ID: 413 496 641
Find your local number: https://cern.zoom.us/u/aT2QQfXAo

    • 15:00 15:10
      News 10m
    • 15:10 15:20
      News from the experiments 10m
      • ATLAS
        • Issue with loadbalancer
          • Possible to log rucio server being used in the backend to client
          • Backend server part of header?
          • Can be put into header by each server anyway, no matter if loadbalancer is used or not
          • Thomas will have a look
        • Potential issue with submitter?
          • Some requests staying stuck in Q state and are not submitted to FTS
          • Seen before also on CMS instance
          • Needs further investigation
            • Martin will look into it
        • "Lost" files on e.g. read-only file systems can not be declared LOST in the current workflow
          • Dimitrios will create issue
        • Dumps: Possibility in replicas dump to flag a replica as being "the last"?
          • Thomas currently in the process of re-writing dump scripts in Spark, can be added
      • CMS
        • Auth issue in JS
          • Thomas is investigating, no news yet
      • Belle II
        • Monitoring
        • New daemon or possible put into hermes?
      • MultiVO
        • Upgrade RAL rucio instance version
      • DUNE/Ikarus/LSST/*
        • Getting Rucio cluster to run
    • 15:20 15:30
      Hot topics 10m
    • 15:30 15:55
      Developers roundtable 25m
      • Burn chart and progress
      • 1.23.0 LTS "The Incredible Donkey" priority followup
        • In Progress
          • Documentation overhaul [Martin, Dimitrios]
            • Page Listing config table and RSE Attribute Parameters #2631 [Martin]
            • Operators Documentation and recipe repository #2636 [Martin]
            • Early phase of picking tools/deciding structure/content
              • Separation between generic / VO specific content
            • Possible discussion in 2 weeks for everyone to comment
          • Expand Kubernetes Usage [Thomas]
            • Waiting for Ricardo for node investigation
            • Reaper2 constantly increasing memory usage (until limit is hit) and restarts
              • Confirmed by CMS too
                • ~50 RSEs processed in reaper
              • ATLAS made big jump to 300+ RSEs
              • Being investigated
              • Check memory usage
              • ATLAS sees this on reaper1 now as well
                • Related to gfal?
                • Needs followup
            • Debug features with attachable containers coming soon
            • MultiZone cluster available now
            • Increasing cluster size next week
          • AAI/OIDC Testing and Improvements [Jaroslav]
            • Test of propagation of account to transfertool
            • New patch release to deploy the recent developments on WLCG DOMA cluster
          • MultiVO Functionality #2635 [Eli]
            • Bringing work up to date
            • Meeting later on to specify next steps
            • Discussion: Administration of different VOs
              • Securing VOs, Accounts etc.
            • List of code-parts which needs specific changes to enable Multi-VO
            • Issue with migration script under py3.6 and oracle
          • Unification of metadata interfaces #3096 [Aris]
            • PR submitted, waiting for comments
          • New Code management Model #3417 [Martin, Ben]
            • Tested github actions to automate testing of cherry-picks against release branches
            • Tests for PR would still run in travis, cherry-picks would be tested in github actions
            • Might make sense to move everything to github actions
            • Looked into different branching models
          • Python 3 #3420 [Martin]
            • rucio-clients setup.py fixed
            • Starting to test py3 server again with travis
        • To do
          • rucio.cfg vs config table #2630 [Mario]
          • Handling of Archives in the Reaper #1431 [Thomas, Cedric]
          • Log the Parameters used in all POST/PUT requests #2686 [Thomas]
          • RSEmgr version 2.0 #3147 [Tomas, Tobi]
          • QoS #3419 [Aris, Mario, Martin]
            • Some open conceptual decisions
            • Dedicated meeting for this
        • Done
      • Gitlab vs Github
        • Worth to move (back) to GitLab
          • At the moment no strong benefit, but might change in the future?
      • Auditor #3437 [Dimitrios]
        • Comparison with old auditor
        • Would be useful if CMS colleagues can test/compare the functionality as well
        • Unit tests missing, but should come soon
      • 2020-04-09
        • Auditor #3437 [Dimitrios]
          • Went through code 
          • Started to work on core function
          • Test cases are missing
          • Side-effects of only taking a dump with AVAILABLE replicas?
          • Object stores
            • Possible to get file lists from object stores (list buckets)
            • Still two lists to compare
            • Possible extra intelligence needed to handle corner cases
        • Monitoring [Cedric, Thomas]
          • For ATLAS monitoring aggregations are done in the monitoring infrastructure
          • A light version of this would be useful for other communities too
          • Tool/Daemon which does this aggregation
        • Traces [Thomas]
          • Trace infrastructure for CMS
          • Actually not easy to do, since there is no documentation and schema
          • Only Kronos daemon expects certain fields in the traces
          • Setup (and enforce) a base schema on the server
            • Decline and/or monitor the traces failing schema validation
          • Kronos daemon has lots of ATLAS specifics
            • Kronos2.0 makes experiment specific pluginable
      • 2020-04-02
        • Handling of lost files in archives in the necromancer [Cedric, Tomas]
          • Tomas can look into it
          • Will require additional queries to check for archives
        • Auditor discussion [Dimitrios, Tomas]
          • Input 2 files: DB Dump, Storage Dump
          • Can Auditor not directly get DB information from Rucio (instead of relying on DB Dump)?
            • Possible to do both ways?
              • Difficult, since not all information is available in the db for past replica states
          • Auditor compares the 2 states (DB, Storage)
            • Auditor might as well work on DB dump (without generated PFNs) and generate the PFNs during processing
          • pre, common, post actions
            • Directories for DB, Storage dump being filled (externally)
            • Auditor runs and fetches data from the directories
            • Auditor produces output
          • Dimitrios will create a ticket to collect ideas/workflows and we move forward from there
            • Collect usecases there, verify that it works (compared to old auditor)
    • 15:55 16:00
      AOB 5m