Rucio Development Meeting

Europe/Zurich
Martin Barisits (CERN)
Description

Zoomhttps://cern.zoom.us/j/413496641

Meeting ID: 413 496 641
Find your local number: https://cern.zoom.us/u/aT2QQfXAo

    • 15:00 15:10
      News 10m
      • Original Photo of 2019 coding camp
      • Google Season of Docs
      • Meeting schedule in May
        • 05-07
        • 05-14 CANCELLED
        • 05-21 CERN Holiday
        • 05-28
    • 15:10 15:20
      News from the experiments 10m
      • ATLAS
        • Issue with conveyor-poller
          • FTS is returning internal server error for one job - getting stuck there
          • Possibly canceling and resubmission?
          • Need a solution which does not block the entire grid due to a failure of one (or a handful) of jobs
          • --> Ticket
      • CMS
        • Getting closer to putting into production for small data tier
        • Issues with containers in 1.22.3 version
          • Auth server does not work
          • ​​​​​​​No issue for ATLAS / DOMA instance
          • Eric will submit PRs for the fixes
            • Possibly an issue with ssl.conf in apache
      • Belle II
        • Working on monitoring (first prototype should be there soon)
          • Will rely on new aggregator
          • Commit next week
        • Migration
          • Almost everything is ready
          • Certification phase now and hopefully by end of June start of migration
      • RAL (MultiVO)
        • No news
      • DUNE
        • Designed plan for policy packages in Multi-vo
      • LDMX
        • Disk ran full and some issues
        • Some configuration improvements
    • 15:20 15:30
      Hot topics 10m
    • 15:30 15:55
      Developers roundtable 25m

       

      • Burn chart and progress
      • 1.23.0 LTS "The Incredible Donkey" priority followup
        • In Progress
          • Documentation overhaul [Martin, Dimitrios]
            • Some tests with static page builders (mkdocs, jekyll)
              • https://github.com/bari12/documentation_test
              • https://github.com/bari12/documentation_test_jekyll
            • Automated API documentation building still open
          • Expand Kubernetes Usage [Thomas]
            • Waiting for Ricardo for node investigation
            • Reaper2 constantly increasing memory usage (until limit is hit) and restarts
              • Confirmed by CMS too
                • ~50 RSEs processed in reaper
              • ATLAS made big jump to 300+ RSEs
              • Being investigated
              • Check memory usage
              • ATLAS sees this on reaper1 now as well
                • Related to gfal?
                • Needs followup
            • Debug features with attachable containers coming soon
            • MultiZone cluster available now
            • Increasing cluster size next week
            • Added more configuration parameters to chart (google secrets)
            • Account switcher @ webui
              • Should be easy to fix
              • Eric will try - patch will follow
            • Activating more daemons on k8s
            • Started to streamlining configuration (for ATLAS) based on flux
            • Python 3 containers not working at the moment; Continue with python 2.7 containers
          • AAI/OIDC Testing and Improvements [Jaroslav]
            • Test of propagation of account to transfertool
            • New patch release to deploy the recent developments on WLCG DOMA cluster
            • Testing transfers with FTS & dCache and OIDC auth
              • Auth flow with rucio-admin token does not work at the moment
              • Second mode: user token
                • Needs a fix
            • WebUI fix
            • Should plan continous test efforts with multiple storages
          • MultiVO Functionality #2635 [Eli, Patrick]
            • Bringing work up to date
            • Meeting later on to specify next steps
            • Discussion: Administration of different VOs
              • Securing VOs, Accounts etc.
            • List of code-parts which needs specific changes to enable Multi-VO
            • Issue with migration script under py3.6 and oracle
              • Py3 server container would be very useful to test this
                • (Thomas will prepare)
            • Policy packages adaption for MultiVO
            • Making good progress on tests
            • Python 3 issue:
              • Containers for py3 needed for testing
              • Possible try to test from venv
          • Unification of metadata interfaces #3096 [Aris]
            • PR submitted, incorporating comments now
          • New Code management Model #3417 [Martin, Ben]
            • Bens presentation
            • Following up on build system with travis/docker to migrate to GH actions
          • Python 3 #3420 [Martin]
            • rucio setup.py fixed
            • Starting to test py3 server again with travis
          • QoS #3419 [Aris, Mario, Martin]
            • Some open conceptual decisions
            • Dedicated meeting for this
            • Conceptual design somewhat fixed
            • Now developments started for Mario, Aris, Martin
            • Will do a presentation about this next week
          • Changing gfal protocol (adding protocol) [Mario]
            • Instead of using gfal API, use GFAL CLI (Which can be cancelled)
            • Already works, but needs a lot of testing
            • Renaming protocols (But leave symlinks)
        • To do
          • Operators Documentation and recipe repository #2636 [Martin]
          • Page Listing config table and RSE Attribute Parameters #2631 [Martin]
          • rucio.cfg vs config table #2630 [Mario]
          • Handling of Archives in the Reaper #1431 [Thomas, Cedric]
          • Log the Parameters used in all POST/PUT requests #2686 [Thomas]
          • RSEmgr version 2.0 #3147 [Tomas, Tobi]
        • Done
      • Reaper 
        • Current reaper relies on data populated by probes (difficult)
        • Can we default this to rucio internal values
          • Should work for used
          • Total/Threshold more difficult
            • (Rucio cannot guess)
        • Thresholds?
        • Ticket -->
      • Client
        • 1.22.3
        • Doesnt find generic package for some reason
      • Auditor
        • How to best help Dimitrios
        • Igor prototyped a function which is much faster, Dimitrios is testing
        • Second prototype from Igor should use less memory 
        • Where to store the experiment specific scripts/tools
          • Experiment repos, if possible public so we can link them
          • Some overlap with policy packages
        • Some testing with actual data would be helpful (feedback)
          • Interpretation of the actions taken by the tool
        • Followup offline meeting
      • 2020-04-23
        • Presentation about Code Management Model from ben
          • Q: How to collaborate on a single development?
            • Pull/Merge from personal branches. Then PR to rucio repository
        • Auditor
          • Decide on interface, development is mostly in the "policies"
      • 2020-04-16
        • Gitlab vs Github
          • Worth to move (back) to GitLab
            • At the moment no strong benefit, but might change in the future?
        • Auditor #3437 [Dimitrios]
          • Comparison with old auditor
          • Would be useful if CMS colleagues can test/compare the functionality as well
          • Unit tests missing, but should come soon
      • 2020-04-09
        • Auditor #3437 [Dimitrios]
          • Went through code 
          • Started to work on core function
          • Test cases are missing
          • Side-effects of only taking a dump with AVAILABLE replicas?
          • Object stores
            • Possible to get file lists from object stores (list buckets)
            • Still two lists to compare
            • Possible extra intelligence needed to handle corner cases
        • Monitoring [Cedric, Thomas]
          • For ATLAS monitoring aggregations are done in the monitoring infrastructure
          • A light version of this would be useful for other communities too
          • Tool/Daemon which does this aggregation
        • Traces [Thomas]
          • Trace infrastructure for CMS
          • Actually not easy to do, since there is no documentation and schema
          • Only Kronos daemon expects certain fields in the traces
          • Setup (and enforce) a base schema on the server
            • Decline and/or monitor the traces failing schema validation
          • Kronos daemon has lots of ATLAS specifics
            • Kronos2.0 makes experiment specific pluginable
      • 2020-04-02
        • Handling of lost files in archives in the necromancer [Cedric, Tomas]
          • Tomas can look into it
          • Will require additional queries to check for archives
        • Auditor discussion [Dimitrios, Tomas]
          • Input 2 files: DB Dump, Storage Dump
          • Can Auditor not directly get DB information from Rucio (instead of relying on DB Dump)?
            • Possible to do both ways?
              • Difficult, since not all information is available in the db for past replica states
          • Auditor compares the 2 states (DB, Storage)
            • Auditor might as well work on DB dump (without generated PFNs) and generate the PFNs during processing
          • pre, common, post actions
            • Directories for DB, Storage dump being filled (externally)
            • Auditor runs and fetches data from the directories
            • Auditor produces output
          • Dimitrios will create a ticket to collect ideas/workflows and we move forward from there
            • Collect usecases there, verify that it works (compared to old auditor)
    • 15:55 16:00
      AOB 5m