DOMA / TPC Meeting

Europe/Zurich
Description

Topic: WLCG DOMA TPC Meeting

Join Zoom Meeting
https://cern.zoom.us/j/99836057922?pwd=ZFhWN3NpYi9oZmwvM3pIRE9zdzFnZz09

Meeting ID: 998 3605 7922
Passcode: 733660
One tap mobile
+41315280988,,99836057922# Switzerland
+41432107042,,99836057922# Switzerland

Dial by your location
        +41 31 528 09 88 Switzerland
        +41 43 210 70 42 Switzerland
        +41 43 210 71 08 Switzerland
        +33 1 7037 2246 France
        +33 1 7037 9729 France
        +33 1 8699 5831 France
Meeting ID: 998 3605 7922
Find your local number: https://cern.zoom.us/u/aeB4ArMgmT

    • 16:00 16:10
      Experiments production 10m
      Speakers: Alessandra Forti (University of Manchester (GB)), Diego Davila Foyo (Univ. of California San Diego (US)), Petr Vokac (Czech Technical University (CZ))

      ATLAS

      • dCache
        • Swiss CA & dCache restarts due to CaNL bug (dCache < 7.x)
          • GGUS tickets to remaining 6 sites on 12 April
            • done: FZK, DESY-ZN, SARA-MATRIX, RRC-KI-T1, RU-Protvino-IHEP
            • scheduled: DESY-HH (GGUS:151317)
          • usually restarted only WebDAV doors or WebDAV + gPlazma in the past, not pools
          • this issue is currently hidden for CMS, because their FTS is configured with pull -> push fallback
        • dCache WLCG SRR issue with missing storageshares
          • dCache developers waiting for CERN/WLCG response since November(?)
            • several other related really old dCache ticket still opened
            • what is the current status? what is minimal dCache version with working SRR?
          • WLCG SRR is a bit too flexible, usually generated by cron script (huge potential for failures - see notes for LCGDM-2744), no standard location, we need to cleanup a old/invalid locations in different CRICs (WLCG, ATLAS, DUNE, ...) ... not really improvement compared to the SRM used in the past
        • dCache documentation recommends periodic CRL updates but once you run fetch-crl than updates are mandatory
          • different command to enable periodic updates on SLC6, CentOS7, CentOS8
      • StoRM
        • official documentation should provide details how to update space occupancy for non-GPFS backend
          • it would be nice to include examples for most common filesystems (e.g. Lustre, CEPH(?))
          • what happened to the related STOR-1356
      • DPM
        • Observing production transfer failures
          • mainly affects long distance transfers (EU <-> US / Asia / Australia)
        • security model implemented by DPM for storage issued tokens (macaroons) is different compared to dCache, StoRM and XRootD LCGDM-2972
        • new fixed DPM release necessary - in progress
          • scheduled for the end of May
      • ECHO RAL/Glasgow status?
        • would it be possible / easier to use CephFS + XRootD without any special plugin?
          • ATLAS 30day transfer average with RAL destination 4Gb/s (hourly peaks < 30Gb/s)
          • ATLAS 30day transfer average with RAL source 4Gb/s (hourly peaks < 30Gb/s)
      • XRootD status of production release for US sites?
      • 100k small files stress tests for DPM (FTS, Kibana) and XRootD (FTS, Kibana)
        • race creating (duplicate) parent directories
        • DPM (449 failed transfers)
          • 3x DESTINATION MAKE_PARENT HTTP 403 : Permission refused
          • 19x DESTINATION OVERWRITE HTTP 403 : Permission refused
          • 404x problem with TRIUMF source
        • XRootD (5952 failed transfers)
          • 32x DESTINATION MAKE_PARENT (Neon): Could not read status line: Connection reset by peer
          • quite a lot of "Connection issues" - destination XRootD overloaded(?)

      FTS related

      • Creating concurrently parent directories causes HTTP-TPC transfer failures at least for DPM and XRootD
      • Change key size for delegated proxy to 2048 (FTS-1700)
        • CentOS8 by default don't support 1024 keys (observed at SARA & dCache)
      • Logging Authorization headers (FTS-1663)
        • FTS log level for HTTPS/DAVS/... should not be set to 3 for any production storage
        • XRootD storage (UNL) even advertise Bearer for failed transfer with log level 0
          • does this comes from bad XRootD configuration or is it necessary to fix XRootD sources?
      • GridFTP issues (we don't really care)
        • Protocol translation from SRM+GridFTP to HTTP doesn't work for files > 4MB
        • GridFTP transfer succeeds but gfal incorrectly close connection which leads to "Aborting transfer due to session termination" in logs
      • Davix+libneon vs. Davix+libcurl future plans

       

      CMS

      Current status

      Total sites (T1_Disk + T2s) 56
      Notified Sites 56 (100%)
      Reporting a davs endpoint 47 (83%)
      Passed manual tests 41 (73%)
      Passed loadTests (davs) 13 (23%)
      Passed loadTests (srm) 12 (21%)
      Enabled to fetch changes in TFC: 0 (0%)

       

      Current issues

      1. Corner cases in the TFC parser

      The TFC parser uses regular expressions to parse all the possible formats of the URLs used within the LFN to PFN rules. We have found cases not taken into account by the set of regexps or pieces in the code whit missing regexps. I'm currently refactoring these pieces of code. 

       

      2. WebDAV in the TFC

      Either the site admins do not define their WebDAV endpoint at their TFC or they define it wrongly

       

      3. Missing permission

      We have found many sites missing permissions for Rucio to write the LoadTest file

       

      4. The ASO problem.

      We found out that ASO relies on the Rucio configuration of the RSEs to schedule its TPC transfers. It always fetches the preferred protocols for read and write from the sites involved but doesn't check whether they're compatible or not. We had to change our TFC parser to keep read/write preferred protocols fixed to srm.

       

    • 16:10 16:20
      XrootD/EOS libcurl and other libs 10m
      Speakers: Wei Yang (SLAC National Accelerator Laboratory (US)), Luca Mascetti (CERN), Brian Paul Bockelman (University of Wisconsin Madison (US))

      Luca M:

      Sorry, unfortunately I would not be able to join this meeting, here a short update from my side.
      On the 12th April we organised a dedicated meeting to follow up between experts (EOS, XRootD, DM clients) the current libcurl memory leak issue in XRootD on CentOS7.

      During the meeting we agreed that a patch will be provided to XrootD5, targeting release 5.2 (thanks a lot to Brian B. for his rapid development of the patch). This patch will enable XRootD to automatically build and load a bundle certificate leading to an expected leak of only 350 bytes per transfer.

      On the EOS operational side we are currently targeting for the roll out of EOS version 5 with the latest XRootD 5 release. This will allow to roll-out in our production cluster the patch with the next EOS release.

      We are also currently evaluating if it will be necessary to port the patch back to XRootD version 4.

    • 16:20 16:30
      SRM+HTTP tape access 10m
      Speakers: Mihai Patrascoiu (CERN), Petr Vokac (Czech Technical University (CZ))
      • It was necessary to add WriteToken for TRIUMF tape even for SRM+HTTP(?)
      • TAPE Testbed
        • Functional Tests TAPE (transfer matrix)
          • now using fts3-devel with gfal2 2.20 (integrated SciToken support)
          • configuration updated to prefer SRM+HTTP on Monday evening
          • SRM URL parameters not passed to TURL (e.g. copy_mode=pull)
            • rely on global FTS configuration
          • removed RAL & SLAC from matrix (unreliable)
        • Transfers from dCache tape source (BNL, TRIUMF) have high failure rate
          • StoRM tape source works fine
          • any tape destination works fine
          • dCache transfer fails after 5s with StoRM destination with error "failure: SocketTimeoutException while fetching URL: Read timed out" or after 120s for DPM destination (0 bytes transferred)
            • StoRM has 5s timeout for not receiving any data
            • DPM by default use 120s timeout to detect low transfer rate
            • at least in case of BNL this can be side effect of their internal dCache topology with external doors - this will be discussed with developers in a mail thread that we already started with BNL
              • NDGF on the other side works fine
          • trying to involve more dCache tape sites (e.g. NDGF with dCache 7.x, FNAL)
      • Rucio implemented support for "new" protocol SRM+HTTPS (rucio#4506)
        • keeping SRM+GridFTP in addition to SRM+HTTP is too complicated and not really necessary
          • tape SRM endpoints can support just one protocol (either HTTP or GridFTP)
            • no direct transfers between SRM+GridFTP <-> SRM+HTTP tape, but there was 0 direct TAPE to TAPE production ATLAS transfers during last 30 days
            • fine assuming most of disk sites still support both GridFTP and HTTP
          • for different protocols => Rucio automatically multihop
    • 16:30 16:40
      Network data challenges 10m
      Speakers: Dr Riccardo Di Maria (CERN), Rizart Dona (CERN)

      Steps towards a minimal unified monitoring solution (Rizart):

      • Getting access to the experiments data sources via the CERN Grafana instance
        • If experiments visualize data with a different technology stack -> they need to start pushing data to an ES data source that can be set up by Monit
      • Inspecting the schemas
        • Transfers
        • Network metrics
      • Inspecting current dashboards
      • Finding intersection of schemas -> allows for creating cross-experiment filters on the new dashboards

       

      Towards Networking DC 2021 (Riccardo):

      • Data Challenge Monitoring Mini Workshop for April 27th (Tuesday) 3:00 PM - 6:30 PM CEST
      • Identified resources from which tests can be initiated (WLCG-DOMA)
      • Identified simple tools to use for initial and simple tests 
        • to be generalised for all experiments
      • Gathering info from experiments, and from sites wrt accessible network monitoring
      • Evolving documentation
    • 16:40 16:50
      Future uniform tape access 10m
      Speakers: Cedric Caffy (CERN), Mihai Patrascoiu (CERN), Oliver Keeble (CERN), Paul Millar
    • 16:50 17:00
      Token Authorization testbed 10m
      Speakers: Andrea Ceccanti (Universita e INFN, Bologna (IT)), Andrea Ceccanti (Unknown)
    • 17:00 17:05
      AOB 5m