DOMA / TPC Meeting

Europe/Zurich
Description

Topic: WLCG DOMA TPC Meeting

Join Zoom Meeting
https://cern.zoom.us/j/99836057922?pwd=ZFhWN3NpYi9oZmwvM3pIRE9zdzFnZz09

Meeting ID: 998 3605 7922
Passcode: 733660
One tap mobile
+41315280988,,99836057922# Switzerland
+41432107042,,99836057922# Switzerland

Dial by your location
        +41 31 528 09 88 Switzerland
        +41 43 210 70 42 Switzerland
        +41 43 210 71 08 Switzerland
        +33 1 7037 2246 France
        +33 1 7037 9729 France
        +33 1 8699 5831 France
Meeting ID: 998 3605 7922
Find your local number: https://cern.zoom.us/u/aeB4ArMgmT

    • 16:00 16:10
      SRM+HTTP tape access 10m
      Speakers: Mihai Patrascoiu (CERN), Petr Vokac (Czech Technical University (CZ))

      Just a minor progress on SRM+HTTPs

      • new test matrix with disk servers (dCache, DPM, EOS, Echo, StoRM, XRootD)
      • fixed issue in Rucio and srm+https rucio#4650
      • SH_* RSE with protocols: srm+https, https, davs
      • 100MB file transfers every ~ 20 minutes (initial tests with 1MB, test bigger 5GB?)
      • no issue related directly to srm+https
        • BNL srm+https destination caused by wrong RSE spacetoken configuration
          • which spacetoken can be used for dteam VO with srm://dcsrm.usatlas.bnl.gov:8443/srm/managerv2?SFN=/pnfs/usatlas.bnl.gov/users/hiroito/testtpc
        • TRIUMF most probably same firewall issue that we found also with prod transfers
        • RAL Echo still problematic
        • EOS firewall/configuration(?) on p06636710r84969.cern.ch
        • dCache prometheus - occaissional MAKE_PARENT failure (most probably related to deploying new release)

      SRM+HTTP tape test - plan:

      0) periodic tests only for SRM+HTTP and DISK (e.g. to understand occasional timeouts seen for transfer from DISK) - DONE
      1) upload 100TB dataset (1GB, 10GB files) to several dCache/StoRM/EOS/DPM DISK storages (Rucio RSEs)
      2) (Rucio) FTS transfer dataset with SRM+HTTP to StoRM TAPE endpoint (INFN-T1) and dCache TAPE endpoint (FNAL or BNL?)
      3) ask tape admins to verify uploaded files are in right state, they reached tape and generally everything looks fine (or use SFO?)
      4) ask tape admins to cleanup dataset files from disk buffer (to let bringonline in next step do real work)
      5) (Rucio) FTS transfer with SRM+HTTP from TAPE storage to DISK endpoints dCache/StoRM/EOS/DPM
      6) repeat only if we find an issue / from the point where we found issue

      It'll be necessary to synchronize with tape storage admins - site ready for 2, local verification 3-4, site ready for 4 - I can imagine that all steps can take few weeks to complete, but first I would like to understand 0 (I'll replace tape with disk for [1]) and meanwhile I'll prepare test dataset.

       

    • 16:10 16:20
      Future uniform tape access 10m
      Speakers: Oliver Keeble (CERN), Paul Millar
    • 16:20 16:30
      XrootD 5 news 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:30 16:40
      Experiments production 10m
      Speakers: Alessandra Forti (University of Manchester (GB)), Diego Davila Foyo (Univ. of California San Diego (US)), Petr Vokac (Czech Technical University (CZ))

      ATLAS

      • HTTP-TPC migration status after May 31 deadline for disk storages
      • Problem with FTS tranfer limits INC2803669
        • not possible to set limit for site, but only for protocol
          • max transfers is SUM(gsiftp + davs limit)
          • this will be solved by dropping other TPC protocols
            • we need tapes on HTTP-TPC first
        • transfer limits doesn't seems to be enforced correctly
          • we saw more active transfers than configured storage limit

      CMS

      Total Sites 53
      In Production 30 - 57%
      Have passed manual tests 8/23
      Have a WebDAV endpoint  12/23
      Do NOT have a WebDAV endpoint 3/23

      This week we might be able to move 6 more sites to Production

      FTS / Gfal2

      • introduced LOG_SENSITIVE=<true|false> configuration option FTS-1663
    • 16:40 16:50
      Network data challenges 10m
      Speakers: Dr Riccardo Di Maria (CERN), Rizart Dona (CERN)

      Monitoring Updates (markdown)

      • Limitations at measuring accurate throughput from the FTS aggregated data (ref.)

      • Lack of metadata fields in the FTS aggregated data, this is needed to separate testing traffic from normal traffic (ref.) (this metadata exists in the aggregated data, thanks to Nick Smith for pointing it out)

      • CMS Rucio data → retention policy of 1 month (hosted in the short term Elasticsearch cluster)

        • In contrast, ATLAS Rucio data → retention policy of 1+ year (hosted in InfluxDB)

        • Longer retention policy for the CMS case might be needed if we need a long-term monitoring → alternatively just use the FTS aggregated data and do not bother with Rucio data sources

    • 16:50 17:00
      Token Authorization testbed 10m
      Speakers: Andrea Ceccanti (Unknown), Andrea Ceccanti (Universita e INFN, Bologna (IT))
    • 17:00 17:05
      AOB 5m