ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-08 14:29:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

    • Disk server for DPM overloaded; work to be done on improvements
  • 147744 UKI-LT2-QMUL urgent in progress 2020-07-08 11:59:00 Inaccessible files at UKI-LT2-QMUL_DATADISK

    • Work ongoing; Next version of storm should be better here
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • JW to get ticket updated.
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • on hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • on hold

New issues

CPU

  • RAL

    • Change to ATLAS sub-group quota; CMS took extra slots, ATLAS pledge down; now reverted and recovering slots
  • Northgrid

  • London

    • QMUL: ATLAS job failures on certain node sets.
  • SouthGrid

  • Scotgrid

    • Xrootd with different versions of voms-xroot to avoid both sets of problems

Other new issues

  • Problem with HC / ATLAS this morning; mass HC-blacklisting.
    • All sites should have been manually set back online.
  • Storage downtimes
  • OX Arc6 upgrade
    • Discussion on current status and current issues
  • Status: GLASGOW - CEPH
    • https://its.cern.ch/jira/browse/ADCINFR-152

    • Work needed on gFal2 and/or xrootd-ceph plugin discussed; also would affect RAL;

    • Update of current status; plans for gridFTP, xroot and redirection brought up

    • gridFTP works ok, but will be depricated (for 12 months) (external)

      • Can be good enough for production work for now
    • gridFTP is the external protocal

    • will aim to get the xrootd write-back

    • and caching is in development

    • If necessary can consider to set internal access to use xrdcp (rather than rucio copy)

  • QMUL:
    • The job failures I believe are all related to the error “generate got a SIGKILL signal (exit code 137)” which I think is and out of memory error resulting the OS killings the job,
    • always arcproxy that gets killed

Ongoing issues

  • CentOS7 - Sussex
    • on-hold
  • Grand Unified queues
    • on-hold

News round-table

(NTR)

  • Vip

    • Discussion on arc-ce6 and possibilities of mapping-errors; followed-up on TB-support
  • Dan

    • NTR
  • Matt

    • ARC-6 being updated
  • Peter

    • Will update Agis for getting LANCS test queue targeting CE
  • Alessandra

    • NTR
  • Sam

    • NTR
  • Gareth

    • NTR
  • Tim

    • NTR
  • JW

    • Working on http TPC with small updates to work around current configuration setup
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-08 14:29:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

          • Disk server for DPM overloaded; work to be done on improvements
        • 147744 UKI-LT2-QMUL urgent in progress 2020-07-08 11:59:00 Inaccessible files at UKI-LT2-QMUL_DATADISK

          • Work ongoing; Next version of storm should be better here
        • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

          • JW to get ticket updated.
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • on hold
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • on hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • on hold
      • CPU 5m
        • RAL

          • Change to ATLAS sub-group quota; CMS took extra slots, ATLAS pledge down; now reverted and recovering slots
        • Northgrid

        • London

          • QMUL: ATLAS job failures on certain node sets.
        • SouthGrid

        • Scotgrid

          • Xrootd with different versions of voms-xroot to avoid both sets of problems

         

      • Other new issues 5m
        • Problem with HC / ATLAS this morning; mass HC-blacklisting.
          All sites should have been manually set back online.

        • https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Site_blacklisting
          Site status can be found in ATLAS Sam monitoring or AGIS
          Based on site downtime : Switcher
          Storage downtime : Site admins are requested to declare downtime for ALL published access protocols at the site. If not all protocols are declared 'stopped', data access will be attempted through allowed protocols.
          Based on site validation with HammerCloud jobs :
          Monitoring
          HammerCloudTutorialATLASsiteAdmins

        • OX Arc6 upgrade:
          http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UKI-SOUTHGRID-OX-HEP&startTime=2020-07-02&endTime=2020-07-09&templateType=isGolden

        • Status: GLASGOW - CEPH
          https://its.cern.ch/jira/browse/ADCINFR-152

        • QMUL:
          https://bigpanda.cern.ch/wns/UKI-LT2-QMUL/?hours=12
          The job failures I believe are all related to the error "generate got a SIGKILL signal (exit code 137)" which I think is and out of memory error resulting the OS killings the job,

        They have upto 4GB RAM per job slot.

        The difference i do see between the nodes is that on the NON- intel Gold CPUS i see lots of errors in dmesg of the type

        arcproxy[27406]: segfault at fffffffffffffff8 ip 00002b97c69f9d4c sp 00007ffec46e2ac0 error 5 in libstdc++.so.6.0.19[2b97c693b000+e9000]

        I don't see this error on the intel Gold CPUS
        i see messages like
        "
        ...
        memory: usage 8191968kB, limit 8192000kB, failcnt 22036
        memory+swap: usage 8396800kB, limit 8396800kB, failcnt 7
        ...
        Memory cgroup out of memory: Kill process 175817 (arcproxy) score 994 or sacrifice child
        Killed process 175817 (arcproxy) total-vm:8567952kB, anon-rss:8174060kB, file-rss:4976kB, shmem-rss:0kB
        arcproxy[217655]: segfault at 402e03 ip 00002acbdd02edce sp 00007ffd58970510 error 7 in libstdc++.so.6.0.19[2acbdcf70000+e9000]
        ...
        "
        it's always arcproxy that gets killed.

        example good nodes

        cn306 - 308 are HPE DL385 AMD EPYC 7351 with 128 job slots and 4 GB ram per Job slot

        cn311-315 are Dell R710 Intel X5650 with 24 job slots and upto 4 GB ram per Job slot

        problem nodes

        cn321 – 344 are Dell R440 Intel Gold 5118 with 48 job slots and 4 GB ram per Job slot

        cn501 is a Lenovo SR570 Intel Gold 6252 CPU with 96 job slots and 4 GB ram per job slot

        • Problem with HC / ATLAS this morning; mass HC-blacklisting.
          • All sites should have been manually set back online.
        • Storage downtimes
        • OX Arc6 upgrade
          • Discussion on current status and current issues
        • Status: GLASGOW - CEPH
          • https://its.cern.ch/jira/browse/ADCINFR-152

          • Work needed on gFal2 and/or xrootd-ceph plugin discussed; also would affect RAL;

          • Update of current status; plans for gridFTP, xroot and redirection brought up

          • gridFTP works ok, but will be depricated (for 12 months) (external)

            • Can be good enough for production work for now
          • gridFTP is the external protocal

          • will aim to get the xrootd write-back

          • and caching is in development

          • If necessary can consider to set internal access to use xrdcp (rather than rucio copy)

        • QMUL:
          • The job failures I believe are all related to the error “generate got a SIGKILL signal (exit code 137)” which I think is and out of memory error resulting the OS killings the job,
          • always arcproxy that gets killed
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex
        • on-hold
      • Grand Unified queues
        • on-hold
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Discussion on arc-ce6 and possibilities of mapping-errors; followed-up on TB-support
      • Dan

        • NTR
      • Matt

        • ARC-6 being updated
      • Peter

        • Will update Agis for getting LANCS test queue targeting CE
      • Alessandra

        • NTR
      • Sam

        • NTR
      • Gareth

        • NTR
      • Tim

        • NTR
      • JW

        • Working on http TPC with small updates to work around current configuration setup

      Comments from the chat window

      OX

      usr/sbin/slapd -f /var/run/arc/infosys/bdii-slapd.conf -h ldap://*:2135 -u ldap
      so file is
      /var/run/arched-arex.cfg
      
    • 10:50 11:00
      AOB 10m