ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-07-09T10:00:00+01:00
End: 2020-07-09T11:00:00+01:00
Location: Vidyo

Thursday 9 Jul 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-08 14:29:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures
- Disk server for DPM overloaded; work to be done on improvements
147744 UKI-LT2-QMUL urgent in progress 2020-07-08 11:59:00 Inaccessible files at UKI-LT2-QMUL_DATADISK
- Work ongoing; Next version of storm should be better here
146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
- JW to get ticket updated.
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- on hold
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on hold

New issues

CPU

RAL
- Change to ATLAS sub-group quota; CMS took extra slots, ATLAS pledge down; now reverted and recovering slots
Northgrid
London
- QMUL: ATLAS job failures on certain node sets.
SouthGrid
Scotgrid
- Xrootd with different versions of voms-xroot to avoid both sets of problems

Other new issues

Problem with HC / ATLAS this morning; mass HC-blacklisting.
- All sites should have been manually set back online.
Storage downtimes
- Site admins are requested to declare downtime for ALL published access protocols at the site. If not all protocols are declared ‘stopped’, data access will be attempted through allowed protocols
- https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Site_blacklisting
OX Arc6 upgrade
- Discussion on current status and current issues
Status: GLASGOW - CEPH
- https://its.cern.ch/jira/browse/ADCINFR-152
- Work needed on gFal2 and/or xrootd-ceph plugin discussed; also would affect RAL;
- Update of current status; plans for gridFTP, xroot and redirection brought up
- gridFTP works ok, but will be depricated (for 12 months) (external)
  - Can be good enough for production work for now
- gridFTP is the external protocal
- will aim to get the xrootd write-back
- and caching is in development
- If necessary can consider to set internal access to use xrdcp (rather than rucio copy)
QMUL:
- The job failures I believe are all related to the error “generate got a SIGKILL signal (exit code 137)” which I think is and out of memory error resulting the OS killings the job,
- always arcproxy that gets killed

Ongoing issues

CentOS7 - Sussex
- on-hold
Grand Unified queues
- on-hold

News round-table

(NTR)

Vip
- Discussion on arc-ce6 and possibilities of mapping-errors; followed-up on TB-support
Dan
- NTR
Matt
- ARC-6 being updated
Peter
- Will update Agis for getting LANCS test queue targeting CE
Alessandra
- NTR
Sam
- NTR
Gareth
- NTR
Tim
- NTR
JW
- Working on http TPC with small updates to work around current configuration setup

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-08 14:29:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures
    
    Disk server for DPM overloaded; work to be done on improvements
    
    147744 UKI-LT2-QMUL urgent in progress 2020-07-08 11:59:00 Inaccessible files at UKI-LT2-QMUL_DATADISK
    
    Work ongoing; Next version of storm should be better here
    
    146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
    
    JW to get ticket updated.
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    on hold
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    on hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    on hold
  - CPU 5m
    
    UK Cloud jobs over last week
    
    RAL
    
    Change to ATLAS sub-group quota; CMS took extra slots, ATLAS pledge down; now reverted and recovering slots
    
    Northgrid
    
    London
    
    QMUL: ATLAS job failures on certain node sets.
    
    SouthGrid
    
    Scotgrid
    
    Xrootd with different versions of voms-xroot to avoid both sets of problems
  - Other new issues 5m
    
    Problem with HC / ATLAS this morning; mass HC-blacklisting.
    All sites should have been manually set back online.
    
    https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Site_blacklisting
    Site status can be found in ATLAS Sam monitoring or AGIS
    Based on site downtime : Switcher
    Storage downtime : Site admins are requested to declare downtime for ALL published access protocols at the site. If not all protocols are declared 'stopped', data access will be attempted through allowed protocols.
    Based on site validation with HammerCloud jobs :
    Monitoring
    HammerCloudTutorialATLASsiteAdmins
    
    OX Arc6 upgrade:
    http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UKI-SOUTHGRID-OX-HEP&startTime=2020-07-02&endTime=2020-07-09&templateType=isGolden
    
    Status: GLASGOW - CEPH
    https://its.cern.ch/jira/browse/ADCINFR-152
    
    QMUL:
    https://bigpanda.cern.ch/wns/UKI-LT2-QMUL/?hours=12
    The job failures I believe are all related to the error "generate got a SIGKILL signal (exit code 137)" which I think is and out of memory error resulting the OS killings the job,
    
    They have upto 4GB RAM per job slot.
    
    The difference i do see between the nodes is that on the NON- intel Gold CPUS i see lots of errors in dmesg of the type
    
    arcproxy[27406]: segfault at fffffffffffffff8 ip 00002b97c69f9d4c sp 00007ffec46e2ac0 error 5 in libstdc++.so.6.0.19[2b97c693b000+e9000]
    
    I don't see this error on the intel Gold CPUS
    i see messages like
    "
    ...
    memory: usage 8191968kB, limit 8192000kB, failcnt 22036
    memory+swap: usage 8396800kB, limit 8396800kB, failcnt 7
    ...
    Memory cgroup out of memory: Kill process 175817 (arcproxy) score 994 or sacrifice child
    Killed process 175817 (arcproxy) total-vm:8567952kB, anon-rss:8174060kB, file-rss:4976kB, shmem-rss:0kB
    arcproxy[217655]: segfault at 402e03 ip 00002acbdd02edce sp 00007ffd58970510 error 7 in libstdc++.so.6.0.19[2acbdcf70000+e9000]
    ...
    "
    it's always arcproxy that gets killed.
    
    example good nodes
    
    cn306 - 308 are HPE DL385 AMD EPYC 7351 with 128 job slots and 4 GB ram per Job slot
    
    cn311-315 are Dell R710 Intel X5650 with 24 job slots and upto 4 GB ram per Job slot
    
    problem nodes
    
    cn321 – 344 are Dell R440 Intel Gold 5118 with 48 job slots and 4 GB ram per Job slot
    
    cn501 is a Lenovo SR570 Intel Gold 6252 CPU with 96 job slots and 4 GB ram per job slot
    
    Problem with HC / ATLAS this morning; mass HC-blacklisting.
    
    All sites should have been manually set back online.
    
    Storage downtimes
    
    Site admins are requested to declare downtime for ALL published access protocols at the site. If not all protocols are declared ‘stopped’, data access will be attempted through allowed protocols
    
    https://twiki.cern.ch/twiki/bin/view/AtlasComputing/SitesSetupAndConfiguration#Site_blacklisting
    
    OX Arc6 upgrade
    
    Discussion on current status and current issues
    
    Status: GLASGOW - CEPH
    
    https://its.cern.ch/jira/browse/ADCINFR-152
    
    Work needed on gFal2 and/or xrootd-ceph plugin discussed; also would affect RAL;
    
    Update of current status; plans for gridFTP, xroot and redirection brought up
    
    gridFTP works ok, but will be depricated (for 12 months) (external)
    
    Can be good enough for production work for now
    
    gridFTP is the external protocal
    
    will aim to get the xrootd write-back
    
    and caching is in development
    
    If necessary can consider to set internal access to use xrdcp (rather than rucio copy)
    
    QMUL:
    
    The job failures I believe are all related to the error “generate got a SIGKILL signal (exit code 137)” which I think is and out of memory error resulting the OS killings the job,
    
    always arcproxy that gets killed
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 - Sussex
  
  on-hold
  
  Grand Unified queues
  
  on-hold
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Discussion on arc-ce6 and possibilities of mapping-errors; followed-up on TB-support
  
  Dan
  
  NTR
  
  Matt
  
  ARC-6 being updated
  
  Peter
  
  Will update Agis for getting LANCS test queue targeting CE
  
  Alessandra
  
  NTR
  
  Sam
  
  NTR
  
  Gareth
  
  NTR
  
  Tim
  
  NTR
  
  JW
  
  Working on http TPC with small updates to work around current configuration setup
  
  Comments from the chat window
  
  OX
  
  specific log: https://aipanda024.cern.ch/condor_logs_2/20-07-09_04/grid.19096170.19.log
  
  https://monit-grafana.cern.ch/d/3naRcbRZz/harvester?panelId=22&fullscreen&orgId=17&var-bin=1h&var-cloud=All&var-site=UKI-SOUTHGRID-OX-HEP&var-computingsite=All&var-pqstatus=online&var-instance=All&var-status=All&var-resourcetype=All&var-computingelement=All&var-jobtype=All
  
  usr/sbin/slapd -f /var/run/arc/infosys/bdii-slapd.conf -h ldap://*:2135 -u ldap so file is /var/run/arched-arex.cfg
- 10:50 → 11:00
  
  AOB 10m