ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-12-17T10:00:00+00:00
End: 2020-12-17T11:00:00+00:00
Location: Zoom

Thursday 17 Dec 2020, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

https://cern.zoom.us/j/98434450232

Password protected (same as OPs Mtg, but repeated)

Hide

Outstanding tickets

149842 UKI-SCOTGRID-ECDF less urgent in progress 2020-12-15 02:10:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
- Rob investigating with DPM. Interface to DPM is not facilitating progress.
149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-11 13:51:00 UKI-SOUTHGRID-RALPP: unable to connect to host
- Resolved initial problem, by switching transfers through IPv6; underlying firewall/network issues to be resolved.
149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- Possibly related to IPv issues; needs following-up on Jira
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-12-10 14:42:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Files declared as lost; should be ok to close, if transfers now look ok.
146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
- On hold; awaiting underlying changes
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Dev harvester instance issues - fixed for condor submitter.

CPU

RAL
- Ok; JW to ensure corepower is updated in new year
Northgrid
- 60k HS06 running; some issue started recently, perhaps with gridFTP
London
QMUL;
- PDU replacements; issues with cabling from PDUs to Racks, to be replaced. -> increased downtime
- Many improvement / some disruption next year leading to improved cooling / capacity.
- Work on CE’s next week
SouthGrid
- OX squid sever down; moved to old server.
- QMUL will use it’s old perf-sonar for it’s new squids.
Scotgrid
- Durham drop in capacity; starting to come back now

Other new issues

Site availability and reliability for Glasgow follow-up; see associated tickets.
- Mixture of CRIC and AGIS information being used.
- Glasgow - to ensure all relevent info is included into GocDB
- ATLAS - to see how much can be exposed before final push to CRIC

Ongoing issues

CentOS7 - Sussex
- Sussex - Peter reports dev server was fixed, so all pilots now working.
  - In a good state for provisioning of nodes in new year
TPC with http
- Expecting a deadline of May 2021 for deployment at most sites
Storageless Site tests (Oxford)¶
- No particular progress
ECDF volatile storage
- Awaiting JW to make SE changes from Jira
Glasgow DPM Decommissioning
- Sam preparing Ceph localgroupdisk
- Hope for transfer across before / during Christmas
- Gareth will put DPM in “AT RISK” of the period.

News round-table

General
- Most sites at limited response from next week
Vip
Asked about mu3e VO port number;
Dan
- NTR
Matt
- NTR
Peter
- NTR
Sam
- NTR
Gareth
- Will set at risk for DPM from the weekend
JW
- NTR

AOB

Next UK Cloud meeting 7th January 2021
- Happy Holidays!

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    149842 UKI-SCOTGRID-ECDF less urgent in progress 2020-12-15 02:10:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    
    Rob investigating with DPM. Interface to DPM is not facilitating progress.
    
    149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-11 13:51:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    
    Resolved initial problem, by switching transfers through IPv6; underlying firewall/network issues to be resolved.
    
    149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    
    Possibly related to IPv issues; needs following-up on Jira
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-12-10 14:42:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Files declared as lost; should be ok to close, if transfers now look ok.
    
    146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    
    On hold; awaiting underlying changes
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    Dev harvester instance - fixed for condor submitter.
  - CPU 5m
    
    New link for the site-oriented dashboard
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Ok; JW to ensure corepower is updated in new year
    
    Northgrid
    
    60k HS06 running; some issue started recently, perhaps with gridFTP
    
    London
    
    QMUL;
    
    PDU replacements; issues with cabling from PDUs to Racks, to be replaced. -> increased downtime
    
    Many improvement / some disruption next year leading to improved cooling / capacity.
    
    Work on CE’s next week
    
    SouthGrid
    
    OX squid sever down; moved to old server.
    
    QMUL will use it’s old perf-sonar for it’s new squids.
    
    Scotgrid
    
    Durham drop in capacity; starting to come back now
  - Other new issues / tasks 5m
  - Glasgow: Site Availability/Reliability Config 5m
    
    From our side (i.e. ATLAS SAM team), the migration has been done; so the ETF tests use CRIC as the source.
    The MONIT side will eventually use the filtered vofeed provided by us.
    But as a temporary solution, MONIT consumes the vofeed (either from AGIS or CRIC) internally handled in MONIT.
    
    ADCMONITOR-491
    
    RQF1708406
    
    Site availability and reliability for Glasgow follow-up; see associated tickets.
    
    Mixture of CRIC and AGIS information being used.
    
    Glasgow - to ensure all relevent info is included into GocDB
    
    ATLAS - to see how much can be exposed before final push to CRIC
- 10:20 → 10:40
  Ongoing Items 20m
  CentOS7 - Sussex
  
  Sussex - Peter reports dev server was fixed, so all pilots now working.
  
  In a good state for provisioning of nodes in new year
  
  TPC with http
  
  Expecting a deadline of May 2021 for deployment at most sites
  
  Storageless Site tests (Oxford)¶
  
  No particular progress
  
  ECDF volatile storage
  
  Awaiting JW to make SE changes from Jira
  
  Glasgow DPM Decommissioning
  
  Sam preparing Ceph localgroupdisk
  
  Hope for transfer across before / during Christmas
  
  Gareth will put DPM in “AT RISK” of the period.
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - TPC with http 5m
  - Storageless Site tests (Oxford) 5m
    
    ADCINFR-185
  - ECDF volatile storage 5m
    
    ADCINFR-184
  - Glasgow DPM Decommissioning 20m
    
    LOCALGROUPDISK and DATADISK decommissioning
    
    ADCINFR-152
- 10:40 → 10:50
  News round-table 10m
  General
  
  Most sites at limited response from next week
  
  Vip
  
  Asked about mu3e VO port number;
  
  Dan
  
  NTR
  
  Matt
  
  NTR
  
  Peter
  
  NTR
  
  Sam
  
  NTR
  
  Gareth
  
  Will set at risk for DPM
  
  JW
  
  NTR
- 10:50 → 11:00
  
  AOB 10m
  
  Next Cloud Meeting: Jan 7th 2021
  - Happy Holidays!

Choose timezone