ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2021-08-05T10:00:00+01:00
End: 2021-08-05T11:00:00+01:00
Location: Zoom

Thursday 5 Aug 2021, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference

ATLAS UK Cloud Support

Zoom Meeting ID: 98434450232
Host: James William Walder
Useful links: Join via phone
Zoom URL

Hide

● Status

153405 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK assigned 2021-08-04 21:38:00 UKI-LT2-QMUL squid degraded WLCG
- - Rebooted; may need new disks if issue reappears
  - Ticket now solved
- 153367 TEAM atlas RAL-LCG2 urgent NGI_UK in progress 2021-08-04 11:55:00 HTTPS on RAL CTA WLCG
  - Tracking ticket for Tape tests
- 153295 USER atlas RAL-LCG2 less urgent NGI_UK in progress 2021-08-02 09:37:00 stuck staging requests from RAL MCTAPE WLCG
  - 14 files declared as lost; unclear now whether they were never correctly transfered/staged, or otherwise lost.
- 153277 TEAM atlas UKI-SCOTGRID-GLASGOW less urgent NGI_UK in progress 2021-07-28 13:14:00 UKI-SCOTGRID-GLASGOW_CEPH job stage-in failures WLCG
  - All should be back to normal now; Cric wan/lan settings quite complex.

● CPU

RAL
- 2021 pledge values now applied; awaiting some steady state to assess fairshares
Northgrid
- LANCS in downtime for updates
London
- QMUL - More nodes onlines; user namespace for singularity needed some rebooting
SouthGrid
- OX - Problems with Xcache over weekend; Went back to a previous configuration
- BHAM; no jobs running for last couple of days
Scotgrid
- Aiming for internal switch monitoring improvements
- Recovering after the (above) GGUS issues

● Other new issues / tasks

- Major Downtime for RAL + T1 on 14/15th August

● Ongoing Items

CentOS7 - Sussex
- NTR
TPC with http
- Both RAL and Glasgow looking ok when Src site; more problems when acting as destination
Storageless Site test (Oxford)
- Following from Storage Mtg. discussions; would be interesting to get accurate / latest numbers for atlas throughputs at sites, and for particular acitvities

● News round-table

Dan
- NTR
Gerard
- NTR
Matt
- NTR;
Sam
- NTR

● AOB

Expect to keep this mtg weekly for the Summer

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  
  Minutes
  153405 TEAM atlas UKI-LT2-QMUL less urgent NGI_UK assigned 2021-08-04 21:38:00 UKI-LT2-QMUL squid degraded WLCG
  
  Rebooted; may need new disks if issue reappears
  
  Ticket now solved
  
  153367 TEAM atlas RAL-LCG2 urgent NGI_UK in progress 2021-08-04 11:55:00 HTTPS on RAL CTA WLCG
  
  Tracking ticket for Tape tests
  
  153295 USER atlas RAL-LCG2 less urgent NGI_UK in progress 2021-08-02 09:37:00 stuck staging requests from RAL MCTAPE WLCG
  
  14 files declared as lost; unclear now whether they were never correctly transfered/staged, or otherwise lost.
  
  153277 TEAM atlas UKI-SCOTGRID-GLASGOW less urgent NGI_UK in progress 2021-07-28 13:14:00 UKI-SCOTGRID-GLASGOW_CEPH job stage-in failures WLCG
  
  All should be back to normal now; Cric wan/lan settings quite complex.
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
  - CPU 5m
    
    Minutes
    
    New link for the site-oriented dashboard
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    2021 pledge values now applied; awaiting some steady state to assess fairshares
    
    Northgrid
    
    LANCS in downtime for updates
    
    London
    
    QMUL - More nodes onlines; user namespace for singularity needed some rebooting
    
    SouthGrid
    
    OX - Problems with Xcache over weekend; Went back to a previous configuration
    
    BHAM; no jobs running for last couple of days
    
    Scotgrid
    
    Aiming for internal switch monitoring improvements
    
    Recovering after the (above) GGUS issues
  - Other new issues / tasks 5m
    
    Minutes
    
    T1 RAL-LCG2 Major downtime for 14/15th August. All services offline; may be possible to foreshorten DT if all goes well.
    Site core infrastructure upgrades; all networking (etc.) down.
    
    Following weekend 21/22 expect minor disruptions (depending on success of the 14th interventions).
    
    Major Downtime for RAL + T1 on 14/15th August
- 10:20 → 10:40
  Ongoing Items 20m
  
  Minutes
  CentOS7 - Sussex
  
  NTR
  
  TPC with http
  
  Both RAL and Glasgow looking ok when Src site; more problems when acting as destination
  
  Storageless Site test (Oxford)
  
  Following from Storage Mtg. discussions; would be interesting to get accurate / latest numbers for atlas throughputs at sites, and for particular acitvities
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - TPC with http 5m
    
    TPC at RAL: ADCINFR-195
  - Storageless Site test (Oxford) 5m
    
    ADCINFR-185
- 10:40 → 10:50
  News round-table 10m
  
  Minutes
  Dan
  
  NTR
  
  Gerard
  
  NTR
  
  Matt
  
  NTR;
  
  Sam
  
  NTR
- 10:50 → 11:00
  AOB 10m
  
  Minutes
  
  ATLAS Monitoring (UK Cloud)
  Expect to keep this mtg weekly for the Summer

Choose timezone

ATLAS UK Cloud Support

Zoom

● Status

● CPU

● Other new issues / tasks

● Ongoing Items

● News round-table

● AOB

Share this page

Direct link

Social networks

Calendaring