ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147082 UKI-NORTHGRID-MAN-HEP in progress 2020-05-21 File not accessible at … WLCG
    – determine if a problem with the server, before declaring the file as lost.

  • 146947 UKI-NORTHGRID-LANCS-HEP in progress 2020-05-20 UKI-NORTHGRID-LANCS-HEP: stage-in/out … WLCG
    – New disk servers to online; will then running with x2 endpoints.
    – Two new servers going online, should double the endpoint; Care needed as DPM can favour emptier servers; set Round-robin mode.

  • 146918 UKI-SCOTGRID-ECDF in progress 2020-05-19 Failovers from jobs running at … EGI
    – Checking if openstack network configuration issue.

  • 146771 UKI-SCOTGRID-ECDF on hold 2020-05-18 UKI-SCOTGRID-ECDF deletion failures … WLCG
    – Set on hold for SL6 to CO7.

  • 146651 RAL-LCG2 urgent waiting for reply 2020-05-19 singularity and user NS setup at RAL WLCG
    – user NS setup in release

  • 146525 UKI-NORTHGRID-SHEF-HEP on hold 2020-05-15 UKI-NORTHGRID-SHEF-HEP: evicted jobs WLCG
    – ARC mis-configured problems; work ongoing

  • 146374 UKI-NORTHGRID-SHEF-HEP on hold 2020-05-15 ATLAS pilot jobs idle on … WLCG
    – ARC mis-configured problems; work ongoing

  • 146159 UKI-SCOTGRID-GLASGOW in progress 2020-05-19 Unaccessiböe files at … WLCG
    – DPM draining from ATLAS
    – Better communication requested to the site

  • 145688 UKI-NORTHGRID-MAN-HEP on hold 2020-04-02 Very old version of squids at … EGI
    – on hold

  • 145510 RAL-LCG2 urgent on hold 2020-05-13 RAL-LCG2: timeouts on stage-in/outs WLCG
    – no specific updates; understanding of FAH interactions with scheduled jobs underway

  • 144759 UKI-SCOTGRID-GLASGOW on hold 2020-02-17 High traffic from UKI-SCOTGRID-GLASGOW … EGI
    – GR to post update

  • 142329 UKI-SOUTHGRID-SUSX reopened 2020-05-14 CentOS7 migration UKI-SOUTHGRID-SUSX WLCG
    – No news this week

CPU

  • RAL
    Last Weds.; reactivated MCORE queue;

  • Northgrid

  • London
    QMUL; tripped cooling unit, but largely ok

  • SouthGrid

  • Scotgrid
    DURHAM - downtime for data centre maintainance

Other new issues

Ongoing issues

  • CentOS7 - Sussex
    – As above

    Glasgow Ceph storage
    – FTS working; some Echo specific improvements to be implemented.
    – Changes to AGIS with endpoints enabled spotting of some  small issues; to be worked through.
    – rucio mover had trouble previously; mostly fixed in ral, should work for GLA and should be tested

    – Want to stress-test with non-primary data to test initially; to follow-up via the Jiria

    Grand Unified queues
    – Awaiting Shefield

News round-table

  • Vip
    - NTR

  • Dan
    – NTR; Cooling unit tripped in week 

  • Matt
    - Continue discussions on DPM OS upgrades

  • Sam
    NTR

  • Gareth
    Will update GGUS ticket (GLA squid)

  • Tim
    – XrootD TPC : RAL Rebooting one of the Gateways fixed the issue with it. To try again once the nautilus upgrades are done

    • NTR

  • JW

  •  

AOB

Monitoring and accounting: new WLCG and ATLAS monitoring and site availability pages made avaialble

FTS: Plan to switch UK cloud to the test (production quality) RAL FTS upgraded instance. Move back to main FTS once upgraded:
Exact dates to be announced
– to check if it will include new TPC possibilities

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 147082 UKI-NORTHGRID-MAN-HEP in progress 2020-05-21 File not accessible at … WLCG
          – determine if a problem with the server, before declaring the file as lost.

        • 146947 UKI-NORTHGRID-LANCS-HEP in progress 2020-05-20 UKI-NORTHGRID-LANCS-HEP: stage-in/out … WLCG
          – New disk servers to online; will then running with x2 endpoints.
          – Two new servers going online, should double the endpoint; Care needed as DPM can favour emptier servers; set Round-robin mode.

        • 146918 UKI-SCOTGRID-ECDF in progress 2020-05-19 Failovers from jobs running at … EGI
          – Checking if openstack network configuration issue.

        • 146771 UKI-SCOTGRID-ECDF on hold 2020-05-18 UKI-SCOTGRID-ECDF deletion failures … WLCG
          – Set on hold for SL6 to CO7.

        • 146651 RAL-LCG2 urgent waiting for reply 2020-05-19 singularity and user NS setup at RAL WLCG
          – user NS setup in release

        • 146525 UKI-NORTHGRID-SHEF-HEP on hold 2020-05-15 UKI-NORTHGRID-SHEF-HEP: evicted jobs WLCG
          – ARC mis-configured problems; work ongoing

        • 146374 UKI-NORTHGRID-SHEF-HEP on hold 2020-05-15 ATLAS pilot jobs idle on … WLCG
          – ARC mis-configured problems; work ongoing

        • 146159 UKI-SCOTGRID-GLASGOW in progress 2020-05-19 Unaccessiböe files at … WLCG
          – DPM draining from ATLAS
          – Better communication requested to the site

        • 145688 UKI-NORTHGRID-MAN-HEP on hold 2020-04-02 Very old version of squids at … EGI
          – on hold

        • 145510 RAL-LCG2 urgent on hold 2020-05-13 RAL-LCG2: timeouts on stage-in/outs WLCG
          – no specific updates; understanding of FAH interactions with scheduled jobs underway

        • 144759 UKI-SCOTGRID-GLASGOW on hold 2020-02-17 High traffic from UKI-SCOTGRID-GLASGOW … EGI
          – GR to post update

        • 142329 UKI-SOUTHGRID-SUSX reopened 2020-05-14 CentOS7 migration UKI-SOUTHGRID-SUSX WLCG
          – No news this week

      • CPU 5m
        • RAL
          Last Weds.; reactivated MCORE queue;

        • Northgrid

        • London
          QMUL; tripped cooling unit, but largely ok

        • SouthGrid

        • Scotgrid
          DURHAM - downtime for data centre maintainance

      • Other new issues 5m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex
        – As above

      • Glasgow Ceph storage
        – FTS working; some Echo specific improvements to be implemented.
        – Changes to AGIS with endpoints enabled spotting of some  small issues; to be worked through.
        – rucio mover had trouble previously; mostly fixed in ral, should work for GLA and should be tested

      – Want to stress-test with non-primary data to test initially; to follow-up via the Jiria

      • Grand Unified queues
        – Awaiting Shefield
    • 10:40 10:50
      News round-table 10m
      • Vip
        - NTR

      • Dan
        – NTR; Cooling unit tripped in week

      • Matt
        - Continue discussions on DPM OS upgrades

      • Sam
        NTR

      • Gareth
        Will update GGUS ticket (GLA squid) 

      • Tim
        – XrootD TPC : RAL Rebooting one of the Gateways fixed the issue with it. To try again once the nautilus upgrades are done

      • JW

        • NTR

    • 10:50 11:00
      AOB 10m

      Monitoring and accounting: new WLCG and ATLAS monitoring and site availability pages made avaialble

      FTS: Plan to switch UK cloud to the test (production quality) RAL FTS upgraded instance. Move back to main FTS once upgraded:
      Exact dates to be announced
      – to check if it will include new TPC possibilities