ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))


Outstanding tickets


GGUS# 146280 Lancaster
  deletion errors: trying to keep disk server online whilst draining the data off (updated post-meeting).

GGUS# 146159 Glasgow
Glasgow. James: discussed in ATLAS Ops meetings. Could declare files as temp unavailable. Want to avoid blacklisting the whole site. Sam: happy to try it.
Elena created JIRA ticket: https://its.cern.ch/jira/browse/ATLDDMOPS-5518
Gareth will feed back. Sam doesn't have access to the JIRA.
Other Glasgow tickets are the same problem, except for dormant Squid issue (won't happen while locked out of Univ)

GGUS# 145688 Manchester
 Alessandra not convinced they need to upgrade Squids. Some debate

GGUS# 145510 RAL 
RAL. Still same problems. Expect new WNs with SSD and see if they do better.
ATLAS removed walltime limit on queues, so more jobs that were hitting walltime limit now succeed.
Still tail of slow transfers.


CPU

Sheffield has been down for all VOs, not just ATLAS. Asked in TB-SUPPORT. Had a reply from Raoul today. Does anyone else have an idea?

Sussex: seems to be a problem with jobs requesting 2GB of RAM, and most jobs use more than this. This is set with JDL maxrss. Peter: AGIS settings look OK, so he'll chase up with Harvester experts. AGIS value also used to broker jobs, so will get jobs close to this limit.


Glasgow Ceph

Sam: rados-cp running in the background, but not as quickly as he'd like. Should be done in a day or two.
In the meantime, writing ansible roles. Hope to be up by the end of the week.

Grand Unified queues

James: proceeding slowly. Only RAL GU queue created.

GridPP Folding@Home.

Alessandra: submitting jobs to all sites with Dirac. https://stats.foldingathome.org/team/246309
Some problems getting it running, eg at Glasgow. But we are doing quite well, and we can steadily continue.
RAL contributing to GridPP, on T1 Cloud, Scarf, PPD GPU machines.


Other

Elena: Request to change HTCondor-CE. Alessandra suggested to apply CEs manually.
Managed to add ce03, but couldn't add other three. They are active in AGIS.
Will discuss with Alessandra on Skype.

News round-table

Alessandra: working on Folding@Home. Will start to look at HTCondor-CE on testbed. Manchester ARC-CE is quite old.
Dan: Aiming to maintain steady state. 
Elena: last meeting as active ATLAS Cloud Support. Will join meeting if need help. Elena will take on ATLAS report Tuesday next week. Alessandra will take over.
Gareth: keeping things going.
James: NTR
Peter: thanks to Elena.
Sam: will get file list of atlas data on problematic server.
Stewart: last meeting chairing.
Tim: thanked Stewart and Elena.
Vip: NTR

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m

      GGUS# 146280 Lancaster
        deletion errors: trying to keep disk server online whilst draining the data off (updated post-meeting).

      GGUS# 146159 Glasgow
      Glasgow. James: discussed in ATLAS Ops meetings. Could declare files as temp unavailable. Want to avoid blacklisting the whole site. Sam: happy to try it.
      Elena created JIRA ticket: https://its.cern.ch/jira/browse/ATLDDMOPS-5518
      Gareth will feed back. Sam doesn't have access to the JIRA.
      Other Glasgow tickets are the same problem, except for dormant Squid issue (won't happen while locked out of Univ)

      GGUS# 145688 Manchester
       Alessandra not convinced they need to upgrade Squids. Some debate

      GGUS# 145510 RAL 
      RAL. Still same problems. Expect new WNs with SSD and see if they do better.
      ATLAS removed walltime limit on queues, so more jobs that were hitting walltime limit now succeed.
      Still tail of slow transfers.
       

      • Outstanding tickets 10m
      • CPU 5m

        Sheffield has been down for all VOs, not just ATLAS. Asked in TB-SUPPORT. Had a reply from Raoul today. Does anyone else have an idea?

        GU queues (RAL) now seen in the accounting.

         

      • Other new issues 5m

        Elena: Request to change HTCondor-CE. Alessandra suggested to apply CEs manually.
        Managed to add ce03, but couldn't add other three. They are active in AGIS.
        Will discuss with Alessandra on Skype.
         

    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex 5m

        Sussex: seems to be a problem with jobs requesting 2GB of RAM, and most jobs use more than this. This is set with JDL maxrss. Peter: AGIS settings look OK, so he'll chase up with Harvester experts. AGIS value also used to broker jobs, so will get jobs close to this limit.
         

      • Glasgow Ceph storage 5m

        Sam: rados-cp running in the background, but not as quickly as he'd like. Should be done in a day or two.
        In the meantime, writing ansible roles. Hope to be up by the end of the week.
         

      • Grand Unified queues 5m

        James: proceeding slowly. Only RAL GU queue created.
         

    • 10:40 10:50
      News round-table 10m

      Alessandra: working on Folding@Home. Will start to look at HTCondor-CE on testbed. Manchester ARC-CE is quite old.
      Dan: Aiming to maintain steady state. 
      Elena: last meeting as active ATLAS Cloud Support. Will join meeting if need help. Elena will take on ATLAS report Tuesday next week. Alessandra will take over.
      Gareth: keeping things going.
      James: NTR
      Peter: thanks to Elena.
      Sam: will get file list of atlas data on problematic server.
      Stewart: last meeting chairing.
      Tim: thanked Stewart and Elena.
      Vip: NTR
       

    • 10:50 11:00
      AOB 10m

      GridPP Folding@Home.

      Alessandra: submitting jobs to all sites with Dirac. https://stats.foldingathome.org/team/246309
      Some problems getting it running, eg at Glasgow. But we are doing quite well, and we can steadily continue.
      RAL contributing to GridPP, on T1 Cloud, Scarf, PPD GPU machines.