RAL Tier1 Experiments Liaison Meeting

Europe/London
Access Grid (RAL R89)

Access Grid

RAL R89

Zoom Meeting ID
66811541532
Host
Alastair Dewhurst
Useful links
Join via phone
Zoom URL
    • 13:30 13:31
      Experiment Operational Issues 1m
    • 13:35 13:45
      VO-Liaison ATLAS 10m
      Speakers: Brij Kishor Jashal (TIFR, RAL, IFIC), Jyoti Prakash Biswal (Rutherford Appleton Laboratory)
    • 13:45 13:55
      VO Liaison CMS 10m
      Speaker: Katy Ellis (Science and Technology Facilities Council STFC (GB))

      AAA was clearly sturggling with data accesses from remote sites. SAM tests were intermittently red and complaints were made by other sites. The files being accesed were 'premix library' files that are usually accessed only from CERN or FNAL. Katy found that these files had been stored on Antares, which is normal, but of course the only way they get to Antares is via Echo (multihop). By design of Rucio these files don't necessarily get deleted immediately...and then jobs around Europe start using them. This usage (popularity) may even reduce the chance of the files being removed, displaced by other files arriving on Echo. 

      Jyothish increased the throttling level on the AAA gateways (previously was a high level of throttling on these machines allowing only 100MB/s in total over 3 gateways) and everything looks much better now. 

      Job performance is commensurate with other T1s. However there is a drop in performance. Yet again CMS are running jobs that only use 1 core but request 8 - I complained again. 

      CE tokens seem to be passing SAM tests quite well now, but I'm told there are still problems on the end of the test. 

      Mini-DC for CMS UK sites next week - tests are for Tier 2s but Tier 1 will be used as a source and sink, making sure not to put unnecessary pressure on Tier 1 at this stage. (ATLAS already started their tests this week and will continue next week). 

      TO DO: Get RAL-FTS to configure for CMS token access. 

    • 13:55 14:05
      VO Liaison LHCb 10m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
      • Heavy-Ion data distribution starts today
        • RAL may get up to 1PB
      • VO-box access ticket (GGUS 168905) is resolved
        • Machine have been reinstalled last week, it now has the new FQAN-based authentication mechanism
        • Somehow, service certificate has not been copied to the machine.
      • Last week the number of transfers on RAL(WN)->CERN channel reduced, so is the number of failures.
        • Proper solution is to connect 201[89] gens to LHCOPN
        • Tom B agreed to try to mitigate the issue by reducing the priority of these tranches for LHCb jobs
    • 14:10 14:20
      VO Liaison ALICE 10m
      Speaker: Alexander Rogovskiy (Rutherford Appleton Laboratory)
    • 14:20 14:30
      VO Liaison LSST 10m
      Speaker: Timothy John Noble (Science and Technology Facilities Council STFC (GB))

      RAL message reciecer and consumer setup on our end, no data currently being sent to RAL since August (as far as I can tell)
      Ensuring we are included in tests

       

      Still awaiting VO access - work on going - thanks Tom

    • 14:30 14:40
      VO Liaison APEL 10m
      Speaker: Thomas Dack
    • 14:45 14:55
      WP-D - GPU, Data Management, Other 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore
    • 15:00 15:01
      Major Incidents Changes 1m
    • 15:05 15:15
      Summary of Operational Status and Issues 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore

      INCIDENTS

      Last weeks callouts in Opsgenie

      Callouts over weekend:

      No call-outs over the weekend.

      There was a “Tier 1 External connectivity issue between Opsgenie and Icinga”  this morning (02/12/24), however this was as the result of DI doing/re-doing their network intervention from Wednesday (27/11/24) 

      Callouts:

      There were a number of call-outs on the morning of the 27/11/24 as the result of the failed DI network intervention (see report below).

      No other call-outs over the week.

      Antares

      Despite the network problem on the 27/11/24 the Antares reboot intervention was completed without issue. 

      Batch Farm

       

      Echo

      There appears to have been some issues with CMS AAA Service,  @Katy Ellis was reporting “intermittent but significant numbers“
      image-20241202-101554.png

      In the absence of @Thomas, Jyothish (STFC,RAL,SC)  (he was on leave), @Brian Davies performed a reboot of gw10 and 11 to catch any updates and this did initially appear to help, however issues soon reappeared.
      image-20241202-101942.png

      The plot above shows the number of connections which indicates dramatically higher usage in the last few weeks.  The number of connections seems high for the throughput, which remains around 100MB/s over the 3 gateways.

       
      image-20241202-102253.png

      More recently the number of connections has reduced although still significantly above the historical average.  The memory usage has increased which is not yet understood.

      Network

      On Wednesday 27th November between 08:05 and 09:15 there was a network outage as a result of a firewall problem caused when a scheduled upgrade went wrong.  This had a mixed impact on the Tier-1 as many routes bypass the firewall (e.g. LHCOPN and LHCONE) however a significant amount of control traffic appears to still go via the firewall and spikes of failures were seen by the VOs.

      “During a routine Firewall upgrade this morning (27 November) we experienced significant networking issues. These started during the regular at risk period, but continued until approximately 09:15.

      The team worked quickly to resolve the issues as soon as they became apparent and we believe them to now be resolved (reports from across campus are that service is now restored) but we will continue to monitor the situation. The network connection to the Internet is currently running at risk and will continue to be so until we are able to restore both firewalls in a stable manner. Additional work will be needed and this will be done out of hours and with notice.

      These issues will have affected all network traffic to and from the RAL site, including VPN access from offsite.”

      “Following the disruption to the RAL network on Wednesday 27th November, I wanted to provide you with an update on the causes and next steps.

      The interruption happened following an upgrade of the firewalls connecting the RAL campus to the internet. During the upgrade, one of the two firewalls failed to complete successfully and as a result caused instability across the pair. This was resolved when the firewall that had failed was disconnected. At that stage, the network was stable, but was running at risk due to having no fallback option if the remaining firewall failed. 

      Since Wednesday we have successfully upgraded the disconnected firewall and on Monday 2nd December at 0730 we will be reconnecting the firewalls to bring back resilience. We anticipate that this should not have any impact but will be carrying it out early morning to minimise any potential disruption.“

      The intervention on the morning, 02/12/24, appears to have gone without issue.  

          Martin BlyMartin Bly  9:56 AM
          From DI, apropos the firewalls:  "The change earlier this morning was successful and the firewalls are now working properly again. They've remained stable since, so the change freeze is now lifted."

    • 15:20 15:21
      AOB 1m
    • 15:22 15:32
      Any other Business 10m
      Speakers: Brian Davies (Lancaster University (GB)), Darren Moore