WLCG Operations Coordination Minutes, Dec 12, 2019

Highlights

Agenda

https://indico.cern.ch/event/869667/

Attendance

  • local: Andrea (CMS + WLCG), Julia (WLCG), Kuba (storage), Luca (storage), Maarten (ALICE + WLCG), Marcelo (LHCb + CNAF), Marian (monitoring + networks), Mayank (WLCG), Renato (LHCb + CBPF)
  • remote: Christoph (CMS), Daniel (FNAL), Dave D (FNAL), Dave M (FNAL), David S (ATLAS), Di (TRIUMF), Giuseppe (CMS), Johannes (ATLAS), Matt (Lancaster), Panos (WLCG), Paolo (CMS), Stephan (CMS)
  • apologies:

Operations News

  • the next meeting is planned for Jan 30
    • please let us know if that date would be very inconvenient

Special topics

SAM ETF update

see the presentation

  • Andrea: could HammerCloud make use of ETF functionality?
  • Marian:
    • in principle there are synergies, but some effort would be required
    • see our report on ETF and HammerCloud

  • Dave M:
    • could we use ETF to probe for vulnerabilities?
    • USCMS would be interested
  • Marian:
    • in principle
    • EGI are using Nagios for some of their security tests
  • Maarten:
    • on a WN one can check the rpm list for vulnerable SW,
      but how would vulnerable services be identified?
    • before we start trying to implement anything in this area,
      we should first discuss such ideas in relevant security forums

  • Julia: HammerCloud has better coverage for any scale tests

  • Stephan: in Jess, can the set of WN metrics be dynamic across sites?
  • Marian: yes
  • Stephan: can we process additional metrics in MONIT?
  • Marian: yes, ETF can send the data to a message queue

  • Stephan: we probably can clean up the set of CMS tests

  • Andrea: is it necessary or just recommended to rewrite the WN tests?
  • Marian: it depends, let's follow up offline

  • Marian: the migration schedule is not fixed, but should happen next year

EOS update

see the presentation

  • Johannes:
    • we confirm the ATLAS instance has been more stable since summer
    • do you have similar stats for EOS-USER?
    • we have suffered a number of problems with web services hosted there
  • Luca:
    • there were EOS incidents affecting users whose name starts with 'a'
      • these are always reported on the CERN IT Status Board if you need to correlate
    • there also were incidents with the web service layer on top,
      which are under responsibility of another group
  • Maarten:
    • you can open a ticket when a web service misbehaves, whatever the cause
    • the regular ATLAS-IT meetings can be used to escalate matters if needed

  • Julia
    • From the perspective of the WLCG Operations we also confirm big improvement of the service stability and would like to thank EOS team for their work. Will take EOS away from the radar of the WLCG Operations

  • Renato: readiness for Run 3?
  • Luca:
    • we are working in particular with ALICE and LHCb,
      because of their greatly increased data rates
    • the HW is expected early 2020, followed by stress tests etc.

SIMPLE framework update

see the presentation

  • Julia: we are looking for volunteers from T2/T3 sites!

  • Marian: could it be used on commercial cloud resources?
  • Mayank: yes, deployment testing was done on DigitalOcean, AWS, OpenStack

  • Marian: collaboration with SLATE?
  • Mayank: we are already in touch

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels
    • High analysis fractions, even 50% or higher for many days
  • No major issues

ATLAS

  • Smooth Grid production over the past weeks with ~330k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis. In addition ~76k job slots from the HLT/Sim@CERN-P1 farm. First tests of running evgen on CERN-P1 are successful
  • No major issues apart from the usual storage or transfer related problems at sites.
  • Detailed discussions about data carousel with Tier1s, FTS, CTA, dCache experts last week during ATLAS S&C week
  • Will do a data15-18 RAW processing with DRAW filtering output. The data carousel tape staging mode will used with the usual Tier1s shares - CERN CTA will be added for some fraction. Start date is planned right after the Christmas break.
  • Will run usual light-weight production workload during the end-of-year break, including full use of P1 for simulation
  • Thanks to all for a very productive year 2019 ! Merry Christmas and Happy New Year !

CMS

  • running with 260k to 310k cores during last month
    • database corruption threw production into lower gear for about a week, fully recovered (and appreciated by people analysing data who made good use of the available resources)
    • ultra-legacy re-reconstruction of 2017 data almost complete
    • ultra-legacy re-reconstruction of 2018 data progressing well
    • nanoAOD campaign 2/3rd complete (small file handling)
  • plan to phase out SSB dashboard at the end of January
  • following up together with WLCG taskforce on dcache upgrade at sites
  • preparing for holiday break (organizing existing work, limiting changes/new processing requests)

LHCb

  • Running around 100K jobs in the grid
    • HLT Farm out for maintenance
  • No major issues
  • Thanks for the great year, we wish to all great holidays, Let's aim for an even better 2020!

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

GDPR and WLCG services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

dCache upgrade TF

  • 17 sites are already running version 5.2.*, 25 to be upgraded
  • SRR still to be enabled at all sites. Only JINR enabled it
  • This week all sites will be ticketed, either for upgrade to 5.2.* or for SRR
  • Link to CRIC

DPM upgrade TF

  • Out of 54 DPM sites used by the LHC experiments, 6 are moving to other solutions, 1 site is suspended from WLCG operations, 12 still to migrate, 7 migrated but to be reconfigured, 28 done
  • Link to CRIC

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • After 6-1/2 years, all of the tasks listed on the SquidMonitoringTaskForce and HttpProxyDiscoveryTaskForce twiki pages have finally been completed!
  • There are still a number of small things to do, but we can probably now officially shut down the task forces and the reporting to this meeting.

  • Julia:
    • thanks a lot for all the work that was done!
    • we will check when Dave can give a summary presentation in the GDB

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2019-12-16 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback