WLCG Operations Coordination Minutes - January 22nd, 2015

Agenda

Attendance

  • local: Nicolò Magini (secretary), Andrea Sciabà, Alberto Aimar, Prasanth Kothuri (IT-DB), Andrea Manzi (MW Officer), Maarten Litmaath (ALICE), Maria Dimou

  • remote: Alessandra Forti (chair), Alessandra Doria, Alastair Dewhurst, Alessandro Cavalli (CNAF), Andrej Filipcic (ATLAS), Anontio Perez Calero Yzquierdo (PIC), Christoph Wissing (CMS), Dave Mason (FNAL), Di Qing (Triumf), Frédérique Chollet (IN2P3), Gareth Smith, Isidro Gonzales Caballero, Jeremy Coles (GridPP), Yury Lazin (NRC-KI), Maite Barroso (Tier-0), Ron Trompert (NL-T1), Thomas Hartmann (KIT)

Operations News

  • Survey now really completed, thank you to all the 101 sites that responded.
  • WLCG workshop in Okinawa agenda draft https://indico.cern.ch/event/345619
  • Alessandra Forti thanks for his work as secretary Nicolò who is moving on

Middleware News

  • Baselines:
    • New version of FTS 3.2.31 released, fixing some issues reported by the experiments. Already deployed at CERN
    • new versions of Gridsite (2.2.5) released in UMD 3, fixing various issues
    • StoRM 1.11,5/1.11.6 released by the PT. Under verification by the MW readiness
    • dCache 2.6.x end of support is June 2015. Sites running 2.6.x versions are encouraged to move to 2.10.x/2.11.x soon

  • MW Issues:
    • The memory leak affecting integration between Storm and Argus has been fixed in the released 1.11.5 version

  • T0 and T1 services
    • CERN
      • FTS upgraded to 3.2.31
    • RAL
      • FTS upgrade to 3.2.31 planned for tomorrow morning
    • IN2P3
      • dCache upgrade to 2.10.14+ on 24/02/2014 (to confirm)

Tier 0 News

  • VOMRS decommissioning and replacement by VOMS-admin: Andrea Ceccanti promised a new VOMS-admin release this week fixing the problems discussed (the possibility of changing your own data, etc.). You can see the ticket for more details: https://ggus.eu/index.php?mode=ticket_info&ticket_id=110227
    • If the release comes this week, we propose to deploy it asap in the testing instance, and give till Monday 16th Feb (3 weeks) for regression testing and experiment testing. If no showstopper, we will deploy on the 16th and decommission VOMRS.
    • If the new release does not come this week, we will deploy the present version on Feb 2 as planned, decommission VOMRS on the same date, and deploy the new one on the testing instance once it is released.
  • GGUS-Ticket-ID: #111083 ALARM CERN-PROD EOS SRM returning error codes in French, 2015-01-08, https://ggus.eu/index.php?mode=ticket_info&ticket_id=111083: we would like to understand why this tickets is eligible for an ALARM for LHCb, thanks
  • Update of AFS UI:

  • On VOMS-admin agreed to proceed with option 1 - delay deployment to Feb 16th
  • On AFS-UI agreed to proceed with tentative decommissioning on Feb 2nd, and find out if there are any remaining use cases in case there are tickets after the closure.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity over the past many weeks
  • huge drop in activity Thu Jan 15
    • AliEn central services needed to be restarted with new certificates
    • this exposed a bug in the operation of an internal cache
    • debugging took until Fri Jan 16 mid afternoon
  • big data loss at SARA (NLT1) due to RAID controller failure
    • 108k ALICE files (~8 TB) lost
  • ALICE offline code repository has been split
  • ARC CE SAM tests
    • direct job submission probe needs to be debugged further

  • Maite Barroso comments that the Tier-0 acknowledges the git issue (which also affects the config management system) and is doing an internal review.

ATLAS

  • Prodsys-2 has been fully validated, took several weeks to fully understand the comparison of the physics distributions from Prodsys-1 and Prodsys-2 datasets
  • Rucio is fairly stable, although monitoring is still lacking some information, as the data-loss and data-recovery information
  • For the last two weeks, the production and analysis fully use the grid resources, although production has some hiccups occasionally (lack of tasks, APF failure). Most of the production runs multicore. Analysis is using 50% of the resources.
  • data loss at SARA: 0.5M files were lost due to raid failure. The recovery procedure in Rucio is working well and fast, but the relevant information on the files/datasets removed from the catalogs needs to be obtained from the Rucio log files for now. The report on the physics projects affected is being prepared.
  • Data recovery in Prodsys-2/JEDI will be tested on the affected tasks in the following few days, and the plan for automatic recovery will be defined after.
  • multicore queues deployment on sites is being followed in jira ADCSUPPORT-4117
  • the data lifetime policy has been applied on both T1 and T2 sites, the order of 3PB of data has been secondarized
  • FTS issues: staging on castor did not work for all the files, callback to Rucio were missing, cancellation of requests was not working properly. All fixed in the latest release being deployed this week.
  • MC15 simulation still not ready, schedule not clear yet. The MC14 tasks are not enough to fill the grid, so we need to wait for the big campaign before the production will use all the resources.

CMS

  • Production/Processing overview
    • Moderate load
    • One bigger MC production campaign over the last ~two weeks
  • Disks full at some Tier-1 sites
    • Cleanup campaigns going on
    • Further Tier-1 centers are being integrated in dynamic data management system right now: T1_DE_KIT and T1_ES_PIC
    • The integration will be coordinated with the CMS site contacts
  • Tier-1 tape staging exercises
    • First site (CNAF) tested successfully
    • Will continue with other sites
    • Will be coordinated with CMS site contacts
  • 50% of Tier-1 capacity multi-core enabled
    • If site has dedicated multi-core resources, it should provide this fraction
    • Will be partly used in "partitional slot mode" (Running n single-core jobs in n core multi-core pilot)
    • Long lifetime of pilots preferred -- what is still feasible for the sites?
  • In progress of moving CRAB and central production into a single global Condor pool
    • Tier-2 will stop receiving pilot jobs with VOMS role production
      • Will request changes in fairshare configuration in the next few weeks - will be reported also here

  • Christoph Wissing clarifies that the CMS pilots will no longer have VOMS production role; the production payloads will still have production role.

  • Pushing for some site configurations
    • Adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out>
    • Phedex Space monitoring: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin
    • Will open (low priority) tickets in a few weeks to track progress

LHCb

  • Operations
    • "Run1 Legacy Stripping"
      • the majority of files have been processed, some merging of remaining last % of files to be done (except SARA see below)
      • operations followed very close the plan to process the data in 6 weeks (see No of processed files)
      • many thanks to all T1 sites for their support, including the Xmas break !!!!
      • Staging at most sites faster than minimum required, also many thanks in this respect !!! Pre-staging with FTS3 worked very well - used for the first time in a large campaign.
    • SARA-MATRIX file loss
      • Note: the points below are not to blame the site but to illustrate the work caused by such a failure
      • 25k out of 95k files are unmerged DST files of the above stripping campaign which need to be considered lost. In case this needs to be re-done a lot of man-power will need to be invested and will extend the stripping campaign by several weeks.
      • another 60k were user files which are partially lost b/c of no second replica available
    • RAL srm extended by one server to overhaul performance issues, many thanks to the site !!!
  • HTTP/WEBDAV access
    • 3 more access points missing before completion of the campaign
    • Looking into the possibility to adopt/deploy webdav SAM probe to test access points

WLCG critical services

  • Andrea Sciabà presents the review of the critical services; see the slides for details.
  • Nicolò and Andrea give examples of services that are now distributed across Tier-1s: FTS3, CVMFS Stratum-1s. Maarten suggests to see if sites can be rewarded for running such services.
  • Discussion on the impact on the MoU of extending the critical service table to the Tier-1/2s: any potential MoU change is outside of the scope of WLCG Operations and must go to the MB.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign ongoing (43 sites)
    • issues at a few sites being investigated (e.g. job output upload)

SHA-2

  • retirement plans for the old VOMS servers
    • the old services were planned to be "alive" until Tue Feb 3, 2015
      • on that day the special router configurations would be removed
      • further references to the old services could hang from then on
      • UI and grid-mapfile configurations should no longer refer to them
    • but this plan is closely tied to the VOMRS retirement, which may have to be delayed somewhat
      • a new VOMS-Admin version is expected this week and will need to be validated
    • we may then want to run with the special arrangements a bit longer

  • Agreed to delay the old VOMS server shutdown until the VOMRS is retired.

Machine/Job Features

  • Asking for volunteer sites to deploy machine/job features on their batch / cloud infrastructure

Middleware Readiness WG

  • The MW Readiness WG met yesterday Jan 21st. Agenda http://indico.cern.ch/e/MW-Readiness_8
  • Excellent participation and follow-up by the Volunteer Sites (Edinburgh, Napoli, Legnaro, QMUL, CNAF, Triumf, NDGF) and the MW Officer Andrea Manzi. Please follow the slides for details.
  • The new version of the Package Reporter is ready, within the deadlines. The new design principles are in line with EGI security requirements. A maximum of code shared with Pakiti. The site is offered configuration options for the reporting. Please follow the presentation here by the developer Lionel Cons for details. Very simple installation instructions are documented here.
  • Next meeting Wed 18 March at 4pm CET. Please note!

Multicore Deployment

  • CMS multicore at T1s, see notes above. Deployment to T2s to restart once the submission infrastructure (pilot factory) testbed is deployed.
  • ATLAS 26 T2 to enable followed in JIRA (see ATLAS report)

IPv6 Validation and Deployment TF

  • F2F meeting at CERN yesterday and today: https://indico.cern.ch/event/352638/
  • All Tier-1 sites are reminded of the deadline of April 2015 to enable dual-stack on their perfSonar instances, as requested by ATLAS and agreed by WLCG.
  • A perfSonar dashboard showing the IPv6 network measurements via IPv6 across the WLCG that have enabled IPv6 on perfSonar has been proposed.
  • A test specific to IPv6 should be added to the set of Nagios tests which are run on perfSonar instances, to immediately identify which sites have enabled IPv6. As for the previous point, this is to be discussed with the network and transfer metrics WG.
  • Now the test VOMS server at CERN is in dual stack.

Squid Monitoring and HTTP Proxy Discovery TFs

Network and Transfer Metrics WG

Action list

  • CLOSED on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
    • Agreed to retire the service on February 2nd.
  • ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing.
    • Ongoing discussions on publication in AGIS for ATLAS.
  • ONGOING on experiment representatives - report on voms-admin test feedback
    • Experiment feedback and feature requests collected in GGUS:110227
  • CLOSED on Andrea Sciabà - review the critical services table

AOB

GGUS news (MariaD):

  • The next meeting will be on February 5th.

-- NicoloMagini - 2014-12-18

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf AFS-stats-Jan-19.pdf r1 manage 58.9 K 2015-01-22 - 14:39 MaiteBarroso  
Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r27 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback