LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes150122 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes - January 22nd, 2015

Agenda

https://indico.cern.ch/event/359845/

Attendance

local: Nicolò Magini (secretary), Andrea Sciabà, Alberto Aimar, Prasanth Kothuri (IT-DB), Andrea Manzi (MW Officer), Maarten Litmaath (ALICE), Maria Dimou

remote: Alessandra Forti (chair), Alessandra Doria, Alastair Dewhurst, Alessandro Cavalli (CNAF), Andrej Filipcic (ATLAS), Anontio Perez Calero Yzquierdo (PIC), Christoph Wissing (CMS), Dave Mason (FNAL), Di Qing (Triumf), Frédérique Chollet (IN2P3), Gareth Smith, Isidro Gonzales Caballero, Jeremy Coles (GridPP), Yury Lazin (NRC-KI), Maite Barroso (Tier-0), Ron Trompert (NL-T1), Thomas Hartmann (KIT)

Operations News

Survey now really completed, thank you to all the 101 sites that responded.
WLCG workshop in Okinawa agenda draft https://indico.cern.ch/event/345619
Alessandra Forti thanks for his work as secretary Nicolò who is moving on

Middleware News

Useful Links:

Baselines:
- New version of FTS 3.2.31 released, fixing some issues reported by the experiments. Already deployed at CERN
- new versions of Gridsite (2.2.5) released in UMD 3, fixing various issues
- StoRM 1.11,5/1.11.6 released by the PT. Under verification by the MW readiness
- dCache 2.6.x end of support is June 2015. Sites running 2.6.x versions are encouraged to move to 2.10.x/2.11.x soon

MW Issues:
- The memory leak affecting integration between Storm and Argus has been fixed in the released 1.11.5 version

T0 and T1 services
- CERN
  - FTS upgraded to 3.2.31
- RAL
  - FTS upgrade to 3.2.31 planned for tomorrow morning
- IN2P3
  - dCache upgrade to 2.10.14+ on 24/02/2014 (to confirm)

Tier 0 News

VOMRS decommissioning and replacement by VOMS-admin: Andrea Ceccanti promised a new VOMS-admin release this week fixing the problems discussed (the possibility of changing your own data, etc.). You can see the ticket for more details: https://ggus.eu/index.php?mode=ticket_info&ticket_id=110227
- If the release comes this week, we propose to deploy it asap in the testing instance, and give till Monday 16th Feb (3 weeks) for regression testing and experiment testing. If no showstopper, we will deploy on the 16th and decommission VOMRS.
- If the new release does not come this week, we will deploy the present version on Feb 2 as planned, decommission VOMRS on the same date, and deploy the new one on the testing instance once it is released.
GGUS-Ticket-ID: #111083 ALARM CERN-PROD EOS SRM returning error codes in French, 2015-01-08, https://ggus.eu/index.php?mode=ticket_info&ticket_id=111083: we would like to understand why this tickets is eligible for an ALARM for LHCb, thanks
Update of AFS UI:
- we propose to keep the proposed date for decommission, 2nd February
- stats, on Jan 19 16:36:09 CET 2015, 3 hours: this time we see an unusual number of anonymous access, after some investigation this has been traced to a high activity from user sdifalco, experiment RE1, https://phonebook.cern.ch/phonebook/#personDetails/?id=553064 (see full stats)

On VOMS-admin agreed to proceed with option 1 - delay deployment to Feb 16th
On AFS-UI agreed to proceed with tentative decommissioning on Feb 2nd, and find out if there are any remaining use cases in case there are tickets after the closure.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

high activity over the past many weeks
huge drop in activity Thu Jan 15
- AliEn central services needed to be restarted with new certificates
- this exposed a bug in the operation of an internal cache
- debugging took until Fri Jan 16 mid afternoon
big data loss at SARA (NLT1) due to RAID controller failure
- 108k ALICE files (~8 TB) lost
ALICE offline code repository has been split
- AliRoot Core (dependable) vs. AliPhysics (agile)
- started in the weekend, finished on Tue evening
- delayed by 1 full day due to instabilities of the CERN-IT Git service
  - "New Puppet code not visible by the Puppet masters"
  - "Git service - internal server error"
- organized analysis (trains) running with AliPhysics since Wed evening
ARC CE SAM tests
- direct job submission probe needs to be debugged further

Maite Barroso comments that the Tier-0 acknowledges the git issue (which also affects the config management system) and is doing an internal review.

ATLAS

Prodsys-2 has been fully validated, took several weeks to fully understand the comparison of the physics distributions from Prodsys-1 and Prodsys-2 datasets
Rucio is fairly stable, although monitoring is still lacking some information, as the data-loss and data-recovery information
For the last two weeks, the production and analysis fully use the grid resources, although production has some hiccups occasionally (lack of tasks, APF failure). Most of the production runs multicore. Analysis is using 50% of the resources.
data loss at SARA: 0.5M files were lost due to raid failure. The recovery procedure in Rucio is working well and fast, but the relevant information on the files/datasets removed from the catalogs needs to be obtained from the Rucio log files for now. The report on the physics projects affected is being prepared.
Data recovery in Prodsys-2/JEDI will be tested on the affected tasks in the following few days, and the plan for automatic recovery will be defined after.
multicore queues deployment on sites is being followed in jira ADCSUPPORT-4117
the data lifetime policy has been applied on both T1 and T2 sites, the order of 3PB of data has been secondarized
FTS issues: staging on castor did not work for all the files, callback to Rucio were missing, cancellation of requests was not working properly. All fixed in the latest release being deployed this week.
MC15 simulation still not ready, schedule not clear yet. The MC14 tasks are not enough to fill the grid, so we need to wait for the big campaign before the production will use all the resources.

CMS

Production/Processing overview
- Moderate load
- One bigger MC production campaign over the last ~two weeks
Disks full at some Tier-1 sites
- Cleanup campaigns going on
- Further Tier-1 centers are being integrated in dynamic data management system right now: T1_DE_KIT and T1_ES_PIC
- The integration will be coordinated with the CMS site contacts
Tier-1 tape staging exercises
- First site (CNAF) tested successfully
- Will continue with other sites
- Will be coordinated with CMS site contacts
50% of Tier-1 capacity multi-core enabled
- If site has dedicated multi-core resources, it should provide this fraction
- Will be partly used in "partitional slot mode" (Running n single-core jobs in n core multi-core pilot)
- Long lifetime of pilots preferred -- what is still feasible for the sites?
In progress of moving CRAB and central production into a single global Condor pool
- Tier-2 will stop receiving pilot jobs with VOMS role production
  - Will request changes in fairshare configuration in the next few weeks - will be reported also here

Christoph Wissing clarifies that the CMS pilots will no longer have VOMS production role; the production payloads will still have production role.

Pushing for some site configurations
- Adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out>
- Phedex Space monitoring: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin
- Will open (low priority) tickets in a few weeks to track progress

LHCb

Operations
- "Run1 Legacy Stripping"
  - the majority of files have been processed, some merging of remaining last % of files to be done (except SARA see below)
  - operations followed very close the plan to process the data in 6 weeks (see No of processed files)
  - many thanks to all T1 sites for their support, including the Xmas break !!!!
  - Staging at most sites faster than minimum required, also many thanks in this respect !!! Pre-staging with FTS3 worked very well - used for the first time in a large campaign.
- SARA-MATRIX file loss
  - Note: the points below are not to blame the site but to illustrate the work caused by such a failure
  - 25k out of 95k files are unmerged DST files of the above stripping campaign which need to be considered lost. In case this needs to be re-done a lot of man-power will need to be invested and will extend the stripping campaign by several weeks.
  - another 60k were user files which are partially lost b/c of no second replica available
- RAL srm extended by one server to overhaul performance issues, many thanks to the site !!!
HTTP/WEBDAV access
- 3 more access points missing before completion of the campaign
- Looking into the possibility to adopt/deploy webdav SAM probe to test access points

WLCG critical services

Andrea Sciabà presents the review of the critical services; see the slides for details.
Nicolò and Andrea give examples of services that are now distributed across Tier-1s: FTS3, CVMFS Stratum-1s. Maarten suggests to see if sites can be rewarded for running such services.
Discussion on the impact on the MoU of extending the critical service table to the Tier-1/2s: any potential MoU change is outside of the scope of WLCG Operations and must go to the MB.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

gLExec in PanDA:
- testing campaign ongoing (43 sites)
- issues at a few sites being investigated (e.g. job output upload)

SHA-2

retirement plans for the old VOMS servers
- the old services were planned to be "alive" until Tue Feb 3, 2015
  - on that day the special router configurations would be removed
  - further references to the old services could hang from then on
  - UI and grid-mapfile configurations should no longer refer to them
- but this plan is closely tied to the VOMRS retirement, which may have to be delayed somewhat
  - a new VOMS-Admin version is expected this week and will need to be validated
- we may then want to run with the special arrangements a bit longer

Agreed to delay the old VOMS server shutdown until the VOMRS is retired.

Machine/Job Features

Asking for volunteer sites to deploy machine/job features on their batch / cloud infrastructure

Middleware Readiness WG

The MW Readiness WG met yesterday Jan 21st. Agenda http://indico.cern.ch/e/MW-Readiness_8
Excellent participation and follow-up by the Volunteer Sites (Edinburgh, Napoli, Legnaro, QMUL, CNAF, Triumf, NDGF) and the MW Officer Andrea Manzi. Please follow the slides for details.
The new version of the Package Reporter is ready, within the deadlines. The new design principles are in line with EGI security requirements. A maximum of code shared with Pakiti. The site is offered configuration options for the reporting. Please follow the presentation here by the developer Lionel Cons for details. Very simple installation instructions are documented here.
Next meeting Wed 18 March at 4pm CET. Please note!

Multicore Deployment

CMS multicore at T1s, see notes above. Deployment to T2s to restart once the submission infrastructure (pilot factory) testbed is deployed.
ATLAS 26 T2 to enable followed in JIRA (see ATLAS report)

IPv6 Validation and Deployment TF

F2F meeting at CERN yesterday and today: https://indico.cern.ch/event/352638/
All Tier-1 sites are reminded of the deadline of April 2015 to enable dual-stack on their perfSonar instances, as requested by ATLAS and agreed by WLCG.
A perfSonar dashboard showing the IPv6 network measurements via IPv6 across the WLCG that have enabled IPv6 on perfSonar has been proposed.
A test specific to IPv6 should be added to the set of Nagios tests which are run on perfSonar instances, to immediately identify which sites have enabled IPv6. As for the previous point, this is to be discussed with the network and transfer metrics WG.
Now the test VOMS server at CERN is in dual stack.

Squid Monitoring and HTTP Proxy Discovery TFs

Network and Transfer Metrics WG

Action list

CLOSED on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
- Agreed to retire the service on February 2nd.
ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing.
- Ongoing discussions on publication in AGIS for ATLAS.
ONGOING on experiment representatives - report on voms-admin test feedback
- Experiment feedback and feature requests collected in GGUS:110227
CLOSED on Andrea Sciabà - review the critical services table
- Dedicated meeting on Dec 12th 2014: https://indico.cern.ch/event/357668/
- Andrea reported at this meeting.

AOB

GGUS news (MariaD):

KIT network maintenance on Mon 26th Jan 5-7am UTC: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=16584
Next update scheduled for the 28th of January. The service will be unavailable during the intervention.This update includes, among other things:
- Change of the GGUS certificate. The subject of the certificate will change
- Change the label 'Problem' by 'Issue', in order to be ITIL compliant.
- DB maintenance

The next meeting will be on February 5th.

-- NicoloMagini - 2014-12-18

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	AFS-stats-Jan-19.pdf	r1	manage	58.9 K	2015-01-22 - 14:39	MaiteBarroso

Topic revision: r27 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback