local: Nicolò Magini (secretary), Andrea Sciabà, Alberto Aimar, Prasanth Kothuri (IT-DB), Andrea Manzi (MW Officer), Maarten Litmaath (ALICE), Maria Dimou
remote: Alessandra Forti (chair), Alessandra Doria, Alastair Dewhurst, Alessandro Cavalli (CNAF), Andrej Filipcic (ATLAS), Anontio Perez Calero Yzquierdo (PIC), Christoph Wissing (CMS), Dave Mason (FNAL), Di Qing (Triumf), Frédérique Chollet (IN2P3), Gareth Smith, Isidro Gonzales Caballero, Jeremy Coles (GridPP), Yury Lazin (NRC-KI), Maite Barroso (Tier-0), Ron Trompert (NL-T1), Thomas Hartmann (KIT)
Operations News
Survey now really completed, thank you to all the 101 sites that responded.
dCache upgrade to 2.10.14+ on 24/02/2014 (to confirm)
Tier 0 News
VOMRS decommissioning and replacement by VOMS-admin: Andrea Ceccanti promised a new VOMS-admin release this week fixing the problems discussed (the possibility of changing your own data, etc.). You can see the ticket for more details: https://ggus.eu/index.php?mode=ticket_info&ticket_id=110227
If the release comes this week, we propose to deploy it asap in the testing instance, and give till Monday 16th Feb (3 weeks) for regression testing and experiment testing. If no showstopper, we will deploy on the 16th and decommission VOMRS.
If the new release does not come this week, we will deploy the present version on Feb 2 as planned, decommission VOMRS on the same date, and deploy the new one on the testing instance once it is released.
GGUS-Ticket-ID: #111083 ALARM CERN-PROD EOS SRM returning error codes in French, 2015-01-08, https://ggus.eu/index.php?mode=ticket_info&ticket_id=111083: we would like to understand why this tickets is eligible for an ALARM for LHCb, thanks
Update of AFS UI:
we propose to keep the proposed date for decommission, 2nd February
On VOMS-admin agreed to proceed with option 1 - delay deployment to Feb 16th
On AFS-UI agreed to proceed with tentative decommissioning on Feb 2nd, and find out if there are any remaining use cases in case there are tickets after the closure.
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
high activity over the past many weeks
huge drop in activity Thu Jan 15
AliEn central services needed to be restarted with new certificates
this exposed a bug in the operation of an internal cache
debugging took until Fri Jan 16 mid afternoon
big data loss at SARA (NLT1) due to RAID controller failure
108k ALICE files (~8 TB) lost
ALICE offline code repository has been split
AliRoot Core (dependable) vs. AliPhysics (agile)
started in the weekend, finished on Tue evening
delayed by 1 full day due to instabilities of the CERN-IT Git service
organized analysis (trains) running with AliPhysics since Wed evening
ARC CE SAM tests
direct job submission probe needs to be debugged further
Maite Barroso comments that the Tier-0 acknowledges the git issue (which also affects the config management system) and is doing an internal review.
ATLAS
Prodsys-2 has been fully validated, took several weeks to fully understand the comparison of the physics distributions from Prodsys-1 and Prodsys-2 datasets
Rucio is fairly stable, although monitoring is still lacking some information, as the data-loss and data-recovery information
For the last two weeks, the production and analysis fully use the grid resources, although production has some hiccups occasionally (lack of tasks, APF failure). Most of the production runs multicore. Analysis is using 50% of the resources.
data loss at SARA: 0.5M files were lost due to raid failure. The recovery procedure in Rucio is working well and fast, but the relevant information on the files/datasets removed from the catalogs needs to be obtained from the Rucio log files for now. The report on the physics projects affected is being prepared.
Data recovery in Prodsys-2/JEDI will be tested on the affected tasks in the following few days, and the plan for automatic recovery will be defined after.
multicore queues deployment on sites is being followed in jira ADCSUPPORT-4117
the data lifetime policy has been applied on both T1 and T2 sites, the order of 3PB of data has been secondarized
FTS issues: staging on castor did not work for all the files, callback to Rucio were missing, cancellation of requests was not working properly. All fixed in the latest release being deployed this week.
MC15 simulation still not ready, schedule not clear yet. The MC14 tasks are not enough to fill the grid, so we need to wait for the big campaign before the production will use all the resources.
CMS
Production/Processing overview
Moderate load
One bigger MC production campaign over the last ~two weeks
Disks full at some Tier-1 sites
Cleanup campaigns going on
Further Tier-1 centers are being integrated in dynamic data management system right now: T1_DE_KIT and T1_ES_PIC
The integration will be coordinated with the CMS site contacts
Tier-1 tape staging exercises
First site (CNAF) tested successfully
Will continue with other sites
Will be coordinated with CMS site contacts
50% of Tier-1 capacity multi-core enabled
If site has dedicated multi-core resources, it should provide this fraction
Will be partly used in "partitional slot mode" (Running n single-core jobs in n core multi-core pilot)
Long lifetime of pilots preferred -- what is still feasible for the sites?
In progress of moving CRAB and central production into a single global Condor pool
Tier-2 will stop receiving pilot jobs with VOMS role production
Will request changes in fairshare configuration in the next few weeks - will be reported also here
Christoph Wissing clarifies that the CMS pilots will no longer have VOMS production role; the production payloads will still have production role.
Pushing for some site configurations
Adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out>
many thanks to all T1 sites for their support, including the Xmas break !!!!
Staging at most sites faster than minimum required, also many thanks in this respect !!! Pre-staging with FTS3 worked very well - used for the first time in a large campaign.
SARA-MATRIX file loss
Note: the points below are not to blame the site but to illustrate the work caused by such a failure
25k out of 95k files are unmerged DST files of the above stripping campaign which need to be considered lost. In case this needs to be re-done a lot of man-power will need to be invested and will extend the stripping campaign by several weeks.
another 60k were user files which are partially lost b/c of no second replica available
RAL srm extended by one server to overhaul performance issues, many thanks to the site !!!
HTTP/WEBDAV access
3 more access points missing before completion of the campaign
Looking into the possibility to adopt/deploy webdav SAM probe to test access points
WLCG critical services
Andrea Sciabà presents the review of the critical services; see the slides for details.
Nicolò and Andrea give examples of services that are now distributed across Tier-1s: FTS3, CVMFS Stratum-1s. Maarten suggests to see if sites can be rewarded for running such services.
Discussion on the impact on the MoU of extending the critical service table to the Tier-1/2s: any potential MoU change is outside of the scope of WLCG Operations and must go to the MB.
Ongoing Task Forces and Working Groups
gLExec Deployment TF
gLExec in PanDA:
testing campaign ongoing (43 sites)
issues at a few sites being investigated (e.g. job output upload)
SHA-2
retirement plans for the old VOMS servers
the old services were planned to be "alive" until Tue Feb 3, 2015
on that day the special router configurations would be removed
further references to the old services could hang from then on
UI and grid-mapfile configurations should no longer refer to them
but this plan is closely tied to the VOMRS retirement, which may have to be delayed somewhat
a new VOMS-Admin version is expected this week and will need to be validated
we may then want to run with the special arrangements a bit longer
Agreed to delay the old VOMS server shutdown until the VOMRS is retired.
Machine/Job Features
Asking for volunteer sites to deploy machine/job features on their batch / cloud infrastructure
Excellent participation and follow-up by the Volunteer Sites (Edinburgh, Napoli, Legnaro, QMUL, CNAF, Triumf, NDGF) and the MW Officer Andrea Manzi. Please follow the slides for details.
The new version of the Package Reporter is ready, within the deadlines. The new design principles are in line with EGI security requirements. A maximum of code shared with Pakiti. The site is offered configuration options for the reporting. Please follow the presentation here by the developer Lionel Cons for details. Very simple installation instructions are documented here.
Next meeting Wed 18 March at 4pm CET. Please note!
Multicore Deployment
CMS multicore at T1s, see notes above. Deployment to T2s to restart once the submission infrastructure (pilot factory) testbed is deployed.
ATLAS 26 T2 to enable followed in JIRA (see ATLAS report)
All Tier-1 sites are reminded of the deadline of April 2015 to enable dual-stack on their perfSonar instances, as requested by ATLAS and agreed by WLCG.
A perfSonar dashboard showing the IPv6 network measurements via IPv6 across the WLCG that have enabled IPv6 on perfSonar has been proposed.
A test specific to IPv6 should be added to the set of Nagios tests which are run on perfSonar instances, to immediately identify which sites have enabled IPv6. As for the previous point, this is to be discussed with the network and transfer metrics WG.
Now the test VOMS server at CERN is in dual stack.
Squid Monitoring and HTTP Proxy Discovery TFs
Network and Transfer Metrics WG
Action list
CLOSED on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
Agreed to retire the service on February 2nd.
ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing.
Ongoing discussions on publication in AGIS for ATLAS.
ONGOING on experiment representatives - report on voms-admin test feedback
Experiment feedback and feature requests collected in GGUS:110227
CLOSED on Andrea Sciabà - review the critical services table