Local: Andrea Sciabà (chair), Nicolò Magini (secretary), Maarten Litmaath (ALICE), Stefan Roiser (LHCb), Marian Babik, Felix Lee (ASGC), Vincent Brillault, Pablo Saiz, Marcin Blaszczyk, Maria Alandes, Maria Dimou, Alessandro Di Girolamo (ATLAS), Hassen Riahi, Alberto Aimar, Domenico Giordano, Markus Schulz
registration is now open (see Indico page)! Registration fee: EUR 115 (+ 40 for social dinner), deadline June 9
Agenda still to be defined: there is a rough draft, discussed at the last GDB, any input and suggestion is very welcome
Task forces
Two task forces to be evaluated for closing: xrootd and perfSONAR. A new task force on Network monitoring?
Experiments Plans
ALICE
KIT jobs behavior:
The cause of the overload of the KIT firewall and the OPN link to CERN has been found:
for analysis jobs the location of the WN was not propagated to the central services
the client then ended up using not only the local replica, but close replicas as well
Over the weekend a patch was developed for TAlienFile in ROOT
it now sends the location information explicitly
The patch became first available in Tuesday's analysis tag and started getting used by a few trains and users
the results were very good
The expectation is for the vast majority of analysis jobs to be using the patched code after a few days
this will be monitored
The jobs cap should then get increased again gradually over the coming days
Our thanks to the teams at KIT for their efforts, and to the other experiments for their patience!
Plans for the next 3 months:
First, continuous activities in preparation for Quark Matter 2014 (May 19-24, GSI Darmstadt).
Then reprocessing RAW data with highest priority.
Starting from 2011 (p+p) and some selected periods (Pass2) from 2012.
Accompanied by the associated MC for these periods.
Anchored to the new calibration resulting from the preceding RAW data pass.
Pb-Pb reprocessing is not in the plans at this moment.
User analysis should be less intense after QM'14.
CERN:
Conclusion of SLC6 job efficiency investigations.
Increased use of the Agile Infrastructure.
Alessandro comments that the KIT incident is an opportunity to review if the monitoring covers all data access (transfers, remote and local jobs) by all VOs to understand the load on a storage. Collect links to internal VO monitoring pages, check what is in the common dashboard monitoring, what can be integrated. Maarten mentions that for ALICE remote access is monitored in Monalisa. Pablo comments that Dashboard has no resources right now for local access monitoring, but it can be added to the wish list. Discussion to be continued offline on the WLCG Ops Coord monitoring mailing list.
Alessandro asks about the metrics used by Alien to define two sites 'close' for remote processing, according to Maarten and Pablo it includes RTT measurements, domain. Includes network tests between the voboxes.
ATLAS
Alessandro presents the status and plans of ATLAS for the next months: main points are new Tier-0, new production system, Rucio migration, database access, request to deploy FAX xrootd and HTTP/WebDAV at all sites, multicore. See slides for details.
Alessandro clarifies that the SAM tests are not affected by the Rucio migration, because they don't interact with the catalogs, only the storage.
Alessandro clarifies that the timeline for sites to enable multicore for ATLAS is NOW if the sites can perform dynamic provisioning. Sites who cannot do it should contact ATLAS central operations (Alessandra Forti) and discuss on a case-by-case basis.
Alessandro explains that ~90% of the sites have joined FAX, 2 T1s (NDGF and IN2P3) and a few T2s are missing. 85% of the sites have enabled HTTP/WebDAV for the Rucio renaming, but the service is not also production-quality for remote access at all sites (e.g. EOS) since the requirements are different.
Stefan asks the reasoning behind the request of a dedicated LSF cluster for ATLAS Tier-0, since this will introduce a partitioning of the resources. Alessandro answers that during 2012 data taking there were 10 ATLAS alarms for LSF, and the problems were not identified or were caused by other users; according to IT-PES, a smaller instance can cope better with the load. CMS is going to AI instead for the Tier-0, possibly introducing partitioning in a different way. The discussion is adjourned to the next meeting since there is no Tier-0 representative attending the current meeting.
CMS
DBS2 has been switched off and remains off most likely
Decommissioning of CE tags for individual CMSSW releases
Still to come in April
glexec SAM test
Remaining scheduling issues solved
Want to go ahead and make it critical (beginning) of May
CSA14
Tier-1 Tape Staging test in May
Analysis Challenge in Summer (July & August)
Preparing samples
Multi Core
First production through multi core pilots at PIC
Technical tests ongoing at other sites
xrootd (AAA)
Related SAM tests not yet critical
Scale testing of AAA (xrootd federation) on going concentrating on European sites
Migration from Savannah to GGUS ongoing
Transfer Team and Workflow started to use it
Minor issues being addressed with GGUS team
Enlarge usage over the next weeks
FTS3
Finish moving sites to FTS3 for Debug transfers
Condor_g mode for SAM
SAM gLiteWMS decommissioning for end June confirmed?
Marian confirms that the June deadline for the new Condor_g SAM probes is correct, though the current schedule is 2-3 weeks late. A prototype is available, establishing the schedule for deployment to preproduction.
LHCb
Data Processing
VAC
Used in production with several hundred VMs at Manchester, Oxford and Lancaster
Infrastructure moved to CERNVM3 / SLC6
Unification / rewrite of the pilot framework in LHCbDIRAC
same infrastructure to be used at WNs, VAC, BOINC, CLOUD
including machine / job features
WMS decommissioning
WMS server decommissioning at CERN went without problems
Currently LHCb is submitting < 4 % of its pilots through WMS to small sites
Decommissioning of remaining sites will continue on low priority
Data Management
Tier2Ds (D==Disk)
Many Tier2Ds are using DPM as storage technology
Bug of the SRM/xrootd interface was fixed earlier this year and verified for version 1.8.8 - many thanks to CBPF.br and NCBJ.pl
Service is used 100 % in production since several months by LHCb
Client was used in "FTS2 mode" within LHCbDIRAC.
Planning to use the REST interface (python only, avoid Boost and other C++ dependencies)
File Access
Upcoming release of LHCbDIRAC will contain ability to use natively built xroot tURLs without going through SRM,
Next step will be to integrate http/webdav access, should be less work with the previous work already done but so far less endpoints available....
LCG file catalog to DIRAC file catalog migration
Migration procedure is currently being prepared, no estimate yet on the final schedule for the migration available. The objective is still before the end of the year.
CASTOR -> EOS migration
LHCb is using EOS already for production data for several months
Last bit missing was user data migration, which is scheduled for 22nd April - Many thanks to DSS for their support !!!!
Infrastructure
IPv6
Test infrastructure currently being setup in LHCb
New LHCb representative to IPv6 WLCG TF and Hepix - Raja Nandakumar
perfSonar
waiting for dashboard interface to consume data
Stefan clarifies that the timeline for sites to deploy xrootd access is when they are ready.
Stefan confirms that the timeline for the switchover to DIRAC file catalog is before the end of the year, and LHCb intends to use it in Run2.
Shawn comments that the dashboard in the next release of perfSonar in May will have a REST API. Data from perfSonar gathered and exposed to WLCG.
Report from WLCG Monitoring Consolidation
Pablo presents the status of the monitoring consolidation project: recent updates, with focus on SAM3 validation and the site nagios plugin; next steps. See slides for detail.
Feedback to be submitted to the monitoring consolidation e-group for discussion. Requests/issues on JIRA tracker.
Ongoing Task Forces and Working Groups Review
WMS decommissioning
CERN WMS instances for experiments have been drained completely without incident
SAM instances to be decommissioned by the end of June
depending on successful validation of the new job submission methods developed for SAM
Maria Dimou explains that the "baseline version" number doesn't necessarily reflect all individual updates of the packages in the dependencies. Need to understand how to publish this.
FTS3
FTS3 successfully managing majority of experiment transfers, agreed with all experiments on FTS2 decommissioning by August.
ATLAS and LHCb already 100% on FTS3, CMS to complete migration to FTS3 by June.
Understand transfer performance with FTS3: mostly validating FTS3 and Dashboard monitoring plots to see if experiment operations have all they need to understand FTS3 transfer behavior.
Expected result: validate optimizer performance for bulk of transfers; spot corner cases of problematic transfers which require dedicated debugging (e.g. at network level)
Validate submission with new clients/REST API and related new functionality. Timeline for integration of new features varies by experiment.
perfSONAR
Shawn presents the final report of the perfSONAR task force: deployed at 205 sites by April 1st deadline, only 8 missing, but 64 still running old versions. Lessons learned and important remaining issues. See slides for details.
xrootd deployment
Domenico presents the final report of the xrootd task force: deployment status, monitoring. Items to be followed up include keeping alive the communication between the two federations, the deployment of the dCache-XRootD monitoring plugin and the registration of the endpoints in the information systems. See slides for details.
Andrea comments that both the perfSonar and the xrootd task forces have reached their goals. He proposes to close them and create a new task force or working group to follow up on remaining issues, increasing the scope of network monitoring beyond perfSonar and xrootd in the long term (e.g. HTTP/WebDAV, FTS3 monitoring). The working group would be the forum to establish the proper procedure to follow up network issues. Shawn and Marian to lead this activity, they are asked to propose the mandate and goals and present them at the next WLCG Ops coord meeting on May 8th. Markus suggests to call the group something like 'data access monitoring' instead of 'network monitoring' to clarify the goals.
Tracking tools evolution
Move from savannah to JIRA done for the GGUS Shopping list tracker on 9th of April
Migration of experiment tracker outside scope of this TF
IPv6
Andrea presents the status of the IPv6 task force: experiments status and plans, highlights from the HEPIX IPv6 F2F meeting. See slides for details.
Machine/Job Features
Batch infrastructure
Support for ALL batch system types available, including SLURM - many thanks to NDGF)
Deployment plan is to test on two sites initially and then roll out to remaining sites
LSF: CERN (done) & second site contacted
Condor: USC & second site contacted
SGE: Gridka (done) & Imperial (done)
Torque/PBS: NIKHEF (done) & second site contacted
SLURM: script being developed
Cloud infrastructure
Setting up a prototype infrastructure at CERN/Openstack (similar to what was done for CERN/LSF)
based on couchdb + administration tools which are currently being written
Later move to more / other IaaS infrastructures
Client (mjf.py)
First version available at WLCG repository and LCG/AA afs area for use by sites / experiments
Bi-directional communication
Currently under discussion and finalizing the structure
Maarten explains that the plans for the task force are to gather experience after CMS turns the glExec test critical; then discuss with LHCb how to ramp up; follow up with other experiments.
CMS not running multicore yet, or at least not extensively enough to asses its impact on sites
ATLAS still has a wavelike submission pattern which is most disruptive
So far the most successful model of scheduling without walltime and/or a steady stream of multicore is (dynamic) partitioning especially at sites that can - one way or another - limit the number of cores to be drained at the time.