WLCG Accounting Task Force Meeting
Attended:
Adrian, Greg, Costin, Dimitrios, Panos, Gorka, Florentia, Alessandro, Pepe, Alessandra, Jordi, John, Julia
Comments/discussion APEL plans and WLCG requirements
Apel plans
Julia asked what it would imply for the sites change to Argo Messaging system , Adrian replied that some configuration changes on the Apel client would be required
Julia asked what is the time line for short/long term plans. Adrian explained that short term is a year, long term plans by the end of the project
Gorka asked whether access to the EUDAT accounting will be restricted, Adrian replied that everyone would have access
There was a question of how many sites have already enabled Apel storage space accounting. Adrian answered 140.
After the meeting Julia checked numbers presented in the portal for some of those sites. They well agree with the ATLAS storage space accounting portal
WLCG requirements
Split of raw wallclock and scaled wallclock for validation. Adrian thinks it is better to avoid additional processing in the repository calculating raw wallclock out of start and end time stamps. John told that for sites which generate summary reports by themselves as CERN does, they would need to enable this split in the summaries
No problem for using an alternative information source for the topology
Alessandro suggested to enable the possibility to inject data form the experiment-specific accounting system in order to enable comparison of accounting data through Apel with accounting data from the experiments in the portal
Discussion regarding HTCondor accounting
The solution developed at CERN (probably apart of condor-history parser) is not currently used by CERN. Current CERN implementation does not look like the one which can be re-used by other sites in a straight forward way.
PIC solution looks generic enough and could be re-used by other sites. Though it is not clear how many sites did it. Pepe mentioned Ciemat. Taiwan also expressed interest, but Pepe is not sure whether they finally applied it. PIC solution does not cover accounting for jobs submitted locally.
Julia asked how many HTCondor sites (not OSG ones) do already report to Apel. John checked GocDB after the meeting and apart of Ciemat, there are only PIC and CERN who labelled services as HTCondorCE . A number of other sites mention HTCondor but they all seem to use it with ARC CE. Possibly the sites who enquired about HTCondorCE ended up with ARC. John will check Apel DB.
Adrian mentions that CE parser for HTCondor is already in the Apel repository, though he was not sure whether it has been used by any site. Alessandra mentioned , that the important point is deployment, which should be straight forward as in case of accounting for CREAM.
Alessandro suggested to get in touch with Brian in order to understand how HTCondor accounting works for OSG sites regarding the part which generates data and reports to Gratia. Julia will do.
To conclude, looks like we do not have a clear idea what EGI sites do to enable HTCondor accounting, and whether they know about recipes from PIC for example or implement something on their own, or just wait... Clearly, someone should own a problem, otherwise , there won't be any progress.
Need to get more information (OSG, etc...) and come back to this discussion in order to have a concrete plan.
WLCG Storage Space Accounting
Pepe asked how many sites have already deployed SRR. Julia explained that storage providers are currently working on implementation, those who progressed the most are DPM and EOS, but we did not get to the point when sites started deployment. This will require coordination effort. Current implementation of the WLCG storage space accounting tool is based on methods which are available (like SRM queries). However it foresees transparent switch to SRR as soon as it is being deployed at the particular site.
Julia asked Pepe, whether PIC started to work on the implementation of the reporting of metrics for tape storage. Pepe confirmed that they were progressing.
Dimitrios told that WLCG monitoring team is working to enable API from the ElasticSearch which is currently used as a storage backend.
Wrong CPU efficiency for CMS T2s in the EGI portal
Pepe worked through various problems which caused wrong efficiency ( 0 CPU reports, CPU>Wallclock, etc...) and the discovered reasons of those problems. Alessandra suggested to keep a track of all those issues and their solutions on the wiki which she started, so that experience is shared.
Then the main discussion was about possible improvements in detecting and debugging of such issues and who should be in charge of it.
Alessandra expressed her concerns about the fact that there is no manpower to follow up on all those problems on regular basis.
After long discussion, there were several suggestion which can improve the situation:
1). Expand the current validation view in SSB, add CPU consumption and CPU efficiency. Add a possibility for sites to subscribe to SSB alarms in case an alarm condition is met. Pepe and John will come up with the proposal what should be an alarm condition.
2). Julia suggested that experiments insist that sites check the accounting reports and validation SSB view regularly. The role of the task force team is to help in debugging and to provide tools which allow to detect problems (like SSB validation view).