What should be taken into account while deciding the status of the sites:
results of SAM tests
avalability of site calculated based on SAM tests
number of running/pending jobs
published SW versions
Piotr objected to the idea that everyone can use SAM programmatic interface (in particular for historical queries which are heavy) for calculating site availability on their own.
Rather we need to think how to provide necessary flexibility in SAM for calculating availability
Julia and Roberto:
LHCb would like to have a possibility to run a job which would publish to SAM aggregated results
of sanity checks done via normal Dirac jobs over a certain time range. Then from SAM they should be published back to the local site fabric monitoring
Why we publish it back to SAM? Should be carefull not to overload SAM too much by publising more and more tests.
Suggested to publish directly from Dirac, but using the same publishing mechanism as used by SAM.
How site admins would like to be informed of how their site is behaving from the point of view of the VOs they are serving.
So far 4 CMS sites had been asked and all answers are different.
San-Diego mail every hour
Brunel - GridView
Taiwan - published back to Nagios
FNAL - run tests locally.
Not always sufficient.
Local tests should improve the time granularity complementing the tests running remotely
- local versions of the CMS tests may be run by sites, but they should not replace the official tests, just complement them
Stefano told that CMS might be interested to try the prototype developed by Grid service monitoring WG for publishing results back to local fabrics monitoring. Julia sent him a link to the twiki page with the prototype description.
For the workshop would be nice if people come up with the suggestion how the UI showing site and service status from the VO perspective should look like
- the CMS SRM tests will be fed for integration to the SAM team
- the FNAL problem with availability should be fixed ASAP in on e way or another, probably with a hack in GridView, otherwise it must be made clear to the MB that the FNAL availability is wrong
- the FCR functionality to exclude SEs is broken because it makes sense only for the classic SEs
- the effect of critical tests on CE exclusion from the BDII via FCR should be decoupled from the effect on the availability calculation
We can have different kind of aggregated metrics with logical 'and' which can trigger different actions, not just excluding test from BDII
Should be algorithm for calculating SAM availability the same for all VOs and sifferent use cases?
Algorithm should be the same for all VOs, but rather there should be a flexibility for defining critical tests
Might be usefull to have dependency of the criticality of the test on the tier or other attribute of the site
Granularity of info in SAM (not just host) but also for exampe pools for storage related tests
Max showed the GridMap prototype.
It needs to be shown to people the experiments , so that people can think about use cases.
One of the issues for adopting of the tool for different experiments needs is how to get the experiment topology.
Some effort is required to have the common way to publish experiment topology so that it can be used by any application also the experiment. On the other hand this info should be kept uptodate by the experiment people.
There are minutes attached to this event.