- UKI ROC: There has been a recent change to Alice policy in job submission - basically they have moved from using mutiple RBs at a site simultaneously, to using one until its fails and then switching to the next. This appears to have been part of the reason one of our RBs was inaccessible last week. We\''d like to know why this policy change was made.
- SWE ROC: At PIC we upgraded the SRM-disk service on 21,22 June. We declared *only* that service in Scheduled Downtime in the GOCDB. However, we have received some complaints from users (CMS) because, since the CE service was not closed, their jobs were still entering to PIC and trying to contact the SRM-disk service, and of course failing. They tell us we should close all the services (also the CE) if we have an intervention in one of them. Is this so? We believe it makes more sense that users check the availability of all the services they consider critical in a center, and not assume that if a CE is up, all of the other services are up.
ANSWER from S. Traylen:
Need to know more information about what CMS is
doing and how they locate the SE in question, also is it for reading or
writing a file.
So if PIC had done the following:
* Stop publishing the glueCESE bind from the CE. This removes the
association of their CE and SE.
and CMS had done
* Try and match make their files with RB matchmaking and then the
RB/WMS would not have matched against this CE in the first place.
I doubt that either of these things were done. The first because it
is currently hard, no tools are in place to do so. The second is pretty much because none of the users do this though CMS would have to confirm.
Things that can be practically done:
0) Find out how CMS are locating the SE in question.
1) Give sites a tool that allows them to mark services as offline (or
critical in fact) See
https://savannah.cern.ch/bugs/?func=detailitem&item_id=17777
for what the tool might do.
2) Have as many clients as possible respect the ServiceStatus flag.
lcg-utils would be a good place to start.
3) A fudge could be done in FCR (it is a complete fudge after all)
to have it delete GlueCESEbind values as well
for marked down SEs. This could be done but it still then
relies on CMS locating the SE in an information system aware way. In
particular the DEFAULT_SE_<VO> or whatever it is defined at a
site is not information system aware.
REPLY FROM CMS (Stefano Belforte): CMS analysis and production jobs are submitted with the JDL requirement
requirements = anyMatch(other.storage.CloseSEs,( target.GlueSEUniqueID == "srm.cern.ch") ) ;
[replace srm.cern.ch wit the SE of various sites]
so what you indicate as 3) would work.
I am not sure Stopping publishing the glueCESE bind from the CE
would be proper, since the association exists, and e.g.
maybe neede for FCR to know what to remove/add as SE go on/off.