THE IDEAL WLCG INFORMATION SYSTEM ================================= Author: Andrea Sciaba' Date: January 8, 2013 NOTE: this is not the CMS point of view but my personal view, as it discusses some implementation details that a VO should not bother about. It should not be quoted as coming from CMS and can be safely ignored. The purpose of the IS is to allow users to know: 1) what resources exist in the Grid (resource discovery) 2) the properties of the resources (resource selection and usage) 3) the status of the resources (resource monitoring) 1) and 2) are equally important, and more important than 3). Resource discovery ------------------ The first step for the user is to know what resources are in the Grid. Typical queries are: - List all sites - List all instances of services of type X at site Y At this level, there are only two categories of objects: sites and services. Each of them has a unique identifier and services have a type and a parent site. This information is static and it should never change during the lifetime of the site or service. It should be always available, even if the site or service is temporarily unavailable. The IS should provide a well defined and unique endpoint to retrieve this information, for the whole WLCG (including the Tier-3 sites that are officially part of EGI, OSG or NorduGrid). As such, this part of the IS may be a service specific to WLCG, like REBUS. Any change to the published sites and services done by site administrators must be quickly propagated to this service. How this happens is an implementation detail. The service interface should be easy to use, possibly a REST interface using popular formats, like JSON or XML, for the output. Resource properties ------------------- Knowing the identifier of a resource is almost never enough to use it. Therefore the IS should publish, directly or indirectly, all the information that is needed to use a service. Here "indirectly" refers to the case where the IS publishes endpoints that can be queried in turn to retrieve the resource properties. Not to reinvent the wheel, the GLUE 2 schema should be used for definition of the resource properties. GLUE 2 is mature enough not to have to impose new requirements on it, so I will not. GLUE 1.3 is not maintained and it cannot be part of a long term perspective. The resource properties, by nature, should be generated by the service itself as they usually must be correct for the service to be usable as intended. Their publication should be resilient to glitches and "short" downtimes (e.g. whenever the unavailability of the service lasts less that the validity of the result of a resource selection query). Therefore the availability of the properties information should NOT be used as an indication that the service is available. Querying this information requires access to all attributes, both to retrieve their values and to express requirements on them. It must be possible to exploit relationships between different types of services and their sites. Typical queries are: - List all CEs in country X accepting CMS jobs and return their complete Globus identifiers - List all CEs "close" to a given SE - List all CEs with SL6 and with more than 100 job slots It is highly desirable to be able to query OSG, EGI and NG resources in exactly the same way. A fully flexible query language is not absolutely required, but the query results should be easily machine-parseable. We can distinguish two types of properties: a) those that MUST be correctly known for the service to be usable (e.g. ports or endpoints) b) those that do not directly affect the usability, if wrong, but prevent some queries from returning correct results (e.g. number of cores or total disk space). It must be guaranteed that the published values are at least compliant to the schema and meaningful. Ideally, it should never be possible to see an obviously wrong value being published. This still leaves room for "realistic" but wrong values. Service developers should ensure that properties of type a) are never wrong and properties of type b) are not "too" wrong (e.g., size-like values should be within a few percent of their true value). It should never happen that some properties are so broken that people give up on them and stop even asking for them to be correct. Changes to these properties must be propagated within a few minutes, in particular for the a) type. Typically changes are due to configuration changes or upgrades of the service, therefore as a result of a human intervention. Resource monitoring ------------------- Service monitoring information needs not to come from the IS: it is acceptable to get it from the service itself, according to its own methods. Here, the IS may provide a more convenient way to access the information. If it does, though, it must ensure that the information is correct and coherent with any other way to retrieve the same information from the service. In WLCG, there is now no use case for resource matching based on highly dynamic information. In any case, there must not be "bogus" information: whatever cannot be reliably calculated, must not be published. Hopefully this will not be the case for any mandatory attribute. VO-custom information --------------------- Experience shows that VOs need to generate their own site and service information. Examples are: - VO-specific site names (also aggregating more sites) - custom services (i.e. services not known to GOCDB/OIM or completely managed by the VO) - experiment contacts at sites - services relevant for the VO at the site - tier number - parent/child relationships between sites - pledges - etc. There is no need for the IS to accommodate this information, which is best managed by the VO using their own databases. Traditionally, the ability to publish experiment "tags" to CEs was successfully exploited, but alternatives could be devised. I do not consider it as a requirement. The VOs should do their best, though, not to unnecessarily duplicate information easily discovered from the IS, or at least put in place automatic validation tools that ensure that the information in their own databases does not become inconsistent. Protocols and interfaces ------------------------ The only real requirements are a REST interface or a convenient API supported by popular languages, standard output formats and a simple query language. The actual protocol (e.g. LDAP) should be hidden, if possible. All the GLUE attributes should be exposed by definition, as this does not add to the complexity of the API and makes easier to implement new use cases. Role of GOCDB and OIM --------------------- The purpose of GOCDB and OIM is to contain "custom" information useful for their Grid projects, for example to publish contact information, production status and downtimes. As GOCDB and OIM are two very different systems, it would be desirable to implement a uniform interface allowing users to transparently access the relevant information. Concerning the idea of expanding them to support more functionality proper to the IS, it does not seems an ideal solution, as it requires changes to both systems being done in coordination. However, it might be the best option for new services which do not have an information provider and which are used by more than one VO.