Topic
Today, we shall discuss new data model architecture notably with respect
to inter-record relations and use cases.
Summary
Here is a summary of the meeting is written by Tibor. Please correct the thoughts and/or please add your comments in another section below.
We considered the following use cases:
- Bibliographic record is updated (e.g. one new blob from outside comes where one affiliation for one author is changed)
- Refextract/Author disambiguation needs to link fields to records
- Searching for paper authored by X
- Searching for the institutions records from authors who authored paper X
- Associating a given author record with an account
- Exposing via REST the IDs of authors of a given paper
- Cataloguer/user wishing to correct an association from a paper to a person
- Merging 2 records
- Record editor
- Record indexing in elasticsearch
- Record validation
- Machine-related queries
- Links to non-record objects: KBs, controlled vocabularies, workflow objects
- Links to objects stored on remote sites: plots.inspirehep.net, data.inspirehep.net ?
We considered the following wanted features:
- Relations
- Schema
- Query language
- Scalability
- User / Developer community
- Ecosystem (existing libraries built around)
- Dependencies (DB required)
- Easy adoption by community (e.g. MARC import/export)
We considered the following approaches:
- Everything ORM
- JSONSchema in PostgreSQL
- JSONSchema in PostgreSQL and Relation table
- JSONSchema in Multimodel OrientDB/ArangoDB
Each approach has its set of pros and cons that I would not go into detail specifying here. The pros/cons opinions differed, but at the end of the day we seem to have had consensus around following.
- The Everything ORM solution resembles "improved bibxxx" like technique, with one nice feature: native linking to non-record objects such as user table. JSON solution can link to non-record objects, but without taking advantage relational constraint.
- The JSONSchema way had the most favours to model the application domain. Content-wise, it should permit easy bibliographic record management for standards such as MARC. Technology-wise, it should be preferably used with PostgreSQL in order to take advantage of speedy JSONB bulk update operations in 9.5. Opinions differed as to how many backends should be supported. (more below)
- Invenio's Records module should offer simple CRUD facilities (create, replace, update, delete). All facilities on top, such as searching in more than one record, would rely on Elasticsearch.
- Relations are modeled as pointers to other records, similar to authority handling and/or record pointing techniques. The pointers are stored in the record itself. After a record is updated, and before populating Elasticsearch, some of the relations may be dereferenced so that ES-searchable JSON is enriched with related information. (Beware of not having too many deep levels.)
- Records backend storage is currently MySQL. Pros: working now. Cons: cannot search in, or update, more than one record efficiently. Since machine-related queries will rely on Elasticsearch, then this does not really matter.
- Records' future wanted storage is PostgreSQL. Currently, only ~8 tests out of ~700 are falining, so Invenio-on-PostgreSQL could be released soon. However, demo site regression tests are to be worked with, plus all the non-tested code.
- Records uses CRUD, so we could use MySQL already now. If a backend supports more efficient operations, then we could take advantage of them. However tiny API wrapper is wanted in order not to expose underlying backend too much. (vendor lock-in)
- Creating hard-dependence on PostgreSQL, and take all its advantages to the full, had most favours. However, we are not there yet. A common sprint towards this goal was considered.
- Since machine queries will rely on Elasticsearch, we don't need to speed up working on PostgreSQL as a prerequisite. We can rather concentrate on finishing enabling Elasticsearch.
- OrientDB/ArangoDB as a main storage seemed to have less favours. However, if more advanced mining tools are needed, they could be fed JSON stream information from Records just as Record feeds Elasticseach. The mining tools can then enrich the record back with more information, the will be stored back in the metadata and made searchable.
- JSONAlchemy may not be even needed to be released as independent package. It is advantageous to stick to JSONSchema naked standand, without local extensions and customisations such as calculated fields. Would enable to publish our schema and allow people to easily built on it.
- Flask-Registry will collect parts of schema from various modules. The modules would update record information via fast CRUD API.
Action plan:
- Exploratory mini-programming is wanted in order to confirm/infirm discussed expectations, notably to test performance/scalablity of proposed solutions. E.g. we can make mini-teams and take http://tiborsimko.org/postgresql-mongodb-json-select-speed.html like exploratory approach, adding relations to demo records via record pointers, and do some speed estimates for searching and updating of many records at the same time, using either PostgreSQL/JSON, PostgreSQL/ORM, or ArangoDB approaches, or any other approach that may come up. (Looking at a quick exploratory programming to confirm/infirm expectations, no big final solution yet.)
- People favouring the JSONSchema technique thought about the following action plan in time: (1) take JSONSchema approach as is, without calculated fields; (2) implement Record CRUD storage above it, using MySQL; (3) implement before/after record update actions to update relations; (4) model people records via authority record pointing (no blocker here, usable since Invenio v1.2); (5) implement feeding Elasticsearch with JSON information (in later stages possibly enriched by dereferencing relations); (6) finish Elasticseach branch WRT missing configuration, with basic analysers, so that we can plug it as the main live IR system (=parallel common sprint A); (7) speed up any possible multi-record operation issues by using PostgreSQL (=parallel common sprint B).
Comments
Please put here any larger comments or technical pieces.