GROBID [1] is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.
INSPIRE-HEP will make use of GROBID in the general ingestion workflow to allow (a) catalogers to work faster with less typing and (b) potential automatic tool for bibliographic reference extraction. In it's first iteration on INSPIRE Labs, we aim to provide an interface for catalogers to upload any PDF and get back extracted metadata and then push the results to the system.
This presentation will present quickly how GROBID is setup in our infrastructure and integrated in INSPIRE Labs via a specialized Invenio module [2]. We will also touch upon possible extensions of this tool and it's use cases in the future.
[1] http://grobid.readthedocs.org/en/latest/
[2] https://github.com/inspirehep/invenio-grobid