(Self) Certification of WLCG / HNISciCloud Sites (Repositories)
This can be considered to be an "audit" of the implementation of the WLCG Data Preservation Strategy (described in somewhat terse terms in the WLCG MoU).
Taken together with the corresponding Data Management Plan (DMP), this should allow us to have a comprehensive understanding of where we are with "Data Preservation for sharing, re-use etc.", areas where we are weak as well as those where we are also strong.
(Active) Data Management Plans for WLCG & HNISciCloud
Tentative update schedule for HNISciCloud:
o Tender publication (May 2016) [were all the use-cases retained]
o Start of prototype phase (April 2017) [could all the use-cases be satisfied]
o Start of pilot phase (January 2018) [could all the use-cases be satisfied]
o End of Project (June 2018) [are the plans for preservation beyond the end of the project still valid]
Description of the data that will be generated or collected, its origin (in case it is collected), nature and scale and to whom it could be useful, and whether it underpins a scientific publication. Information on the existence (or not) of similar data and the possibilities for integration and reuse.
Standards and metadata
Reference to existing suitable standards of the discipline. If these do not exist, an outline on how and what metadata will be created.
Description of how data will be shared, including access procedures, embargo periods (if any), outlines of technical mechanisms for dissemination and necessary software and other tools for enabling re-use, and definition of whether access will be widely open or restricted to specific groups. Identification of the repository where data will be stored, if already existing and identified, indicating in particular the type of repository (institutional, standard repository for the discipline, etc.).
In case the dataset cannot be shared, the reasons for this should be mentioned (e.g. ethical, rules of personal data, intellectual property, commercial, privacy-related, security-related).
Archiving and preservation (including storage and backup)
Description of the procedures that will be put in place for long-term preservation of the data. Indication of how long the data should be preserved, what is its approximated end volume, what the associated costs are and how these are planned to be covered.
Annex 2 of H2020 DMP Guidelines45m
Annex 2: Additional guidance for Data Management Plans
This can be applied to any project that produces, collects or processes research data, and is included here as reference for elaborating DMPs in Horizon 2020 projects. This guide is structured as a series of questions that should be ideally clarified for all datasets produced in the project.
Scientific research data should be easily:
a. DMP question: are the data and associated software produced and/or used in the project discoverable (and readily located), identifiable by means of a standard identification mechanism (e.g. Digital Object Identifier)?
a. DMP question: are the data and associated software produced and/or used in the project accessible and in what modalities, scope, licenses (e.g. licencing framework for research and education, embargo periods, commercial exploitation, etc.)?
Assessable and intelligible
a. DMP question: are the data and associated software produced and/or used in the project assessable for and intelligible to third parties in contexts such as scientific scrutiny and peer review (e.g. are the minimal datasets handled together with scientific papers for the purpose of peer review, are data is provided in a way that judgments can be made about their reliability and the competence of those who created them)?
Useable beyond the original purpose for which it was collected
a. DMP question: are the data and associated software produced and/or used in the project useable by third parties even long time after the collection of the data (e.g. is the data safely stored in certified repositories for long term preservation and curation; is it stored together with the minimum software, metadata and documentation to make it useful; is the data useful for the wider public needs and usable for the likely purposes of non-specialists)?
Interoperable to specific quality standards
a. DMP question: are the data and associated software produced and/or used in the project interoperable allowing data exchange between researchers, institutions, organisations, countries, etc. (e.g. adhering to standards for data annotation, data exchange, compliant with available software applications, and allowing re-combinations with different datasets from different origins)?
This "virtual annex" consists of guidelines / questions from other funding agencies that complete / complement those from H2020, as well as from the LHC (and other) experiments themselves.
<B>What is the relationship between the data you are collecting and any existing data? (NSF)</B>
<B>Requirement #2: DMPs should provide a plan for making all research data displayed in publications resulting from the proposed research open, machine-readable, and digitally accessible to the public at the time of publication. (DoE)</B>
This includes data that are displayed in charts, figures, images, etc.
In addition, the underlying digital research data used to generate the displayed data should be made as accessible as possible to the public in accordance with the principles stated above.
This requirement could be met by including the data as supplementary information to the published article, or through other means.
The published article should indicate how these data can be accessed.
<B>Additional requirements from LHC experiments (aka "CAP Use Cases")</B>
1. The person having done (part of) an analysis is leaving the collaboration and has to hand over the know-how to other collaboration members.
2. A newcomer would like join a group working on some physics subject
3. In a large collaboration, it may occur that two (groups of) people work independently on the same subject
4. There is a conflict between results of two collaborations on the same subject
5. A previous analysis has to be repeated
6. Data from several experiments, on the same physics subject, have to be statistically combined
7. A working group or management member within a collaboration wishes to know who else has worked on a particular dataset, software piece or MC
8. Presentation or publication is submitted for internal/collaboration review and approval: lack of comprehensive metadata
9. Preparing for Open Data Sharing
10. Validating archived data