S R T B H P N

∞ Data Library - Concept

Author: Alexey Zaitsev


Concept:

The primary concept for the ∞DL model is that data objects generated in a research laboratory represent unique events of data collection, traceable to a human data producer, the equipment used, the research subject(s) (e.g., animals; plants, soil samples), date/time, place, and the context of data production in the form of a global research process.

Architecture:

Data objects are divided into Raw Data, Derived Data, and Report objects. Raw Data objects are treated with special deference as representations of unique events in a research process, and are first-class citizens in the ∞DL universe. Data files in formats “foreign” to ∞DL need first to be converted to the ∞DL data format using an API provided by ∞DL. Converted data objects are accepted into ∞DL following validation which includes a the presence of mandatory attributes specifying context of data object production, and best-effort check on data meaningfulness. Once accepted by ∞DL, data objects are “sealed” with an output of a hash function (e.g., SHA256 or SHA512) computed on essential features of the data object and serving as a world-wide unique data ID. Data object integrity can be checked at any time by comparing the ID with the hash function value computed by ∞DL for validation. Permanent deletion of ∞DL objects is out of the control of an end user and is performed by the system according to established policies. The trace of deleted data object remains in the system indefinitely. Geospatially, or in the context of the global research process, ∞DL system is a two-tiered federation. Any individual or research group can create a ∞DL Data Producing Node (DPN) using ∞DL Client software. An arbitrary number of DPNs can be initiated and constitute the first tier. ∞DL Authority constitutes the second tier and provides mapping services to dispatch globally-generated data requests to individual DPNs. In order to facilitate global visibility and access to data, DPN nodes have to register with a ∞DL Authority, but this is not necessary for day-to-day data archiving and access operations within a DPN.

Implementation:

∞DL Client software uses established Cloud storage services, e.g., Google Drive or Amazon AWS, mostly as lightweight Cloud file servers. Business logic for data submission and access is realized by ∞DL Client software. The recommended configuration maintains a fully redundant copy of all data locally (storage is cheap), which enables a ∞DL Client to work autonomously for any amount of time. Data access requests are managed by ∞DL Client software and producw Views of the data objects. The Views enable users to extract subranges of viewed data and to export them to “foreign” formats. ∞DL provides file reader and writer APIs which can be used for processing and analyzing data directly from native ∞DL data files, without exporting to “foreign” formats. DPN nodes maintain a DPN Ledger which registers all events of data object submission, replacement, moving, or “trashing” (which is decoupled from actual permanent deletion), as a sequence of blockchained records. A DPN Ledger can only grow with time, making corruption more difficult with addition of every record. DPN Ledger, together with data object sealing and other DPN features, supports integrity of the entire history of DPN evolution. Modification of an object is easily discoverable. Deliberate corruption of the entire DPN is virtually impossible or will require a level of user sophistication incompatible with perceived rewards of such fraud.