Research Big Data Streaming and Reproducible Data Processing

Embracing Next Generation of Data Processing Technologies

Evidences through code containerization

In R&D, reproducing a complex data analysis process is often cumbersome and might even become impossible.

In some cases, just a couple of weeks after an analysis has been completed, the update of a software library can break a complete workflow making the analysis not reproducible anymore.

It usually gets worst with time passing.

We fix this problem with containerization of workers which we distribute on clusters of computers.

Eventually, information about data processing and evidences of results can be put into a distributed ledger.

Scientific data as streams

In most industries, R&D data come in constant streams from inside and outside of an organization.

Most of current software try to stop the data streams to lock them into silos or old fashioned databases from which results should be extracted.

We take the problem upside-down. We build systems that are able to connect to data streams.

They don't intervene in the streams, they just route them to processing engines.

When reproducibility is important, streamed data can be routed to our own data processing systems.