GSFRS: Giant Signal File Random Sampler

Prasanta Pal; Judson Brewer

doi:10.36227/techrxiv.15000945.v2

loading page

GSFRS: Giant Signal File Random Sampler

Prasanta Pal ,
Judson Brewer

Abstract

In the modern world, it is hard to imagine a day without some form of interaction with digital data. Real world data originating from signal generating transducers or communication channels are often recorded as streams of data samples separated by time stamps, sample counters or simply data record delimiter e.g. newline (\n), comma (,) etc. Sampling is the basis of statistical estimation from any data source containing signal records. The process of random sampling has been in practice since time immemorial. However, rapid scale of data generation processes working in tandem with of computing infrastructures , the volume of data is getting quite unmanageably large in nearly every discipline of science. On the other hand, mere volume of data is of no consequence if we can’t extract effective intelligence out of it on an “on demand” basis. Of particular interest is the case where data is stored in a file as a record separated by newline(or any other delimiter) character. When the number of records in the file is greater than a threshold, random sampling is a formidable task. It is nearly impossible to pragmatically load the entire file in the computer memory or even if theoretically possible, the time it takes to load the data in its entirety from natural data sources can be overwhelmingly long or often unnecessary! We can strategically bypass these problems by carefully designing a data interface tool such that any part of a given file can be instantly accessed for random sampling or other kinds of processing tasks by loading only the necessary parts of the data. With this goal, we created a novel, portable and highly efficient rapid data access tool named GSFRS: Giant Signal File Random Sampler, written in modern C++ language to enable near real-time access to any part of an arbitrarily large sized data file that is almost independent of the file size for all practical scenarios. Also, big-data processing would become relatively commonplace and cost effective even in commodity hardwares once the indices are made available through its indexing protocol. This capability would potentially revolutionize the way we gather intelligence from files containing large samples. Adaptation of GSFRS at the source level of various data generators, processing times and energy footprints of various computations can be dramatically reduced.