Prasanta Pal

and 4 more

Knowingly or unknowingly, digital-data is an integral part of our day-to-day lives. Realistically, there is probably not a single day when we do not encounter some form of digital-data. Typically, data originates from diverse sources in various formats out of which time-series is a special kind of data that captures the information about the time-evolution of a system under observation. How- ever, capturing the temporal-information in the context of data-analysis is a highly non-trivial challenge. Discrete Fourier-Transform is one of the most widely used methods that capture the very essence of time-series data. While this nearly 200-year-old mathematical transform, survived the test of time, however, the nature of real-world data sources violates some of the intrinsic properties presumed to be present to be able to be processed by DFT. Adhoc noise and outliers fundamentally alter the true signature of the frequency domain behavior of the signal of interest and as a result, the frequency-domain representation gets corrupted as well. We demonstrate that the application of traditional digital filters as is, may not often reveal an accurate description of the pristine time-series characteristics of the system under study. In this work, we analyze the issues of DFT with real-world data as well as propose a method to address it by taking advantage of insights from modern data-science techniques and particularly our previous work SOCKS. Our results reveal that a dramatic, never-before-seen improvement is possible by re-imagining DFT in the context of real-world data with appropriate curation protocols. We argue that our proposed transformation DFT21 would revolutionize the digital world in terms of accuracy, reliability, and information retrievability from raw-data.

Prasanta Pal

and 1 more

In the modern world, it is hard to imagine a day without some form of interaction with digital data. Real world data originating from signal generating transducers or communication channels are often recorded as streams of data samples separated by time stamps, sample counters or simply data record delimiter e.g. newline (\n), comma (,) etc. Sampling is the basis of statistical estimation from any data source containing signal records. The process of random sampling has been in practice since time immemorial. However, rapid scale of data generation processes working in tandem with of computing infrastructures , the volume of data is getting quite unmanageably large in nearly every discipline of science. On the other hand, mere volume of data is of no consequence if we can’t extract effective intelligence out of it on an “on demand” basis. Of particular interest is the case where data is stored in a file as a record separated by newline(or any other delimiter) character. When the number of records in the file is greater than a threshold, random sampling is a formidable task. It is nearly impossible to pragmatically load the entire file in the computer memory or even if theoretically possible, the time it takes to load the data in its entirety from natural data sources can be overwhelmingly long or often unnecessary! We can strategically bypass these problems by carefully designing a data interface tool such that any part of a given file can be instantly accessed for random sampling or other kinds of processing tasks by loading only the necessary parts of the data. With this goal, we created a novel, portable and highly efficient rapid data access tool named GSFRS: Giant Signal File Random Sampler, written in modern C++ language to enable near real-time access to any part of an arbitrarily large sized data file that is almost independent of the file size for all practical scenarios. Also, big-data processing would become relatively commonplace and cost effective even in commodity hardwares once the indices are made available through its indexing protocol. This capability would potentially revolutionize the way we gather intelligence from files containing large samples. Adaptation of GSFRS at the source level of various data generators, processing times and energy footprints of various computations can be dramatically reduced.