Statistical Outlier Curation Kernel Software (SOCKS): A Modern, Efficient Outlier Detection and Curation Suite

Real world signal acquisition through sensors, is at the heart of modern digital revolution. However, almost every signal acquisition systems are contaminated with noise and outliers. Precise detection, and curation of data is an essential step to reveal the true-nature of the uncorrupted observations. With the exploding volumes of digital data sources, there is a critical need for a robust but easy-to-operate, low-latency, generic yet highly customizable, outlier detection and curation tool, easily accessible, adaptable to diverse types of data sources. Existing methods often boil down to data smoothing that inherently cause valuable information loss. We have developed a C++ based, software tool to decontaminate time- series and matrix like data sources, with the goal of recovering the ground-truth. The SOCKS tool would be made available as an open-source software for broader adoption in the scientific community. Our work calls for a philosophical shift in the design pipelines of real- world data processing. We propose, raw data should be decontaminated first, through conditional flagging of outliers, curation of flagged points, followed by iterative, parametrically tuned, asymptotic converge to the ground-truth as accurately as possible, before performing traditional data processing tasks.


Introduction
I N his book "21 lessons for the 21st century" Yuval Noah Harari stated, "In a world deluged by irrelevant information, clarity is power" [1]. The data-driven world of science, technology and medicine is so much deluged by countless engines of digital data collection processes that clarity is often profoundly blurred by the clout of overinformation. It is more important than ever that, pursuit of information clarity is of far more importance than its mere volume. Clarity often comes through curation of the source level raw-input-data ( ∈{ℱ,ℛ } ) 1  Data, collected from natural-systems are often contaminated by various artifacts, background-noise and extremely deviant outliers [2]- [4] that obfuscate the clarity of the underlying, presumably pristine, nature of the observed system. Noise is generally referred to as a systematic perturbation on observed measurements, whereas outliers or artifacts are considered as ad-hoc interruptions occurring at discrete time points, or cases that are significantly deviant from a signal's central trends. Typically, outliers are defined by data points deviating by more than a set number of standard deviations (e.g., often 3.0 for normal distribution) from a signal's central tendency measures (e.g. mean, median, mode etc.). A real-world example would be, the sound generated by the ongoing traffic on an otherwise quiet street can be considered as background noise, whereas the sounds emitted by honks at discrete time points would be considered outliers with respect to the otherwise natural state of quietness. By contrast, the constant beeping sound of a cardiac pulse monitor As a simple working definition, we'll consider steady state signals over a well defined period of time as � and any significant deviation from the central tendency measure as outliers. Outlier detection and curation is a centuries old classic problem, that pervades almost all disciplines of science, art and technology. In art, outliers most often consist of outstanding features that are in sharp contrast to the backdrop of an artistic creation. In science and technology, outliers appear in various contexts of data analysis, machine-learning, fault-tolerance testing and design of robust dynamic systems, such as aerodynamic pressure fluctuation on airplanes. Depending on the exact scenario, the significance of outlier detection and curation has minimal to profound effects on the accuracy of real-world systems' analysis, modeling and predictability. For example, the sound of a passing airplane over a quiet wooded area, however annoying to the ears, is not of any critical significance. On the other hand, a faulty pressure-sensor measurement feeding incorrect data to an automated airplane navigation software (MCAS) [5], led to multiple crash events of Boeing 737-MAX aeroplanes [6]- [8].
With the large scale adoption of machine-learning (ML) driven data modeling tools, it is crucial to note that, even with the most advanced ML tools, a barely simple prediction may completely fail if the training or test data sets are not perturbation free [9]. Also, it costs a lot more computational power to train outlier infused data sets than its relatively pristine counterparts due to the divergence of the convergece time to a stable model [10]- [13]. Ironically, a highly advanced ML model would also train the noise and outliers as valid data points. To harness the true strength of core ML algorithms, decontamination of available data sets are of paramount importance [14]. In the common real-world situations, often outlier removal is erroneously equated to data smoothing [15]- [18], using various filters (e.g. Gaussian, Wiener etc.) or fitting of the data using generalized polynomial functions [19]- [22]. Also there are often simplistic assumption (e.g. Gaussian, Poisson etc.) [23]- [25] about the nature of the noise in the data that are often inaccurate in most realistic scenarios. The actual time-scales and length-scales of the noise models underlying the data, often do not conform to standardized models (e.g. Gaussian etc.). This issue calls for additional data-discovery on-the-fly during the data processing operations. Existing techniques, although, often make it visually more appealing by reduction of noise, however, it is at the cost of information loss and blurred clarity. Fitting and smoothing techniques are meaningful only when the data is optimally free from unwanted extreme values. As an example, the modern healthcare system heavily relies on various medicalimaging techniques [26], [27]. But, in most realistic scenarios, medical images are often plagued by various noise and artifact signals, that require rigorous noisemitigation as part of pre-processing steps [28]- [30] to make accurate inferences. On the other hand, real-time stream-based biomedical applications such as EKG [31], [32] pulse or EEG data-train monitors [33]- [35], through which clinical decisions are to be made in real-world settings while a patient is hooked, it is critically important that the data-stream be outlier free in real-time to prevent inaccurate diagnostic outcomes on a moment-by-moment basis. Occasionally, data-records get contaminated at discrete time-points by fluctuations arising from various sources e.g. extraneous background noise, sensor-sensor coupling, movement-artifacts, thermal-noise, line-noise as well as electrical-interferences from equipment operating in the same electrical network. Another example includes, multi-dimensional time-series data (simultaneous recording of multiple sensors) for which sensors are physically constrained by underlying geometry of the sensor coordinates (e.g. EEG electrode placement positions on the scalp's surface). In this case, data contamination can result from coupling to neighboring channels' signal(s), and the extent of contamination can vary depending on sensors' relative position to each other in physical space. Fig. 2  Thus, subjects would receive feedback derived from contaminated data and attempt to self-regulate using erroneous feedback measures. To make it even worse, the contaminated data is often orders of magnitude higher than that of pristine data. As an example of a situation that arises in real-time neuro-bio-feedback systems is when a participant undergoes mental training based on one's brain signal measured through sensors strategically placed on the exposed surface of subject's physical body. Often, the data-stream passes through source-localization or similar diagnostic and haptic [36], [37] or audio-visual feedback [38], [39] modules. In such situations, the outlier signal may deem to be the actual feedback signal unbeknownst to the subject. In training paradigms involving meditation or relaxation, a natural jaw-clenching, eye-blinking or swallowing may mislead the subject into believing this to be its actual neural signal or being in a state of higher awakening. Even if the source-localization algorithms are highly reliable, the presence of data contamination would induce the participant to train using contaminated information. Human subjective experience under normal conditions, being harmoniously continuous in nature, such disruption would only worsen any perceived outcomes as well as it may potentially raise the ethical question of making someone believe in something that is incorrectly grounded to its perceived origin. On the other hand, even in the presence of an accurate outlier curation module, the temporal perception of processed data may be entirely inaccurate with respect to the neural signal of interest if the outlier curation process is not efficient enough to make it real-time and thus preventing processing delays. In high sampling rate (Ω), real-time and high-density sensor driven applications, minimization or complete elimination of processing delay is, therefore, an important processing step to ensure that participants receive signals that are temporally synchronized to their real-time physically and mentally grounded experience. In general practice cases of multi-dimensional time-series outlier curation problems the focus is often limited to single channel systems where each dimension is treated independently of each other. However, although easy to implement from a computational point of view, such simplistic modeling is likely to be prone to inaccurate curation because, often certain groups of sensors are geometrically, neurologically, dynamically or otherwise correlated. The consideration of geometrical constraints arising from the natural physical space would make the curation algorithm more reliable due to additional physically grounded constraints. As an example, complex systems like high-density EEG systems, multiple sensors typically placed on the scalp are utilized to collect real-time high-density data. The relative geometrical constraints manifested by the natural shape of the human scalp, should be taken into consideration to make our outlier curation process more robust and customized.
Since, the nature of many bio-physical data e.g. EEG, EKG etc. are physically rooted both in spatial and temporal domains, we designed the software to address both of these orthogonal aspects of natural data-sources. Existing outlier removal methods PCA [40]- [42], ICA [43], [44] etc. although, well suited for offline tasks and targeted towards certain specialized kinds of data-sets, the available implementations are either not scalable or not efficient enough for adoption in real-time, multi-dimensional, multi-scale noise reduction tasks. For example, EEG-LAB [45] is an excellent toolbox for certain niche EEGdata processing tasks. However, its dependency on proprietary MATLAB software and being written in scripted language is best suited for offline tasks and can not be integrated into embedded devices like fitbit [46], Oura-ring [47] etc. Moreover, there are often too simplistic assumptions about the nature of the underlying data e.g. linearly varying or under equilibrium conditions which in reality, may have nothing to do with those assumptions. Also, existing methods often remove outlier values at the cost of modification of non-outlier values which in turn leads to loss of information even for perfect data-points. Also such kind of tools should be portable enough to work on a wide range of devices including mobile phones, high-performance-computing, micro-controllers etc. with minimal memory footprint. Finally, the world of open-source software is a major force behind the exponential growth of data science, ML, and scientific computing. Our open-source code initiative would encourage a thriving developer and research community 3 , that is of the likes of PCL [48], armadillo [49], CGAL [50], Tensor flow [51] etc. This initiative would open the gateway to better standardize outlier detection and curation techniques in the context of signal processing of bio-physical signals. As result it would help us answer universal questions like, "What is the standard threshold for reliable EEG (or EKG) signal-fluctuations"? or "What is the correct data buffer-size to reliably identify cardiac-arrhythmia in realtime monitoring environments"? Although, there are many outlier-detection standards available for generic signal processing [52]- [54], source-localization [55], [56], audio-visual processing [57]- [60] contexts, there is no such established standard-practices available for outlierdetection in large scale data driven real-time and postprocessing software systems under complex non-trivial operating conditions. This work would help address those issues by creating a globally accessible platform dedicated towards that mission.

I. Method
The primary focus of this work is to build a generic opensource software tool such that, outlier-detection and raw-  The SOCKS kernel curates the data either on time domain or spatial domain (or image) to produce filtered curated data while separately storing the outlier data-points as well data curation for a wide variety of data types are facilitated both offline as well as in real-time scenarios. There are two basic steps to achieve this goal. a) Flagging (or detection) of the outliers and b) curation of the flagged data-points. During regular data-acquisition scenarios, the recording is typically done with a specified Ω and in EEG like sensors (e.g. array or montage of sensors), representative values of Ω are 256, 512, 1024, 2048 etc. In our case-studies involving time-series data (e.g. EEG), it is set to Ω = 2048.
is read in a block-wise fashion and temporarily stored in a data-container ( � ) using the deque [61], [62] data-container. For simplicity, a datablock is represented as # and # as the number of instances of # . The size of # , is represented by # , and taken as an input parameter through a configurationfile 4 . In practical situations, the parameter values would typically be set from the domain knowledge (e.g. spatial or temporal scales) of the system or the nature of the specific curation tasks.
# is effectively the interrogation time-scale for # upon which the median and MAD parameters calculated. The following relationship holds Typically, an outlier curation routine is performed by choosing a value of and the curated output is obtained for that time-scale. For the purpose of simplicity and brevity we discuss only the temporal aspects of the operation, however, the spatial aspect of it follow a similar logic. A more sophisticated method would be to perform outlier curation at various scales in a sequential order. As a general recommendation, shorter time-scale outliers should be removed first followed by relatively longer ones. Often, the system underlying is continuous in nature and discrete, ad-hoc division of a continuous system into discreteblocks may introduce consmetic discontinuity [63], [64] 5 of trends in outcome-measures of consecutive data-blocks. To mitigate this kind of processing-artifacts, we introduce forward # and backward # overlap windows, data-blocks, 6 the size of which are usually a percentage (~25%-50%) of # . So, the effective data-block length, including the overlap-windows in forward and backward directions # →## All# s are uniform by design choice except, near the start and end of the data stream due to natural nuances around data boundaries (e.g. the number of samples may not be integer multiples of # ) .
5 similar to Gibb's ringing 6 forward means forward in time or southward in case of an image Robustness is intrinsically built into the software by incorporating relatively robust central measure, e.g. median (), and median absolute deviation (̂or simply MAD) [65]- [68]. These measures are minimally influenced by presence of extreme outliers. We define modified z-score for value as below.
In the flagging step, any if for a given data point , ( ( ) ≥ ) is satisfied, (or its location) is flagged (or labeled) as an outlier-candidate and staged to be curated in the subsequent steps if additional constraints (specific to the curation) are satisfied. Curation is not mandetory 7 and the program may halt at the flagging stage if necessitated by the user. Then, using non-NaN 8 values we create a linear-interpolation (or any other order) model to replace the NaN values (flagged points) by constructing a model from the data-points in the neighborhood of the flagged points 9 . As a vanilla case, we take advantage of armadillo library's built-in interp1 function that is optimized to fill in the flagged values from the un-flagged, neighborhood data points. Interp1 does extrapolation of the values if it is outside of the domain on which the model is built. Optionally, we can extend the default software settings built-in with static intializers into additional outlier detection procedures based on other statistical measures like arithmetic mean, harmonic mean, mode or custom criteria specific to the underlying data-curation task based on specific domain constraints. The kernel of the software consists of multi-threaded data-io mechanism, coupled with appropriate data type conversions, matrix transformations, flagging, curation and iteration process. Efficient availability of data from the source is facilitated by the multi-threaded environment through lockbased, mutex-enabled, synchronization mechanisms. The above steps can be done recursively for certain number of iterations (ℝ). Once the targeted iterations are finished (or by any other runtime condition), the flagging and curation steps may continue by piping the data from the last curated output as input and potentially with a different parameter tuple e.g. ( , # , ℝ). We call it cascading with a parameter tuple The software also allows, multiple processing cascades ( ) to be sequentially staged by cascadating flagging, curation and iteration steps in a serial fashion. Fig. 4 describes the steps for a given cascade level and the number of cascades ∈{1,2,…, } needed for a specific curation objective and can be defined through the configuration file.
A. Pre-post-processing and data source structure ASCII data is read from in floating point format.
It is reasonable to presume that, the default data arrangement is column-wise in the sensor space and row-wise 7 often it is enough to get the outlier points 8 Not a number 9 Actual details depend of the dimensionality, interpolation schema or evn keeping the NaN values in the temporal space. Each row is an individual record of the uniformly sampled data. Traditionally, the data may contain few meta-data header-lines at the beginning. Depending upon the circumstances, the header information may be utilized for calibration or may be discarded as well. In principle, there may be extraneous columns present in than those deemed relevant for processing and may be discarded as well by setting appropriate boundaries of the data through the configuration-file. The curation is performed on the effective samples ( ) that are candidates for actual processing. It is a submatrix of the original . The file or the stream often comes with auxiliary information, time-stamp, serial no. etc. channels and it is important to discard the indices of those channels. Also, often the first few lines of the data file may contain header information (e.g. file type, channel description, Ω etc.) and we discard these lines for actual processing. The zero time stamp starts at the first sample of the actual data after the header. If the input file contains few lines of header information, we skip those lines from processing (see footnote 4).

B. Design principles, data-io, transformations and dependencies
The underlying design principles are, easy availability (open-source), flexibility, scalability (can handle large data from diverse sources), adaptability (various input data formats, new filter implementations), low-latency, and robustness (works even in the presence of extreme values of data). It can be usedused" either as a stand-alone tool for offline-processing or as a plugin to real-time systems through file or streaming interfaces [69]. Highly efficient matrix-data processing mechanism facilitated by armadillo linear algebra library at the heart of the software. From the input raw data matrix (ℳ) it is trivial to create a reference submatrix sub-matrix containingonly the absolutely necessary chunk of data to be curated by the SOCKS kernel. is constructed from ℳ by taking into account appropriate row and column index ranges. The input data is presumed to be in a columnar manner separated by standard field separators (e g. space, comma etc.) where each column represents time-series data from each individual sensor whereas each row is a single snapshot record of the measurement. As illustrated in Fig. 3, the operations on are divided into two primary categories based on the respective operational domain of the data a) time-domain b) spatial-domain . Fig. 2 illustrates 10 cxx-opt 11 jsoncpp the spatial arrangements of EEG sensors where the data points are generated from sensors placed on the head-model of standard EEG data-acquisition systems. Due to natural hemi-spherical geometry of human-head model, any individual sensor or a group of them may produce outlier data but it can be recovered in the context of , by constructing a sensor proximity map from the inter-geodesic distances between sensors. In principle, inter-geodesic distances can be replaced by any other suitable distance measure e.g. inter-sensor dynamic correlation, functional proximity (e.g. how functionally close two sensor locations are) etc. When spatial curation is under consideration, this map is optionally taken as an input from the configuration-file (see footnote 4). For the purpose of brevity, we would discuss the domain operations only, however, domain operations are done in an equivalent fashion except for few additional elements like sensor proximity map etc. and described in detail in the software documentation 12 . The scalability of processing load depending on available resources was ensured by utilizing a multi-threaded processing framework, paired with block-by-block data-reading, processing and asynchronous storage protocols. At the initialization stage of the software a thread-pool () of size is allocated so that computation can be delegated to various hardware threads when necessary, while avoiding expensive threadcreation and destruction operations. Generally, scales with the number of cpu-cores of the host-device for processing. A run-time data-block container � internally having the deque data structure, is maintained as a container of # s throughout the lifetime of the dataprocessing task 13 , so that it acts as the interface between the and the data-processing-module 14 while reducing the data-loading time from by making the datablock readily available in the hardware memory 15 . Before the data-processing operation starts, � initialized to be in a filled state � of size � ≥ 3 whereas, the maxinum size of � is determined by the choice of memory footprint of the software as pre-defined in the configuration-file (see footnote 4). The front element of � is popped for processing while an asynchronous request is made to read the next data block from the to be pushed (or en-queued) in partially empty � . Before the push or pop operation, � is locked with a mutex. This way the wait time to read data from the is avoided while the processing happens simultaneously on a separate thread of .
The following schematic sequence 16 describes different states of the data-block container from the initial empty � , followed by complete filling and consecutive popping and pushing of # 's from data-container � until 12 socks-documentation 13 all computations are referred to as data-processing task 14 consists of flagging, curation and io-operations 15 Reading from data storage is very expensive compared to processing time 16 The output as well as outlier instances are asynchronously stored through an output data sink ( ) interface that can optionally be piped to a data file or a streamer labstreaming layer (LSL) 17 . The adaptability to variations of data formats is achieved by first reading each line of the input data as a fresh record in the format of a string. String being one of the most universal data types, any kind of ASCII data files can be read in a line by line fashion. Then the data parsing is done by utilizing the data-domain boundaries contained as the actual experimental details of . As a concrete example, we used EEG data samples collected from Biosemi [70] systems where, the first column usually have time stamp information, mid columns have sensor data information, and the last few columns may have auxiliary or external trigger information. In most practical scenarios, only a subset of these data columns are relevant and SOCKS would parse it as per data-boundaries set through the config file. The low-latency and efficiency goal was achieved by using, highly optimized, low-level, multithreaded, compiled C++ language with modern standards (-std ≥ 11) that intrinsically supports minimization of expensive data copy through the modern syntax introduced in standard 11. By move semantics [71], [72], reference passing and perfect forwarding [62] principles, the need for expensive data copying was partially if not completely eliminated and every unique piece of data is guaranteed to reside in a single memory location. It is to be noted that, the introduction of move semantics and perfect forwarding, native language level multi-threading, enabled a very powerful interface to build efficient, close to metal real-time modern softwares. Parallelized OpenMPloops [73], [74] were used in appropriate code-blocks to make it even more efficient. Additionally, multi-purpose linear-algebra library armadillo [49] has beens heavily utilized to facilitate a diverse variety of highly complex but efficient matrix as well as data transformation operations. One big advantage of the armadillo library is that, it creates a natural interface to form sub-matrix views using references such that any operation on a submatrix does not require expensive data copy operation. This has a significant impact on the efficiency of the software because the is often either imported from file on the hard drive or from a live-stream that in its raw form, belongs to a higher dimensional sensor space than the subspace of the matrix where � of interest belongs. Without the reference based sub-matrix view, we have to perform expensive and memory intensive copy operations. Functional operations/transformations are performed on a reference based sub-matrix structure of the imported in the form of matrix. In case of streaming (see footnote 17 for details) data we added the interface to stream the output to outgoing data-stream clients. In the current verion of the software, we implemented only the outgoing data-streaming, identified by the name and type of the stream. These parameters can be trivially modified in the configuration-file as needed by specific situations. The incoming data-stream can be easily implemented but we skip it in the current version for the sake of simplicity. It is to be noted that, output from SOCKS kernel can be further processed internally using traditional auxiliary fileters (e.g. Gaussian, Poisson etc.). For illustration purpose, we included an implementation of half-Gaussian filter that can be used optionally by turning on or off appropriate flags (for details see footnote 12).

D. Software validation
1) Measured data: Validation is performed on data collected from high density (number of sensors = 128), high Ω EEG experiments using Biosemi device that include controlled outlier events like jaw-clenches. For the purpose of simplicity we show the examples with jaw-clenches but the software can handle any possible event. Fig. 5 a) shows the onset of a jaw-clench event followed by relaxation and b) highlights one such small segment for a particular sensor 4 of Biosemi cap. Fig. 7 demonstrates the curated as well as the containing a jaw-clench event approximately around the middle of the graph. (easymotion-prefix) 2) Simulated data: We created a software validation pipeline to emulate contaminated time-series data. Synthetic data has been created such that the use cases are intuitive, reproducible and closely resembles the primary outlier features of real EEG data. Using this guideline we performed validation on simulated data as well as the measured data. Fig. 8 demonstrates one such segment of a synthetically generated data-instance that was created to validate the software in a controlled fasion.

II. Results
The results are divided into two major parts. a) Synthetic outlier infused data constructed in a controlled manner from pre-defined distribution functions (e.g. normal distribution) followed by flagging and curation steps. Fig. 8 demonstrates as well as recovered data.
Since, the data is simulated and the outlier locations are previously known, we verified that the flagging steps work with high accuracy as per user expectation and the curated data makes complete intuitive sense when observed carefully (e.g. converging to neighborhood data points), b) experimental data form EEG experiments with human-subjects obtained with Biosemi EEG device 18 . Fig. 10 demonstrates that the variability of the 18 the details of the device is irrelevant Fig. 8: Recovery plot of a simulation basedcorrupt timeseries. Filled squares (purple) represent simulated corrupt data and circles (green), represents recovered data. The horizontal axis represents time points as sample index curated-data can be kept under parametric control by appropriate selection of . Variabilities of output data contains important information about the system and in practical scenarios, the variability for a certain objective should be set as a system requirement. Fig. 9: Comparison of raw time-series data (black) and curated data (red) for a jaw-clench sequence for few representative values of # applied in sequence by piping the data from the output immediately preceding a given label e.g., input for b) is output from a) data from is used at the start of the processing. a) # = 4, b) # = 8, c) # = 16, d) # = 32, e) # = 64, f) # = 128, g) # = 256, h) # = 512. The plots corresponding to capital letter labels -for each small letter labels -ℎ highlights a single instance of a jaw-clench event. The threshold = 0.42 is kept at the same value for all the plots -ℎ. Fig. 9 demonstrates results with aggressive level of threshold parameter whereas Fig. 10 shows the same with a liberal value of . Fig. 11 is similar as Fig. 10 in parameter space except the curated data is fitted with Fig. 10: Comparison of raw time-series data(black) and curated data (red) for a jaw-clench sequence for various values of # applied in sequence by piping the data from the output immediately preceding a given label e.g. input for b) is output from a).
is used as data source for a).
The relevant parameter list is given below a) # = 4, b) # = 8, c) # = 16, d) # = 32, e) # = 64, f) # = 128, g) # = 256, h) # = 512. The plots corresponding to capital letter labels -for each small letter label highlights a single instance of a jaw-clench event. The threshold = 3.0 is kept at the same value for all the plots -ℎ 3rd degree polynomial. Fig. 11 demonstrates the effect of fitting and smoothing on the top of the curation steps. Fitting curated data with appropriate order polynomials preserves the local curvature of the data trends and that is often a clearer representations of the underlying system than the raw data itself even after outlier curation.

III. Discussion, Limitations, Conclusion & Future Direction
We developed a low-latency, robust, open-source, tool to curate large volumes of time-series data without any practical limitations on the size of the input data source 19 . After operating on the raw-data followed by curated-data in an iterative and cascaded fashion, the output is separately stored in curated and outlier signal files for each iteration cycle within a given cascade. The primary results are demonstrated in sec. II in Figs. 8, 10, 9, 10, 10 & 11 highlighting various curationcapabilities of the software. The actual details of the parameters are described in the figure captions. The key observation is that, the curated output can be parametrically controlled so that the tool can be adapted to a wide variety of real-world situations. The simulated data provides us with a basic pipeline for controlled validation of the software. SOCKS is designed to be highly efficient, close to metal, Fig. 11: Comparison of raw time-series data (black) and curated data (red) for a jaw-clench sequence for various values of # applied in sequence by piping the data from the output immediately preceding a given label e.g. input for b) is output from a). Raw input data is used as data source for a) # = 4, b) # = 8, c) # = 16, d) # = 32, e) # = 64, f) # = 128, g) # = 256, h) # = 512. The plots corresponding to capital letter labels -highlights a single instance of a jaw-clench event. The threshold = 3.0 is kept at the same value for all the plots -ℎ. The data is fitted with a polynomial of degree 3 after the default curation step.
flexible and robust enough to be utilized in various data acquisition settings even in the presence of unprecedented levels of extreme outlier values. The primary C++ base class SOCKS is structured such a way that, static variables and functions are designated for elements that preferentially need to be initialized along with the class object instantiation. To keep it simple, we put special emphasis on a single depth the class hierarchy. Also, class instances are not necessary for static functions so that we can call those elements through the class-scope without an actual instantiated class-object. Although, the main context for building SOCKS was motivated by long, high-dimensional, time-series data from EEG, but the scope is far more expansive and can be easily extended to matrix or imageprocessing problems as well. It can be safely adapted to any generic time-series scenarios (real-time or post-processing) by adjusting appropriate configurational parameters. New family of filters can be constructed using the flag, curate, iterate and cascade protocols using the same builtin data-io facility. Appropriate usage of ensures that all the available hardware-threads of the host device are optimally utilized. Any parallelizable process is distributed on the independent cores of the microprocessor by virtue of the facility. Additionally, by utilizing labstreaminglayer (see footnote 17) in the streaming protocol, the process can be cloned to multiple hosts in parallel through TCP-IP [77] network protocol (e.g. WIFI, ethernet etc.). For simplicity, we implemented streaming only for the curated output-stream but the input-stream can be trivially implemented in a similar fashion. One of the known limitation is the pontential behavior of the software under sharp-edge data-boundaries (e.g. Heaviside function like edges in the data). Edgepreservation between data feature-boundaries is often, an important criteria for data processing outcomes. We'll address this aspect in future versions of the software. In the currently published version, we assume the data sources to be without sharp edges. Another known limitation is that, for image inputs, the software operates only on grayscale color planes, e.g. either the source itself is grayscale or the color image is converted to grayscale so that SOCKS can operate on each color plane independently. There can potentially be many ways of addressing the color image processing issues which is beyond the scope of the current article. To conclude, in the currently published version of the software, we have concretely demonstrated the median based MADfilter and additional variants (e.g. mean, mode etc.) would be developed as an extension in future versions. The closest match with MAD-filter and popular imageprocessing filters is the median filter [68]. However, median filtes has some serious limitations like modifying even the un-corrupted data points. SOCKS is designed to be very generic in nature and can be used under diverse signal-prcocessing scenarios. In a time-series segment, often valuable information is encoded in various orthogonal frequency-bands. For any kind of precise time-series analysis, bounds for the specific frequency-band must be specified. In the current version, we do not address the frequency-band specific implementation although, we are aware of the importance of it but beyond the scope of the current implementation goals. Future versions would address custom bandpass filters versions of the software using open-source DSPFilter 20 repository. The software can be further extended for cloud support as well as support for non-ASCII file types (e.g. binary files like .edf, .bdf formats). In practice, a segment containing a lot of outliers in a time-series data does not necessarily mean that the segment is not information-rich. It may as well, contain valuable intrinsic information (sound of a honk) or effects of external influences on steadystate behavior of the system (sudden stop of traffic at red signal). Part of the design goal of the software is perform the flagging and curation steps sequentially. So, once we flag the outliers the curation is optional and we can simply store the flagged data records without any further curation. Although, there is some apparent similarity between outlier removal and FFT based highfrequency [78], [79] removal procedure but this similarity is only superficial. Any FFT based algorithm implicitly assume periodicity of underlying data and generally very sensitive to extreme outlier (e.g. Gibbs ringing 20 https://github.com/vinniefalco/DSPFilters effect) however SOCKS, algorithm is grounded on feature discoverability on-the-fly and almost agnostic to any traditional noise models. In future we'll add a separate repository of SOCKS compatible filters to create a filter-bank with most generic types of filter to address a wide variety of outlier scenarios (e.g. distributions of various categories). Since, the source-coude is written in C++ language using modern standard (-std ≥ 11), it is extremely flexible to extend and customize the utilities under various operatingsystems environments 21 , including embedded-systems, HPC, mobile-OS, data-servers as well as desktopapplications. Optionally, the matrix processing part of the computation can be delegated to GPU s [80]- [83] using OpenCL, CUDA etc. frameworks [84]- [86] to further augment the processing efficiency. Additional auxiliary filters (e.g. half Gaussian filter or any of the traditional filters) may optionally be applied further on the curateddata either on-the-fly or post storage on hard-drives. In future, a user friendly GUI interface would be added as well as connectivity to databases, diverse variety of data-streams [87], [88] and direct interfaces with various devices, as well as integration with reactive libraries [89] for better intuitive io-design. Although, the development of the software was inspired by large-scale EEG recordings, but the software is not limited to EEG data. The principles of time series processing are applicable to any other commonly available time-series data like data (e.g. air pressure fluctuations, audio signal processing, video processing, location tracking etc.) as long as the data is suitably transformed into ASCII time-series or metrices. Documentation of the software was generated using the Doxygen [90] documentation tool and the documentation is hosted at the project-repository (see footnote 12). Finally, the open-source initiative would encourage a thriving developer and research community, and drive new innovations in the space of standardizing outlier detection and curation techniques to address questions like what is the standard threshold for EEG outliers to detect certain category of neurological signals (e.g. alpha wave) or what is the definitive outlier value to know if the pressure-sensor measurement of a moving aeroplane is no-longer reliable? Although, there are many standards available for signal-processing, sourcelocalization, communications etc. but there is no such standardization for outlier-detection in large-scale data generation and storage systems. Building a forum towards the goal of standardization of outliers for various physical systems and normalization of its scales, under diverse operating conditions, would potentially be a revolutionary milestone to reach towards enhancing clarity of real-world data. Philosophically, outlier curation is deeply linked to our perception of the physical-world. Our visual or auditory senses are heavily dependent on visual or audio signal 21 a guarantee provided by the language itself! intensities. Often, our audio-visual perception about common objects (e.g. image, audio) are strongly influenced by high-intensity spectrum of the signal. By curating those data points, containing high intensity signals, our perception would be relatively free from cognitive biases. The potential applications of SOCKS tool would span nearly all disciplines of data-driven systems with or without outliers (definition of outlier is often arbitrary). We strongly believe that, SOCKS would revolutionize the way we look at real-world data.

Acknowledgement
The authors acknowledge logistical support from the Brown Mindfulness Center, Department of Epidemiology and School of Public Health, Brown University. We are thankful for financial support from Fetzer Memorial Trust Foundation. Part of this research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University. PP personally acknowledges Megan Ranney, MD and Eric Loucks, PhD for helpful discussions and institutional support.

Prasanta Pal Prasanta Pal, PhD is an investigator of Epidemiology at the Brown Mindfulness Center, Brown School of Public
Health. An applied physicist by training from Yale University School of Engineering, USA & IIT Kharagpur, India, Dr. Pal has been the chief director, designer and developer of several neuro-feedback, data-science and mobileapplication technologies directed towards human health and wellbeing. He is closely involved with bio-medical engineering research through first principle, minimal assumption methodos. Dr. Pal is deeply involved in building a suite of fundamental data science and medical imaging technologies to accelerate the field of data-driven diagnostic and healthcare interventions. Dr. Pal's work in medical technology is driven by the philosophy that mind and body works as one integrated system in a complimentary fashion. Holistic health outcomes are possible only when they are combined together through modern technology interfaces. He developed several technologies like MindView, MindScope etc. to help enhance various dimensions of human mind in a data driven, evidence based fashion. He serves in the editorial board of several biomedical engineering journals. He has been awarded funding from NIH and NARSAD, and serves as a Review Editor for the journal Frontiers in Psychiatry.

Remko Van Lutterveld
Veronique Taylor Veronique Taylor completed her MSc. and obtained her PhD at University of Montreal in 2017 in cognitive neuroscience. The specific focus of her graduate work was investigating the neural and physiological bases of mindfulness and emotion regulation. She then held several academic positions, such as at Bishop's University (Psychology Department) where she was involved in teaching and conducted independent research. She currently works as a postdoctoral research associate at the Mindfulness Center at Brown University, investigating the neural bases of mindfulness as well as its relationship with reward learning to treat addiction-related behaviors. She has been awarded research funding from several sources, including the Mind and Life Institute and from several Canadian research funding agencies.
Nancy Quirós Nancy Quiros obtained her PhD in Atomic Physics at the University of Nevada Reno (UNR). At Weinstein Lab in UNR, she studied the equilibrium thermodynamic properties of the van der Waals molecule Titanium-Helium (TiHe) at temperatures of 1K. Then she obtained a MSc in Digital Currency at the University of Nicosia (UNIC) and investigated the exchanges of Fiat money and Cryptocurrency. Currently, she teaches Cryptocurrency and Blockchain Technology at Lead University, Costa Rica. Her interests are analyzing the impact of emergent technologies in society, philosophy of science, mind and consciousness.
Judson A. Brewer Jud Brewer MD PhD is the Director of Research and Innovation at the Mindfulness Center and associate professor in Behavioral and Social Sciences at the School of Public Health and Psychiatry at the Medical School at Brown University. He also is a research affiliate at MIT. A psychiatrist and internationally known expert in mindfulness training for addictions, Brewer has developed and tested novel mindfulness programs for behavior change, including both in-person and app-based treatments for smoking, emotional eating, and anxiety. He has also studied the underlying neural mechanisms of mindfulness using standard and real-time fMRI, and source-estimated EEG, and is currently translating these findings into clinical use. He is the author of The Craving Mind: from cigarettes to smartphones to love, why we get hooked and how we can break bad habits (New Haven: Yale University Press, 2017) and the New York Times best-seller, Unwinding Anxiety: New Science Shows How to Break the Cycles of Worry and Fear to Heal Your Mind (Avery/Penguin Random House, 2021). Follow him on twitter @judbrewer.

Appendix A Case studies
As stated in sec. III, SOCKS can be used in multiple data-contexts beyond EEG time-series data including image-processing where raw-image is transformed into a data-matrix of grayscale pixel-intensities. In the context of image-processing it would be a very effective tool to perform both flagging and curation steps through iteration and cascading by choosing appropriate kernel-radius ( ) 22 and -values. The details of how it works are available in the software documentation and beyond the scope this article. We shall demonstrate the image-processing capabilities of SOCKS through some case-studies. More diverse types of case-studies are available on SOCKS community website (see footnote 3) and future updates would be made available here.

A. Image processing
The basic algorithm described for time-series data, is easily adapted to image-processing problems for filtering [67], [68], [91], [92] and segmentation tasks. All that is needed is to transform a raw image into data-matrix using OpenCV based imread function followed by conversion to armadillo matrix. armadillo library has a very advanced data-io facility.  [93] from NASA images. We also give an estimate of the underlying terrain as demonstrated in Fig. 13. The impact of various threshold levels are tested. At very relatively higher threshold values, is is expected to show 22 is equivalent to # for one dimensional case Fig. 13: Recovery of Vikram Lander crash landing site and identification of potential fragments. a) Original image obtained by NASA at the Vikram crash site, b) using moderate threshold, c) using high threshold, d) using two separate threshold (high followed by low) values in sequence a smoother underlying surface however at lower threshold vales, more feature details are preserved although at the cost of higher level of variation. We used the curation algorithm in a recursive fashion, such that the output from a previous iteration is fed into the next iteration. After few iterations, it is observed that the plot converges to a stable one, with the number of outliers in each iteration steps to be greatly reduced from those at the start of the iteration. 2) Segmentation of Multiple Sclerosis lesions: Lesions observed in a multiple-sclerosis patient's MRI images is one of the neural bio-markers of the progress and recovery of the disease. Finding the exact coordinates of the lesions is a very important [95] but challenging task. We applied SOCKS kernel to recover the coordinates of the lesions along with an estimate of the underlying anatomic features with the hypothetical absence of the lesions.  23 . This way we get access to various layers of the segmented regions.