Byte-Pair Encoding for Classifying Routine Clinical Electroencephalograms in Adults Over the Lifespan

Routine clinical EEG is a standard test used for the neurological evaluation of patients. A trained specialist interprets EEG recordings and classifies them into clinical categories. Given time demands and high inter-reader variability, there is an opportunity to facilitate the evaluation process by providing decision support tools that can classify EEG recordings automatically. Classifying clinical EEG is associated with several challenges: classification models are expected to be interpretable; EEGs vary in duration and EEGs are recorded by multiple technicians operating various devices. Our study aimed to test and validate a framework for EEG classification which satisfies these requirements by transforming EEG into unstructured text. We considered a highly heterogeneous and extensive sample of routine clinical EEGs (n = 5785), with a wide range of participants aged between 15 and 99 years. EEG scans were recorded at a public hospital, according to 10/20 electrode positioning with 20 electrodes. The proposed framework was based on symbolizing EEG signals and adapting a previously proposed method from natural language processing (NLP) to break symbols into words. Specifically, we symbolized the multichannel EEG time series and applied a byte-pair encoding (BPE) algorithm to extract a dictionary of the most frequent patterns (tokens) reflecting the variability of EEG waveforms. To demonstrate the performance of our framework, we used newly-reconstructed EEG features to predict patients' biological age with a Random Forest regression model. This age prediction model achieved a mean absolute error of 15.7 years. We also correlated tokens' occurrence frequencies with age. The highest correlations between the frequencies of tokens and age were observed at frontal and occipital EEG channels. Our findings demonstrated the feasibility of applying an NLP-based approach to classifying routine clinical EEG. Notably, the proposed algorithm could be instrumental in classifying clinical EEG with minimal preprocessing and identifying clinically-relevant short events, such as epileptic spikes.


I. INTRODUCTION
E EG is a neurophysiological test that records the brain's electrical activity by measuring time-varying electrical potential differences between pairs of electrodes. EEG is capable to capture brain activity in different mental states: awake and sleep [1], eyes-closed and eyes-open, alertness or resting state [2], emotional arousal [3], to name a few. In clinical practice, EEG is used to help diagnose several clinical conditions and symptoms. EEG is affordable and captured in a standardized fashion, making it widely available in hospitals throughout the world. The conventional approach to clinical EEG evaluation consists of an analysis of data visually presented on a computer screen by a highly trained expert. The expert has to describe clinically relevant waveforms and differentiate EEG records into broad categories, i.e., normal or abnormal, and if abnormal, epileptiform or non-epileptiform, according to the American Clinical Neurophysiology Society guidelines [4]. Such an approach to EEG evaluation has several challenges. Numerous features in the data are distributed over several channels, and these features vary over time. Evaluating EEG patterns may be affected by personal and institutional biases, which results in a high inter-interpreter variability [5]. For example, certain features of sleep EEG [6], common normal variants [7], or EEG artifacts [8] can be misinterpreted as pathological discharges.
The problem of automatic classification of EEG signals using machine learning techniques has been increasingly studied in cognitive and clinical neuroscience. Potential applications include emotion recognition [9], motor imagery tasks [10], seizure detection [11], brain injury [12], Alzheimer's classification [13], 2168-2194 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
depression [14], gender classification [15], and detection of abnormal EEG [16], to name a few. From a methodological perspective, several approaches exist through feeding EEG data into machine learning models. Studies on classifying EEG signals can be loosely divided into three categories based on how EEG is organized as an input for prediction models: time series, images, and a vector of extracted features [17]. First, the time series approach preserves the original dynamics recorded with EEG. In this case, the input is highly dimensional, which can be suitable for deep learning, but not for classical (non-neural) approaches. Second, feature extraction is often used to reduce the dimensionality of EEG signals. Extracted features can be of various natures. Often these features are defined in the frequency domain, i.e., spectral power, in the time-frequency domain, i.e., spectrograms, or in the temporal domain, wherein various linear and nonlinear features are computed (extracted), e.g., signal complexity measures [18]. Feature extraction often requires advanced EEG preprocessing, as artifacts may bias the estimation of EEG features. Feature extraction can be naturally combined with deep neural networks or classical machine learning models. The features are often organized as a vector without a clearly defined structure. Third, raw EEG or EEG features can be organized as images. This common approach allows one to adopt methods that have demonstrated high performance in classifying images with deep learning, i.e., convolutional neural networks (CNN) [19].
Classification of clinical EEG is expected to have extra challenges compared to EEG recordings in a controlled laboratory setting. Clinical imaging environments are highly heterogeneous, often involving multiple EEG systems operated by multiple EEG technicians who may follow different guidelines for EEG recordings. In particular, those guidelines do not necessarily impose strict standards for selecting and locating the reference and ground electrodes. The abundance of physiological artifacts and their variability are expected to be higher in patients compared to participants tested in a laboratory. Notably, the duration of clinical EEG recordings varies significantly.
The heterogeneous nature of clinical EEG can potentially be addressed with tools developed to analyze and interpret unstructured data such as text. Natural Language Processing (NLP) is a field defined at the intersection of linguistics and computer science. The NLP works with human language and analyzes large amounts of symbolic data. A large number of NLP studies have delivered models of high performance, e.g., in applications for modern machine translation and generating human-like text, known as GPT-3 [20]. The common ground of NLP methods is based on breaking unstructured text into repeating parts, such as characters or words, parts of words, or groups of words -so-called tokens. Subsequently, these tokens can be used as new features. In particular, tokens' occurrence has been used as features for various text classification tasks, e.g., predicting whether a social media post is expressing positive or negative feelings [21].
Recently, there have been attempts to adapt NLP tools for time series classification. One study exploited the analogy between NLP text patterns and signal patterns for anomaly detection in multivariate time series arising from telemetry streams [22]. Another study classified the dynamics of heart rate and daily step count data from wearable devices to predict participants' personality traits [23]. In particular, the authors performed several steps in their analysis: (1) converting the original time series into a symbolic (character) series, (2) defining repeating parts (tokens) in the newly reconstructed symbolic series, and finally, (3) using tokens' occurrence frequencies to train their machine learning models. Some elements of such an approach can be traced to EEG studies, which characterized EEG signals in terms of symbolic complexity measures [24], [25], [26].
We hypothesized that NLP algorithms designed for analyzing semi-structured or unstructured data (texts) can be incorporated into an EEG preprocessing workflow with a subsequent EEG classification. In our study, we aimed to apply, test, and validate an NLP-based pipeline previously developed for time series classification [23]. We analyzed a large sample of routine clinical EEG recorded in a highly heterogeneous cohort of patients between 15 and 99 years old. Having applied a minimalistic preprocessing pipeline, we converted EEG signals into strings of symbols. We then applied an algorithm known as byte-pair encoding to split the newly reconstructed text into the most frequent combinations of symbols by iteratively counting the appearances of unique pairs of symbols and merging the most frequent pairs into new complex symbols (tokens). To demonstrate the performance of our approach, we tested to what degree the most frequent tokens, associated with specific patterns of changes in EEG amplitude, could predict patients' biological age.

A. Dataset Description
We analyzed routine clinical EEG recorded and evaluated in the process of neurological assessment of patients in a hospital in the greater Vancouver area within Fraser Health Authority. The ethics protocol was approved by the Research Ethics Boards at Simon Fraser University and Fraser Health Authority, April 1, 2022, protocol number H18-02728. The original sample included virtually all EEG studies (n = 7048) recorded between 2012 and 2018. The duration of recordings varied from 10 minutes to several hours (mean duration ∼35 minutes). The pool of patients included both in-patients and out-patients. The patients' age range was between 15 and 99 years. The hardware and firmware were identical across all the EEG stations, each equipped with a Natus Xltek EEG32U EEG amplifier and gold-cup electrodes. The EEG montage was kept uniform: 10/20 system positioning, 20 standard EEG electrodes (FP1, FPZ, FP2, F3, F4, F7, F8, FZ, T3, T4, T5, T6, C3, C4, CZ, P3, P4, PZ, O1, O2), two electrooculography (EOG), and two electrocardiographic (ECG) electrodes. The location of the reference and ground electrodes was unknown. The sampling frequency was either 500 Hz or 512 Hz.

B. EEG Preprocessing
To test and validate a workflow, which can be universal across highly variable environments of public hospitals, we applied a minimal set of EEG preprocessing procedures. First, EEG data were converted from the Natus proprietary format into the EDF format with Natus's Neuroworks. If EEG recordings in an original EEG study were turned off and on, potentially several times, the recorded EEG segments were linked with digital zeros in the resulting continuous EDF file. EEGs were then de-identified with the PyEDFlib Python toolbox [27]. Each EEG recording was filtered between 0.5 and 55 Hz to include canonical delta, theta, alpha, beta, and lower gamma oscillations. We applied an overlap-add Finite impulse response (FIR) filter, as implemented in the MNE python toolbox [28]. Note that with such a filter, we avoided the powerline frequency of 60 Hz and its higher harmonics. EEGs were then resampled at 500 Hz when the original sampling frequency was not 500 Hz. Separately for each EEG channel, we removed a possible linear trend. In each EEG scan, we identified the time intervals corresponding to the flat signal (digital zeros), hyperventilation, and photic stimulation procedures, if any. Avoiding these time intervals, we aimed to randomly select one 10-minute resting-state EEG segment from each of the original EEG recordings, which failed in some cases. These cases were discarded from further analysis. The final sample included n = 5785 EEG segments, each associated with one original EEG scan.

C. Symbolization of EEG Time Series
To apply NLP methods, we had to convert EEG signals into text. This procedure was divided into two stages: Piecewise Aggregate Approximation (PAA) and discretization. PAA was applied across time, whereas discretization was applied across EEG amplitude [29]. At the PAA stage, we divided the entire EEG segment into non-overlapping windows of a fixed length of 10 data points. The window length was chosen arbitrarily, being approximately equal to the ratio of the sampling frequency (500 Hz) over the high-frequency cut-off of the applied bandpass filter (55 Hz). We then averaged EEG amplitude across time points within each EEG segment, thus reducing the total number of time points by 10, as illustrated in Fig. 1, based on one EEG time series as an example.
At the discretization stage, which was performed separately for each EEG channel, we divided the entire range of this EEG channel's amplitude into several bins, each assigned with a letter from the Latin alphabet. First, we defined two quantiles: Q1 and Q3, representing respectively the 25th and 75th percentiles in the range of EEG amplitude. Second, we calculated the Inter Quartile Range (IQR), which was the difference between Q3 and Q1. Finally, we defined the upper and lower boundaries, which were Q1 -1.5 IQR and Q3 + 1.5 IQR, respectively. All values above the upper boundary and below the lower boundary were deemed outliers. We assigned two bins for the upper and lower outliers. The rest of the amplitude range was divided into 20 equally spaced bins. The entire range of EEG amplitude, separately for each channel, was thus discretized into 22 bins, denoted by the symbols from "a" to "v". Using this mapping, we assigned a symbol to each signal value obtained from PAA, as illustrated in Fig. 2. Specifically, each EEG segment of 10 data points was assigned to a symbol corresponding to the segment's mean amplitude. In other words, each set of 10 consecutive data points from an original recording corresponded to one symbol. For example, the symbolization of a 60second EEG recording will return a sequence of 3000 symbols An example demonstrating the symbolization of EEG signals after the discretization stage, which is illustrated in Fig. 1: (a) the upper and lower boundaries (dashed lines) are defined as the Q1 -1.5 IQR and Q3 + 1.5 IQR, which in turn defines two bins for the upper and lower outliers; (b) a 'zoomed-in' version (y-axis rescaled) of the same signal, as in Fig. 2(a), wherein we divide the amplitude range between the boundaries into 20 bins (dotted lines). Each bin, including those for outliers, is assigned a symbol from the Latin alphabet (from "a" to "v", from bottom to the top).
(60 seconds * 500 Hz sampling frequency/10 points window size), having a form like "ababdc …". As a result of this procedure, all EEG signals were normalized and symbolized.

D. Tokenization of EEG Symbolic Series
Once EEGs were symbolized, we applied a Byte Pair Encoding (BPE) algorithm to split the strings of letters into tokens representing groups of letters standing together. BPE has been around for a long time [30]; however, it has received much more recognition since it was applied in a study demonstrating its ability to handle rare and unknown word translation tasks [31]. For example, compound words like "authorship" could be understood by NLP models as subwords "author-" and "-ship" because sub-words "author" and "ship" alone often occur in human language. Extending the analogy, we aimed to find comparable patterns in symbolized EEG data. BPE takes all pairs of consecutive symbols, counts their occurrences in the text, and merges the most frequent pairs into new symbols. Then a newly merged symbol can be again merged with another symbol if the new pair occurs most frequently within a given iteration. In this way, the algorithm starts with a base vocabulary (in our case, symbols from "a" to "v" that occur in the training dataset), merges the basic symbols to form new symbols, and generates a new vocabulary of basic and merged symbols. We refer to these newly merged symbols as tokens. The algorithm (tokenizer) iterates until the vocabulary attains an a priori defined vocabulary size. We illustrate the BPE workflow with the following example: Input string: a a a b d a c a a a

b a c
Step 1: merging aa: aa a b d a c aa a b a c Step 2: merging ab: aa ab d aa a c ab a c Step 3: merging aa and ab: aaab d a c aaab a c Step 4: merging ac: aaab d ac aaab ac As a result, instead of having a sequence of 11 symbols, we have a sequence of five tokens: two aaab, two ac, and d. In terms of the tokens' length, some tokens may represent more sophisticated patterns than a single symbol (note that the tokens ultimately reflect patterns in EEG signal amplitude).
To merge symbols and find repeated tokens in the dataset, we trained the tokenizer on the symbolic series of all EEG recordings in the entire dataset. Specifically, the input data was 5785 multivariate symbolized series, which makes it in total ∼35 billion symbols (5785 EEGs, each having 20 channels, and 30000 symbols representing each channel). We applied the open-source implementation of the BPE algorithm, as developed by the Hugging Face community (Hugging Face, BPE).
The algorithm is controlled by several parameters: (1) a minimal frequency of a pair of symbols to be merged into a token, and (2) a maximum vocabulary size, which defines a threshold of how many tokens are generated before the algorithm will Fig. 3. A schematic illustration of the feature space to describe tokenized EEG signals. Each row is associated with one EEG segment. Each column represents a feature associated with one token. Its value is the token's occurrence frequency for a given EEG channel. The tokens are channel-specific, so columns are ultimately named in the format 'channel_token': for example, 'C4_BABC' or 'O2_BABC'. Note that the tokens' occurrence frequencies sum up to 100% for each EEG channel. form before it stops. We applied the following values of the parameters: minimal frequency = 2, which is the default value, and maximum vocabulary size = 1500, in order to keep the total number of tokens (across all EEG channels) being approximately equal to the tokenizer's vocabulary size multiplied by the number of channels. (see the next subsection).

E. EEG Features Based on EEG Tokens
Once the tokenizer had learned a vocabulary of tokens, we described each symbolic EEG series by a sequence of newly learned tokens, each representing a unique pattern of changes in EEG amplitude in the original EEG signal. We considered each token as a word composed of one or more letters and counted the number of tokens in each channel separately for each EEG segment, similar to the bag-of-words approach. The bag-of-words model is commonly used in NLP for text document classification, wherein each word's occurrence frequency is used as a feature for training a classifier [32]. We weighted the tokens' occurrence frequency by their length and normalized it with respect to the total length: Token's occurrence frequency, % = Number of token appearances × Token length Total length of the series × 100 Note that for a given channel, these values sum up to exactly 100%. As we applied the tokenizer separately to each EEG channel, the tokens' occurrence frequency was calculated across the pool of tokens generated for a given channel, as illustrated in Fig. 3. Thus, the total number of tokens across all EEG channels was approximately equal to the vocabulary size of the tokenizer multiplied by the number of channels. We say approximately as there was no guarantee that the same tokens would appear in all channels. So, some channels may lack tokens present in the vocabulary. As a result, the EEG dataset had a size of 5785 samples with about 30000 features, representing the tokens' occurrence frequencies across 20 channels.
We also considered situations wherein tokens had the same shape but represented different EEG amplitudes, as exemplified in Fig. 4. Specifically, two tokens, 'bdc' and 'dfe' in Fig. 4(a) have the same shape, but they are shifted across the amplitude axis. Formally, these two patterns of changes in EEG amplitude are represented by different combinations of letters. We converted all the tokens into their "relative" form ( Fig. 4(b)) by calculating the distance between the adjacent symbols in terms of the number of bins between them. For example, in the token 'bdc', the distance between the letters 'b' and 'd' is two bins up, and that between 'd' and 'c' is one bin down. Thus, both tokens in Fig. 4(a) had the same form [+2, −1] (Fig. 4(b)). In our study, we used the terms of "symbolic tokens" and "relative tokens" to designate the original tokens and tokens representing relative changes in EEG amplitude, respectively.
The transition from the absolute symbolic form to the relative form decreased the number of features in our dataset from approximately 30000 to 8000. To compute the occurrence frequencies of tokens representing relative changes in the EEG amplitude, we summed up the tokens' occurrence frequencies across those tokens that had the same relative shape. This was done separately for each EEG channel. For example, if token 'bdc' in channel O2 occurred with a frequency of 0.015, and the token 'dfe' in the same channel O2 occurred with a frequency of 0.01, the corresponding relative token with the shape O2 [+5, −2] would have the occurrence frequency of 0.025. Note that as before, all frequencies within a given channel summed up to 100%.

F. Tokens-Based EEG Features and Their Relationships With Age
We correlated the tokens' occurrence frequencies with patients' biological age to validate our pipeline based on the symbolization of EEG signals, and demonstrated a physiologically-relevant representation of tokens. We applied two approaches. First, we trained a machine learning model, namely a Random Forest regression, to predict patients' biological age using tokens' occurrence frequencies as new EEG features to assess these features' predictive power as a whole. Second, we correlated each EEG feature of interest with patients' age univariately (separately for each token) to find features most sensitive to age-related changes.
Correlations between tokens' occurrence frequencies and age: We explored correlations between each token's occurrence frequency and age across subjects. Distance correlation [33] was designed to reflect functional, nonlinear associations between two variables. Separately for each token and EEG channel, we calculated the distance correlation coefficient between this token's occurrence frequency and patients' age. We also analyzed how this measure varied across EEG channels. For tokens with the strongest correlation (99.99-percentile), we explored how their occurrence frequency changed with age and what patterns in EEG signals they represented.
Age prediction with a machine learning model: We tested the capacity of the new EEG features to predict the patients' age. Specifically, we tested a random forest regression model, using the tokens as features, the tokens' occurrence frequencies as the predictor variables, and the patients' age as the target variable.
The entire dataset was randomly split into train and test subsets (50% of EEG segments were used for training and 50% for testing) 10 times (cross-validation rounds). For each round of cross-validation, we applied the cuML RAPIDS implementation of the Random Forest [34] regression model to utilize high-performing training on GPU (Tesla T4 16GB). We applied the default values of the hyperparameters: 100 decision trees in the forest; the maximum tree depth or the number of nodes from the root of a tree to the final leaf was set to 16; the number of leafs in a tree was not restricted; all features were considered per node split; all samples were used for fitting each tree; the number of bins to split by feature was 128; the minimum number of samples per leaf node was 1; bootstrapping was enabled; MSE (mean squared error) was used as the criterion to split nodes; MAE (mean absolute error) was used as the accuracy metric. Further, for each round of cross-validation, we evaluated the performance of the age prediction model with three metrics: (a)  Mean Absolute Error (MAE) between the actual and predicted age, (b) Pearson correlation coefficient between the predicted and actual age, and (c) explained variance score, which measures the proportion to which a model accounts for the variation of a dataset. We reported the mean and standard deviation values of the performance metrics averaged across cross-validation rounds. The Random Forest models were trained and evaluated separately for two sets of features: symbolic and relative tokens.

III. RESULTS
The Random Forest model, which was based on token's occurrence frequencies, has predicted patients' age with an MAE = 15.7 ± 0.14 years (p < 0.0001) (Fig. 5). The Pearson correlation coefficient between the predicted and actual age in the test sample was 0.53 ± 0.01 (p < 0.0001), whereas the explained variance score was 0.26 ± 0.01.
Compared to the symbolic tokens, the random forest regression model based on the relative tokens demonstrated similar Fig. 7. Relationships between relative token occurrence frequency and age. Two relative tokens were identified as having the highest distance correlation values: [+5, −2] (left) and [+2, +1, 0] (right), both representing changes in EEG amplitude on the channel O2. The entire age range was divided into non-overlapping age groups, with a moving step of one year. Each dot on these scatterplots represents the mean occurrence frequency averaged across subjects within each age group.
performance in terms of MAE (15.7 ± 0.14 years), correlation (0.54 ± 0.01), and explained variance (0.26 ± 0.01). We note, however, that a smaller number of relative tokens compared to original symbolic tokens (approximately 8000 vs. 32000) significantly improved the computational time required for the model training: about 40 minutes versus 2 hours.
We also explored the distribution of tokens according to their length, which we defined as the number of symbols or letters composing a token (Fig. 6). 95% of unique tokens had their length between 1 and 10 symbols, with the mean of 7 symbols and the median of 4.
We employed the distance correlation metric to find out the most influential tokens and explored how their occurrence frequency changes with age. We calculated the distance correlation between tokens' occurrence frequency and patients' age separately for each token. The tokens with the highest correlations (with a threshold at the 99.99-percentile) had their distance correlation values in a range between 0.25 and 0.29 for the symbolic tokens and 0.23 to 0.24 for relative tokens (p < 0.1). We identified the two most influential relative tokens and explored how their occurrence changed with age. Specifically, we grouped all the patients from the training dataset into non-overlapping one-year-long age categories. Fig. 7 shows the functional relationships between the tokens' occurrence frequency and age, wherein each dot represents the tokens' mean occurrence frequency, which was averaged across patients within each age category. As can be seen from Fig. 7, these relationships are nonlinear in general, with positive or negative trends depending on the age range.
We have also checked how the distance correlation between tokens' occurrence frequency and age varied across the spatial organization of EEG channels. On average, some EEG channels expressed stronger correlations. We computed the median and mean values of the correlation coefficients across all tokens within each EEG channel. We found that tokens extracted from Fig. 8. Distribution of the median distance correlations between tokens' occurrence frequency and patient's age. The median value was computed separately for each EEG channel across all tokens. Each cell shows an EEG channel name and its median correlation value. The darker areas stand for higher correlations. Asterisks denote statistical significance of the correlations.
EEG recorded in the frontal and occipital areas have a higher median distance correlation. Visually, the distributions of median or mean correlation values across channels tended to preserve a spatial symmetry between the left and right hemispheres (Fig. 8). Note that this symmetry was not modeled by the proposed workflow of analyses. From the EEG symbolization stage, we know the exact position of tokens in the symbolic series and can trace them all in the original EEG signal. Our method thus allows us to find the EEG fragments corresponding to the tokens that correlate the most with age (Fig. 9). Notably, these EEG segments are relatively short. For example, with the original sampling frequency of 500 Hz and the window size of 10 data points used for defining a symbol, the token of three symbols has a duration of 3 × 20 = 60 milliseconds.

IV. DISCUSSION
In our study, we applied an NLP-based pipeline for extracting features from clinical EEG recordings to their subsequent classification. We tested this approach using an extensive sample of routine clinical EEG scans recorded in a large pool of patients within a very wide age range. First, with minimal preprocessing, we converted EEG signals into symbolic series. The byte-pair encoding (BPE) algorithm then extracted the most frequent patterns (tokens) from newly reconstructed symbolic series. To validate our approach, we used the extracted tokens as new EEG features and applied a random forest regression model to predict patients' age. We also examined functional relationships between tokens occurrence frequency and patients' age and found that these relationships may follow a U-shape pattern, depending on the age. Spatially, the most strong correlations were expressed by EEG channels from the prefrontal and occipital areas.
We tested a minimal preprocessing of EEG signals recorded in a complex clinical environment, using a classical (non-neural) model without fine-tuning the model parameters. Still, we demonstrated the presence of relatively high correlations between tokens and age and obtained results that were comparable with other studies. We note that the mean absolute error (MAE) between the actual and predicted age in our data is relatively high compared to other studies on age prediction from EEG of comparable sample size (Table I) Limitations of the study: The highly heterogeneous nature and sample size of our EEG data may explain the lower performance of our approach compared to other studies. Our workflow of analyses processed EEG as it was recorded in real-world clinical practice scenarios. EEGs were recorded by different EEG technicians operating several EEG stations. The location of the reference electrode was expected to vary from one EEG study to another, and in general, was unknown. Importantly, we analyzed virtually all EEG scans recorded and evaluated in the process of the diagnostic workup in one hospital without any selection bias. Our population included both in-patients (they are required to stay in the hospital overnight) and out-patients (no such requirement). This is a highly heterogeneous population, with various diagnoses and a full spectrum of comorbidity levels.
These patients were expected to take various medications, which is known to impact EEG dynamics [36]. Furthermore, we did not separate EEG scans into normal and abnormal categories. The status of the EEGs, such as normal or abnormal, if abnormal, epileptiform or non-epileptiform, was not easily available, as this information was ultimately stored in neurological reports (neurologists' understanding of EEG waveforms), representing non-structured medical text. Also, we did not use hyperparameter tuning techniques for our models, as our primary goal was to demonstrate the feasibility of an NLP-based approach for classifying clinical EEG.
The proposed workflow of analyses included two main stages: symbolization of EEG records and their subsequent tokenization. Symbolization or transformation of EEG time series into symbol (letters) sequences is not new [26]. Several studies applied this approach to characterizing the signal complexity of EEG dynamics with metrics such as symbolic entropy [37], [26], in various applications, e.g., anesthetics effects on cortical information flows [25]. Typically, the symbolization stage in the analysis performed by those studies was used to define a single value, which characterizes the entire time series, such as the entropy of a given signal. At a later stage, EEG classification or group analysis was performed based on the macro characteristics of EEG signals, which by design prevents tracing group differences back to original EEG dynamics.
Advantages of the proposed approach: In contrast to symbolization, tokenizing the symbols or grouping them into "words" is novel in the EEG literature, and this contributes significantly to solving one of the key issues in machine learning applications in healthcare, namely, interpretability [38]. The proposed method of feature extraction with BPE ensures that the extracted features, namely tokens or ultimately short patterns in EEG amplitude, are interpretable. Here we define interpretability as the pipeline's ability to directly point to the most sensitive EEG waveforms and their exact locations in the original signal. A combination of EEG symbolization and subsequent tokenization of symbolic series allows one to focus on relatively short patterns of changes in EEG amplitude, which can be of high relevance to a target variable of interest. Ultimately, this functionality would allow physicians to focus on the EEG patterns in the original recording, which corresponds to a clinically-relevant parameter. This can be particularly important in the context of inter-rater variability observed when evaluating routine clinical EEG. Sharp transients in EEG, including interictal epileptiform discharges, vary significantly in their morphological properties, such as voltages, durations, slopes, areas, and across-channel correlation, which may not be interpreted reliably by different physicians [39]. One of the advantages provided by BPE-based tokenization is that EEG transients can be potentially identified automatically in an unsupervised manner. Importantly, our method can directly identify very short clinically relevant EEG waveforms and characterize their morphological features, such as duration, voltage amplitude, and changes in the slope, without significant modifications. Also, the proposed approach can handle relatively long recordings. Specifically, our study was based on 10-minute long EEG segments, whereas comparable approaches utilized 30-60 seconds epochs [1], [35].
We have acknowledged several limitations, which are essentially related to the variability and uncertainty of clinical data recorded in real-world scenarios of modern healthcare. On the one hand, this may be considered a flaw. But, on the other hand, it could be regarded as a strength, especially in the context of model drift in machine learning. Clinical data collection likely will not change quickly, so expecting clinical neurophysiological data to meet laboratory-setting standards would limit the ability to translate this type of work. Importantly, these typical data are necessary to avoid algorithmic bias for developing socially responsible artificial intelligence tools for more effective health care.
In general, the proposed approach offers several advantages for EEG classification. Potentially, it can handle EEG recordings of arbitrary duration, as it was originally designed to classify unstructured texts of varying length. The method does not require advanced EEG preprocessing and finds local patterns (in the temporal domain) in raw EEG in an unsupervised manner. The feature extraction method can be applied to other classification tasks besides age prediction. Importantly, the proposed approach can identify short clinically-relevant events, such as interictal epileptiform discharges. The local traceable features can be potentially utilized in decision support systems to highlight clinically-relevant EEG segments for further examination by neurologists with conventional methods.