Tractable Maximum Likelihood Estimation for Latent Structure Influence Models With Applications to EEG & ECoG Processing

Brain signals are nonlinear and nonstationary time series, which provide information about spatiotemporal patterns of electrical activity in the brain. CHMMs are suitable tools for modeling multi-channel time-series dependent on both time and space, but state-space parameters grow exponentially with the number of channels. To cope with this limitation, we consider the influence model as the interaction of hidden Markov chains called Latent Structure Influence Models (LSIMs). LSIMs are capable of detecting nonlinearity and nonstationarity, making them well suited for multi-channel brain signals. We apply LSIMs to capture the spatial and temporal dynamics in multi-channel EEG/ECoG signals. The current manuscript extends the scope of the re-estimation algorithm from HMMs to LSIMs. We prove that the re-estimation algorithm of LSIMs will converge to stationary points corresponding to Kullback-Leibler divergence. We prove convergence by developing a new auxiliary function using the influence model and a mixture of strictly log-concave or elliptically symmetric densities. The theories that support this proof are derived from previous studies by Baum, Liporace, Dempster, and Juang. We then develop a closed-form expression for re-estimation formulas using tractable marginal forward-backward parameters defined in our previous study. Simulated datasets and EEG/ECoG recordings confirm the practical convergence of the derived re-estimation formulas. We also study the use of LSIMs for modeling and classification on simulated and real EEG/ECoG datasets. Based on AIC and BIC, LSIMs perform better than HMMs and CHMMs in modeling embedded Lorenz systems and ECoG recordings. LSIMs are more reliable and better classifiers than HMMs, SVMs and CHMMs in 2-class simulated CHMMs. EEG biometric verification results indicate that the LSIM-based method improves the area under curve (AUC) values by about 6.8% and decreases the standard deviation of AUC values from 5.4% to 3.3% compared to the existing HMM-based method for all conditions on the BED dataset.


I. INTRODUCTION
C OUPLED hidden Markov models (CHMMs) are probabilistic functions that use interacting Markov chains to model complex dynamical systems. CHMMs are used by many applications involving multi-channel time series, such as sign language recognition [1], Audio-Visual Speech Recognition (AVSR) [2], [3], EEG and ECG classification [4], [5], [6], dynamic Functional Connectivity (dFC) in fMRI [7], disease interactions [8], freeway traffic modeling [9], [10], and financial crisis detection [11]. CHMMs effectively handle non-stationarity and non-linearity that commonly exist in real-world time-series, especially multi-channel brain signals. Multiple brain regions exhibit temporal dependencies, according to brain functional connectivity researches. Therefore a CHMM is a good candidate for analyzing multi-channel brain signals. Study [12] demonstrates a CHMM-based methodology for modeling the trajectory of EEG topography over time. This methodology classifies single trials from visual detection tasks as target and non-target. In a recent study, simulations and real EEG data from epileptic patients were used to test the classification performance [13]. In addition to providing classification results, the model also mapped brain activity back onto the scalp, allowing the EEG signals to be interpreted. Study [14] presents a novel and customized method to detect and localize epileptic seizures in multi-channel scalp EEG recordings. This CHMM framework captures the spatiotemporal propagation for robust seizure detection. Standard HMMs can model multi-channel interacting timeseries, but CHMMs are the better alternatives [15]. They complement the capabilities of standard HMMs by capturing interactions both in space and time for multi-channel time-series [15]. Each channel has its hidden Markov chain and observations (univariate or multivariate) in a C-channel CHMM with N hidden states per channel. Transition probabilities of a channel's hidden states depend on the all previous hidden states, so each channel has a large transition matrix (N C × N ). In general, the number of state-space parameters grows exponentially with the number of channels, and additionally, more sample observations are also needed to estimate them. For example, a 20-channel CHMM with 10 hidden states per channel has 2 × 10 22 parameters, and it requires a vast number of observations to learn these parameters that are impossible in practical problems. Therefore the learning problem of CHMMs is much more challenging than HMMs.
There are two common approaches to overcome the exponential growth of state-space parameters in the literature. A simplified factorization of the transition matrix was proposed by Brand in which it was assumed that the probability of a hidden state conditioned on previous states was equal to the product of the marginal conditional probability [15], [16]. While Brand's assumption reduces the number of parameters to (NC) 2 , it needs N C normalizing values, as emphasized by [17]. The next approach is the influence model that prevents the exponential growth of state-space parameters [18], [19] , and transition matrices are factorized as follows where, coupling weight θ c,ξ indicates influence from channel c to channel ξ. Coupling weights can be viewed as an adjacency matrix of a graph or network named influence model [19]. The influence model reduces the exponential growth to a quadratic growth ((N 2 + 1)C 2 ). Several studies used the influence model in higher-order Markov chains modeling, stochastic language modeling, and mixed memory Markov models [18], [20], [21]. Latent Structure Influence Models (LSIMs) are CHMMs with the influence model as the interaction model of Markov chains employed in social computing in several studies [22], [23], [24], [25]. In contrast to CHMMs, LSIMs can easily be applied to datasets with many channels. Two important problems must be solved for LSIMs to be useful in real-world applications known as inference and learning problems. Exact inference is achieved by transforming an LSIM to an equivalent HMMs with a large cardinality using the Cartesian product of hidden states of all channels. Unfortunately, the exact solution's computational complexity is O(T N 2 C ) [15], [17] that grows exponentially concerning the number of channels. Thus, it can be very demanding and time-consuming, and several approximated inference algorithms were proposed in the literature to overcome this computational demand. Various approximate inference algorithms were proposed to cope with this exponential complexity. The first algorithm uses nonlinear mapping based on Structured Variational Inference (SVI), and marginal forward and backward parameters are obtained by polynomial complexity O(T (NC) 2 ) [26]. The next algorithm was developed based on mean-field approximation and variational inference [27], which calculates the one-slice parameter considering the Completely Factorized Variational Inference (CFVI) by computational complexity O(T (NC) 3 ). We also proposed a new approximated algorithm to compute marginal forward, backward and one-slice parameters with computational complexity O(T (NC) 2 ) [28]. Simulated and real datasets' results confirmed that the proposed inference has less error and superior performance than the previous existing SVI and CFVI.
Learning or estimating LSIM parameters is a more critical and challenging problem than inference. Learning and inference problems are efficiently solved for HMMs to be practical in real-world applications. The well-known forward-backward algorithm solves the inference problem in HMMs. The learning problem involves choosing an optimal set of parameters for some observed multi-channel time series to maximize an appropriate criterion. A well-known training method in HMMs is the Baum-Welch or Expectation-Maximization (EM) algorithm, and it is used to estimate parameters using the maximum-likelihood framework [29]. The EM algorithm finds local maximum likelihood parameters of HMMs based on an auxiliary function, defined upon the Kullback-Leibler divergence [30]. According to Baum's optimization procedure, the optimal parameters are defined as the critical points of the auxiliary functions [31]. Juang extended the EM algorithm of HMMs to accommodate a broad class of mixture of strictly log-concave or elliptically symmetric multivariate distributions [32], [33]. The proof of convergence and closed-form relations of learning in HMMs were presented based on the auxiliary function [32]. A recent study developed a novel Markov chain Monte Carlo (MCMC) algorithm that simultaneously performs inference and parameter estimation in nonhomogeneous Markov chains and puts CHMMs in the context of modeling the spread of infectious diseases [34].
There is no existing learning framework for LSIMs that guarantees the likelihood of model monotonically increases through a re-estimation algorithm and provides proof of convergence. Previous studies developed learning algorithms based on partial derivatives of the likelihood function [17], [22]. The first learning algorithm maximizes a simplified likelihood function with standard constrained optimization [17]. This algorithm uses the chain rule and also takes partial derivatives based on the approximate forward parameter. The re-estimation equations are not explicit as the Baum-Welch algorithm, and channel observations are also considered independent from other channels to simplifying partial derivatives. Thus, there are two sources of error in this framework, and the convergence proof was not discussed. The second learning algorithm is a re-estimation (EM) algorithm developed based on the partial derivatives of a simplified lower band of the log-likelihood function [22], [27]. While this algorithm has a closed-form solution, it does not guarantee a monotonic likelihood increase due to re-estimation procedures. Another study proposed a dynamical influence model for LSIMs with several simplifying assumptions on the structure of transition probabilities and the pattern of coupling weights [25]. A variational EM algorithm is used in the study for the exponential computation of inference, but the theoretical convergence and biases of approximate variational inference are not examined [25].
Thus, the theoretical convergence of the re-estimation algorithm has not been proven exhaustively for LSIMs. Our previous study developed fast and accurate recursive equations to solve approximate inference in LSIMs for a given λ with computational complexity O(T (NC) 2 ) [28]. The current study extends the standard EM framework from HMMs to LSIMs and proves monotonic convergence. The current study takes advantage of our previous approximate inference to avoid exponentially high computational demands in maximization re-estimation transforms and proposes a fast and tractable closed-form algorithm for learning LSIM parameters. This work focuses on LSIM learning, and the contributions can be summarized as follows.
r The re-estimation algorithm of HMMs is extended to LSIMs, and convergence is proven theoretically with the influence model and a mixture of strictly log-concave distributions. Convergence is also confirmed practically by applying the proposed algorithm to simulated and real time-series.
r We develop an auxiliary function to prove the convergence for LSIMs theoretically. The technical challenge involves adequately simplifying the observation likelihood based on the influence model. We show that this auxiliary function has a unique global maximum, expressed in a closed-form expression called the re-estimation transformation.
r The re-estimation algorithm is presented as a closed-form expression based on marginal forward-backward parameters similar to the Baum-Welch algorithm. Approximate inference keeps the complexity at O(T (NC) 2 ) instead of O(T N 2 C ), and the proposed re-estimation algorithm works well even for datasets with more than 100 channels. The rest of this manuscript is structured as follows. In Section II, we describe the notations and constructing an auxiliary function. Then, the re-estimation algorithm's convergence is proved, and the re-estimation algorithm is achieved by maximizing the auxiliary function. In Section III, procedures of data simulation are explained. Several real multi-channel time-series are also introduced, and the validation criteria are described. The proposed re-estimation algorithm is applied to simulated and real multi-channel time-series, and Section IV presents the results of LSIMs comparing to other models. Finally, conclusions are described in Section V, following with expressing some proofs in Appendix B, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TPAMI.2023.3244130 and Appendix C, available in the online supplemental material.

II. PROPOSED LEARNING FRAMEWORK
In this section, we define symbols and variables with the same notations as [6], [28]. Some new definitions are also considered for future probabilistic manipulation. Then, a well-established auxiliary function is constructed based on HMMs concepts inspired by the study [32], which is an acceptable source of EM algorithm in HMMs. This function is defined accurately by simplifying f (o 1:T |λ) based on LSIM parameters. We prove that regarding the influence model, the constructed auxiliary function preserves the structure of the auxiliary function in HMMs. Finally, the uniqueness of the global maximum is showed for the constructed auxiliary function, and then a closed-form solution is also presented to find it. Furthermore, the proposed framework is equivalent to HMM learning framework for an LSIM with one channel.

A. Notations
We assume a LSIM with C channels, and its observa- where D(c) is the number of Gaussian in channel c and } are weights, means and covariance matrices of GMM in channel c at state m, respectively. Similar to state space, we define mixture space as Ω c K = {1, 2, . . ., D(c)}. Sets of all mixing weights, means and covariance matrices are also denoted by ω, μ and Σ. Thus, the LSIM is characterized by λ = {π, A, Θ, ω, μ, Σ}, and Λ is also defined as the total parameters space (λ ∈ Λ).

Set of observations at time t is denoted by
The variable φ ξ t is also defined previously in [21], and this variable indicates the independent partial influence of all channels on channel ξ. The influence model is the marginal distribution of (3) as follows We further define some sequence spaces in following. Let Ψ c Q be the T th Cartesian product of the hidden states as . ., T }, and the mixture branch sequence space of LSIMs is defined as the Cth . ., T }, and the channel branch sequence space of LSIMs is defined as the Cth . A list of acronyms and notations is included in Appendix A, available in the online supplemental material to assist the reader in tracking notations in the following sections.

B. Joint Density and Auxiliary Function
The first step of constructing an auxiliary function is to simplify f (o 1:T |λ) based on LSIM parameters as much as possible. We decompose and simplify the global density function f (o 1:T |λ) as follows The joint density function f (o 1:T , Q, K, Φ|λ) is differentiable in λ since the parameters of influence model place in a product form similar to f (o 1:T , Q, K|λ) in HMMs. Following the concept of the Kullback-Leibler divergence in HMMs, we define an auxiliary function Q(λ,λ) as a function of two sets of parameters λ andλ in Λ by the following equation The next theorem of Q(λ,λ) is generalized to LSIMs quickly.
Since (5) and (6) express both the joint density and the auxiliary function, Theorem II.1 is valid. This theorem is essential in the generalized EM algorithm [30]. For a given observed multi-channel time-series o 1:T , the re-estimation algorithm starts with an initial model λ. Then, a transformation ofλ that increases Q(λ,λ) determines the next model in the re-estimation algorithm. However, a better transformation is the maximizer of Q(λ,λ) as a function ofλ. So, the proposed algorithm re-estimatesλ from the current model λ asλ = T (λ) ∈ {λ ∈ Λ|Q(λ,λ) = maxλ ∈Λ Q(λ,λ)}. The transformation T (λ) : Λ → Λ is called the re-estimation transformation. Consequently,λ plays the same role as λ so that the new re-estimate determines the next model. According to the following theorem, this sequence of models consistently increases Proof. The proof follows by Juang and Baum [31], [32]. Theorems II.1 and II.2 guarantee that the model of reestimationλ always increases the likelihood, i.e., f (o 1:T |λ) > f (o 1:T |λ), unlessλ is a fixed point of the transformation. Thus, this transformation will converge to a fixed point, or, in other words, a critical point of the likelihood.
An adequate development of Q(λ,λ) may not seems significant until, for example, constructing Q(λ,λ) based on f (o 1:T , Q|λ) instead of f (o 1:T , Q, K, Φ|λ) also kept Theorems II.1 and II.2 going. Two essential consideration of constructing a new Q(λ,λ) should be noticed: the uniqueness of its global maximum and a fast closed-form solution of this global maximum. Next, we indicate that these points exist in the proposed Q(λ,λ).

C. Maximization and Re-Estimation Algorithm
The logarithm of global joint density has a separability property as follows Assuming there are I separable parameter sets such that Assuming λ is fixed and Q(λ,λ i ) as a function ofλ i has a unique global maximumλ i , that is a critical point of Q(λ,λ i ). The partial transformation is defined as Proof. The proof follows by Juang and Baum [31], [32]. Now, we decompose Q(λ,λ) as the sum of separated parts by substituting (7) in (6). It is then proved that each separated part has a unique global maximum (including influence model parameters).
Let Υ c n,κ is the parameter set that defines the density b c n,κ (o c t ). This study assumes multivariate Gaussian densities and Υ c n,κ = (μ c n,κ , Σ c n,κ ). Inserting (7) into the auxiliary function in (6) gives where Q(λ,λ i ) is described in detailed as follows Under the Theorem II.3, if each individual auxiliary function has a unique maximum global, then the parameter sets can be re-estimated independently by maximizing the individual auxiliary functions. Fortunately, maximization of Qπc (λ,π c ), Qθ c (λ,θ c ), Qāc ,c n c (λ,ā c ,c n c ) and Qωc nc (λ,ω c n c ) subject to the following constraints is well-known (for all appropriate c, c , and n c ) Each auxiliary function has a well-known form N j=1 w j log y j coupled with the constraints N j=1 y j = 1, y j ≥ 0 and w j ≥ 0, which leads to a unique global maximum as follows [32] is elliptically symmetric [33]. Therefore, all individual auxiliary functions have a unique global maximum, and parameters of LSIMs can be estimated by iterative maximization of individual auxiliary functions according to the mentioned theorems.
The re-estimation transformation is derived by applying (12)  λ,Ῡ c n c ,k c is also well-known and straightforward [33].
We have a brief review of inference in LSIMs that including forward, backward, one-slice parameters. Then, we use these parameters to accomplish the re-estimation transformation by maximization of individual auxiliary functions.

1) Inference in LSIMs and Auxiliary Parameters:
Inference in an LSIM is to compute the conditional probabilities of hidden states at a time t given some duration of observations. The forward, backward, and one-slice parameters are the most critical parameters in the inference that have an essential role in reestimation transformation. The marginal forward, one-slice, and backward parameters are respectively defined as follows [28] To complete the maximization process, two new auxiliary parameters Γ c ,c t (n c , n c ) and γ c t (n c , k c ), must also be defined that expressed according to the previous forward, backward and one-slice parameters. Two-slice parameter Γ c ,c t (n c , n c ) is the joint distribution of two consecutive states (from different or same channels) plus φ c t given all observation. This parameter is defined and simplified as follows (see Appendix B, available in the online supplemental material) The next parameter is γ c t (n c , k c ), which is defined as the joint distribution of the hidden state and its mixture branch conditioned on all observation. This parameter is expressed as follows (see Appendix B, available in the online supplemental material) Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

2) Maximization of Individual Auxiliary Functions:
Here, introduced parameters are used to extract partial re-estimation transformations achieved by maximizing individual auxiliary functions. Maximization procedures of individual auxiliary functions are described in detail in Appendix C, available in the online supplemental material. Initial probabilities are reestimated by the maximization of Qπc (λ,π c ) as follows Coupling weights and transition matrices of the influence model are also re-estimated by maximizing Qθ c (λ,θ c ) and Qāc ,c n c (λ,ā c ,c n c ) through the following equations Maximization of Qωc nc (λ,ω c n c ) also gives mixing weights of GMMs according tõ Lastly, the mean vector and covariance matrix of GMMs are re-estimated by the maximization of QῩ c nc,kc (λ,Ῡ c n c ,k c ) as follows Above closed-form solutions (partial re-estimation transformations) are impractical for a dataset with many channels due to the exponentially computational complexity of exact marginal parameters. Our previous study proposed a fast approximate inference that computes marginal parameters fast and recursively with the computational complexity O(T (NC) 2 ) instead of O(T N 2 C ) for an LSIM with C channels of N states apiece observing T data points [28]. This approximate algorithm is fast and acceptable for many practical applications, while exact inference can be demanding and time-consuming. Hellinger distances are small enough, indicating that the proposed approximate inference is sufficiently close to the exact inference when considering various channels, hidden states, and other parameters [28]. Further, the proposed inference algorithm has superior performance than existing approximate inference algorithms. Therefore, our proposed forward, backward, and one-slice parameters are used in partial re-estimation transformation in the remainder of this manuscript.

III. SIMULATED AND REAL DATASETS
This section describes various simulated and EEG/ECoG datasets to evaluate the proposed learning algorithm. Then, we consider the application of multi-channel time-series modeling using LSIMs, CHMMs and HMMs. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used to compare these models in the context of modeling on both simulated and real datasets. Since one of the most important applications of LSIMs as a generative model is in time-series classification [35], [36], we also evaluate the performance of LSIMs against CHMMs, HMMs and Support Vector Machines (SVMs).
in CHMMs refers to transition probabilities different from those of the influence model in LSIMs. Consider a CHMM with C channels; each channel takes a random state number between 2 to 6. Initial state probabilities π c = {π c 1 , . . ., π c M (c) } is initialized by uniform distribution U (0, 1), and then normalized dividing them to the sum of them. In the same way, each row of A c g is also initialized and normalized. The observation dimension of each channel also initialized randomly between 1 to 5. Emission probabilities belong to GMM generally, but it is simpler to assume just one Gaussian component. N (m, 1) initializes each element of the mean vector μ c m,k . For simplicity, covariance matrices of emission probabilities are assumed to be diagonal, and U (1, 3) initializes their diagonal elements.
2) Embedded Lorenz Systems: Lorenz system is an interesting nonlinear dynamical equation with a set of ordinary differential equations known as Lorenz equations ⎧ ⎨ Hundreds of research articles and at least one book-length studied the Lorenz equations [37]. Two different sets ({σ = 10, ρ = 8 3 , β = 28} and {σ = 10, ρ = 8 3 , β = 56}) were selected to generate time-series of the Lorenz system. The system exhibits chaotic behavior for these sets. Each set of parameters generated a three-channel time-series, and these 3-channel timeseries are embedded to create a new six-channel time-series. The proposed algorithm is applied to a six-channel time-series that has more complexity than a three-channel case.

B. Brain Signal Datasets
Besides simulated datasets, we consider EEG and ECoG datasets to assess performances better, as described in the following.
1) EPFL EEG Dataset: This dataset comprises EEG recordings of five disabled and four healthy subjects [38]. EEG was recorded from 32 electrodes placed at the standard positions of the 10-20 international system. The initial sampling rate was 2048 Hz that was resampled to 128 Hz. Subjects were facing a laptop screen on which six images were displayed. The images showed a television, a telephone, a lamp, a door, a window, and a radio. The dataset and preprocessing used in the present work are made available for download on the EPFL BCI group. The recording of subject five was excluded from the dataset due to the presence of artifacts. Study [38] had described the detail of protocols and subjects.

2) Macaques ECoG Dataset:
This dataset contains ECoG recordings of monkeys in a tracking food task [39]. ECoG (64channel) and hand motion data were recorded simultaneously during the tracking food task with sampling rates 1 kHz and 120 Hz, respectively. This study focused on monkey B recording to have more summarized results due to space limitation.

3) Biometric EEG Dataset (BED):
The purpose of this dataset is to study the performance of methods for the task of biometric person verification and identification [40]. BED is a dataset designed for EEG-based biometrics, using a low-cost consumer-grade EPOC+ measuring 14-channel of EEG using contact sensors located at locations that closely align with the AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8 and AF4 locations of the international 10-20 system. BED contains EEG recordings acquired from 21 healthy individuals with 12 different types of stimuli. EEG signals are collected throughout three sessions spaced one week apart to study template aging.

IV. RESULTS AND DISCUSSION
This section presents the results of applying LSIMs to simulated and real datasets in well-categorized parts. We first check convergence, then examine modeling and classification applications, and next analyze approximate inference biases. We select a subset of datasets for each part to create more summarized and comprehensive results.

A. Convergence Testing
The proposed learning algorithm is applied to various multichannel time-series to examine its convergence and monotonically increasing.
1) Simulated Data: Convergence is tested first using simulated multi-channel time series. Fig. 1 displays log-likelihood curves for 3 and 5 channels simulated observations of CHMMs. All channels are in similar states, and the number of states increases from 2 to 7 in the re-estimation algorithm. In all cases, the log-likelihood curves increase monotonically, as shown in the figure. Results show that the final log-likelihood increases as the number of states in LSIMs increases. Furthermore, simple models tend to have faster convergence at the same number of iterations.
2) EPFL EEG Dataset: We further apply the proposed algorithm to the EPFL EEG dataset. Two different configurations are analyzed, including 16 and 32 channels with the same channel selection as [38]. In all subjects, the state number is four with two Gaussian components. Fig. 2 shows log-likelihood curves that are consistent with previous simulation results.
3) Macaque ECoG Dataset: In the end, the Macaque ECoG dataset with 64 channels confirms the convergence of the proposed algorithm in a high channel number scenario. Fig. 3 shows the log-likelihood curves for different state numbers in this dataset. Log-likelihood curves increase monotonically in all cases, indicating that the proposed algorithm has a stable monotonic convergence.

B. Statistical Modeling of Multi-Channel Signals
In addition to checking convergence, simulated embedded Lorenz systems and the Macaque ECoG dataset are used to compare the modeling capability of LSIMs with HMMs and CHMMs.
1) Embedded Lorenz Systems: The first scenario involves applying HMMs, CHMMs and LSIMs to a six-channel time series from embedded Lorenz systems with 4000 observation samples. A simple grid-search also finds optimal state numbers (2-35 for LSIMs, 2-80 for HMMs and 2-4 for CHMMs with one Gaussian component) based on AIC and BIC. LSIMs and CHMMs can have different states per channel (M (c)), and the exact grid-search grows exponentially with the number of channels. To avoid this exponential grid-search, we assume that all channels have the same number of states (M ). The AIC and BIC curves for HMMs, CHMMs and LSIMs are shown in Fig. 4. Based on AIC and BIC, LSIMs are better than HMMs and CHMMs. CHMMs perform worse than HMMs because the number of state-space transition parameters increases quickly and exponentially with the number of channels (M C+1 ) in CHMMs, and both AIC and BIC penalize the number of parameters. In addition, Fig. 5 shows the coupling weights of the selected LSIM based on the BIC, and as expected, the coupling weights are zero between two independent Lorenz systems.
2) Macaque ECoG Dataset: Macaque ECoG dataset contains 64 recording electrodes with 10000 samples, which can be modeled by a 64-channel LSIM (one electrode per channel), a very complex model with lots of parameters. This complexity  can be reduced by considering a four-channel LSIM consisting of 16 electrodes per channel such that each channel contains 16 neighbor electrodes on the multi-electrode array (see [39] for more details). The cross-correlation matrix also supports this reconfiguration (see Fig. 7(a)). The covariance matrices in GMMs are assumed to be diagonal matrices to avoid singularities [41].
Similar to the previous dataset, a simple grid-search is used to obtain the optimal state number (2-100 for LSIMs, 2-250 for HMMs and 2-7 for CHMMs with one Gaussian component). Fig. 6 indicates that LSIMs perform better in terms of AIC and BIC than HMMs and CHMMs. Since C=4 in this dataset, state-space parameter growth is less than the embedded Lorenz system for CHMMs, and state numbers can be increased to 7 (M C+1 =16807). The AIC values of CHMMs and LSIMs are similar for state numbers 2-7; when states exceed 7, however, CHMMs become intractable and inefficient. Moreover, Fig. 7(b) shows the coupling weights of the selected model according to the BIC, consistent with the cross-correlation matrix.
3) Computational Efficiency: One of the advantages of the proposed LSIM framework is its computational efficiency in large channel systems where CHMMs cannot be used due to

C. Classification & Biometric Verification
LSIMs can also classify multi-channel time-series data, as HMMs and CHMMs can do. If there are different generative models behind different time series classes, HMMs, CHMMs and LSIMs may be more appropriate than other classifiers [35]. During the training phase of a two-class classification problem, two Markov models are learned using train observations from each class. Hence, there are two models with λ 1 and λ 2 parameters, and a given test time series must be assigned to one of these models. For this assignment, the conditional observation likelihood of the test time series is computed for model parameters λ 1 and λ 2 , and ll λ 1 and ll λ 2 indicate their log-likelihoods. If ll λ 1 − ll λ 2 > 0, then the test time series is assigned to model λ 1 and vice versa. This part compares the accuracy of LSIMs against HMMs, CHMMs, and SVMs using simulated CHMMs and real EEG datasets.
1) Simulated CHMMs: Under the classification scenario, two CHMMs parameters are initialized randomly with the same structure (same channel numbers, channel dimensions, and state number per channel) to generate train and test observations for a two-class problem.
HMMs, CHMMs, and LSIMs have two hyper-parameters, including hidden state numbers and the number of Gaussian components. A proper model selection criterion (AIC, BIC,...) must be used to determine the optimal hyper-parameters based on the training observations without needing a validation set. The optimal hyper-parameters are determined by minimizing AIC using a grid search on hidden state and Gaussian components. The optimal hyper-parameters are then applied to the test time series to classify them.
We analyze the effect of channel number (C), the number of training set samples (I), and sequence duration (T ) on the classification accuracy. There are 10 test time-series with the same T as the training set for each CHMM, and CHMM parameters are reinitialized 1000 times for each particular condition (including  II  CLASSIFICATION ACCURACY FOR SVM, HMM, CHMM AND LSIM IN  3-CHANNEL SIMULATED CHMMS   TABLE III  CLASSIFICATION ACCURACY  Tables II and III present classification accuracy for various C, I, and T . The results show that increasing C, I, or T positively affects classification accuracy independent of the LSIMs, CHMMs, HMMs, or SVMs, which are equivalent to increasing the given information.
The tables show that LSIMs are superior to HMMs, CHMMs and SVMs, and CHMMs perform better than HMMs and SVMs. Table II indicates LSIMs are better than CHMMs by about 0.4% on average for the 3-channel simulation scenario, and Table III shows LSIMs beat CHMMs by about 3% for the 5-channel simulation scenario. This improvement shows that LSIMs become more appropriate and stronger than CHMMs by increasing the number of channels in interesting datasets, even for simulated data from generic CHMMs.
2) BED Biometric Verification: This part aims to improve the verification performance of the BED dataset by using LSIMs instead of HMMs for EEG-based person verification. The verification task is to determine whether a user is who they claim to be. Verification compares the query with the template of the requested identity, and users are accepted or rejected based on whether the result of the comparison exceeds or falls below a certain threshold. In contrast, identification refers to deciding who the user is from a pool of possible profiles. This context compares all available profiles and assigns the query to the profile that provides the best match.
In [40], an HMM-based verification method is developed, and verification performance is evaluated on the BED dataset for different types of features and stimuli. A similar EEG-based verification is also conducted using LSIMs as in [40] to avoid ambiguity or questions on the processing procedures. The current study follows the same data preprocessing and epoching, feature extraction, and the decision rule for accepting or rejecting an epoch proposed in [40].
In summary, the recordings of channel c data (c = 1, 2, .., C) are segmented into P consecutive 5-second epochs with 50% overlap (e (c,p) , p = 1, 2, .., P ). Next, epoch p is split into H overlapping frames of 1 s and 50% overlapping, and represented as a sequence of observations o (c,p) , so that o (  ]. The dataset of features is also part of the BED dataset, and we directly downloaded the preprocessed features for verification evaluation. Existing works build an HMM λ c HMM with 4 hidden states usingô (c,p) , then compute a posteriori log likelihood l (c,p) = P (ô (c,p) |λ c HMM ) for EEG channel c with respect to the maximum probability path through the Viterbi algorithm [40], [42]. The decision rule for accepting or rejecting epoch p based on C models of any given subject is according to [40], [42] where τ C is the minimum number of channels to accept test epochs, and τ s is a threshold for deciding whether to accept or reject the epochs for individual channels. Our contribution involves training an F -channel LSIM (λ c LSIM ) forô (c,p) instead of a standard HMM (λ c HMM ) regarding EEG channel c. We treat every single feature inô (c,p) as a channel of LSIMs, and LSIMs capture the dynamic and interaction between individual features. For example, we train 12-channel LSIMs for MFCC features and 14-channel LSIMs for SPEC features. Then, we calculate the log-likelihood l (c,p) = P (ô (c,p) |λ c LSIM ) for each epoch, and the verification is performed using the same decision rule as in (22).
In order to simulate a realistic usage scenario, data acquired during one session is used for training, while data acquired at later sessions are used to test the verification performances [40]. Hence, we also train the subject models using the data from the first session and test them independently using data from the second and third sessions. The verification performance is measured by the area under the curve (AUC). Two τ C and τ s thresholds are also determined by the same way used in [40]. In addition, we use four states and three Gaussian components per channel, like [40], [42].
For existing HMM-based and proposed LSIM-based methods, Tables IV and V show AUC values for each stimulus (columns) and EEG feature (rows). The AUC values of HMMs have been taken directly from [40]. Additionally, the average performance across all types of stimuli per feature of the EEG is computed and reported in the last column. LSIMs have superior AUC results than HMMs in two test sessions for almost all stimuli and EEG feature types (bold values in columns). In the second and third sessions, the AUC values show that the proposed LSIM-based method has a significant improvements of 4.5% and 9.1% over the HMM-based method (paired t-test with α=0.01). The AUC improvement is also about 6.8% statistically significant under all conditions (stimulus, feature type, and session). The proposed LSIM-based method also reduces the standard deviation of AUC values across stimuli, EEG features, and sessions. For all conditions, the standard deviation of AUC values for HMM-based method is 5.4%, while it decreases significantly to 3.3% for LSIM-based method. Thus, LSIMs not only improve the verification performance but also decrease the standard deviation across all conditions.

D. Analysis of Approximate Inference Bias
Finally, we empirically examine how the bias of the approximate inference is transmitted to estimated parameters. So, the proposed EM algorithm is re-implemented based on exact inference. Exponential computation of the exact inference only allows us to consider LSIMs with few channels. We evaluate the effect of replacing exact inference with approximate inference by using simulated data from a 4-channel CHMM and a 4-channel configuration of the EPFL EEG dataset [38]. EM algorithms with exact and approximate parameters are compared for convergence speed and log-likelihood. Fig. 8 shows the exact log-likelihood curves for EM algorithms via approximate and exact inferences for simulated data and a subject of the EPFL EEG dataset. These results are selected to represent the overall patterns that emerge from the results of EM algorithms via approximate and exact inferences for various situations. As can be seen, the exact inference gives faster convergence than approximate inference. Both plots confirm that approximate inferences reach parameters with nearly the same likelihood as exact inferences in the last iteration. The exact inference would give a better likelihood in some cases, but because of its computational complexity, it is not feasible to apply it to higher channel numbers.

V. CONCLUSION
We extend the scope of the re-estimation algorithm of HMMs for LSIMs in this study. Using the influence model and the multivariate mixture of strictly log-concave, we demonstrate that the LSIMs re-estimation converges to a local maximum of the likelihood function. The proposed auxiliary function has a unique global maximum, and closed-form re-estimation formulas are derived from marginal forward and backward parameters.
We test the theoretical convergence of the algorithm by examining simulated datasets and real EEG/ECoG datasets (up to 64 channels). The log-likelihoods increase monotonically with iterations in all cases. As the model complexity increases, its log-likelihood increases as well, and it requires more iterations to reach convergence. Modeling and classification tasks compare the performance of LSIMs with HMMs and CHMMs. Modeling embedded Lorenz systems and ECoG recordings shows that LSIMs outperform HMMs and CHMMs according to AIC and BIC. A primary application of LSIMs as generative models is multi-channel time-series classification. CHMM data classification and EEG-based biometric verification are used to compare LSIMs and HMMs. The proposed LSIM-based method significantly improve the verification results over the existing HMM-based method, and it reduces the standard deviation of AUC values in all conditions. Therefore, LSIMs are suitable for modeling and classifying multi-channel time-series that exhibit spatial and temporal structure in many multi-channel signal processing applications. It should be noted that LSIMs use the linear-wise influence model in the state space and may be less effective when there are many multiplicative relationships between the channel interactions in state space. While exact inference gives a faster convergence rate than approximate inference, both inferences reach parameters with nearly the same likelihood in the last iteration.
This study and our previous study [28] solve both the inference and learning problems of LSIMs accurately and efficiently. However, there is an intrinsic difficulty in learning LSIMs and CHMMs due to having different channels with different states. In order to select the optimal state number per channel, the gridsearch size increases exponentially as the channels increase. Future research can focus on optimizing search algorithms to deal with this exponential growth.