Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

—This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace—estimated from other source languages—to build an educated prior , thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-speciﬁc subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental re-sults show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-speciﬁc priors are preferable to language-agnostic priors for acoustic unit discovery.


I. INTRODUCTION
Building a speech recognition system requires a large collection of transcribed data. For instance, recent publications [1], [2], [3] report using tens of thousands hours of recordings paired with their corresponding textual transcription. Such amounts of transcribed data are available for only a handful of languages and stunt the development of speech technologies for many languages. While collecting audio data is relatively easy in our digital word, human-based transcriptions are expensive and too slow to keep pace with the daily production of multimedia content. A tremendous step would be made if one could automatically transcribe this data as it would drastically increase the amount of available resources to build speech technologies in many languages.
In parallel, there is a keen interest to understand how infants learn to recognize speech. Indeed, whereas speech recognition systems are built upon human-transcribed data, toddlers learn seamlessly to structure speech with very distant and noisy supervision. As a remarkable example of this learning capability, children born blind are perfectly able to learn to recognize speech even deprived of any supervision coming from the visual signal. In an attempt to explain this capability Lucas Ondel and Bolaji Yusuf contributed equally to this work. Lucas Ondel is with LISN, CNRS, Universite Paris-Saclay, Orsay, France (e-mail: lucas.ondel@lisn.upsaclay.fr) Bolaji Yusuf and Lukáš Burget are with the Brno University of Technology, Faculty of Information Technology, Speech@FIT, Czechia (e-mail: bolaji.yusuf@boun.edu.tr) Bolaji Yusuf and Murat Saraçlar are with Bogaziçi University, Istanbul, Turkey using the "reverse-engineering approach" [4], many models for automatic labeling of the data have been proposed by the machine learning community [5], [6], [7].
These two research ideas, while having very distinct objectives, share a common interest: to build a machine learning algorithm that automatically learns a discrete representation of the speech signal in an unsupervised fashion. For the former, this would allow automatic labeling of large collection of data, for the latter, it would serve as a simulation of how infants learn to process speech.
Note that these approaches are not mutually exclusive and can be combined as was shown in [15], [16]. This work focuses on subspace model techniques applied to non-parametric Bayesian models on the task of discovering a set of pseudo-phones (called acoustic units) from unlabeled audio recordings 1 .
Preliminary results on subspace models for AUD have been published in [17], [18] giving rise to two models: the Subspace HMM (SHMM) and Hierarchical-Subspace HMM (H-SHMM). In this paper, we provide a more comprehensive theoretical coverage of these models, their relationship with the Dirichlet process and a complete inference scheme. In addition, we conduct an in-depth performance analysis of the subspace models as well as a comparison with state-of-the-art baselines. Note that we assume readers' familiarity with the Dirichlet process and variational inference [19].
The rest of the paper is organized as follows: in Section II, we introduce a formal Bayesian formulation of the AUD problem, as well as the Dirichlet process HMM model upon which the subspace models are built; in Section III, we introduce the subspace models as HMM-based AUD models with specific prior forcing the model's parameters to dwell in a "phonetic" subspace; in Section IV, we detail the inference and experimental results are presented in Section V.

A. Probabilistic interpretation of AUD
We first introduce a probabilistic formulation of the AUD task which will motivate the Bayesian approach of this work. Given a speech utterance of N observations X = (x 1 , . . . , x N ), the AUD task amounts to find: • A collection of U vectors H = {η 1 , . . . , η U } best describing the observations, where each η represents the parameters of a distribution of observations for a specific sound. These sounds are called acoustic units as they represent the basic elements of speech. • The sequence of indices u = u 1 , . . . , u L , L < N where u i ∈ {1, . . . , U } is the index of an acoustic unit.
Rather than the maximum a posteriori estimate H * , u * = arg max H,u p(H, u|X), we seek to obtain the posterior distribution over the embeddings H and label sequence u: This allows us to have an estimate of uncertainty. The Bayesian statement of AUD in (1) is analogous to the statistical formulation of ASR [20] with, however, one major difference: in the case of AUD, the inventory of units is unknown and needs to be inferred from the data along with the acoustic unit embeddings η 1 , η 2 , . . . . It is important to understand the different roles played by the three factors in the numerator of (1): • p(X|H, u): the likelihood of the observations given the collection of acoustic unit parameters and the sequence of labels. This term, referred to as the acoustic model, tells how plausible the sequence of observations is given the sequence of acoustic units. • p(u|H): the prior over the label sequence is the language model over the acoustic units' labels. It models the phonotactic constraints of the discovered units. • p(H): the prior over the collection of embeddings, this term is essential as it defines, before observing data, what are the potential acoustic unit candidates. This term will be the focus of this work. Under the Bayesian AUD interpretation, the collection of vectors H bears a particular meaning: they are the parameters of the acoustic model.

B. Non-parametric AUD
In practice, the size of the acoustic units inventory U is not known and we have to pick an appropriate value. This is not an easy task since every language has a unique set of phones and we would like to infer the value of U in light of the data. We achieve this behavior by letting U → ∞ and adding a distribution P over the parameters of p(u, H). This approach, referred to as non-parametric Bayesian [21], lets the model learn its own complexity from the data. Following, [8], [11] we set P to be a Dirichlet Process DP(γ, G 0 ) [22] with concentration γ and base measure G 0 (η) over the acoustic unit embeddings.
To allow efficient variational inference with the Dirichlet Process [19], we use the stick-breaking process view of the Dirichlet process, expressed as a generative process: where B is a Beta distribution, G is a Gamma distribution and δ η i is the Dirac delta function centered at η i . Note that we have added a Gamma prior over the concentration of the Dirichlet Process so that we learn the value of the concentration parameter directly from the data. Finally, we use the base measure G 0 (η) and the constructed distribution G(η) to build the prior p(u|H)p(H) in the following way: One more time, we highlight the different roles played by the two terms in (7). G 0 (η) is a continuous density over the embedding space: it defines which embeddings are likely to be selected as acoustic units. G(η un ), on the other hand, is a discrete distribution over an infinite set of atoms (i.e. the samples from the base measure) and it defines how frequently a unit occurs in speech. In other words, G(η un ) is a (unigram) language model of the units.

C. Acoustic Model
We now turn to the definition of the acoustic model p(X|H, u). We denote by X u l the sub-sequence of the observed data that belongs to the acoustic unit with index u l such that X = X u1 , . . . , X u L . We assume the following factorization of the likelihood: Following [8], we model the likelihood p(X u l |η u l ) by a leftto-right HMM with S hidden states where each state has a GMM emission density with C components: p(x u l n l , c u l n l |s u l n l , η u l )p(s u l n l |s u l n l −1 ), where: • s u l = s u l 1 , . . . , s u l N l is the sequence of indices of the HMM states for acoustic unit u l , • c u l = c u l 1 , . . . , c u l N l is the sequence of indices of the mixture components for the acoustic unit u l , • p(s u l 1 |s u l 0 ) = p(s u l 1 ) is the probability of the initial state, • N l is the length of the sequence of observations X u l . Finally, the acoustic unit embedding η u l encodes the parameters of the HMM model {π s u l }, {µ s,c u l }, {Σ s,c u l }, ∀ s ∈ {1, . . . , S}, c ∈ {1, . . . , C}, where: • π s u l are the mixing weights of the GMM associated with the sth state of the HMM of the acoustic unit u l , • µ s,c u l , Σ s,c u l are the mean and the covariance matrix of the cth Normal component of the GMM associated with the sth state of the HMM of the acoustic unit u l . We have not included any parameters for the within-unit transition probabilities p(s u l n l |s u l n l −1 ) as it has been empirically observed that they play no significant role when modeling speech [23]. Therefore, we assume the transition probabilities are fixed so that there is a 0.75 probability of remaining in the same state and a 0.25 probability of exiting to the next state of the HMM. Note that this only concerns the transition probabilities within the unit; transition probabilities between units are governed by the distribution sampled from the Dirichlet process.

D. Acoustic Unit embeddings
We detail now the relation between the embedding η u l and the HMM parameters. To keep the notation uncluttered, we drop the subscripts and superscripts u l and n l , therefore, we write x, η, . . . instead of x u l n l , η u l , . . . . Observe that the distribution of p(x, c|s, η) is a product of a Normal and a Categorical distribution and, moreover, each of them is a member of the exponential family of distributions [24]. Consequently, we have: where ω s , T (c) and A(ω s ) are the natural parameters, the sufficient statistics and the log-normalizer of the Categorical distribution over the GMM components of the sth HMM state. Similarly, θ s,c , T (x) and A(θ s,c ) are the natural parameters, the sufficient statistics and the log-normalizer of the Normal distribution associated with state s and the mixture component c. For both distributions, the natural parameters and the sufficient statistics can be derived from their respective definitions [24]: . . .
where "vec" is the vectorization operation and 1 is the indicator function. Note that because π is constrained such that C c=1 π c = 1, the natural parameter ω is a (C − 1)dimensional vector whereas π is a C-dimensional vector. Finally, the log-normalizers A(ω s ) and A(θ s,c ) are given by: where ω s k is the kth element of ω k , mat is the inverse of the vec operation, θ s,c 1 = (Σ s,c ) −1 µ s,c , and θ s,c 2 = − 1 2 vec((Σ s,c ) −1 ). Finally, we define the embedding η to be a "super-vector" obtained by concatenating the natural parameters of the Normal and Categorical distributions of all S states of the HMM modeling of an acoustic unit. It has the following layout: where η s is the concatenation of the natural parameters of the Normal and Categorical distributions for the sth state of the HMM of an acoustic unit.

E. Joint Distribution
To conclude the description of the model, we present the complete joint distribution. For simplicity, we introduce the variable z n = (u l , s n l ) which encodes an acoustic unit index u l and a particular HMM state s n l . Notice that the time index n in z n -which combines both the relative time indices l and n l -is absolute with respect to the sequence of observations, i.e. n ∈ {1, . . . , N }. Similarly, c n represents the index of a GMM component at time n. We write η zn = η sn l u l which corresponds to the natural parameters of the s n l th HMM state of the acoustic unit with index u l . With this notation, the joint distribution is given by: = N n=1 p(z n |z n−1 , v)p(x n , c n |η zn ), (20) where v = {v 1 , v 2 , . . . } is the set of stick breaking weights. This is the likelihood of a typical "phone-loop" HMM [25] where z is the sequence of hidden states. As explained in [26], this interpretation of the model allows an efficient dynamic programming algorithm to evaluate all possible sequences of where θ cn zn is the vector of natural parameters of the c n th mixture components of the s n state, and the two factors on the right hand side are defined in (11) and (12). The transition probability combines the within and across units' transition in the following way: where the transition probability within a unit's HMM is fixed: p(s n |s n−1 ) = const and the probability of the unit index p(u l |v) is given by the stick-breaking process as defined in Section II-B: The priors over the stick-breaking process parameters v and the prior over the concentration parameter γ are given by: the prior over embeddings H is defined from the base measure: and its exact construction will be addressed in the next section. Finally, Figure 1 gives a graphical perspective of the complete model. It is composed of 3 layers of hidden variables: (i) the GMM components layer which acts as a quantization layer, i.e. it transduces a sequence of continuous features into a sequence of discrete symbols, (ii) the HMM states layer which captures the temporal dynamic of the data, and (iii) the acoustic units' layer which encodes phonetic information.

III. PRIOR OVER THE EMBEDDINGS
We have formulated a probabilistic interpretation of the AUD problem. From this, we have seen that 3 terms emerge: (i) the likelihood defining the acoustic model, (ii) the language model and (iii) the prior over the embeddings. We have detailed the two first terms in the previous section and, now, we draw our attention to the last term: the prior over the embeddings G 0 (η).

A. Conjugate prior
Early Bayesian AUD models [8], [11], [16] set G 0 (η) to be the conjugate prior of the conditional HMM likelihood: where ξ 0 and ϑ 0 are the natural parameters of the conjugate prior of the states' emission density. The conjugacy implies that the prior p(ω i ) over the natural form of the mixing weights is a Dirichlet distribution and the prior p(θ i,j ) over the natural form of the mean and the precision matrix (inverse of the covariance matrix) is a Normal-Wishart distribution. This choice is convenient since it greatly simplifies the inference; it is, however, difficult to control precisely which type of sounds the prior will emphasize. In previous works, p(ω i ) and p(θ i,j ) were set to be vague priors (i.e. priors that play a mininal role in the estimation of the posterior distribution) leading the AUD model to consider, say, the phone /aw/ and the sound of a slamming door as equally good candidate acoustic units.

B. Phonetic Subspace
Vague priors are easy to define but they fail to provide a reasonable selection of "good" candidates. Recent works [17], [18] have proposed to remedy this shortcoming by introducing phonetic subspace speaker subspace Fig. 2: Illustration of the concept of phonetic subspace: each phone is represented as a vector η encoding the parameters of a probabilistic model (an HMM in this example). Ideally, moving away from the subspace only changes the characteristics of the phone (speaker, channel, loudness, ...) but not the phone itself. For illustration purposes, the red line represents one of such factor of variability: the speaker subspace. Finally, not that in this example, the parameter space has only 2 dimensions (η 1 and η 2 ) but in practice it will have several thousands of dimensions.
subspace-based priors which act as informative (or educated) priors over the space of acoustic unit embeddings. These works rely upon the concept of phonetic subspace which we'll illustrate with an example. Let's consider that we fit an HMM to a set of recordings of the English phone /aw/ which gives the embedding vector η aw . Moving η aw in any direction in the embedding space will affect the parameters of the HMM and, consequently, the phone it represents. As an example, a particular displacement may lead to changing the phone /aw/ to /ow/. Moving the η aw even further will change the original /aw/ phone more profoundly and yield, say, the consonant /z/. In this thought experiment, we have assumed that there is a continuum between all phones or, expressed in another way, that we can smoothly transition from one phone to another. Generally, we can envision all the phones of a language as points on a low-dimensional manifold which represents this continuum. This manifold is depicted by the blue line in Figure 2 and it is what we call the phonetic subspace. Importantly, this concept of phonetic subspace is independent of the choice of the phone model: GMM, HMM, Linear Dynamical Model, etc. However, the type of model used will influence how well the continuity between phones is represented.

C. SHMM
The Subspace HMM (SHMM) [17] defines the base measure G 0 (η) as the probability distribution induced by the following sampling process: where e u is a Q-dimensional embedding of the acoustic unit u on the subspace, the weights matrix W and the bias vector b are the parameters of the phonetic subspace, and f (·) is a function that takes a real vector and projects it to the HMM parameter space. In this work, it is set such that: where diag(·) returns a diagonal matrix from an input vector, exp is the element-wise exponential function and exp{...} j is the jth element of the resulting vector. W s π is the subset of rows of matrix W assigned to the mixing weights π s of the sth HMM state. Matrices W s,c µ and W s,c Σ are similarly defined for the mean and covariance matrix of the cth Gaussian component and sth HMM state. Note that (36) holds for j ∈ {1, ..., K −1} and π K = 1 − K−1 k=1 π k . Importantly, the construction of an acoustic unit embedding η u relies upon a phonetic subspace parameterized by W and b. Since these parameters are unknown in practice, they need to be inferred prior to utilizing them for AUD. This issue will be addressed in Section IV-B.

D. H-SHMM
The SHMM introduced in the above section has made the implicit assumption that the phonetic subspace is universal, i.e. it is the same for all the languages. [18] argues that this assumption is unrealistic and proposes to have a languagespecific base measure G λ 0 (η) defined by the following generative process: This generative process incorporates a G-dimensional language embedding α λ which is used to built a language specific phonetic subspace by a linear combination of bases {M i } and {m i }. These bases define a hyper-subspace of languages, as depicted in Figure 3. Because of the hierarchical nature of the generative process, the resulting model is termed the Hierarchical Subspace HMM (H-SHMM).
The bases {M i } and {m i } are shared across languages and act as "template" phonetic subspaces. In this view, each language specific phonetic subspace is a weighted combination of these generic subspaces. Similar to the SHMM subspace parameters, these parameters are unknown in practice and need to be estimated prior the AUD task.

IV. INFERENCE
We now turn to the problem of inference for the SHMM and the H-SHMM. Since these models include many parameters, the derivation of the update equations is long and tedious. Therefore, we have opted to only give a general overview of the training in the main text with more technical details left to Appendix A.

A. Variational Bayes Inference
As discussed in Section II-A, from a Bayesian perspective, the AUD task amounts to finding the a posteriori distribution: Note that H refers to the acoustic unit parameters, incorporating the subspace parameters as well as the low-dimensional embeddings. Since the denominator p(X) = · p(X|·)p(·)d· is intractable, we resort to the Variational Bayes framework [27] to find an approximate posterior q(c, z, H, v, γ). This entails maximizing the following lower-bound: where we write: To be able to maximize (46), we use the following structured mean-field factorization: Fig. 3: Illustration of a hierarchical subspace model. For each language λ, acoustic unit embeddings (encoding the parameters of a probabilistic model) are assumed to live in a language-specific subspace of the total parameter space spanned by W λ . This subspace is given by a weighted sum of matrix bases M 1 , M 2 , . . . (shared across languages) and language-specific weights α λ :

Algorithm 1
Training of phone-loop model for acoustic unit discovery. Detailed coverage of the update equations can be found in Appendix A .

B. Learning from other languages
The idea behind the SHMM is to supply prior information via the subspace parameters {W, b} to the AUD system before observing the data. Hence, training the subspace parameters on the target language defeats the purpose of the model. In practice, we infer a set of variational posteriors q 0 ({z}), q 0 ({e u }) and q 0 (W, b) on phonetically transcribed source languages. This is the supervised phase of the training, where the system learns the notion of phone from transcribed data. At this stage, the phone-loop of the AUD model is replaced with a forced alignment graph since the variable u is observed in this case. Then, on the target language, we infer new variational posteriors q 1 ({z}), q 1 ({e u }) using q 0 (W, b). Note that q 0 (W, b) is not updated during this stage, but transferred from the source languages as is. The H-SHMM is trained with a similar procedure: first we estimate q 0 (M) on several languages and then we use this posterior to learn q 1 ({e u }, α λ ) (and the other variational posteriors) on the target language λ while keeping q 0 (M) fixed.

V. RESULTS
In this section, we experimentally validate the benefits of subspace models on the AUD task. In Sections V-A and V-B, we describe the experimental setup and the evaluation metrics respectively. Then, we analyze the improvement brought by the SHMM and the H-SHMM compared to a Bayesian HMM AUD baseline in Sections V-D and V-E. Finally, we compare the SHMM and the H-SHMM models with two alternative approaches: cross-lingual phonetic decoders, i.e. phone recognizers each trained on a different language than the target one, and neural-networks with discretization layers.
To account for model stochasticity, we run our AUD systems 5 times and we report the mean and standard deviation of the results.

A. Data and features
We use the following languages to evaluate the performance of our models: 1) Mboshi [29]: 4.4 hours with 5130 utterances by 3 speakers. 2) Yoruba [30]: 4 hours with 3583 utterances by 36 speakers. In keeping with the AUD problem definition, we assume that we lack any transcribed data at training time, and our test data constitute our training data. Specifically, we do not assume the existence of a separate, transcribed development set for the target language for hyper-parameter selection. Therefore, we train and test on the entirety of each corpus as we would have to do for a real target language.
In place of a language-specific development set, we use English (from TIMIT [31] excluding the sa utterances) as a development language. Any hyper-parameter selection is done by picking the model which maximizes the task metrics on this set, and we transfer the model directly to any new target languages. The use of English as a development language has the added advantage that it facilitates comparison with baselines that can only be constructed for English, e.g. because they require training data that is only available for English.
In addition to the target languages, we also need a set of source languages for training the subspace of the SHMM and the hyper-subspace of the H-SHMM. We use seven transcribed source languages: German, Spanish, French and Polish from Globalphone [32]; and Amharic [33], Swahili [34] and Wolof [35] from the ALFFA project [36]. For each of these, we use only a subset of 1500 utterances; the resulting durations are shown in Table I. The difference in durations is due to the varying length of utterances for each corpus. Finally, each system is trained on 13-dimensional MFCC features (12 coefficients and the per-frame energy) along with their first and second order derivatives.

B. Metrics
We use F-score and normalize mutual information (NMI) as the metrics for evaluating AUD performance. 1) F-score measures phone segmentation performance. We get precision and recall rates by comparing phone boundaries detected by the system of interest to reference phone boundaries with a tolerance ±20 milliseconds. We report the harmonic mean of precision and recall as the F-score. 2) Normalized mutual information measures phone clustering quality. We compute the NMI from a frame level alignment of discovered units U and actual reference phones P , resulting in a matrix containing the joint probabilities p(U, P ). From this, we compute NMI as: where H(·) is the Shannon entropy functional [37] and I(P ; U ) is the mutual information [37]. Since 0 ≤ I(P ; U ) ≤ min H(P ), H(U ) , the NMI takes on values between 0 and 100. An NMI of 0 is obtained when I(P ; U ) = 0 and the discovered acoustic units are completely unrelated to the actual phones. An NMI of 100 is obtained when I(P ; U ) = H(P ) = H(U ) which only occurs when discovered units have a one-toone correspondence with the actual phones. Note that the H(U ) term in the denominator penalizes representations with too many units. Without it, we could artificially inflate the NMI by increasing the number of units.

C. Hyper-parameters
Unless stated otherwise, the hyper-parameters of the SHMM and H-SHMM are set as follows: • each acoustic unit HMM has 3 states left-to-right topology with 4 Gaussians per state with diagonal covariance matrix.

D. SHMM
Our first experiment compares the effect of using an educated prior-as implemented by the SHMM-against the noninformative conjugate prior as described in Section III-A. We refer to the later model as the HMM-based AUD system or simply HMM. Our implementation of the HMM follows [11].
From Table II, we observe that the SHMM outperforms the HMM baseline in terms of clustering quality and segmentation accuracy. We also report the results of two oracle systems: • Topline-U, the unsupervised topline, is an SHMM AUD system whose phonetic subspace is pre-trained on the target language using the reference transcription. The concomitant phone embeddings are discarded and new embeddings are inferred with the AUD procedure without any transcription. This "cheating" experiment shows us the best performance achievable if we could estimate the perfect phonetic subspace. • Topline-S, the supervised topline, is an HMM phonerecognizer with a uniform phonotactic language model trained on the target data. This model reveals the best clustering results we could obtain by using an HMM to model each acoustic unit. In terms of clustering quality (NMI metric), we observe that Topline-U is quite close to Topline-S. This highlights the soundness of using a phonetic subspace. However, it also shows that estimating the phonetic subspace from other languages is not optimal and leaves room for improvement.
The goal of an educated prior is to provide the system with some information before observing the data. In the context of the SHMM, this prior information is the phonetic subspace which encodes the notion of phone for the AUD system. To verify that the phonetic subspace brings relevant information a priori, we report in Table III the performance of the HMM  and the SHMM AUD systems at initialization. For the SHMM, this means after the subspace has been pre-trained on the source languages; for the HMM, this means after random initialization of the variational posteriors). As expected, the SHMM has much better performance at initialization compared to the HMM system which has a vague prior.

E. H-SHMM
We have seen that building an AUD system with an educated prior such as the SHMM brings a significant improvement. This performance boost can be explained partly by the added information brought by the phonetic subspace. However, this information may not always be accurate: for instance, the set of languages used for learning the phonetic subspace may not be "relevant" (phonetically speaking) for the target languages. This phonetic mismatch between the source languages and the target language results in the observed performance gap between the SHMM model and Topline-U (see Table II).
As explained in previous sections, the H-SHMM attempts to reduce the mismatch between the source and target language by adapting the phonetic subspace to the target data. In Table IV, we compare the H-SHMM to the SHMM. We observe that if we update the SHMM phonetic subspace posterior q 0 (W, b) on the target language rather than freezing it as learned from the source languages, the clustering and segmentation performance degrades ("SHMM-finetune" in Table IV). The H-SHMM, on the other hand, by constraining the adaptation of the phonetic subspace by its hyper-subspace, successfully adapts the phonetic subspace on the target data. However, despite the improvement brought by the H-SHMM, our best system remains far from Topline-U suggesting that there is still potential for adapting the subspace to the target language.

F. Comparison with other methods
We have shown that subspace models, as implemented by the SHMM and the H-SHMM, offer a significant improvement over the Bayesian HMM baseline. We now broaden the comparison with non-Bayesian approaches.
1) Cross-lingual decoders: We compare our subspace AUD models against cross-lingual decoders. To make the comparison fair, the cross-lingual decoders are structurally equivalent to the AUD models: each of them is an HMM phone-loop (with the same number of Gaussians per state) trained on phonetically transcribed data. We use the same languages and data (Table I) as for estimating the phonetic subspace for the SHMM and H-SHMM models. Results are shown in Table V: we observe that these cross-lingual decoders are much less accurate both in terms of clustering and segmentation. Note that the SHMM and H-SHMM use the data of all source languages whereas the cross-lingual decoders are trained on a single language. To assess that the benefits of the subspace methods are not due to having more data, we report in Table VI the performance of the SHMM 2 using only one language to estimate the phonetic subspace, we observed that for any given language, the SHMM AUD system outperforms the crosslingual decoder trained on the same source languages.
Additionally, an interesting insight is highlighted by the results in Table VI: for Mboshi, we observe that training the phonetic subspace of the SHMM on a single source language is better than training on all source languages. This indicates that some combination of source languages can be detrimental for the AUD SHMM. However, the H-SHMM, which adapts the phonetic subspace (in an unsupervised fashion) to the target language achieves better results than any SHMM.
2) Neural-network based AUD systems: In recent years, several neural-network-based systems have been proposed for discovering acoustic units from speech. While architecture and objective function differ across models, all of them follow the same principle: an encoding-decoding architecture with one or more discretization hidden layers. We compare our subspacebased models against the following neural-network baselines: • VQ-VAE [38]: a variational auto-encoder with a quantization layer; variations of this model were successfully used for AUD by several teams in recent iterations of the Zero Resource Challenge [12], [7], [39], [40]. Keeping with our theme of using English as a development language, we tuned the VQ-VAE hyper-parameters to maximize the NMI on English and transferred them to the other languages • constrained VQ-VAE [41]: a recently proposed postprocessing method for VQ-VAE which encourages temporally consecutive frames to be quantized to the same class; this was shown to provide a significant improvement over the the plain VQ-VAE [41] • ResDAVEnet-VQ [14]: neural network with quantization layers trained to correlate images with their associated audio captions; we choose this baseline to compare our method against an AUD system with a weak supervision signal • VQ-WAV2VEC [13]: a convolutional neural network with a quantization layer trained with a contrastive prediction objective on the 960 hour Librispeech corpus [42]. The VQ-VAE 3 and the constrained VQ-VAE are trained on the same target data as the SHMM and the H-SHMM. For ResDAVENet and VQ-WAV2VEC, we used the pre-trained model directly and use their quantization output for evaluation purposes. Note that ResDAVENet and VQ-WAV2VEC were trained only on English data which explains some of the degradation in performance when they are used for AUD for other languages.
The results are presented in Table VII. We observe that the best performing neural network baseline is the constrained VQ-VAE, showing that a temporal constraint is an important feature in any AUD models. Nevertheless, the Bayesian subspace models perform significantly better.
The Bayesian AUD models may seem simple compared to large convolutional neural networks. However, they benefit from a well-structured prior which guides them during the clustering. Conversely, the neural network-based AUD models are very potent but lack structured priors and are easily trapped in sub-optimal solutions, limiting how well they can utilize their potential.

G. Effect of the subspaces dimensionality
In this last part, we provide an analysis of the effect of phonetic and language subspace dimensionality.
1) SHMM: In Figure 4a, we illustrate the effect of the subspace dimensionality for the SHMM. We observe that, in terms of clustering, the behavior varies by target language. For English and Yoruba, the optimal subspace dimension is around 250 while for Mboshi, it is between 50 and 100.
This performance variance highlights the major drawback of the SHMM: the assumption of a universal phonetic subspace. Indeed, we see that there is no unique setting that fits well for all possible languages.
Segmentation wise, a low (50-100) dimensional subspace leads to more accurate segmentation. This suggests that a coarser phonetic representation is preferable when the segmentation accuracy is concerned.
2) H-SHMM: The H-SHMM has two subspaces: the language subspace and the phonetic subspace. Figure 4b shows the performance of the H-SHMM as the language subspace dimension is varied (the phonetic subspace dimension is fixed to 100). We observe that having larger dimension is globally preferable though the effect is only significant for the Mboshi data. Notice that the curves in Figure 4b are somewhat noisy, indicating that the H-SHMM is affected by random initialization.
The performance of the H-SHMM as the phonetic subspace dimension is varied is shown in Figure 4c. In this experiment, the language dimension is fixed to 5. We observe that the behavior is now homogeneous across languages: both for the   clustering and segmentation the optimal phonetic subspace dimension is between 50 and 100 dimensions. This shows the benefit of adapting the phonetic subspace on the target language: we can have better clustering and segmentation while using lower dimension to represent each languagespecific phonetic subspace.

VI. CONCLUSION
This work provides a theoretical treatment of subspace models for the task of Acoustic Units Discovery (AUD). It shows how the paradigm of subspace models naturally fits within the non-parametric Bayesian framework: an educated prior is formed by constraining the base measure to a subspace that is estimated on phonetically transcribed data from a set of source languages. Thus, the acoustic unit parameters are constrained to live in a phonetic subspace forcing the model to learn units that resemble the phones of the source languages.
This work focuses on two specific models: the Subspace HMM (S-HMM) and the Hierarchical Subspace HMM (H-SHMM). The SHMM assumes that the phonetic subspace is language agnostic: it is the same for every language whereas the H-SHMM assumes that the phonetic subspace is language dependent and has to be adapted on the target language.
Experimental results show that, both the SHMM and the H-SHMM outperform state-of-the-art AUD baselines in terms of clustering quality and segmentation accuracy in three different languages: English, Yoruba and Mboshi. Furthermore, the H-SHMM proves to be superior to the SHMM which supports the idea that each language has a unique phonology that needs to be learn specifically.
Finally, the concept of subspace models for AUD can be expanded in several ways; we list here potential future research on subspace modeling for AUD.
• The quality of the phonetic subspace-how well the subspace models the continuum of phone in a languageis highly dependent on the choice of the acoustic model, an HMM in the present work. Building more refined generative models of phones would allow a qualitatively better acoustic unit embeddings. • This works uses a single subspace of all the phones assuming implicitly a continuum between any pair of phones. This continuity may not be relevant between phones of distinct category, e.g. vowels and fricatives.
To bypass this issue, one can have a specific subspace for different phonetic categories. This would have the added advantage of easing the interpretation of the final acoustic units (for instance an acoustic unit embedding on a vowel-specific subspace is a vowel) • Adding a speaker subspace to model explicitly the speaker variability would help the AUD model adapt to the speaker and avoid having speaker-specific clusters.

APPENDIX A INFERENCE
In this appendix, we detailed the optimization of the variational posterior defined in (47). Note that q(H) and q(v) are distributions over infinite set of variables and, therefore, cannot be used directly in any practical implementation. We derive the optimal factors ignoring this issue and we address it specifically in appendix A-D.
From (10), the expected likelihood has the following form: ln p(x n , c n |η zn ) q(η zn ) = ω zn q − A(ω zn ) q T (c n ) 1 In practice, the expectations are estimated empirically using the variational posterior q(η zn ) derived in appendix A-B.
Using (51), we derive the optimal posterior of the HMM state sequence: p(x n , c n |η zn ) q(c n |z n ) q(cn|zn)q(η zn ) For conciseness, we introduce the following placeholders: A zn−1,zn = ln p(z n |z n−1 , v) q(v) .
Rewriting (53) with φ n (z n ) and A zn−1,zn we get: The normalization constant ζ in (57) requires to sum over all possible state sequences z. Despite the astronomical number of possible sequences, this sum can be computed exactly and efficiently using dynamic programming [43], [44], [26].

B. Acoustic units' embeddings
We focus now on deriving the acoustic units' posterior q(H) using first the SHMM and then H-SHMM. In both cases, we assume the other variational factors q(c|z), q(z), q(v) and q(γ) to be fixed. 1) SHMM: Recall from section III-C that each acoustic unit vector is constructed from a low-dimensional embedding in a subspace. Because the prior and the likelihood are not-conjugate, we cannot obtain closed-form solution and, consequently, we add the following parametric constraints to the variational posterior: where ν is a vectorized form of W and b, and we optimize the empirical expectation of (46) with respect to the parameters ν and ξ: ln p(x n , c n |η zn,j ) q(cn|zn)q(zn) − D KL q(S j , E j )||p(S j , E j ) (59) (S j , E j ) = ν + exp{ξ} j , j ∼ N (0, I), where represents element-wise multiplication. In (59), the expectation of the log-likelihood can be computed exactly by using (51) and q(z n ): ln p(·|η zn,j ) q = q(z n ) ω zn,j −A(ω zn,j )      q(c 1 |z n ) . . . T (x n ) 1 .
2) H-SHMM: Optimization of the acoustic units' posterior in the H-SHMM is very similar to the SHMM case. However, we need to take into account that each subspace is language specific. Let's consider that we have a set of L languages and we would like to learn an inventory of acoustic units H λ for each language λ ∈ {1, . . . , L} . From the definition of the H-SHMM ((40)-(44)), we have: Similarly as before, we introduce the following parametric constraint: and we optimize the empirical expectation of (46): ln p(x λ n , c λ n |η λ zn,j ) q(c λ n |z λ n )q(z λ n ) using stochastic gradient ascent.

C. Update of the stick-breaking process
We address now the last part of the inference: the update of the variational posteriors of the stick-breaking process. For this stage, we consider that the variational posteriors q(c|z), q(z), q(H) are fixed. The following updates equations are based on the the variational treatment of the stick-breaking process presented in [19].
1) Stick-breaking parameters: We first start to estimate the optimal q * (v) assuming q(γ) is fixed: From (22) and (23) we have p(z|v) = p(s|u)p(u|v) which leads to: β k = γ q(γ) + ui∈u 1[u i > k] q(u) ), where 1[...] is the indicator function. The expectations in (69) adn (70) requires summing over all the units of all possible sequences u from q(z). Once again, this large summation can be calculated exactly using dynamic programming [26].
2) Concentration parameters: Finally, the optimal variational posterior q * (γ) while assuming q(v) is fixed is given by:

D. Truncation
In our derivation of the optimal variational posteriors, we have ignored issues raised by the sum or product of infinitely many terms. Following [19], we address this by introducing a truncation parameter τ such that q(v τ = 1) = 1. This approximation, motivated by the almost sure truncation of the Dirichlet Process [45], ensures that q(u i > τ ) = 0, ∀i and, therefore, truncates all infinite sum and product to τ terms in the solution of the optimal variational posteriors.

E. Initialization
Because of the constraints imposed on on the variational posterior (47), the optimization is prone to converge to a local optimum. To avoid this, we initialize the model for the supervised phase of the training by the following procedure: 1) we train a standard HMM with C-components GMM emissions for each phone using the Baum-Welch training and the provided phonetic transcription. 2) for each state of each phone's HMM a) we set the mixing weights π such that π k = 1 C b) compute the per-state global meanμ = 1 C C c=1 µ c and global diagonal covariance matrix Σ = 1 C C c=1 Σ c c) we set each Gaussian component to have meanμ and covariance matrixΣ. 3) using the HMM estimated in step 1, we initialize q * (z n ) using the Baum-Welch algorithm and we set q * (c n |z n ) = const 4) we set ν = 0 and exp{ξ} = 1 D 1 then, we burn-in the model by optimizing ν and ξ until convergence while keeping other factors fixed. The initialization for the unsupervised phase-the actual AUD task-is easier: • we initialize the posterior of the stick-breaking process by setting q * (v) := p(v) and q * (γ) • we use the variational posteriors estimated during the supervised phase to initialize the new variational posteriors as explained in section IV-B. • finally, we set ν, ξ such that q 1 (E) = N (0, I) (respectively q 1 (E, α) = N (0, I) for the H-SHMM).