Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems: A Case Study for Modern Greek

Modern speech recognition systems exhibit rapid performance degradation under domain shift. This issue is especially prevalent in data-scarce settings, such as low-resource languages, where the diversity of training data is limited. In this work, we propose M2DS2, a simple and sample-efficient fine-tuning strategy for large pre-trained speech models, based on mixed source and target domain self-supervision. We find that including source domain self-supervision stabilizes training and avoids mode collapse of the latent representations. For evaluation, we collect HParl, a 120-hour speech corpus for Greek, consisting of plenary sessions in the Greek Parliament. We merge HParl with two popular Greek corpora to create GREC-MD, a test-bed for multi-domain evaluation of Greek ASR systems. In our experiments, we find that, while other Unsupervised Domain Adaptation baselines fail in this resource-constrained environment, M2DS2 yields significant improvements for cross-domain adaptation, even when only a few hours of in-domain audio are available. When we relax the problem in a weakly supervised setting, we find that independent adaptation for audio using M2DS2 and language using simple LM augmentation techniques is particularly effective, yielding word error rates comparable to the fully supervised baselines.


I. INTRODUCTION
Automatic Speech recognition (ASR) models have matured to the point where they can enable commercial, real-world applications, e.g., voice assistants, dictation systems, etc., thus being one of machine learning's success stories.However, the performance of ASR systems rapidly deteriorates when the test data domain differs significantly from the training data.Domain mismatches can be caused by differences in the recording conditions, such as environmental noise, room reverberation, speaker and accent variability, or shifts in the target vocabulary.These issues are extenuated in the case of low-resource languages, where diversity in the training data is limited due to poor availability of high-quality transcribed audio.Therefore, specialized domain adaptation approaches need to be employed when operating under domain-shift.
Unsupervised Domain Adaptation (UDA) methods are of special interest, as they do not rely on expensive annotation of domain-specific data for supervised in-domain training.In contrast to supervised approaches, where the existence of labeled data would allow to train domain-specific models, UDA methods aim to leverage data in the absense of labels to improve system performance in the domain of interest [1], [2].In the context of speech recognition the importance of UDA is extenuated, as the transcription and alignment process is especially expensive and time-consuming.Adaptation methods have been explored since the early days of ASR, at different levels of the system and different deployment settings [3].UDA has been used to improve the robustness of ASR on a variety of recording conditions including farfield speech, environmental noise and reverberation [4], [5], [6].Furthermore, UDA has been used for speaker adaptation, and to improve performance under speaker, gender and accent variability [7], [8].UDA has also been employed for multilingual and cross-lingual ASR, in order to improve ASR models for low-resource languages [9], adapt to different dialects [10], and even train speech recognition systems for endangered languages [11].
Modern ASR pipelines, increasingly rely on end-to-end neural networks, e.g., [18], [19], or large pretrained models with self-supervised objectives [20], [21].The key approaches employed for UDA of end-to-end ASR models can be grouped in three categories, namely, teacher-student learning [10], domain adversarial training [22], and target domain selfsupervision [23].The benefit of these techniques is that they do not require any special knowledge about the source or the target domain.This makes end-to-end UDA approaches versatile and able to be utilized in a larger array of adaptation scenarios.In particular, adaptation through self-supervision has been shown to be a robust, simple and efficient technique for adaptation of state-of-the-art speech models [24].
Here, we leverage in-domain self-supervision to propose the Mixed Multi-Domain Self-Supervision (M2DS2) finetuning strategy, enabling sample-efficient domain adaptation of wav2vec2 [20] based speech recognition models, even when available in-domain data are scarce.Our key contributions are 1) Inspired by recent advances on UDA for Natural Language Processing systems [45], we propose a finetuning strategy for speech models, where the self-supervised objective is based on a contrastive loss in Section III.Contrary to prior works, who leverage only in-domain self-supervision, we find that in this contrastive setting this leads to mode-collapse of the latent representations, and mixed source and target domain self-supervision is essential.We demonstrate this empirically in Section VII-B.2) We collect and curate HParl, the largest publicly available 1 speech corpus for Greek, collected from plenary sessions in the Greek Parliament between 2018 and 2022.We establish a data collection, pre-processing and alignment pipeline that can be used for continuous data integration, as the parliamentary proceedings get regularly uploaded.We provide a detailed description of our data collection process and the dataset statistics in Section IV-A.HParl is merged in Section IV with two popular Greek corpora (Logotypografia and Common-Voice) to create GREC-MD, a testbed for multi-domain evaluation of ASR systems in Greek.3) We demonstrate that, while other baselines fail at UDA in our resource-constrained setting, M2DS2 can improve model performance in the target domain in multiple adaptation scenarios in Section VII.Specifical emphasis is given in the sample efficiency of our approach in Sec-tion VII-A, where we demonstrate successful adaptation even when we reduce the available in-domain data.4) When we relax the problem to a weakly supervised adaptation setting, where some in-domain text is available but the pairing between audio and text is unknown, we find that M2DS2 can be effectively combined with simple N-gram adaptation techniques to get comparable performance with the fully supervised baseline in Section VIII.Furthermore we find that a simple text augmentation approach, based on perplexity filtering of a large corpus can produce strong adaptation results, even for small amounts of in-domain text.Additionally, we provide a formulation of the UDA problem for ASR in Section II-A and link prior works to this formulation in Sections II-B, II-C and II-D.We provide detailed experimental settings for reproducibility in Section V, and an upper-bound estimation for UDA performance with fully supervised finetuning in Section VI.

II. BACKGROUND
We start by formally defining the Unsupervised Domain Adaptation (UDA) problem.Initially, we formulate the problem in a classification setting and then we extend it for speech recognition.We then provide an overview of different adaptation approaches in the literature, and link each approach to the UDA problem formulation.Table I presents a summary of the key adaptation settings and applications that are explored in the literature.We see, that a relatively small amount of methods, and their variants, is used to address multiple real-world ASR problems, for example, cross-lingual, accent, speaker and noise adaptation.Furthermore, while the majority of the works focus on the English language, there is an effort to explore other popular languages, e.g., Mandarin, and underresourced languages, e.g., Ainu, Somali etc.

A. Problem Definition
Formally, the problem of UDA can be defined as follows.Let X ⊆ R n be a real-valued space that consists of ndimentional feature vectors x ∈ X, and Y a finite set of labels y ∈ Y , i.e., Y = {1, 2, . . ., L}.Furthermore, assume two different distributions, i.e., the source domain distribution S(x, y) and the target domain distribution T (x, y), defined on the cartesian product X × Y .
The goal is to train a model that learns a mapping between feature vectors x T to their respective labels y T for samples drawn from the target distribution (x T , y T ) ∼ T .
At training time we have access to samples from the source distribution S(x, y) and the marginalized target distribution T (x), i.e., no target labels are provided.We define the training dataset D as the concatenation of the source and target training sets, D = (D S , D T ).D S and D T are defined as sequences of tuples, i.e., where we draw N samples from S(x, y) and M samples from T (x).Finally, we augment tuples in D with a domain indicator function: 1) Unsupervised (Acoustic) Adaptation for ASR: The above definition can be directly extended in the case of speech recognition, with some modifications.In detail, we modify the feature space X, to be the set of (finite) sequences of real-valued feature vectors Furthermore, the label space Y is modified to be the set of sequences (y m ) m∈N\{∞} , where Y = ({1, 2, . . ., L}) * contains finite-length sequences over a finite lexicon.For CTC training we make the assumption that k > m for any sample (x k , y m ), i.e., feature sequences are longer than their respective label sequences [46].The rest of the definitions need no modifications.
2) Unsupervised (Language) Adaptation for ASR: Adaptation for ASR systems can also be performed at the language level, i.e., the label space.In this setting, we assume that the target domain samples are drawn from the marginalized target distribution T (y).The target dataset D T now consists of tuples in the form (∅, y i ), where y i is the label word sequence (y m ) m∈N\{∞} for the i-th sample.
3) Weakly supervised Adaptation for ASR: The last setting we explore is the case were both audio and language indomain samples are available, but the mapping between them is unknown.This situation can be encountered in real-world settings, e.g., in the case in-domain audio and text are collected independently.For example consider the case where audio clips from news casts are collected, along with contemporary newspaper articles.Another example is the case where long audio clips alongside with transcriptions are available, but no fine-grained time alignments 2 .In this case the target domain samples are drawn independently from the marginalized distributions T (x) and T (y), and the target dataset D T consists of tuples in the form (x i , ∅) and (∅, y i ).

B. Teacher-Student Models
Teacher-Student learning or self-training, is one of the earliest methods in semi-supervised learning [47]- [49].The key idea is to reduce the problem of unsupervised learning of the task at hand in the target domain to a supervised one.The general methodology is to train a teacher model g S using the labeled data in the source domain D S , and then use this for inference on the target domain to produce pseudolabels ŷi = g S (x i ), x i ∼ T (x).The target domain dataset D T is augmented with these silver labels, to contain tuples (x i , ŷi ).Finally, a student model g T is trained in a supervised fashion, using the augmented D T or a combination of D S and D T .This process is usually repeated, with the student model serving as the teacher model for the next iteration, until no further improvement is observed.More recently, soft target Teacher-Student learning has been explored for ASR [26], [31], [50], where the KL divergence between the teacher and student output label distributions is used as the loss function.
Being trained only on the source domain data the teacher model is susceptible to error propagation.Filtering is a commonly used technique to achieve the right balance between the size of the target domain used for training the student model and the noise in the pseudolabels.Confidence scoring based on the likelihood is usually applied, discarding those utterances for which the hypothesized labels are untrustworthy [51].In [25] dropout is used to measure the model uncertainty.The agreement between model predictions with and without dropout are used for confidence scoring.In [23] a multi-task training objective with a confidence loss is applied to minimise the binary cross entropy between the estimated confidence and the binary target sequence.In order to learn more robust and generalizable features from the teacher model, Noisy Student Training (NST) has been proposed in [52].The teacher models generates pseudolabels for D T while the student models are trained on a heavily augmented version of D T [52].In [52], [53] the augmentation of the input target data is performed with SpecAugment [54], while in [29] a spectrum frequency augmentation is performed.
In [4] Teacher-Student learning with soft labels is introduced for ASR to tackle noisy, far-field, and children speech.In [5], this approach is extended for LF-MMI based models and used for noisy, far-field and bandwidth adaptation.In [29] a weighted sum of hard and soft target cross entropy losses is used for Japanese dialects and children speech adaptation.Ramabhadran et al. [31] propose a self-adaptive distillation, and a method for distilling from multiple teachers that is applied across several multilingual ASR systems for different language groups.A comparison between soft and hard targets for RNN-T models [19] showed that soft targets perform better when both the teacher and student models have the same architecture.Otherwise, hard targets are superior [50].

C. Domain Adversarial Training
Domain Adversarial Training (DAT) was initially introduced for image classification [55].The key idea is to train a model that learns deep features that solve the task at hand in the source domain, while being invariant with respect to the domain shift.Concretely, the model is trained endto-end using a combination of the supervised task loss L t , learned on D S , and the domain discrimination loss L a , i.e., L = L t − αL a .The loss L a is binary cross-entropy, trained for domain discrimination using the tuples (x i , 1 i ).Notice the − sign in the loss indicates adversarial learning, i.e., the model should learn features that cannot discriminate between domains, while solving the task.
In [6] DAT is employed for noise adaptation on a noise corrupted version of WSJ [56] as the target dataset.Using the Aurora-4 [57] dataset which has labels associated to the noise type, Serdyuk et al. [33] train an adversarial noise classifier.In [8] and [39] DAT is utilized for accent adaptation for Mandarin and English respectively.Anoop C.S. et al. [9] propose DAT, to address the scarcity of data in low-resource languages which share a common acoustic space with a high-resource language, namely Sanskrit and Hindi.They empirically demonstrate the effectiveness of adversarial training, presenting experiments with and without the reversal of the domain classification loss.

D. Leveraging In-domain Self-supervision
These lines of work have roots in Natural Language Processing tasks [45], [58], and explore domain adaptation by leveraging the in-domain data D T for self-supervised learning.The core focus is domain adaptation of large pre-trained models, e.g., [59], and self-supervision is achieved by use of the pre-training self-supervised loss L s .This process can either take part in stages, via continual pre-training [58], or by constructing a multitask objective L = L t + αL s , as in [45].
Continual Pre-Training (CPT) has been explored for adaptation of ASR models.Robust wav2vec2 [24] explores the effectiveness of CPT for domain adaptation, indicating the importance of utilizing unlabeled in-domain data.In CASTLE [42], CPT is combined with an online pseudolabeling strategy for domain adaptation of wav2vec2.Cross-dataset evaluation for popular English speech corpora indicates that CPT helps to reduce the error rate in the target domain.In [43] and [11] CPT is utilized for cross-lingual adaptation of wav2vec2 for Korean and Ainu respectively.Notably for Ainu, which is an endagered language, CPT has resulted in significant system In the right, we see the proposed domain-adaptive finetuning stage, where the speech recognition task is learned using transcribed source domain data, while adaptation to the target domain is performed by including the self-supervised loss over (audio-only) source and target domain data improvement.DeHaven and Jayadev [44] compare CPT and pseudolabeling for adapting XLSR-53 to four under-resourced languages, i.e., Georgian, Somali, Tagalog and Farsi.They find that both approaches yield similar improvements, with CPT being the more computationally efficient approach.
While CPT yields significant improvements in a variety of tasks, one common theme in these works is the assumption of hundreds or thousands of hours of available in-domain data, mostly from online resources, e.g., YouTube.This can be infeasible when we consider more niche adaptation settings, or possible privacy concerns, e.g., how would one collect 1000 hours of psychotherapy sessions in Greek?In this work, we explore domain adaptation methods in a more resourceconstrained environment.

III. DOMAIN ADAPTATION THROUGH MULTI-DOMAIN SELF-SUPERVISION
The proposed approach is based on end-to-end adaptation of a large pre-trained speech model during the finetuning phase, by including in-domain self-supervision.We extend UDALM [45], that has shown promise for NLP tasks, for adaptation of wav2vec2 based acoustic models, and specifically XLSR.We focus on the problem of UDA in the context of a low-resource language, i.e., Greek.The key finding of our exploration is that straight-forward extension of UDALM, i.e., by using only target domain self-supervision, underperforms in this setting, and use of both source and target domain data is essential for successful adaptation.In this section, first, we will present a quick overview of the XLSR-53 training procedure, and then we are going to outline the proposed domain adaptation approach, which is shown in Fig. 1.

A. XLSR-53
XLSR-53 [21] is a massively pre-trained speech model, trained on 56, 000 hours of multilingual speech, covering 53 languages.The model is based on wav2vec2 [20], which is composed of a multi-layer convolutional feature encoder, that extracts audio features z t from the raw audio, and a transformer context encoder that maps the latent audio features to the output hidden states c t .Each latent feature z t corresponds to 25 ms of audio with stride 20 ms.A contrastive objective L c is used for pre-training.For this, product quantization [60] is applied to the features z t , and then a discrete approximation of z t is obtained by sampling from a Gumbel-softmax distribution [61], to obtain discrete code vectors q t , organized into G = 2 codebooks with V = 320 vocabulary entries each.The contrastive loss aims to identify the correct code vector for a given time step, among a set of distractors Q t , obtained through negative sampling from other timesteps.To avoid mode collapse, a diversity loss L d is included by maximizing the entropy over the averaged softmax distribution over the code vector entries pg .The total loss is:

B. Domain Adaptive finetuning for Contrastive Learning of Speech Representations
Fig. 1 shows the proposed finetuning process.The key intuition is that we want the model to synergistically learn the task at hand (in our case ASR), while being adapted to the target domain by in-domain self-supervision.In the left we see the general pre-training stage of XLSR-53, which is pre-trained on 56K hours of multilingual audio corpora using the contrastive pre-training objective.In the right we see the proposed finetuning stage, which is inspired by [45].
During finetuning we form a mixed objective function: where (x s , y s ) ∼ S(x, y), x t ∼ T (x), L CT C is the CTC objective function, optimized using transcribed source domain data, and L s is the contrastive loss from Eq. (3).We scale the contribution of each term using hyper-parameters α and β.
Note that contrary to [45], who use only in-domain selfsupervision, we leverage both source and target domain samples for the mixed self-supervision.We find that this is essential in our case to avoid mode collapse, i.e., the model using only a few of the available discrete code vectors.Simultaneous self-supervision on both the source and target data alleviates mode collapse by anchoring the target code vector space to have a similar structure as the source code vectors.
Hence we refer to this approach as Mixed Multi-Domain Self-Supervision (M2DS2).

IV. THE GREC-MD CORPUS
For our experiments we compose a speech corpus for the Greek language, that is suitable for multi-and cross-domain evaluation.The GREC-MD corpus contains 206 hours of Greek speech.Audio is segmented into individual utterances and each utterance is paired with its corresponding transcription.Table II summarizes the included sub-corpora, as well as the train, development and test splits.The dataset is constructed with three core principles in mind: 1) Data Volume: We collect the largest publicly available speech recognition corpus for the Greek language, able to scale to hundreds of hours of transcribed audio.2) Temporal Relevance: Language changes over time.We aim at an up-to-date corpus that encompasses the latest terms and topics that appear in daily speech.3) Multi-Domain Evaluation: Single domain evaluation can lead to misleading estimations of the expected performance for ASR models.For example, state-ofthe-art ASR models [27] achieve under 5% Word Error Rate (WER) on Librispeech [62] test sets, but this is an over-estimation of system performance in the field.This is extenuated when considering different acoustic conditions or terminology.We consider multi-domain evaluation essential when developing and deploying real-world ASR models.To satisfy the first two points, we collect data from a public, continuously updated resource, i.e., the Hellenic Parliament Proceedings, where recordings of the parliamentary sessions are regularly uploaded.The benefit of using this resource is the straight-forward collection of a continuously growing, multispeaker corpus of transcribed audio that is always up-to-date, as the parliamentary discussions revolve around current affairs.We refer to this corpus as HParl.For the multi-domain evaluation, we merge HParl with two publicly available corpora, that have different acoustic and language characteristics.We refer to the merged, multi-domain corpus as GREC-MD.In this Section, we will describe the collection and curation process of HParl, and present the relevant statistics for the experiments.

A. Collection and Curation of HParl
Modern technological advances allow for more direct government transparency, through the commodification of storage and internet speeds.In this spirit, the records of plenary sessions of the Hellenic Parliament are made publicly available, for direct access through a webpage 3 .The available video recordings date back to 2015.For each plenary session, a video recording is uploaded, along with a full transcription that is recorded verbatim, and in real time by the parliament secretaries.For the creation of HParl, we build a webcrawler that can traverse and download the video recordings, along with the transcriptions from the official website.The collection process is parallelized over multiple threads, and parameterized by a range of dates and, optionally, a target corpus size in GB or in hours.For this version of HParl, we collect the plenary sessions in four date ranges, as described in Table III.The majority of the collected sessions are from 2019, but we also include sessions from 2018 and 2022 to include coverage of different topics.The individual components of the HParl curation pipeline are: Audio Pre-processing, Text Preprocessing, Alignment, Post-processing, and dataset Splitting.
1) Audio Pre-processing: Fig. 2 shows the layout of the Hellenic Parliament Chamber.Plenary sessions mainly take place in this room, or in the secondary House Chamber that has similar setup but is smaller in size.Because of the room and microphone characteristics, the captured audio in the video streams contains reverberation, due to sound reflections.We employ a light preprocessing pipeline, by passing the input video streams through FFmpeg, and converting them to monophonic, lossless audio format at 16000 Hz sampling rate.The resulting audio is not passed through any de-reverberation or speech enhancement software.The resulting audio files have a minimum, average and maximum duration of 6 minutes, 6 hours and 16 hours respectively.
2) Text Pre-processing: The text files contain full, wordby-word transcription of the speeches and questions asked by members of the audience, as well as extra annotations made by the parliament secretaries.Some annotations are relevant, 3 https://www.hellenicparliament.gr/en/ i.e., the speaker name, while others are plain text descriptions of events happening during the session and need to be filtered out (e.g., "The session is interrupted for a 15 minute break").We use a rule-based system, based on regular expressions, that filters the unnecessary information, keeping only the transcriptions and the speaker names.The speaker labels are created by transliterating their names and roles from Greek to Greeklish using the "All Greek to Me!" tool [63].Text is lower-cased and normalized to remove multiple whitespaces.The result is a text file containing the raw transcriptions, and a mapping from speaker labels to their respective text parts.
3) Aligment and Segmentation: The primary challenge of exploiting the plenary sessions for ASR purposes is the length of the plenary recordings, as their durations vary from 6 minutes to 16 hours in length.However, data samples used to train ASR are generally less than 30 seconds long.Computational challenges have limited the length of training utterances for HMM-GMM models [64], and continue to do so in the contemporary neural network models.Therefore, we need to segment the sessions into smaller pieces more suitable for ASR training.A second challenge is posed by mismatches between audio and transcripts.Parliamentary proceedings do not fully capture everything that is said during the parliamentary sessions, and do not account for speech disfluencies.
In order to obtain smaller, clean segments, that are suitable for ASR training we follow the segmentation procedure proposed by [65].Initially the raw recordings are segmented into 30 second segments and the transcriptions are split into smaller segments of approximately 1000 words called documents.Each segment is decoded using a seed acoustic model trained on the Logotypografia corpus [66] and a 4gram biased LM trained on the corresponding transcription of each recording.The best path transcript of each segment is obtained and paired with the best matching document via TF-IDF similarity.Finally each hypothesis is aligned with the transcription using Smith-Waterman alignment [67] to select the best matching sub-sequence of words.The above method yields a list of text utterances, with their corresponding start and end times in the source audio files.The procedure yields 120 hours of useable segmented utterances out of the original 303 hours of raw audio, or a ratio of 39.6%.
4) Post-processing: After the segments are extracted, we filter out extremely short segments (less than 2 words).Moreover, the iterative alignment algorithm may replace some intermediate words with a <spoken-noise> tag.When this tag is inserted, we match the surrounding text with the raw transcriptions and re-insert the missing words.Furthermore, we match each segment to its corresponding speaker label.Segments without a speaker label are discarded.Lastly, speakers are associated to their gender based on name suffixes, using a simple, Greek language-specific, rule: Speaker names which end in a(α), h(η), w(ω) or is(ις) are classified as female, while the rest as male.We format the segments, speaker and gender mappings in the standard folder structure used by the Kaldi speech recognition toolkit [36].
5) Data Splitting: We provide an official train -development -test split.The development set contains 3 plenary sessions, one from 2018, one from 2019 and one from 2022, resulting to 9 hours of segmented speech.Similarly, the test set contains one session from each year, resulting to 11 hours of segmented speech.The rest 99 hours of segmented speech are assigned to the training set.

B. Including corpora from different domains
We merge HParl with two publicly available corpora to create GREC-MD for multi-domain evaluation.
1) Common Voice: Common Voice (CV) [68] is a crowdsourced, multi-lingual corpus of dictated speech, created by Mozilla.The data collection is performed by use of a web app or an iPhone app.Contributors are presented with a prompt and are asked to read it.The prompts are taken from public domain sources, i.e., books, wikipedia, user submitted prompts and other public corpora.The maximum prompt length is 15 words.A rating system is built into the platform, where contributors can upvote or downvote submitted <audio,transcript> pairs.A pair is considered valid, if it receives two upvotes.Speaker independent train, development and test splits are provided.The dataset is open to the research community, released under a permisFsive Creative Commons license (CC0).In this work, we use version 9.0 of CV, accessed on April 27, 2022.We keep only the valid utterances, i.e., 16 hours of speech from 325 contributors (19 − 49 years old, 67% male / 23% female).
2) Logotypografia: Logotypografia [66] is one of the first corpora for Large Vocabulary Continuous Speech Recognition in Greek.The dataset contains 33, 136 newscast utterances, or 72 hours of speech.The utterances were collected from 125 speakers (55 male, 70 female), who were staff of the popular "Eleftherotypia" newspaper in Greece, under varied acoustic conditions.Approximately one third of the utterances were collected in a sound proof room, one third in a quiet room and the last third in an office room.The average utterance duration is 7.8 seconds.The transcriptions contain several speech and non-speech events (e.g., <cough>), lower-cased Greek words and stress marks.Numbers are expanded to full words.We use the whole dataset, and perform light preprocessing in the transcriptions, by discarding the annotated events and punctuation.

V. EXPERIMENTAL SETTINGS
For our experiments we use the following hyper-parameter settings, unless explicitly stated otherwise.For model training, we use AdamW optimizer [69] with learning rate 0.0003.We apply warmup for the first 10% of the maximum training steps, and a linear learning rate decay after that.Models are finetuned for a maximum of 10000 steps.For speech recognition training, we make use of the Connectionist Temporal Classification (CTC) loss [70], optimized using the available transcribed data in each scenario.Validation runs every 500 steps on the development set, and early stopping is employed on the development CTC loss with patience 5. Batch size is set to 8 during finetuning for all scenarios, except for M2DS2.In the case of M2DS2 we create mixed batches of size 12, containing 4 transcribed source domain samples and 8 unlabeled target domain samples and train for 10, 000 CTC updates.For memory reasons we split the mixed batches in mini-batches of 4 and interleave them during model training.Gradients are accumulated over 3 interleaved batches.For the self-supervised objective, we create masks of maximum timestep length 10, with masking probability 0.4.We weigh the contributions of the source and target domain contrastive objectives, and bring them to the same order of magnitude as the CTC loss, by setting α = 0.01 and β = 0.02.The convolutional feature encoder is kept frozen for all experiments.Our code is based on the huggingface4 implementation of XLSR.For all experiments we resample the audio files to 16 kHz and downsample to single channel audio.We exclude utterances in the training set that are longer than 12 seconds.All experiments are run on a single NVIDIA RTX 3090 GPU, with mixed precision training.
For the Language model training, we create a large corpus for the Greek language using a subset of the Greek part of CC-Net [71] (approximately 11 billion tokens) and combine it with 1.5 billion tokens from the Greek version of Wikipedia and the Hellenic National Corpus (HNC) [72].During preprocessing, we remove all punctuation and accents, deduplicate lines and convert all letters to lowercase.We will refer to this corpus as the Generic Greek Corpus (GGC).We train a 4-gram language model on GGC using KenLM [73] and prune bigrams, trigrams and four-grams with counts less than 3, 5 and 7 respectively.We incorporate the n-gram LMs at inference time using the pyctcdecode framework 5 .We use language model rescoring over a beam search decoder with 13 beams.
The evaluation metric is the Word Error Rate (WER) over the target test set.For assessing the adaptation effectiveness we also report the relative WER improvement over the unadapted baseline in appropriate scenarios, which is defined in Eq. (5).We refer to this metric as Relative Adaptation Improvement (RAI) for the rest of this paper: The minus sign is included, so that RAI takes negative values when the adaptation fails, i.e., when W ER unadapted < W ER adapted .will give an upper bound estimation for UDA performance.
We finetune XLSR-53 on CV, HP and LG (separately) and perform in-domain evaluation on the respective test sets.
Results are summarized in Table IV.The first row indicates the performance of greedy decoding, while in the second row we report the performance of the beam search decoder, rescored using the scores of the 4-gram GGC language model.We observe that the greedy decoding performance is under 30 WER for both HP and CV, while for LG we achieve ∼ 32 WER.This makes sense, as LG is the most diverse dataset, with respect to the included acoustic conditions.Furthermore, we observe that the incorporation of a language model results in an impressive WER reduction on CV, followed by HP and then LG.While CV includes relatively simple phrases with common vocabulary, HP and LG contain more specialized terminology.

VII. UNSUPERVISED DOMAIN ADAPTATION USING IN-DOMAIN AUDIO
Here, we evaluate the effectiveness of M2DS2 for UDA.We compare with three baselines:   IV, evaluated in out-of-domain settings.We see that out-of-domain evaluation results in a large performance hit, e.g., while in the CV9 → CV9 in-domain setting we achieve 29.33 WER, in the CV9 → HP out-of-domain setting we get 69.55 WER.This confirms that for real-world ASR tasks, multi-domain evaluation is of essence.Second, we observe that in most adaptation scenarios both CPT and PSL fail to surpass the SO (unadapted) baseline.In the case of CPT, we hypothesize that is due to the relatively data constrained version of our setting.In the best-case scenario, we have 99 hours of available target domain audio, which is not enough to perform a discrete CPT stage.Note that most of works in the literature use ∼ 1000 hours of target audio for CPT.In the case of PSL, the poor performance is due to the quality of the silver labels created by the seed model.While the performance would improve with more elaborate approaches (e.g., confidence filtering), in challenging adaptation scenarios PSL approaches are limited by the SO model's performance.Lastly, we observe that M2DS2 is the only approach among our baselines that manages to achieve a positive RAI in most adaptation scenarios, by consistently outperforming the SO baseline by significant margins.This is exaggerated when we include a LM during inference.One exception in this pattern is the HP → LG scenario, where the SO baseline achieves the best performance.We attribute this to the fact that we performed minimal hyper-parameter tuning during model development.

A. The sample efficiency of M2DS2
One key observation in the literature, and in our experiments is that CPT requires a large amount of un-transcribed target domain audio.This raises the question, can we leverage selfsupervision for domain adaptation in data constrained settings?
In Fig. 3 we evaluate the performance of M2DS2, when we reduce the amount of target domain audio.Specifically we focus on the scenario of LG → CV.The full training corpus of CV contains 12 hours of audio.We train M2DS2 with 50%, 25% and 10% of the available samples, or 6, 3 and 1.2 hours of audio respectively, and plot the resulting WER on the target (CV) test set.In all cases, the full source (LG) training corpus is used.We observe that M2DS2 achieves lower WER than the SO baseline, even with only 3 hours of target domain audio.While CPT can suffer from catastrophic forgetting, as most multi-stage training approaches, M2DS2 avoids this issue, being a single-stage approach with a mixed task-specific and self-supervised objective.This provides a promising avenue for adaptation, when collection of in-domain recordings is expensive, or infeasible.

B. The importance of Multi-Domain Self-Supervision
In Section III-B we argue that it is essential to include both source and target domain data for the self-supervised objective of M2DS2.To illustrate the effect of this approach, we train two versions of M2DS2 for the LG CV scenario.For the  first version we set α = 0.01, while for the second we set α = 0, removing the second term of Eq. ( 4).We extract the code vectors for the first 100 samples of both LG and CV, and flatten them across the time steps , resulting to 60000 × 768 code vectors corresponding to individual timesteps.We plot these code vectors using T-SNE [74] in Fig. 4 for both models.
We see that when we do not include the source domain selfsupervision, the code vector space collapses in a few tight clusters, and most audio segments correspond to just a few code vectors.This is a visual clue that indicates the mode collapse problem.When we include the source domain term, we see that the that the code vector space has more structure, and coverage of the space is more complete, both for CV (target domain) and LG (source domain).Experimentally we train M2DS2 with α = 0 for all source / target domain pairs and we find that the mode collapse is destructive for target domain performance.During our experiments we got WER in the range 80 − 99, indicating failure to converge to acceptable solutions across all scenarios.The simple inclusion of both source and target domain self supervision stabilizes training, avoids mode collapse and leads to successful unsupervised adaptation between domains.

VIII. UNSUPERVISED AND WEAKLY SUPERVISED LANGUAGE ADAPTATION
When small amounts of in-domain textual data are available, simple N-gram LM adaptation techniques can be very effective.In this brief set of experiments, we first explore the unsupervised language adaptation setting, where no in- domain audio is used, and then we relax the problem to the weakly supervised setting, where M2DS2 is combined with the adapted N-Gram LMs.These settings are described in Sections II-A2 and II-A3 respectively.We explore two approaches for LM adaptation: biased LMs, and in-domain data augmentation.To create biased LMs, we train a 4-gram LM on the available in-domain data.Then we replace the generic LM trained on GGC.For LM data augmentation we follow a perplexity filtering approach similar to [71].We first train a biased LM using available target domain text, and then use it to calculate the perplexity of each line in the GGC corpus.We keep the 10% of the lines with the lowest perplexity.Then we train a 4-gram LM on the augmented "indomain" corpus and use it for inference.Fig. 5 shows the performance of the SO LG → HP model with biased and augmented LMs, as we reduce the amount of available in-domain text data from 100% to 1% of the in-domain transcriptions (11B tokens to 110K tokens respectively).As a baseline we include the LG → HP SO model in combination with the generic LM trained on GGC.We observe that the use of biased LMs can lead to successful adaptation, when an adequate amount of in-domain text data is available.On the other hand the LM augmentation approach results to successful augmentation, even with very small amounts of indomain text.
In Table VI we see the results of LM adaptation, combined with the M2DS2 LG → CV model.To demonstrate the sample efficiency of the approach, we use the variant that was trained using only 25% of the target domain audio (3 hours).We compare with M2DS2 combined with the 4-gram GGC LM for inference.We draw similar conclusions, i.e., use of biased LMs performs well for sufficient text data.When we use augmented LMs we can leverage very small amounts of in-domain text.

IX. DISCUSSION & CONCLUSIONS
In this work, we have explored Unsupervised and Weakly Supervised Domain Adaptation of ASR systems in the con-text of an under-resourced language, i.e., Greek.We focus on domain adaptation through in-domain self-supervision for XLSR-53, a state-of-the-art multilingual ASR model.Specifically, we adopt a mixed task and self-supervised objective, inspired from NLP, and show that using only in-domain selfsupervision can lead to mode collapse of the representations created by the contrastive loss of XLSR-53.Therefore, we propose the use of mixed task and multi-domain selfsupervision, M2DS2, where the contrastive loss leverages both the source and target domain audio data.For evaluation we create and release HParl, the largest to-date public corpus of transcribed Greek speech (120 hours), collected from the Greek Parliamentary Proceedings.HParl is combined with two other popular Greek speech corpora, i.e., Logotypografia and CommonVoice, for multi-domain evaluation.
In our experiments, we find that while most UDA baselines fail in our low-resource setting, the proposed mixed task and multi-domain self-supervised finetuning strategy yields significant improvements for the majority of adaptation scenarios.Furthermore, we focus our ablations on showcasing the sample efficiency of the proposed finetuning strategy, and demonstrating the necessity of including both source and target domain data for self-supervision.Finally, we show that M2DS2 can be combined with simple language model adaptation techniques in a relaxed weakly supervised setting, where we achieve significant performance improvements with a few hours of in-domain audio and a small, unpaired indomain text corpus.
More concretely, in Table VII we present a summary of the discussed unsupervised and weakly supervised adaptation combinations, for different amounts of available in-domain audio and text.Note that for the weakly supervised scenarios, the in-domain audio and text are unpaired.We see, that when no in-domain data are available, including an n-gram LM trained on large corpora is recommended.Furthermore, when in-domain audio is available, following a mixed multi-domain finetuning strategy using M2DS2 can yield significant WER reductions, even for a few hours of audio.When small amounts of in-domain text is available, using a corpus augmentation strategy, e.g., perplexity filtering, can produce adapted LMs and yield small improvements to the final WER.In the case of sufficient amounts of unpaired in-domain text and audio, independent adaptation of XLSR-53 using the audio data and the n-gram LM using the text data can yield comparable performance with a fully supervised finetuning pipeline.

X. FUTURE WORK
In the future we plan to explore the effectiveness of the proposed adaptation strategy for other languages, and different adaptation settings, e.g., accent or cross-lingual adaptation.Of special interest is the investigation of the effectiveness of our approach for endagered languages, e.g., Pomak.Furthermore, we plan to explore the combination of in-domain self-supervision, when combined with other popular UDA techniques, e.g., teacher student models, adversarial learning, and data augmentation approaches.On the language adaptation side, we plan to explore multi-resolution learning, which has shown promise for ASR [75], and investigate more elaborate end-to-end weakly supervised adaptation methods.Finally, we plan to expand our study in a multimodal setting, where both audio and video are available, e.g., lip reading.

Fig. 1 .
Fig. 1.Target-domain adaptation through self-supervision.In the left we see the general pre-training stage of XLSR-53 using the self-supervised loss Ls.General pre-training is performed on 56, 000 hours of audio in 53 languages.In the right, we see the proposed domain-adaptive finetuning stage, where the speech recognition task is learned using transcribed source domain data, while adaptation to the target domain is performed by including the self-supervised loss over (audio-only) source and target domain data

Fig. 2 .
Fig. 2. Overview of the Hellenic Parliament Chamber.The chamber has an amphitheatrical shape and can accomodate approximately 400 − 450 people.The positions of the key speakers, i.e., current speaker and the parliament president are annotated in the image.

1 )
Source Only Training (SO): We perform supervised finetuning of XLSR-53 (CTC) using only the sourcedomain data, and run decoding on the target domain test set.No in-domain data are used for adaptation.2) Continual Pre-Training (CPT): We perform a pretraining phase using the loss in Eq. (3) on the target domain train set, to create adapted versions of XLSR.Pre-training is run for 20000 steps with batch size 4.Only the audio is used, without transcriptions.The adapted checkpoints are then finetuned by use of CTC loss on the source domain transcribed data.Evaluation is performed on the target test set.3) Pseudolabeling (PSL): We finetune XLSR-53 using the source domain data with CTC loss.Then we run inference on the source model, to extract silver transcriptions for the target domain training set.We use the silver transcriptions for supervised finetuning on the target domain.In Table V we compare M2DS2 with the SO, CPT and PSL baselines for six adaptation scenarios, i.e., cross dataset evaluation between the three datasets in GREC-MD.The left half corresponds to greedy decoding, while for the right half we use the 4-gram LM trained on GGC.First, we observe the SO model performance.The SO models are the finetuned

Fig. 3 .
Fig. 3. Performance of M2DS2 (blue line) for the LG → CV setting, when reducing the amount of available target samples to 50%, 25%, and 10% of the original dataset (horizontal axis).SO performance is indicated with the orange line.Vertical axis: WER, Horizontal Axis: target audio percentage (100% → 0%) Fig. 4. T-SNE scatter plots of code vectors extracted from M2DS2 without source domain self-supervision (top) and with source domain self-supervision (bottom) for LG (red) and CV (teal)

Fig. 5 .
Fig. 5. Language-only adaptation for LG → HP using the SO model finetuned on LG.In-domain text data range from 11M tokens (left) to 110K tokens (right).Blue/dashed: Baseline with generic LM.Purple/circles: Biased LM.Orange/diamonds: Augmented LM.

TABLE II THE
GREC-MD CORPUS.WE CAN SEE THE DURATION OF EACH SPLIT IN H O U R S:M I N U T E S:S E C O N D S FORMAT, AS WELL AS THE NUMBER OF SPEAKERS FOR EACH OF THE SUB-CORPORA.

TABLE III PLENARY
SESSIONS INCLUDED IN HPARL.THE HOURS COLUMN REFERS TO THE RAW (UNSEGMENTED) HOURS OF COLLECTED AUDIO.
VI. SUPERVISED IN-DOMAIN TRAININGIn the first set of experiments, we explore the performance of supervised finetuning of XLSR-53 for each domain.This

TABLE V M2DS2
PERFORMANCE USING GREEDY DECODING FOR UDA BETWEEN HP, CV, AND LG.A → B INDICATES THAT A IS THE SOURCE DOMAIN AND B IS THE TARGET DOMAIN.(G) INDICATES GREEDY DECODING.(LM) INDICATES BEAM SEARCH WITH LM RESCORING.WE REPORT THE WER ON THE TARGET TEST SET, AS WELL AS THE RAI (%) OVER THE SO (UNADAPTED) BASELINE.WER: LOWER IS BETTER.RAI: HIGHER IS BETTER.

TABLE VI LANGUAGE
ADAPTATION OF THE M2DS2 LG → CV MODEL, USING BIASED AND AUGMENTED LMS.WE USE THE VARIANT OF THE MODEL TRAINED WITH 3 HOURS OF IN-DOMAIN AUDIO.WE VARY THE AMOUNT OF IN-DOMAIN TEXT DATA FROM 752K TOKENS TO 38K TOKENS.

TABLE VII CLOSING
THE GAP BETWEEN SO TRAINING AND FULLY SUPERVISED TRAINING FOR THE LG → CV ADAPTATION SCENARIO USING M2DS2, WITH VARYING AMOUNTS OF AVAILABLE UNPAIRED IN-DOMAIN AUDIO AND TEXT.(U):UNSUPERVISED ACOUSTIC OR LANGUAGE ADAPTATION.(W):WEAKLY SUPERVISED ADAPTATION.