Language Model Integration in Encoder-Decoder Speech Recognition

Shahad Mahmud

4 years ago

Currently, Attention-based recurrent Encoder-Decoder models provide an elegant way of building end-to-end models for different tasks, like automatic speech recognition (ASR), machine translation (MT), etc. An end-to-end ASR model folds traditional acoustic model, pronunciation model, and language model (LM) into a single network. An encoder maps the input speech to a sequence of higher-level learned features. On the other hand, the decoder maps these higher-level features to the output text or labels. This also provides an alignment between the speech and text with the help of an attention mechanism. This type of model can learn end-to-end (E2E) and just requires paired speech and text data.

As E2E models require paired speech and text data, it restricts the models to only these texts. Conventional ASR models leverage a separate LM trained on all available text. As a result, this can be of larger magnitude order. To leverage the power of an external language model, several approaches have been taken.

Language Model integration appraoches

We can divide the LM integration approaches into three broader categories. Let’s discuss these briefly.

Shallow fushion: In this approach, we integrate the language model via log-linear interpolation at inference time only. Let’s assume that we are using a model with Beam search. An approximate solution can be as: \( y^* = argmax\ log\ p(y|x) \). Now, when using a LM with shallow fusion, we can modify the equation as: \( y^* = argmax\ log\ p(y|x) + \lambda\ log\ p_{LM}(y) \). So, we are using the LM at inference time only with a shallow fusion approach.
Deep fushion: Like shallow fusion, deep fusion is also a late integration procedure. The difference is that it assumes both the ASR model and the LM model to be pre-trained. In this approach, we integrate the models by fusing the hidden states of the language model and the decoder.
Cold fushion: Cold fusion inherits the idea of deep fusion. But it is an early fusion method and the language model is pre-trained. While integrating the LM with the ASR model we use a gatting mechanism. In this approach, we use LM logits rather than hidden states. This allows flexible LM swapping.

Integration scenarios

Now the question arises about – when to integrate the LM model? We can consider the following criteria or scenarios:

Early/late model integration: We need to consider at which point we will integrate the LM computations of a ASR model. While working with deep and cold fusion, we fuse the LM directly into the ASR model. Through this, we create a tight integration by combining the LM’s hidden states. As a result, we get a single model. On the other hand, while working with shallow fusion approach, the LM and ASR model remains separete. Like an ensemble, only the scores are combines.
Early/late training integration: We also need to think about when to integrate the language model’s training. Deep and cold fusion approaches use a late integration. In this case, the ASR and LM models are trained separtely and then combined. In case of cold fusion, we use an external pretrained language model from the very start of the ASR model training. If the ASR and LM model change freuently, early intefration can be computationaly costlier.

Other LM integration approaches

We can integrate a language model in other approaches also. Let’s have a look at the following approaches:

LM as lower decoder layer: We can use a pre-trained LM to initialize the decoder. This approach can provide more contextual word embeddings.
LM integration via multitask learning: We can consider the decoder as a conditional LM, conditioned on the encoder features that represent the speech input. Now, the unpaired text has no corresponding speech signal. We can represent this by a zero context vector. A zero context vector reduces the decoder from a conditional LM to a plain LM. This way, we can use the decoder for language modeling tasks. Thus this approach has no external language model. Rather we use the decoder for both decoding and language modeling.

We can use these approaches for integrating the language model with an encoder-decoder attention-based model. This article is based on a research work titled – “A comparison of techniques for language model integration in encoder-decoder speech recognition“. You can have a look for more mathematical and implementation details.

That’s all for this post. I hope this post will help you to enrich your understanding of language model integration in an encoder-decoder model. I am a learner who is learning new things and trying to share with others. Let me know your thoughts on this post. Any suggestions or opinions will be highly appreciated. You can reach me through LinkedIn, Facebook, Email, or find me on GitHub. Get more machine learning-related posts here.