Extending Gentle Aligner
Week 5
Groundwork for Generating a Language Model
-
Now that we know how to generate forced-alignements for utterances using Kaldi, we shift our focus to generating a customized langauge model that would probably increase the accuracy of the forced alignment.
-
What is a Language Model made of?
The kind of language model that we are aiming to build will be specific to an utterance of an audio file. In other words, each utterance will have a language model of its own. To build a language model, we will require followng inputs:
1. lexicon or dictionary [lexicon.txt, words.txt] 2. phonemes [nonsilence_phones.txt, optional_silence.txt, silence_phones.txt] 3. grammar [G.fst]
-
Building a lexicon or a dictionary
Given that creating a lexicon could be a tedious task for a new language, we have a new challenge where we would need a lexicon for each utterance, and that would need to be generated on the go as an utterance is provided.
There are two ways in which we can generate a lexicon from an utterance:
1. Using a simple python script [located here] createLexicon.py It takes an utterance, breaks it up into words and looks up the words in an even bigger dictionary/lexicon of the language. 2. Using g2p-seq2seq library Therefore, grapheme2phoneme library seems like an useful tool that can be pre-trained on any langauge's lexicon (all possible words in a langauge) and then it can produce a lexicon specific to a text. [Link to g2p library](https://github.com/cmusphinx/g2p-seq2seq) install virtualenv (will make your life easier) install g2p dependencies instal g2p-seq2seq train your lexicon using g2p-seq2seq
-
Generating Phonemes from Lexicon
./script/phones.sh path-to/lexicon.txt generates three files. eg: ./script/phones.sh data/local/dict [Edit phones.sh line 9 & 12 if you have more than two silent phonemes in your lexicon.txt] eg: head -n 3 lexicon.txt > temp.txt and tail -n +4 $lex/lexicon.txt > temp.txt if there are 3 silent phonemes.
Path: data/local/dict contains lexicon.txt Those are nonsilence_phones.txt, silence_phones.txt, optional_phones.txt
PS: This script gets called from createLexicon.py itself. So, no worries!
-
Finishing with putting all input files together and generating L.fst, symbol-integer mapped dictionary and phones.txt etc.
This script provides L.fst, words.txt, phones.txt, topo, L_disambig.fst, oov.txt, oov.int, phones.
./utils/prepare_lang.sh data/local/dict/ <UNK> data/local/lang/ data/langdir/ data/local/dict/ lexicon.txt nonsilence_phones.txt silence_phones.txt optional_phones.txt data/local/lang/ align_lexicon.txt lex_ndisambig lexiconp.txt lexiconp_disambig.txt phone_map.txt data/langdir/ L.fst L_disambig.fst words.txt phones.txt topo oov.txt oov.int phones/
Above given directory structure is what you should follow for easy integration with Gentle code for generating the langauge model. Make sure to copy(just copy, don’t move) langdir/words.txt to tdnn_7b_chain_online/graph_pp/words.txt. Gentle expects words.txt inside
tdnn_7b_chain_online/graph_pp
directory. - Next: Generating the language model Week 6
- Prev: Automate Decoding Process Week 4
- main page
Tools: Kaldi, Python, C, Bash Scripting
References: For generating phoneme files using this blog
Link to GSoC Project Repository