Speech:Models LM Build


 * Home
 * Semesters - Project Work by Semester
 * Information
 * Experiments - List of speech experiments

Project Notes

 * Unix Notes
 * Active Directory
 * Backups
 * Network Bridge
 * Speech S/W Installation
 * Speech Corpus Setup
 * Switchboard Data Notes
 * Experiment Setup
 * Scripts Page
 * [Model Building]
 * Step 1: Run a Train
 * Step 2: Create the Language Model
 * Step 3: Run the Decode

Model Building
General discussion on how to build and verify models including the initial setup and preparation of data, building of a statistical language model and finally generating a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. For detailed steps on how to train and decode, see the sub-steps under Model Building above.


 * Data Preparation
 * [Language Modeling]
 * Building & Verifying Models
 * GenTrans script

Parsing the Transcript
The first step to building a language model is to clean it up by removing all unwanted characters from the raw transcript file. To do this, the ParseTranscript.perl script must be called from the /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi directory. The command to execute the script is "perl ParseTranscript.perl test.txt tmp.text" with test.txt being an example of a name for a raw transcript file, and tmp.text being the filtered transcript. Both the input and the output have to be text files because the text2wfreq command that is called in lm_create.pl requires a text file for input. The result of running this script is a copy of the transcript that is purely what was said in the audio files.

Creating the Language Model
The next step to create a language model is to run the lm_create.pl script. This perl script calls four different executable commands. The first of these commands in text2wfreq. The format of the command is "text2wfreq  tmp.wfreq". text2wfreq takes in the filtered transcript file that ParseTranscript.perl created uses to create a file that contains the frequency of every word in the transcript. After that, the command wfreq2vocab is executed in the form "wfreq2vocab  tmp.vocab". wfreq2vocab takes in tmp.wfreq as input and creates an alphabetical list of every word that was found in the transcript. The next command used in creating a language model is text2idngram, which is used in the form "text2idngram -vocab tmp.vocab -n 3 -write_ascii <$infilename> tmp.idngram". This is done to enable more n-grams to be stored efficiently in memory. The last command, the executable that actually creates the language model has two forms. The first being one that be physically read by users "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -arpa tmp.arpa -ascii_input", and one that is used by the computer "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -binary tmp.binlm -ascii_input".