Speech:Spring 2014 David Meehan Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4th, 2014
Phonetic Dictionary: .dic Phoneset: .phone Language Model: .lm.DMP Filler Dictionary: .filler File mapping: _train.fileids and _test.fileids Training transcript: _train.trans and _test.trans
 * Task:
 * My primary task for this week iteration is to begin exploring Caesar. To begin training and decoding, the modeling group must first become oriented with the files, configurations and inner workings of Sphinx. Last week I researched how speech recognition works in relation to sphinx, so I have a decent understanding of what the different file extensions are and what they are used for. With this in mind my first task is to locate and analyze the following files:
 * If the group feels comfortable with the system, I would like to attempt to initialize a train using the Tiny dataset.


 * Results:

2/1
 * I was able to successfully SSH into Caesar and change my password.
 * I reviewed the Experiment setup and Training guide. I also checked the revision history on both of those to ensure that the data was up-to-date to ensure that the information is current with the previous class' findings.
 * Analyzed the Exp directory for where the various dictionaries and files are located. There are two primary places to find these files. The first is in a directory /root/speechtools/SphinxTrain-1.0/, which acts as the base for which the train data is taken. In preparation for a train we will copy the files of one of the baseline trains in this directory to /mnt/main/Exp/. Inside the Exp directory, most of the core files can be found in /mnt/main/Exp//etc:
 * The .dic, .phone, .filler, .fileids and .trans
 * Analyzed the switchboard corpus data-sets. The audio files are in the .sph file format. This will be relevant as some sphinx settings will vary slightly depending on whether we are using .sph or .wav. From Eric's logs it would also seem we are using 8 KHz sound files, which will also affect the settings we use (some audio settings such as lo and hi filtering could be affected).

2/2
 * Read logs

2/3
 * I spent a little more time analyzing the training and experiment documents.
 * It appears some changes were made to the configuration of Caesar without being modified in the wiki:
 * The biggest change being that there are no script files or Sphinx training directories in /root anymore
 * I found the scripts, which were located at /mnt/main/scripts/user.old/ and /mnt/main/scripts
 * The training data is located at /mnt/main/root/tools/SphinxTrain-1.0/train1
 * Began separate training process 0145 (Colby is working on 0144) using Tiny corpus. The trainer produced an error at step 6, much like with Colby's experiment. I am working at adding those items to the dictionary. I will also search to see if we have a script available to parse the html output and grab all the words that failed. If not, I will begin writing the script. It appears many of the words that were missing have a quotation mark either before or after the word. I'll need to look into this more to determine how those words are represented in the dictionary (i.e. do we need to enumerate all words preceded and proceeded with a ").

2/4 if( $#ARGV != 0) { print ""; exit -1; }
 * Wrote process_missing_words.pl: Reads in an html error file produced by the training software and generates a text file with the missing words:
 * 1) !/usr/local/bin/perl

$HTML = $ARGV[0]; $search = "WARNING: This word: ";

open my $MYFILE, $HTML or die "Could not open $file: $!";

while(my $line = <$MYFILE>) { if($line =~ /$search/) { $line =~ s/$search//g; $line =~ s/\s.*//g; print "$line\n"; } } /mnt/main/scripts/train/scripts_pl/process_missing_words.pl /mnt/main/Exp/ / .html > missing_words.txt sed -i 's/"//g' 0145_train.trans MODULE: 45 Prune Trees   Phase 1: Tree Pruning FATAL: "main.c", line 167: Unable to open /mnt/main/Exp/0145//trees/0145.unpruned/AW2-0.dtree for reading; No such file or directory MODULE: 50 Training Context dependent models    Phase 1: Cleaning up directories:        accumulator...logs...qmanager...    Phase 2: Copy CI to CD initialize    Phase 3: Forward-Backward        Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)        0% FATAL_ERROR: "main.c", line 1054: initialization failed
 * To run the script, type the following on the command line:
 * I haven't tried to add any words into the dictionary yet. The quotation marks present in the error log lead me to believe something else went wrong. Tomorrow (or later today) I will retry the train and see if I get the same results.
 * I am going to try and strip the " characters from the transcript and try to train again. I'm not convinced these characters serve any purpose in the actual pronunciation of words. To remove the character I ran the following:
 * Reran the train (it still failed but the list of missing words is much smaller and easier to add back in now)
 * Some words did not exist in the pronunciation database, for those words I combined several known words that collectively makeup the sound.
 * If we plan to automate training, adding missing words will be the hardest part. I am looking at the pronunciation site and thankfully the site is using query strings to process words. With that in mind, if Perl can retrieve and parse a web page it could be entirely possible to automate even this. I have done similar tasks in Java and PHP, where I retrieve and parse HTML data, so I suspect a similar function exists in Perl as well. The other hard part will be automating words that cannot be found here. This can be remedied using my technique. The scripts could prompt the user to enter several sub-words which collectively make the missing word.
 * Train failed:
 * It is likely that the failure was caused because of the quotation marks.
 * I began running a train using the last 5 hour data model. I couldn't find a missing words document so I began building one using the CMU website tool provided in the wiki.
 * I wrote the modeling and introduction sections of the proposal.


 * Plan:
 * (Estimated Deadline: 2/2): Analyze the settings we are currently using for the train data. Make sure that the settings match the specifications provided by Sphinx given file types and audio file details. Even if they are different from the recommended I will not change anything until we have a successful train completed. I am also curious about the density and senone settings currently set. As I understand it, audio of 5 hours requires 200 senones and a density of 8. A senone count that is too large will result in overly sensitive speech recognition which will fail to account for diversity in speech patterns, while small senone counts will not be discriminating enough to discern differences between words. Manipulating these settings could help improve our baseline.
 * (Estimated Deadline: 2/4): Run a Tiny train. Estimates suggest this will take about 30 minutes, not accounting for failures. We should allocate at least two hours to this task. If we are successful it would also be good to try and decode our data.
 * (Estimated Deadline: 2/3): Automate parsing the HTML error file.
 * (Estimated Deadline: TBD): Automate adding dictionary words by retrieving the correct pronunciations.


 * Concerns:

Week Ending February 11, 2014

 * Task:
 * Resolve issues with Tiny and Mini data trains.
 * Improve dictionary2.pl for better performance.

2/6
 * Results:
 * Completed a 5 hour train using first_5hr.
 * Spoke with Pauline and Ray, told them about the problems we were encountering with Tiny trains, told them to use first_5hr.
 * Continued working on fixing the Tiny data corpus.
 * Colby and I worked with Pauline to show her how to run a train.
 * I created a language model and began decoding Exp 0150 (using the acoustic model from 0148).
 * To do this efficiently without wasting server resources, I created a new experiment (0150) and created symlinks pointing to the files located in experiment 0148 (the acoustic model).
 * This worked, except when I needed to run the decode. The decode asks for the experiment number of the acoustic model. When I gave it 0150 as the model, it attempted to access the files located at /mnt/main/Exp/0150/, which linked back to 0148. This worked, but because the files in the original directory contained the exp id 0148 in file names and not 0150, the number I provided, the decoder failed to run. To fix this, I created a symlink in 0148 called LM (the dir created for the Language Model) which pointed to the LM directory in 0150. When I decoded, I told it the experiment number for the acoustic model was 0148, which was able to access the needed files as well as the Language Model via the symlink. The decode results were located in /mnt/main/Exp/0150/DECODE. I may clean out the LM symlink in 0148 to ensure that the directory is standalone, but before I do I want to make sure no files in 0145 were changes during the construction of the Language Model or decoding.

2/7
 * The decode finished last night. I scored the decode.log file and got the WER. The results are archived in the Experiment log for experiment 0150. The WER was pretty bad (47%). I need to take a closer look and determine why it was so high. Perhaps it has something to do with the settings in the sphinx configurations file.

2/10
 * I noticed a number of people had created experiments over the weekends but nobody created an experiment wiki page (Experiments 0151-0155).
 * I created the log entry for those experiments, specifying the username and date information from checking each experiment with ls -all.
 * Created a new experiment, 0156. This is the same as 0148 except I changed the density to 8 and the senones to 200.

2/11
 * Spoke with Colby J. about the Tiny and Mini trains. If we could get them to work, we could do a brute-force approach with some of the known parameters to see what produced the best results. If we could do that and get Torque running this may actually be feasible. Colby found that the transcript appeared to be in a different order. It is unclear as to whether or not this makes a difference.
 * I created a new version of run_decode.pl (run_decode2.pl) which accounts for the number of senones. The previous script assumed that there were 1000 senones, but if you change the number some of the files in model_params had a different name. By adding an extra parameter to the script I was able to automate it.
 * I created the LM and decoded experiment 0158. For some reason the decode took over 8 hours and still had not completed. I ended up having to stop it. Despite this I was still able to score it, and everything seemed to be in order somehow, although the final scoring was quite abysmal.
 * I am not sure whether the decreased performance was a result of changing the senones or because I stopped the script. The fact that I could score suggests perhaps the former. Eric seemed to think that increasing the senones would improve the rate. The Sphinx guide says to use a smaller number for 5 hours, but based on the performance perhaps I will try increasing them next time. It does make some sense that using a higher senone count would improve the decode process as it is using the same data that was used to build the models. Perhaps the lower number is only viable for general use where different people will be using it.


 * Plan:
 * Continue running trains with different parameters. My next goal is to use a higher senone count to see if that help.
 * Work with data group to get the tiny data fixed.
 * Concerns:
 * It would be much more efficient to test if we could use the two shorter data sets. Training and decoding on 5 hours takes a long time, and limits the amount of experimentation we can do.

Week Ending February 18, 2014

 * Task:
 * Improve the baseline for experiments by working with the senone and density variables.
 * Build a master dictionary containing all words present in the full corpus
 * One problem we have to account for when training is missing words in the dictionary. Thus far we have been using the first_5hr data set because we have a txt file with all the missing words for that train already available. In the future, it will be important to merge all transcript words into one dictionary that works for any data set. I created an experiment 0179, where I will be working on finding and defining the pronunciation for all missing words. We have the 100 or so missing words from the first_5hr and I am currently filling in the 300 missing words for the 10hr (less since I merged the first_5hr words we already have).


 * Results:

2/12 2/16
 * I created a new experiment 0160, which also uses the first_5hr data set. This time I am configuring the experiment with 2000 senones instead of 200. I read over more of Eric's experiment logs, and on one he posted a link to the Sphinx FAQ page. On that page there was a table expressing the relationship between data length and senone count. According to the page, 4-6 hours of data requires 2000 senones. I am a bit perplexed by this, as the Sphinx train guide shows a similar table but has different mappings. I suspect the difference is because the version Eric used was for Sphinx 3. The version I looked at was for the most recent version of Sphinx, so that could explain the discrepancy. I will try an experiment with 2000 and see if I cannot match Eric's run, which was 30% WER. My own experiment may be somewhat higher as first_5hr trains tend to have a higher WER than the last_5hr.
 * Read logs, in particular, I looked into the results for the brute force trains Colby had started last Thursday. According to the results the ideal value is a senone value of and a density of 64, with a WER of 15%. I am unsure how well this would perform in an actual test, or if this configuration works well only with this data. When the data group builds the test data sets it will be important to test these results further.
 * I Took a closer look at the tiny data set in comparison to the first_5hr. It seems the lines present in both are almost identical except for the quotation marks. I started looking to make sure all the sounds were present but it is still unclear how the mapping there works. It appears as though multiple wav files have been compressed into one sph, which would explain why the dic file references sound files that don't exist.
 * I worked on the proposal, formatting it to follow Josh's template. We still need to determine the deadlines and task delegation.
 * I built a new experiment 0178, with the plans of running the optimal configurations from Colby's test on the 10hr train. So far the largest train we have run has been the first_5hr. I am not sure if an add.txt file exists for this data set, but of not I will proceed to generate the pronunciations for the words.

2/17 use Net::SSH::Perl; my $host = "miraculix"; my $user = $ENV{LOGNAME} || $ENV{USER} || getpwuid($<); my $passwd = ""; my $ssh = Net::SSH::Perl->new($host); $ssh->login($user, $pass); my($stdout, $stderr, $exit) = $ssh->cmd($cmd);
 * Began working on building the missing words dictionary for the 10hr train. Train 0127 had been done using the 10hr data set, and appears to be the most recent time a 10hr train was done. There is a words list already done here but the words do not seem to match the list I got. My guess is that the data set may have been changed since then or perhaps the dictionary was updated with the new words. I have copied over 0127.dic. If this file contains all the missing words I need, and seems to match the transcript for the 10hr train, I will use that as the baseline for the 300 words I am missing.
 * I created experiment 0179. The purpose of this experiment is to build a master dictionary that contains all the missing words from the full transcript. This is one process that consistently slows us down when using any data other than the first_5hr train. First I will finish the 10hr data. When that is done I will find all the missing words, merge the 10hr words in and fill in the rest. Depending on how many are missing I may attempt to automate this but it may be too variable to actually do. Unfortunately many of the words are not spelled correctly or incomplete, and must be entered by hand. If we can create a master dictionary we will no longer need to worry about what data set we are using as all data will be present.
 * I discovered a Perl module called Net::SSH::Perl. This module allows Perl to open SSH sessions and run remote processes. If we cannot get Torque running, it may be worthwhile to investigate this module. Most of our intensive processes are Perl scripts, such as genTrans and pruneDictionary. Example code for running this is as follows:
 * 1) This command retrieves the current user
 * 1) Can also use cmd($cmd, $strin) for standard input
 * 1) Iterate over output
 * I began training experiment 0178.
 * I did more work and finally got the Tiny train running (experiment 0145). Strangely, by changing the senone and density values the train was successfully able to run. Normally it stops and errors out at process 45, due to missing dtree files in /mnt/main/Exp/ /trees/ _unpruned/
 * Created LM for 0145 (experiment 0184). After that was created, began to decode.

2/18 nohup run_decode2.pl 0172 0172 3000 > outfile & ps r | grep decode
 * Colby created a new dictionary for the 10hr train, containing all the words. We will use this as the source dictionary for 10hr trains from now on.
 * The Decode for experiment 0184 failed due to duplicate entries in either the base transcript or hyp.trans file.
 * I ran the uniq command on both but the problem was not resolved.
 * I began reviewing the experiments for trains that were not "test on train". It seems as though most were, although I two experiments that were labeled as "test on dev", which was experiment 0111 and 0024. To help us further determine the success of configuration modifications, it will be important to train on our data but also on external data to ensure our models are not too highly tuned to the data. I created a new experiment (0183), with the plan to explore using different data when decoding. My first step will be modifying the run_decode2.pl script I wrote last week to further allow decodes on different data.
 * I added a new section to the proposal for our test on development experimentation. I also added in an introduction and added it to the working final version Josh has setup.
 * My decode keeps failing because of inactivity in the terminal (despite having set the wakeup signal). I found the following command which will cause the process to run outside of the context of the session:
 * I logged out and ran the following command:
 * This displayed a list of running processes with the word decode. Sphinx3_decode was running in the process list.


 * Plan:
 * Shift research focus to decoding using test data instead of test on train. There are currently a number of test data sets already built. I will review these to make sure they are in working condition. The data group is also working on building these tests sets so I will keep an eye on what progress they are making.
 * Concerns:

Week Ending February 25, 2014

 * Task:
 * Begin decoding Colby's 5 base experiments (0171, 0173, 0174, 0175, 0176) using the last_5hr test data.
 * Learn more about our corpus.
 * Start looking at factors other than sphinx_config parameters for why we are getting a lower WER. I have the string suspicion there are problems with the transcript files such as missing words.


 * Results:

2/19
 * Added conclusion to proposal
 * Wrote Perl script to calculate the total length of the corpus. The total time it calculated was 308 hours, which explains where the other group got that number. I'm looking at train.trans now to find out why it gets that number:
 * 1) !/usr/bin/perl

if($#ARGV != 0) {

}

$length = $ARGV[0]; $main = "/mnt/main/corpus/switchboard/308hr/train/trans/train.trans"; open(MYINPUTFILE, "<$main") || die("Error"); $time = 0; while(my $line = ) { @temp = split(' ', $line); $time += $temp[2] - $temp[1]; } print("Total Time: " . $time . " Seconds!\n");
 * The 308 hours includes a large amount of overlap data, caused by using two channels. With this in mind I modified my script, counting the max number of seconds for each file, mixing the channels (noted by the highest second count value), and then adding them to a total. When I did this I got a total of 250 hours of data, which was the number Sam Workman found before. We should be seeing something along the lines of 97 hours, which means there is still likely something I am overlooking when doing these calculations. I did notice one potential problem though. I ran my script on other data such as the 10hr train and the first_5hr train, and found that the script produced expected results, i.e. they were 10 hours and 5 hours respectively. This means that either the data is in fact 250 hours long or the data subsets are timed incorrectly.
 * I did additional research on the switchboard corpus. It was released in a number of iterations. I found the following information in relation to the size of the corpus:
 * The switchboard 1 -release 2 was released in 1993 (and re-released on 1997). It contains 2,400 sound files, with 543 speakers. These sound files consist of 1155 labeled conversations of 5 minutes. That would leave us with 96 hours of data (Linguistic Data Consortium).
 * The initial report for switchboard ("SWITCHBOARD: Telephone Speech Corpus for Research and Development"), released in 1992 by Texas Instruments, Inc (Godfrey, Holliman, and McDaniel) states that the corpus consists of 500 speakers, for a total of 250 hours of speech. This seems to be the source most often cited when describing the size of switchboard. Other sources I have looked at such as "The Semi-Supervised Switchboard Transcription Project" (Subramanya, Bilmes) and "Investigation of Deep Neural Networks(DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMsin Acoustic Modeling" (Pan, Liu, Wang, Hu, Jiang, Hefei, Anhui, China) suggest that this source is indeed referring to the switchboard 1 (as noted above as 96 hours), which would make some sense since the switchboard corpus had not yet been completed at the time of the 1992 report (which makes the 93 switchboard release a probable actualization of the research still underway in the 92 report). According to these other sources, the switchboard audio files contain 320 hours total, 250 of which are recordings of actual speech (the rest is silence and other non-speech related data). Why there seems to be a discrepancy between most sources and the Linguistic Data Consortium estimate is still unclear. One answer may be that the LDC specified that 96 hours of data had been "labeled". I do not know what this means, and if this differs from the actual total size of the data.
 * Also, according to the LDC overview page, 150 conversations were missing from the original release of the switchboard corpus. Conversations run for an average of 5 minutes each, which leaves us with 750 minutes of data, or 12.5 hours missing. Assuming what I previously found was true, this would make quite a bit of sense. If the total number of hours was 320, and 12.5 hours were missing from the initial release, assuming we were using the initial switchboard corpus we would have 308 hours total. This statement makes a number of assumptions, especially since we don't know if the 320 hour baseline was before or after the new conversations were added, or if the dual channels were accounted for somehow in this number. It is quite probable that this is just a coincidence that the new total would be 308, the same number we were getting for the full transcript.
 * Sources:
 * Well-known and influential corpora: A survey
 * The Semi-Supervised Switchboard Transcription Project
 * Investigation of Deep Neural Networks(DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMsin Acoustic Modeling - Jump to the Experiment section.
 * Switchboard-1 Release 2
 * SWITCHBOARD: Telephone Speech Corpus for Research and Development - You will need to access IEEExplore via the UNH Library to read this article
 * Corpora Available from The Linguistic Data Consortium
 * Started adding words to the dictionary for the 308hr train we are doing.

2/20
 * Continued building the missing words dictionary. The PruneDictionary Colby ran last night finished, we are missing 3,500 words! The ten hour dictionary consisted of about 300 missing words, which leaves us with 3,200 words still.
 * To try and ease this process alone, I wrote a utility script to try and fill in the missing words using the CMU pronouncing dictionary. The nice thing about this was that the website uses query strings for the word to search and the flag to determine whether to use stresses or not. After a few hours of development I got a working script. The final script was a simple HTML web page using JavaScript. The reason I chose HTML/JavaScript was because of the benefits of Ajax provided for asynchronous processing of a large number of HTTP GET requests. The script worked but was not very effective. Overall, it only managed to find about 1/20 of the total words. The problem is that an overwhelming majority of the missing words are not actually words at all. They are either 1) Numbers 2) Misspelled words 3) Last Names 4) Abbreviations of words. Because of this, CMU was very ineffective at finding most of the words. I could fine tune the JavaScript to process numbers, but even then it wouldn't be worth the time as there are only about 100 numbers.
 * With the above in mind, it does raise a red flag about our data. Based on a numerous misspellings and partial words, I would have to imagine this would have a pretty strong negative impact on the effectiveness of our models. At a bare minimum it would likely result in a large number of decoding errors as the decoder must decide which word was spoken when there are two to four versions of the exact same word with the same (or very similar) phonemes, but a slightly different spelling. The Language Model might be able to offset this but to what extent I am not sure. The other problem is that it is unclear whether the misspellings are actually related to the pronunciation in the audio or if they are transcription errors.

2/21 density: 8 WER: 69% density: 16 WER: 68.7% density: 32 WER: 71.0% tr -d '\15\32' < 0192_train.trans > 0192_train.trans.new full: 256.4 Hours 308hr: 256.4 Hours 100hr: 76.7 Hours 10hr: 10.0 Hours first_5hr: 5.0 Hours last_5hr: 4.9 Hours mini: 11.6 Hours tiny: 1.2 Hours
 * Created experiment 0194 (All are using 3000 senones, last_5hr test data, genTrans5.pl)
 * Began decoding 0171 using last_5hr test in /mnt/main/Exp/0194/0171/
 * Began decoding 0173 using last_5hr test in /mnt/main/Exp/0194/0173/
 * Began decoding 0174 using last_5hr test in /mnt/main/Exp/0194/0174/
 * Cleaned the Windows newline characters from the transcript file in experiment 0192 using the following command:
 * Finalized corpusSize.pl, which calculates the total size of the corpus provided as arg 0 to the script (only the basename, i.e. /mnt/main/scripts/user/corpusSize.pl 10hr). The script now resides in /mnt/main/scripts/user. Based on these calculations, our corpus sizes are as follows:

2/25/2014
 * I further analyzed the last_5hr test data to check for compatibility with the first_5hr. It appears that the first_5hr dictionary contains roughly 700 of 1000 words in the last_5hr test dictionary, or around 70%. This would account for some of the quality loss I was experiencing, but it still seems to be a stretch that decoding using that data would result in 70% error.
 * Compare results with Colby. His decodes performed better (although still not great), using the 10hr acoustic model and LM. The larger data size seemed to make a pretty dramatic improvement. Future tests with even larger data should be done.
 * I wrote two scripts: cleanTrans.pl and pullFromTrans.pl which cleans out all special characters from the trans (much like genTrans without any sph processing or fileids generation) and extracts the filler words respectively. These will be used when Colby and I attempt to add the filler words to the filler dictionary.


 * Plan:
 * Add missing words to train.trans in the corpus directories.
 * Fix misspellings in train.trans.
 * Run a larger acoustic/LM train and decode on those.
 * Concerns:

Week Ending March 4, 2014

 * Task:
 * Work with Colby to run tests dealing with the out of vocabulary words in the transcript. Our first test will be to build test data that does not contain any OOV lines (we are removing the line not the words). By eliminating these lines we will be getting a good perspective on what our results should be like when we find effective ways to clean them up.
 * Develop tools to provide us flexibility for training and decoding. We want to develop a script to strip all lines containing OOVs from the transcript (cleanTrans.sh).
 * Using our cleanTrans script we want to make clean transcripts for all the primary corpora.

3/1/2014
 * Results:
 * Worked with Colby J. and C. to begin preparation for our test trains (0199). We will be running a series of tests, replacing various out of vocabulary words.
 * Wrote bash script to strip all lines containing [] or _1 from the transcript, and replacing i- with i and {} with nothing. The script cleanTrans.sh is now in scripts/user
 * Built clean data transcript for the 10hr data.
 * Started preparing 9 trains 0200/[d8|d16|d32]/[s3000|s5000|s7000]/ using the last_5hr clean data transcript
 * We crashed Caesar due to running too many trains at once...

3/2/2014
 * Started working on incorporating the shell script I wrote into a more complex script which will generate a transcript file containing x hours of clean data.

3/3/2014
 * Caesar is up again. I will resume the trains that were not finished on 3/1.
 * Continued working on the script to build dictionary data. I added an extra parameter, called offset. The script takes in a transcript file (param1), a time in hours (param2) and an offset time in hours (param3). The script produced a new transcript derived from the base transcript that is param2 hours long, starting at param3. The script works great for general transcript files but does not work for cleaned transcript files. The reason for this is that the time calculating algorithm uses a shortcut present in the base transcript to avoid complex calculations. My goal is to make a general purpose script that can read any transcript file and generate a new transcript from it. Since we have a full cleaned transcript file already, it would be great to be able to build new transcripts for that for 5 hours and 10 hour trains. The current clean transcripts do not match the correct time. For instance, the first_5hr cleaned transcript was generated off of first_5hr train, with all [] lines removed, making the total time less than 5 hours (I believe it is 3.8 hours). This script will give us accurate shortened transcripts and will make the generation of test scripts very easy.
 * I wrote a new version of corpusSize.pl, which performs a more accurate algorithm which will work on any transcript. The original script could only calculate unclean transcripts, and would error out on clean transcripts. The new script works for both, although is a bit slower since we have to calculate the gaps and used time. According to the script we have 192 hours of clean audio.

3/4/2014
 * I finished the script createSubCorpus.pl which now resides in /mnt/main/scripts/user.
 * I built a new data set called 50hr, containing 50 hours of data after the first 10 hours, with both a clean and uncleaned version. To test this, I am running a train on this data using 16 density mixtures and 5000 senones (exp0204).
 * There will likely be missing words for this. I may not proceed to fill them out, as it is not our immediate goal to run large trains, so much as it is to fix OOVs. This experiment is merely a test to demonstrate our new data and the script which produced it.
 * Ran trains for 0200
 * d8 - s3000, s7000 (s5000 encountered a problem)
 * d16 - s3000, s5000, s7000
 * d32 - s5000, s7000 (s3000 encountered a problem)
 * Build LM and symbolically link it.


 * Plan:
 * Finish experiment 0200 with decodes to compare with Colby's results for the first_5hr. Last_5hr contains minimal cross talk so that may be the key to improving performance.


 * Concerns:
 * The data we are using is for training is too small for successful decodes on test data. To successfully train on test data, we need larger trains, which necessitates filling in missing dictionary words. If we can build a dictionary for the clean data corpus there should be far fewer missing words as we no longer have incomplete words left behind by eliminating OOVs.

Week Ending March 18, 2014

 * Task:
 * For this week I continued working with Colby in refining the train process. First, I finished running the decodes for the 9 trains I did using the last 5 hour.
 * I also resumed working with decoding on foreign data. To start, I built a data set consisting of four hours of data outside of last_5hr and first_5hr.
 * Began decoding on the AM and LM for experiments 0200, 0199.

3/5/2014
 * Results:
 * The results for the decodes for the nine trains I did in experiment 0200 are in:
 * The biggest difference was in d8 and d16 which experienced an improvement using last_5hr over first_5hr. Using a density of 32 actually decreased the performance of the last_5hr. My guess is that this data is over-trained, which would explain the drop. For five hours we should be using around 3000 senones and a density of 8 or 16.

3/6/2014 sed "s/^[^ ]*\s //g" cmudict.0.7d | sort | uniq -c -d | sed "s/ /\n/g" | grep "^[0-9]" | tr '\n' "\+" | sed "s/\+/\-1\+/g" | sed 's/\+$/\n/g' | bc
 * I finished scoring the last two experiments and added them up above.
 * Extracted duplicate words and pronunciations in the dictionary file. According to a command I ran, there are over 11,000 pronunciations which appear in the dictionary at least but often more than two times. Additionally, using the following command I calculated that there are 15,968 duplicate pronunciations with the original pronunciation factored out:

3/17/2014
 * I created a new test dataset in last_5hr called test2. Test2 contains 4 hours of test data outside of the last_5hr data. The reason I chose 4 hours is that Colby mentioned that he saw that test data should be no more than 4 hours. Therefore, unless otherwise specified all test data will be 4 hours for uniformity. I will be using this data to test the AM built for experiments 0199 and 0200, which both produced results of less than 20%, the best of which was 15%. I still think this is over-trained, but I would like to see how we do in comparison to the decode I did on experiment 0175 (also got a WER of 15%) which decoded at 70% when decoding on unknown data. I'm hoping these experiments will provide better decodes.
 * The last_5hr test2 data was generated from the full clean corpus, containing 4 hours of data starting at the 30 hour mark (this means that we can use this same test data to test on first_5hr without problems). I used the script I wrote, createSubTranscript.pl to build this.

3/18/2014
 * The decodes I started yesterday finished. The results were as follows:
 * Decode on 0200/d32/s3000: 76.4%
 * Decode on 0190/d32/s3000: 79.0%
 * Decode on 0200/d16/s5000: 81.0%
 * Decode on 0199/d16/s5000: error
 * I am not sure why are decodes are still doing so poorly. It's possible the models for 32 densities are over trained, but a density of 16 should do better than this. For consistency I am beginning decodes for the d8 models we built (s3000, s5000, s7000). When these are done I will try the remaining d16s and d32s. Depending on the results, I may need to analyze the decoding process I am doing and see if there is an error there.


 * Concerns:


 * The decodes are still doing terrible. The first thing I will try is decoding on our data with the smallest density (8). Next, I may try to build a new Language Model for the full data and use that instead of the LM for the last_5hr. I am still determining which files we need to use from the base experiment and which we need to generate for our test data.
 * The problem is likely being caused by the dictionary, although it could be the language model. I have been using the dictionary for the new test data since it presumably will only need to find those words, but maybe if I try using a bigger dictionary the decode will do better.

Week Ending March 25, 2014

 * Task:
 * Continue analyzing the AM and LMs we built in previous experiments by running decodes against them using a 4 hour test data set I built from the middle of the full transcript.
 * Research decoder settings that may be useful for improving our scores.


 * Results:

3/19/2014
 * Started several decodes for the 0200 and 0199 d8 trains. Because the clean data is 3.8 hours, it would make sense that we would have a lower density.

3/23/2014
 * Read logs

3/24/2014
 * I scored decodes for s3000 and s5000, the results are as follows:
 * 0200-d8-s3000 = 83.4
 * 0200-d8-s5000 = 86.3
 * I had assumed that the lower density trains would have done better but in fact they did increasingly worse than the larger density trains.
 * I also ran a test decode using the language model from experiment 0218 (the 100 train). Surprisingly the results were decently better (6 percent improvement).
 * 0200-d8-s3000(dict) = 77.7
 * With this in mind, I have begun to run new decodes for the d32 trains using the new language model.
 * While these are going, I am also running another test decode using the 0200-d8-s3000 using the dictionary from 0218 as well as hard coding the sample rate on the decoder to 8000. The decoder may be defaulting to 16000 which would increase the score on our decodes. If this is the case, then the results for that experiment should be better than 77.7%.
 * I just looked at the decode.log file of one of my previous decodes. The decoder is defaulting to 16000 for the sample rate. This should be 8000 as the switchboard corpus uses 8000 Khz audio. Hopefully the 8000 Khz decode I am running now will improve the WER.
 * I started a new test on train decode in experiment for experiment 0200/d8/s3000 using a sample rate of 8000. Hopefully because these decodes are using a density of 8 they should go quicker. I'm curious to see what our new test on train score will be using the correct sample rate. We got a 26% WER running it with a samprate of 16000.

3/25/2014
 * The decodes I ran yesterday have finished. The results were as follows:
 * 0200-d8-s3000-full_dict-samprate(test on eval) = 77.7%
 * 0200-d8-s3000-full_dict-samprate(test on train) = 29.9%
 * Strangely, changing the sample rate for the 0200-d8-s3000-full_dict(eval) had no impact on the WER at all, yet for the test on train variant it increased the WER quite a bit. I'm rerunning both of these decodes using the standard language models to try and get a better picture for what's happening. In retrospect I should have done this for the test on train experiment anyway since doing otherwise would add a new variable to the experiment (since the base used the standard LM).
 * I reran the decodes using the default language model, but once again the sample rate had no impact on the score.
 * With this in mind, I am now preparing to run a test on eval decode using the AM and LM for the 100 hour train. Maybe a larger model will produce better results.


 * Plan:
 * Run decodes to determine whether or not settings will improve the overall WER.
 * Concerns:
 * The decodes for all three density levels for our clean data decodes performed poorly against the test data.
 * Changing the sample rate decoder setting surprisingly had no affect on the final WER. 16,000 and 8,000 Hz performed identical.

Week Ending April 1, 2014

 * Task:

3/30/2014
 * Results:
 * I created an experiment we will be using for testing the construction of acoustic models using the 100 hour data set (Exp/0245). To simplify the process, I created symbolic links to the artifacts of 0192 and overwrote the sphinx configuration file. Before I proceed, I want to analyze the files closer first to ensure that there are not other files that need to be unique per experiment.
 * I created another experiment 0244 in which I will prepare all the eval data sets for so we can run decodes without having to reconstruct the feats, dictionary and transcript files. Currently this process is being held back because I need to first generate the eval and dev test data for this.


 * Plan:


 * Concerns:

Week Ending April 8, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 15, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 22, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 29, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: