Speech:Spring 2014 John Kelley Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4th, 2014

 * Task:

Read past logs and do research on experiments and data gathered the previous semesters


 * Results:

Have a basic understanding of requirements and past experiments, along with data gathered


 * Plan:

Continue to work on proposal and understand requirements
 * Concerns:

Communicating effectively with group

Week Ending February 11, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending February 18, 2014

 * 2/15/2014: Read logs and brainstormed proposal ideas
 * 2/16/2014: Read logs
 * 2/17/2014: Read logs
 * 2/18/2014: Complete proposal and submitted it to the Proposal group page

My first task for this week is to to work on our proposal. After reading Josh's proposal group page, I believe his idea would work best for creating a streamlined narrative, something that professor Jonas is looking for. Uniform entries would help the flow of the document drastically. Reading past proposals, you can clearly see that there was no real continuity in layouts or even discussion. It seemed to me like every semester was unique in the way they approached the proposal. I believe our proposal will present a lot of what the professor is looking for. I fully plan to take advantage of Josh's idea, and submit our proposal to the proposal group's page in the same format he has listed.
 * Task:

My second task was to continue working on training. I'm still reading the how-to guide on running trains and tests on trains.

My third task was to familiarize myself more with Unix. I still find myself struggling with basic commands, that others seem to have no problem remembering. This is my first time actually working with Unix, so remembering basic commands is proving to be my only setback. I have been reading basic unix tutorials in various sections of the internet.

I have successfully put our assignments and thoughts into text with our proposal. I feel as though the Data Group will accomplish a lot this semester, and make things much easier for future Data Groups. Coming into this blind certainly wasn't fun. I constantly found myself stumped when people asked me questions that I felt I should know. Like what format is the audio in, where is it located, are the transcripts complete, do they contain odd characters etc. After doing some research, I finally feel more confident in my abilities, and my understanding of our groups tasks and responsibilities.
 * Results:

I plan to complete a couple of basic Perl tutorials to get an understanding of the language. I have little programming experience, and what I do have isn't very strong. Programming has never been a strong suit of mine, but I'm willing to learn Perl, not only for this project but for being helpful in future endeavors. I also plan to run an instance of the genTrans6 script myself. I found this website when googling basic perl tutorials (http://www.perl.com/pub/2000/10/begperl1.html) After following it for a while, I feel like I have the most basic understanding of the language. Thankfully, it doesn't seem nearly as difficult as some other languages. For some reason, Perl reminds me of PHP. Maybe this is because they are similar, but I wouldn't know for sure because my knowledge of PHP is essentially non-existent.
 * Plan:

Learning Perl will be difficult. I don't expect to understand it much by the time the week is over. It will take me a couple of weeks to feel comfortable using it. As I explained earlier, I have little to no programming experience.
 * Concerns:

Week Ending February 25, 2014
My task was to create a page on media wiki that would be useful to us as Data Group, and future Data Groups. My plan is for it to contain all the most up to date information and locations of the things Data Group is responsible for, such as transcripts and audio. The URL for this page is http://foss.unh.edu/projects/index.php/Speech:DataInfo. I will be continuing to update it periodically over the week.
 * 2/23/2014: Read logs
 * 2/23/2014: Read logs
 * 2/24/2014: In the process of updating wiki
 * Task:

I have added the locations of the various transcript files in Cesar to the wiki page. I also added a link to Matt Henniger's transcript excel spreadsheet, which is used to calculate the time of the transcripts.
 * Results:


 * Plan:


 * Concerns:

Week Ending March 4, 2014
Learn unix better to help modeling group clean out the transcripts.
 * 3/1/2014: Read logs
 * 3/2/2014: Read logs
 * 3/3/2014: Read logs and worked on unix commands for cleaning transcripts
 * 3/4/2014: Added additional unix commands containing more advanced regular expressions
 * Task:

After using basic commands and asking for the assistance of my peers, I performed the following commands to try and replicate what professor Jonas did in class. My unix knowledge is still very basic, and my regular expressions were very weak. I first did a cd to the transcript file location cd /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ Then I performed an ls command to see all the files listed. I couldn't remember the name of the main transcript file, but saw it was called ms98_isci_word.text Then I did a cat command to see the text in the putty shell window cat ms98_icsi_word.text | less I wanted to use the vim built in text editor to do a quick find of the brackets, so I typed in  vim ms98_icsi_word.text At this point I realized I was definitely behind with my unix knowledge because I was stuck. With help from a coworker I just did a cat uniq command to see unique text. cat ms98_icsi_word.text |uniq Then I did a grep for the left bracket cat ms98_icsi_word.text | grep '\[' I realized this wasn't nearly enough, especially considering the complexity of professor Jonas' regular expression, so I did a grep help command grep --help followed by  grep --help | more I saw the -w command and thought it would be useful for showing only lines with the brackets, and it was. cat ms98_icsi_word.text | grep -w '\[\ My coworker also told me about another built in text editor that didn't require command knowledge as much, called nano. I opened the transcript in nano to try and make the process easier, but this is where I became stuck. nano ms98_icsi_word.text This is where I'm at right now, and want to continue working on regular expressions and unix knowledge to get these transcripts cleaned up a bit.
 * Results:

I plan to continue collaborating with not only my coworker who is knowledgeable in unix, but also the modeling group, especially Colby and David who seem to have a good amount of unix and regular expression knowledge. After doing more research on regular expressions I came across a useful website. http://www.grymoire.com/unix/Regular.html contains a large list of regular expression commands, including grep, and explains the syntax and different characters that can be used. For example, the ^ and $ characters are used to define another characters placement on the line / string. So if I want to find brackets in our transcript. I could do the following command: grep '\$[\[...]\$]' This says that the left bracket starts anywhere on the line, and is followed by any range of characters, and then a right bracket anywhere on the line. And even more advanced grep that could only look for lines with [text-text] would be  grep '\$[\[...]\-\[...]\$]' My biggest concern was messing something up, because of lack of knowledge with unix. I didn't want to ruin the transcript file and have everyone hate me. Even more so, I don't know what our backup is like, so I thought if I really wanted to make changes to the transcript file I could send it to a new text file, that way if something catastrophic happens, a backup has been created.
 * Plan:
 * Concerns:

The Unix command I learned to copy a file is as follows: cp ms98_icsi_word.text ms98_icsi_word_duplicate.text This command will copy the contents of the first transcript file to a second text file.

Week Ending March 18, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending March 25, 2014

 * 3/23/2014: Read logs
 * 3/25/2014: Began running an experiment. The directory to my newly created experiment can be found here: http://foss.unh.edu/projects/index.php/Speech:Exps_0228

I will be following the Wiki page on creating an experiment and running a train. I'm hoping the issues I run into will be easily resolved. I would like to complete my first experiment this week, and not have to wait until class to sort out my issues.
 * Task:

I noticed the Wikimedia Experiments page had its last experiment listed as 0227, so I created a 0228 page. This is odd, however, because when I ran the master_run_train.pl script, the experiment directory 0237 was created. This leads me to believe there were most likely experiments run but not entered into the wikimedia page. Output from script: Successfully created 0237 experiment directory. Please type '1' to continue: Output from script:
 * Results:

I then continued to follow the steps recreating the 0161 experiment. I used a density of 8 and a senome value of 3000. I ran it on the first_5hr/train corpus. Once gentrans8.pl was run, I received the following output from the script. Output from script: can't open file: No such file or directory at /mnt/main/scripts/user/genTrans8.pl line 29. I don't know what this error means, but it asked me to enter a 1 to continue so I did. I entered 1, and then the rest of the script ran, with a few errors here and there. Most of them relating back to the missing file / directory as described above. According to the output of the script, the dictionary and phone files were copied successfully. This is the output I received. Executing pruneDictionary2.pl: /mnt/main/scripts/train/scripts_pl/pruneDictionary2.pl 0237_train.trans /mnt/main/corpus/dist/custom/switchboard.dic 0237.dic Output from script: text2wfreq : Reading text from standard input... cat: 0237_train.trans : No such file or directory text2wfreq : Done. Now inside directory: /mnt/main/Exp/0237/etc Copying over the filler dictionary ... cp -i /mnt/main/root/tools/SphinxTrain-1.0/train1/etc/train1.filler 0237.filler Success! Copying over the genPhones.csh script ...  cp -i /mnt/main/scripts/user/genPhones.csh. Success! Executing genPhones.csh: ./genPhones.csh 0237 ./genPhones.csh 0237 0237  Success! Successfully created the Phones file located in the /etc directory.

Then came step 5, where I need to navigate to my experiment directory to call the RunAll.pl script. I ran into the same issue that Jared experienced, and described in his log. The error I received was as follows: Configuration (e.g. etc/sphinx_train.cfg) not defined Compilation failed in require at RunAll.pl line 48. BEGIN failed--compilation aborted at RunAll.pl line 48.

I realized that I stupidly wasn't in my Experiment's directory when I ran the script which is why it was missing the cfg file. I navigated to my directory and ran the command again. This is what I received as output from the script: MODULE: 00 verify training files O.S. is case sensitive ("A" != "a"). Phones will be treated as case sensitive. Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file. WARNING: The phonelist (/mnt/main/Exp/0237/etc/0237.phone) does not define the phone SIL (required!) Found 3 words using 1 phones WARNING: This phone (SIL) occurs in the dictionary (/mnt/main/Exp/0237/etc/0237.dic), but not in the phonelist (/mnt/main/Exp/0237/etc/0237.phone) Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary Can not open listoffiles (/mnt/main/Exp/0237/etc/0237_train.fileids) at  /mnt/main/Exp/0237/scripts_pl/00.verify/verify_all.pl line 203. Something failed: (/mnt/main/Exp/0237/scripts_pl/00.verify/verify_all.pl)

I have no idea what the error "Something failed:" is supposed to mean, so at this point I'm at a stand still until I get to class tomorrow (Wednesday) and can discuss my issues with the experiment experts. I wish I had a better understanding of the issue as I would really like to successfully run a train on my created experiment. On the original tutorial page I noticed it mentioned that the first train will usually fail, and to run it again. It said an html file would be outputted to the exp directory containing readable information. When I ran lynx 0236.html however, all I got was what was in the terminal window in the lynx text editor and it wasn't very useful.

I want to go through the wikimedia tutorial to run my first experiment (0161 clone). I chose 0161 because It was a first 5 hr with a density of 8 and senome of 3000, generally basic values. Also, Jared seemed to think this would be a good experiment to start with, so I took after him and chose it as well. I would like to finish all of this Tuesday night when I get home.
 * Plan:

I realize I'm bound to run into errors.
 * Concerns:
 * 3/25/2014: I of course, ran into an issue. The error presented was when I ran RunAll.pl. I received the output "Something failed: (/mnt/main/Exp/0236/scripts_pl/00.verify/verify_all.pl"

Week Ending April 1, 2014

 * 3/26/2014: Worked on experiments in and out of class
 * 3/27/2014: Attempted to run the decode on experiment 0241 with no success
 * 3/28/2014: Successfully got my decode to run on experiment 0241
 * 3/30/2014: Read other's logs. Added more to my results section

Due to last week's unsuccessful experiment, I'm going to re-create my original failed experiment with Colby Johnson, and then create another on my own.
 * Task:

In class today (3/26/2014) I went over my issues with Colby. We first discovered my RunAll.pl script failed because SIL was missing from my phonelist. After further observation, we realized that the master script failed to run gentrans8.pl. I saw this in the terminal when I ran it yesterday (the 25th) but due to my naivety assumed it was ok to continue, considering the script kept going, and asked me to enter '1' to continue. Once Colby and I took a look at my experiment directory, we realized the phonelist, and essentially every other created file was empty, due to gentrans8.pl failing. None of my files were populated with information, which is why when I ran RunAll.pl I received that error. He helped me recreate this experiment, and we used a density of 8, a senome value of 5,000, and ran it against the mini corpus subset. I followed him as he successfully trained the data, and then ran the decode. Unfortunately, even though this was my experiment originally, the modifications were made with Colby's username so I was unable to score the data without receiving a permission denied. I even tried using sudo command when I ran sudo sclite -r _train.trans -h hyp.trans -i swb >> scoring.log However even after giving the root password, I was still presented with the permission denied error. I'm going to have to leave scoring this experiment up to Colby. SYSTEM SUMMARY PERCENTAGES by SPEAKER ,-.     |                            hyp.trans                            | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |=================================================================|     | Sum/Avg |  549  10774 | 79.7    8.7   11.6    4.8   25.0   85.4 | |=================================================================|     |  Mean   |  2.9   57.0 | 81.5    8.2   10.3    7.8   26.3   86.2 | | S.D.   |  1.9   44.5 | 16.3    8.1   13.5   14.8   19.7   25.2 | | Median |  3.0   47.0 | 86.1    6.9    5.4    3.9   21.0  100.0 | `-'                            Successful Completion
 * Results:
 * Update: Colby ran the score and the following results were presented:

I then ran my own experiment in class today. The experiment is 0241 (http://foss.unh.edu/projects/index.php/Speech:Exps_0241). I successfully trained the data in class, and all that's left is for me to create the language model, decode and then score the data. I am in the process of completing these steps as I write this log.

ERROR: "ngram_model_arpa.c", line 76: No \data\ mark in LM file So now I'm moving on to the decode process. I remembered when we ran the decode in class it was running but nothing was outputted to the terminal. I forgot this initially however, and assumed it failed because I didn't see anything, and then interrupted my terminal. I don't know if that will cause the decode to fail or not, but I ran it again with a trick I learned from Colby to run it in the background which is  nohup & So I ran nohup ./run_decode.pl 0241 0241 & and now I have the process running in the background. I remembered in class the decode didn't take too long to complete, so I'm hoping I'll see some results soon, and then I can score my test on train. I opened my decode.log and noticed the following error on page 11, and believe my decode has stopped running. FATAL_ERROR: "mdef.c", line 680: No mdef-file
 * Update: I have successfully created the language model. One of my last lines of output however concerned me, as I don't know if it's a real issue or not. The following was presented to me after running % ./lm_create.pl trans_parsed

I spoke with Colby and we determined the issue was that I was using the wrong decode script, along with not specifying the senone value as part of my parameter for running the decode command. I then copied the decode2.pl script to my DECODE directory and tried it again, but this time with the following command: nohup run_decode2.pl 0241 0241 3000 & This specifies I'm using the decode2 script on my experiment with my experiment's acoustic model and a senone value of 3000, while running it in the background. It ran a little longer than last time, however still failed. I opened my decode.log in Lynx to see what the issue was. Instead of only 11 pages, I got 17 pages this time, so it ran about a millisecond longer than last time, which is still progress. The error I received was the following: FATAL_ERROR: "lm_3g_dmp.c", line 462: fread(/mnt/main/Exp/0241/LM/tmp.arpa) failed
 * 3/27/2014

This was after it starting reading the tmp.arpa file. I'm still at a standstill, and waiting for any feedback from Colby. He's been incredibly helpful.

Today I decided to try to run the decode again. I went to see Colby yesterday and he was stumped as to why my decode wasn't running. We looked at my language model, and although it was built, all of the files were empty. We thought this was strange, so we tried to build the language model again. Unfortunately, it failed a second time. Colby wasn't sure why the language model wouldn't build, and thought something must have changed. I suggested trying to build the language model with a different corpus subset than the mini/train we had been using. I left soon after, and don't know what happened since that time (~2:30pm) and right now (10:15am) but the language model was built successfully. I'm now running the decode, and so far it looks like it's going to finish without error.
 * 3/28/2014

,-.     |                         hyp.trans.uniq                          | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |-+-+-|     |=================================================================|      | Sum/Avg |  549  10774 | 86.0    6.6    7.4    4.8   18.8   84.2 | |=================================================================|     |  Mean   |  2.9   57.0 | 87.3    6.5    6.2   10.1   22.8   86.1 | | S.D.   |  1.9   44.5 | 10.8    7.7    6.8   22.8   24.2   23.8 | | Median |  3.0   47.0 | 89.6    5.5    4.9    3.8   16.7  100.0 | `-'
 * Update: My decode ran successfully, and I ran the SClite to score my experiment. I came out with the following results:

I don't know much about interpreting results of the train, but I assume an 18% WER is good? I'll have to talk to Colby more about understanding what I'm actually doing, and not just spitting out scoring charts. I did, however, run into a few issues when scoring the experiment. The two common issues that occur when running SClite as described on the wiki page happened for me. The two errors I ran into were as follows: (Note, I didn't capture the actual errors I got so my ID's are different) Error: double reference text for id '(sw2479a-ms98-a-0071)' Error: Not enough Reference files loaded Missing: (sw2259a-ms98-a-0021) (sw2295b-ms98-a-0011) (sw2331a-ms98-a-0049) (sw2389b-ms98-a-0096) (sw2428a-ms98-a-0017) (sw2442b-ms98-a-0059) (sw2451b-ms98-a-0044)

I was able to fix this easily with the help of my coworker. I think the wiki page should be updated with simple VI usability commands, because without the help of my coworker I wouldn't have known how to enter search mode etc. Anyways, I a vi 0241_train.trans to open it up, and then pressed Shift + : to enter the command mode. I typed : set ignorecase. Then, I searched by entering a forward slash /. I searched for my ID, and found the two instances the wiki said I would. They were basically the same line with a slight difference. So I followed the wiki and removed it with a :d command, then saved and exited with a :wq.

After this, I had to fix the second issue, where not enough reference files were loaded. To fix this I followed the wiki tutorial and it was much easier than the first part. I copied the hyp.trans with a uniq parameter, and then called the hyp.trans.uniq when I ran the SClite script the second time, and it actually worked!

The last issue I ran into was getting the table to my log / experiment page. I opened the scoring.log file in lynx, and then remembered that when I highlight something and do a ctrl+c, it closes the lynx text viewer. I found an option called print, and just did a print to screen, which printed the whole table into my terminal window. Then I just highlighted it all and copied it into a text document, where I could remove the extra junk and just leave the stuff I wanted (the averages). Then I pasted it into my logs above, and the experiment page.

Colby said something about the permissions in a directory changing, so I guess building the language model was temporarily broken for everyone, and not just me. I noticed the permissions must have changed, because performing commands that never used to give me issues now ask for confirmation. For example, trying an rm to remove my decode.log file in my decode directory asked for confirmation. Also, when I tried to recursively remove all the files out of my LM directory to rebuild my language model, I used an rm -r, and it said permission denied, so I had to throw a sudo in front of it and enter the root password. After this I didn't run into any more issues.


 * 3/30/2014: I spoke with Colby about the installation of SciPy on Fedora 19. Initially, I assumed he was asking me if it could be installed on Fedora 19. I replied that it could be, because I believe Fedora 19 comes with Python version 2, and SciPy is compatible with versions 2.6 and 2.7, as well as versions 3.2 and newer. Once I spoke more with Colby, he told me he was running into an issue in the terminal with mirrors missing for downloading the SciPy package. It sounds like he's connected to the repositories, but they just fail when it comes to the download. I saw his logs, and he created a page that shows the error messages he's getting, and it's essentially a 404. The mirrors are missing, which is preventing the download of the package. I suggested he might try to add more repositories, and maybe that way he would be successful. I don't know what his progress is on that, but I figure if he doesn't have it working by class on Wednesday I can show him how to add more repositories to the Fedora terminal.

I plan to review the scored experiment created with Colby Johnson (0237) I also plan to run my own experiment without the assistance of others After watching Colby run an experiment, train that experiment and then create the LM & Decode + score it, I'm confident in my ability to do it without assistance, however, if I do run into issues I will consult the wiki page and then my bootcamp group members (team one). I'm concerned about fatal errors I've been getting consistently when trying to decode my experiment. The first was resolved, however the second time I ran the decode with a different script and different parameters, it ran for about 1 millisecond longer, but ended up failing with another fatal error. See above logs for specifics of the error.
 * Plan:
 * 3/27/2014: Today I want to decode the experiment I ran yesterday. After reading Colby's email it seems I need to change the command to use the decode2.pl script, and add an additional parameter of my senone value, because it wasn't the default.
 * 3/28/2014: I'm going to try one last time to decode and score my experiment. I'm hoping whatever was wrong with the language model yesterday has been fixed today.
 * 3/30/2014: Plan for today is to read logs and update my results section
 * Concerns:
 * 3/28/2014: This is my third day trying to successfully decode my experiment and I'm really hoping I don't run into anymore fatal errors.
 * 3/30/2014: I don't know how the progress with installing SciPy on Fedora 19 is going. Colby wanted the data group to help out with this issue, so I've been looking into possible causes, and my thought is that maybe he just needs to add additional repositories.

Week Ending April 8, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 15, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 22, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 29, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: