Speech:Spring 2014 Pauline Wilk Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4th, 2014

 * Tasks:
 * To ssh into Caesar and explore the file system.
 * To read through everyone's logs.
 * Read through the SpEAK media wiki pages
 * Read through experiments.
 * Figure out how to access SpEAK from Caesar. I know SpEAK is located on ROME which has its own separate disk.
 * See if any fields need to be added.
 * Explore and document how the experiments are saved on disks.
 * Attempt to, or learn how to write Perl scripts for creating new experiments.


 * Results:

Thursday January 30, 2014

Worked on some administrative work with Mike Jonas to get everyone's usernames and passwords set up and deleted the users from previous semesters.

Along the way I learned a few simple commands in Unix to get started.

Friday January 31, 2014

Found that the PuTTY equivalent for the Mac is a build-in application called Terminal. I connected to Caesar using my account:

Successfully changed my password using  command. I was prompted to enter my old password and then to enter my new password.

Read through the documentation on media wiki about the SpEAK project. Tomorrow I will try and get the web application running on my computer.

Sunday February 02, 2014

Read through logs and got SpEAK running on my computer. I was able to do this from instructions that my team members logged.

Monday February 03, 2014

Read logs.


 * Plan:

To access and familiarize myself with SpEAK through reading mediawiki. Get the Web application running on my computer.
 * Concerns:

Week Ending February 11, 2014

 * Task:

 Wednesday February 5, 2014 

Met with part of the Modeling group to see what kind of scripts we should be writing. Determined that to gain a better understanding of what needs to be done, we need to first go through the process of running a train ourselves.

Updated the  experimental group's log for this week's tasks ("To Do's")

Continuing to effectively communicate with my group members, and other group members.

 Thursday February 6, 2014 

Read logs and other media wiki pages to start running a train. My experiment number is 0149. Got the experiment to successfully run using the 5 hour data type rather than the mini data set. Everything ran smoothly. I started to run the experiment at UNH Manchester on my lap top. I did not have the charger cable. When the power of my laptop died, I hoped that the experiment would continue to run on the server as it was. But, it turns out that when the ssh connection is terminated, the experiment stops. So although I got the experiment to run which was the important part because I was trying to better understand the process of running a train so that I could see what could be shortened with scripts and how the process could be eased, it did not complete itself. I learned what I needed to.

 Monday February 10, 2014 

Read Logs. Read up on and practiced Perl scripting.

 Tuesday February 11, 2014 

Read logs. Continued to teach myself on Perl scripting.


 * Results:


 * Plan:


 * Concerns:

Week Ending February 18, 2014
02/12/2014- logged in and updated group log

02/17/2014- logged in and did research

02/18/2014- logged in and continued research and read logs. Went through all scripts and determined what each did just from what I could see in the files

02/18/2014- logged in and continued to work on describing files but this time based on what I could find on the media wiki pages from logs.

02/19/2014- Logged in and read group members' logs to get familiarized with what each had done so that I could be ready for the meeting.


 * Task:

This week's main task is to understand and create a guide to the structure of the experiment directory.

Veered off task a bit because I wanted to have one place where people could go to see what scripts are available and what they do, if they still work, who wrote them, if there is an updated version, what they differences are, etc.


 * Results:

02/12/2014 Updated the group log with the new direction that our group is taking and our new tasks. We will be focusing on learning the structure of the experiment directory and then creating a guide of it and adding it to the information page. To see details and our other goals, please view the Experiment Group Log here.

02/17/2014


 * logged in and read logs.
 * logged in to caesar and went through the experiment folder to great a mental picture of the directory structure and to be able to view the files.
 * The file path for the experiment folder is mnt/main/Exp.
 * I used Experiment 127 to learn the structure. This was running a 10hr baseline train. Then just to see if there were any differences between what's found in a train Exp folder and a decode experiment folder. I also inspected Exp 128 folder(which is a decode experiment of the train that was run in 127). There was a difference. There are different folders based on the type of experiment being run.


 * The contents of a train experiment folder are:
 * bin
 * bwaccumdir
 * etc
 * feat
 * logdir
 * model_architecture
 * model_parameters
 * python
 * qmanager
 * scripts.pl
 * trees
 * wav
 * the contents of a decode experiment folder are:
 * bin
 * bwaccumdir
 * DECODE
 * etc
 * feat
 * LM
 * logdir
 * model_architecture
 * model_parameters
 * python
 * scripts.pl
 * wav


 * So I went back to the media wiki experiments page to find other types of experiments that I could compare. (SIDE NOTE: experiment 0146 does not have a wiki page) My conclusion is that a lot of these folder's contents have been altered or deleted because there were some experiments like 145 that are empty besides a wav folder. I just went through a created a combined list of all possible folders within the experiment directory to look into the purpose of each. Why each is created and what they are used for. Not all of these folders are found in each type of experiment.


 * bin:
 * bwaccumdir: contains some kind of counting files
 * DECODE: contains a decode log file and a script for running a decode. The decoder combines output and status/error text into that single decode.log file.
 * etc: contains important files that are used throughout the experiment process. This includes but is not limited to the sphinx configuration file (sphinx_train.cfg), the transcript, the experiment dictionary (.dic) which contains a list of words along with its corresponding pronunciation in Arpabet format, and the experiment file-IDs.
 * feat:Feat is short for Features. Features data is used in training and is derived from the recordings. It is also used in the decode.
 * LM:
 * logdir: contains a bunch of log files for
 * model_architecture:
 * model_parameters:
 * python: in this folder, there is a folder called sphinx/ and it contains a bunch of (.py) python files and a feat/ folder with more python files in it.
 * qmanager: This folder is empty. Does it get used and then emptied or does it never get used?
 * scripts.pl: It looks like this folder contains all of the scripts that were executed throughout the particular experiment process.
 * trees:
 * wav: Contains all of the sphinx audio files?
 * wavTemp:
 * An html file: The trainer will create an HTML logfile the base experiment directory with the name .html, this document contains everything that was outputted to the screen by the trainer. To take a look at it, use the terminal-based web-browser lynx.

02/18/2014

-t: Type of clone. A REQUIRED argument. 3 possible values: dict: Copy over an experiment dictionary, phone list, and filler dictionary from an experiment and adapt it for a new experiment. trans: Copy over a an experiment transcript and fileID, and the the contents of the experiment's 'wav' directory. all: Both of the above. -e: Experiment Number. This is a REQUIRED argument. States which experiment's config file the actions will be performed on. Not supplying this argument, or supplying an invalid experiment number will result in the script erroring out. -c: Clone. Also REQUIRED The experiment number to clone from. -h: Help, This text. When this flag is given, the script will not perform any other actions.
 * logged in and read logs.* Joshua Anderson's log for this week was very helpful in completely capturing and summarizing in a very understandable manner the processes of each type of experiment and their relations together.
 * Updated my timeline in the proposal for the Experiment Group.
 * Continued on yesterday's work of capturing information about each of the folders in the experiment directory.
 * Read the forward of an email between Joshua Anderson and the Modeling Group. This was unbelievably helpful.
 * Add run_decode2.pl to the list of scripts to describe for next week.
 * It might be useful to create an entire wiki page on the different scripts, what they do, where they can be used, and any further details like parameters, etc. Another question I have that I feel silly about is, were all these scripts created by students or did some come along with the program Sphinx? Should we try and describe ALL the scripts or would that be a waste of time? It would be useful to archive any script we no longer use too if there are any. The list of scripts and their locations that I will be working on describing are:
 * clone_exp.pl - mnt/main/scripts/user/clone_exp.pl
 * Author: Eric Beikman
 * This script is designed to clone one experiment based on the contents of another.
 * Allows for complete cloning of transcripts, dictionaries, phone lists, fileid lists, and wave files.
 * This script is designed to clone an existing experiment. It will either clone the dictonaries and phone list; the transcripts, file list, and wavfiles; or it will do both. It will not touch the sphinx_train.cfg file or create feats from the copied wav files; use train_02.pl and make_feats.pl respectively to do those tasks.
 * Flags:
 * convert.pl - mnt/main/scripts/user/convert.pl
 * Author: unknown
 * This script will make symbolic links to all the required sph files that are noted in a transcript file located in a particular corpus directory.
 * It first sets corpus directory, then appends the path the trans file based on the corpus directory provided, and then opens the transcript file for processing
 * copySph.pl - mnt/main/scripts/user/copySph.pl
 * Author: unknown
 * This script will make symbolic links to all the required sph files that are noted in a transcript file located in a particular corpus directory.
 * Sets the corpus directory, appends the path the trans file based on the corpus directory provided, creates an array that will contain the list of sph files noted in the transcript, opens the transcript file for processing, reads in file line by line, copies line to new variable, removes all characters after the speaker and utteranceID, this pulls out the utterance ID, pulls out all duplicate sph file names to create a unique list, copies each sph file in the list to the wav dir in the corpus dir, creates a new symbolic link that points to the original file and not to the link to the original file.
 * This script is very similar to the conert.pl script. The difference that I noticed was that this script (copySph.pl) takes out all duplicate sph file names so that it will contain a unique list and copies each sph file in the list to the wav directory in the corpus directory.
 * createTranscript.pl - mnt/main/scripts/user/createTranscript.pl
 * Author: unknown
 * This script will create a smaller transcript that is of a length of time specified by the user. The length_of_time is in seconds.
 * This script will create a transcript where the spoken dialog lasts for the amount of time specified by length_of_time
 * Start time indicates how far into the transcript the script should go before it starts to copy dialog to the new transcript. Time is also in seconds.
 * createdict.pl - mnt/main/scripts/user/createDict.pl and mnt/main/scripts/train/scripts_pl/createdict.pl
 * Author: unknown
 * Compares the list of words in a file to the words in a dictionary and outputs the words available with pronunciations.
 * Perl GenerateDictionary WordFile DictionaryFile OutputFile.
 * dictionary.pl - mnt/main/scripts/user/dictionary.pl
 * Author: unknown
 * Contents are identically the same as the the createdict.pl script. Only difference is that the spacing between the lines of code.
 * dictionary2.pl - mnt/main/scripts/user/dictionary2.pl
 * Author: unknown
 * Compares the list of words in a file to the words in a dictionary and outputs the words available with pronunciations.
 * perl GenerateDictionary WordFile DictionaryFile OutputFile
 * This file is an upgraded version of createdict.pl. It end the foreach loop if the word is found because there is not reason to keep going.
 * Creates an add.txt file for the missing words that it cannot find pronunciations for.
 * This script is called in pruneDictionary.pl.
 * dictionary3.pl - mnt/main/scripts/user/dictionary3.pl
 * Author: Eric Beikman
 * Takes a list of words, looks up a pronunciation in a given dictionary, then outputs it to a specified dictionary.
 * Words which don't have a pronunciation are put into a addX.txt file, Where 'X' is either null or a number starting from 1 assigned to make the file unique within a directory.
 * "Compares the list of words in a file to the words in a dictionary and outputs the words available with pronunciations.
 * This script is intended to be used in conjunction with pruneDictionary2.pl.
 * This script also has a counter that keeps track of how far deep in the dictionary you are at. Starting at zero so that, assuming the word list and dict are sorted, you don't need to start at the beginning of the dict to find the next word in the word list.
 * find.pl - mnt/main/scripts/user/find.pl
 * Author: unknown
 * Looks like this just searches for a term in the cmudict.0.6d ductionary specifically.
 * NOTE: Should we write one of these that searches the combines dictionary that Colby Johnson made? If I'm even correct. Though, it's only a few lines of code. It might not even be worth having a separate script for.
 * gen_errors.pl - mnt/main/scripts/user/gen_errors.pl
 * Author: unknown
 * This script is used when training the acoustic model for an experiment.
 * It gives the output of how many errors were encountered and fills the html file with how many errors were found in each step.
 * genFileIDs.csh - mnt/main/scripts/user/genFileIDs.csh
 * Author: unknown
 * This csh script uses the sed command, which parses and transforms text. also know as a substitution command.
 * I am thinking this is a foreach loop that replaces each .wav file with the .sph file ending. I'm not sure about this.
 * The script is very short so here it is:

#!/bin/csh foreach i (./wav/*.sph) echo $i | sed "s|^./wav/||" | sed "s/.sph//" end

% cp -i /mnt/main/scripts/user/genPhones.csh. Execute it with: % ./genPhones.csh 
 * genPhones.csh - mnt/main/scripts/user/genPhones.csh
 * Author: unknown
 * This script generates the phone list.
 * Phones are the smallest component of a phonetic transcription code (such as Arpabet), they represent how each part of a word sounds like.
 * usage: After copying the genPhones.csh script to your etc folder:
 * The script of very short so here it is:

#! /bin/csh set train = $1 echo $train awk '{for(i=2;i<=NF;i++){print $i}}' $train.dic | sort | uniq > $train.phone Use your favorite text editor and edit .phone Insert SIL in the appropriate alphabetic-ordered spot. Not doing this will cause the trainer to error out. /mnt/main/scripts/user/genTrans6.pl  rm: cannot remove `../etc/hyp.trans': No such file or directory This is normal and is expected. The script should still have run successfully. The script is trying to remove an existing hypothesis file before making a new one. But of course, since we never ran the script before, the hypothesis file doesn't exist; thus returning an error. This is a bug that probably should be fixed.
 * Then you need to insert a new phone into the .phone list created in the last step.
 * NOTE: We could probably create a script that does this piece automatically.
 * genTrans.pl - mnt/main/scripts/user/genTrans.pl also in mnt/main/corpus/scripts/genTRans.pl
 * Author: unknown
 * should be executed from the top level experiment directory ex: /mnt/main/Exp/0011\n"
 * Sets corpus directory based on argument input and sets prefix to the exp_id argument that was input.
 * Appends the path the trans file based on the corpus directory provided.
 * Sets the output file names
 * Processes and produces sph files.
 * Sends the transcript to a new file.
 * genTrans2.pl - mnt/main/scripts/user/genTrans2.pl
 * genTrans3.pl - mnt/main/scripts/user/genTrans3.pl
 * genTrans4.pl - mnt/main/scripts/user/genTrans4.pl
 * genTrans5.pl - mnt/main/scripts/user/genTrans5.pl
 * genTrans6.pl - mnt/main/scripts/user/genTrans6.pl
 * Author: David Meehan
 * Looks like this is the genTRans script that we are currently using.
 * It is executed when in generating the transcript, once you pick the corpus subset to use (these are found in /mnt/main/corpus/switchboard/)
 * usage:
 * lm_create.pl - mnt/main/scripts/user/lm_create.pl
 * Author: unknown
 * Creates a Sphinx Language Model from a text file.
 * Usage: lm_create.pl inFile
 * make_feats.pl - mnt/main/scripts/train/scripts_pl/make_feats.pl
 * Author: Copyright (c) 2000 Carnegie Mellon University.
 * Creates feature files (cepstra) from wave files
 * make_phoneset.pl - mnt/main/scripts/train/scripts_pl/make_phoneset.pl
 * Author: unknown
 * usage: make_phoneset.pl etc/EXPT_NAME.dic etc/EXPT_NAME.filler > etc/EXPT_NAME.phone
 * parseDecode.pl - mnt/main/scripts/user/parseDecode.pl
 * Author: unknown
 * usage: parseDecode.pl
 * Takes all the lines that begin with FWDVIT and dump them into a temp file. These are the predicted lines that we need to look at to compare to the original statement.
 * Removes the parsed file if it exists.
 * Open the temp file.
 * Reads and manipulates each line in the temp so that it matches the format that the transaction log is in and outputs the reformatted string.
 * PLEASE NOTE: When running parseDecode.pl, it isn't uncommon for it to return:
 * parseTranscript2.pl - mnt/main/scripts/user/ParseTranscript2.pl
 * Author: unknown
 * Parses non-word characters from a transcript file.
 * This is cool. I would imagine a student didn't write this but who knows.
 * process_missing_words.pl - mnt/main/scripts/train/scripts_pl/process_missing_words.pl
 * Author: unknown
 * The code makes me think that this script searches an HTML file for certain words but the title claims that it processes missing words. Not sure on this one yet.
 * pruneDictionary2.pl - mnt/main/scripts/user/PruneDictionary2.pl
 * Author: unknown
 * usage: pruneDictionary2.pl 
 * This script prunes the master dictionary, creating a new dictionary with only the words we are interested in.
 * Sets variables from command-line arguments.
 * runs text2wfreq which gives a unique list of all the words that appear in the transcript, including how many times each word appears. Unfortunately that includes the (swxxx) statements.
 * Those results are sorted and fed to grep which yanks out the sw statement lines and outputs the results to a temp file.
 * The temp file is then opened for processing.
 * for each word in the temp word list this loop strips each word of any numbers after a whitespace (meaning that a word consisting of a numeric character will be allowed), it will also strip out any words which begin with a '<'. Such characters always precedes a non-word attribute which is not defined in the dictionary.
 * It then saves the line in a temporary pruned file.
 * Then it calls the dictionary2 script which create a new dictionary that only contains the words in the Pruned list.
 * Lastly it removes temporary files.
 * pruneDictionary3.pl - mnt/main/scripts/train/scripts_pl/pruneDictionary3.pl
 * Author: David Meehan
 * Not sure what the difference between this script and pruneDictionary2.pl is yet.
 * Found out on 02/19/2014 from David Meehan that there currently is no difference but that he will be working on making changes to this script to improve it.
 * RunAll.pl - mnt/main/scripts/train/scripts_pl/RunAll.pl
 * Author: Copyright (c) 1996-2000 Carnegie Mellon University. (Ricky Houghton)
 * Runs a series of scripts to make a model:

"$ST::CFG_SCRIPT_DIR/00.verify/verify_all.pl", "$ST::CFG_SCRIPT_DIR/01.vector_quantize/slave.VQ.pl", "$ST::CFG_SCRIPT_DIR/02.falign_ci_hmm/slave_convg.pl", "$ST::CFG_SCRIPT_DIR/03.force_align/slave_align.pl", "$ST::CFG_SCRIPT_DIR/04.vtln_align/slave_align.pl", "$ST::CFG_SCRIPT_DIR/05.lda_train/slave_lda.pl", "$ST::CFG_SCRIPT_DIR/06.mllt_train/slave_mllt.pl", $ST::CFG_SCRIPT_DIR/20.ci_hmm/slave_convg.pl",  "$ST::CFG_SCRIPT_DIR/30.cd_hmm_untied/slave_convg.pl",   "$ST::CFG_SCRIPT_DIR/40.buildtrees/slave.treebuilder.pl",   "$ST::CFG_SCRIPT_DIR/45.prunetree/slave.state-tying.pl",   "$ST::CFG_SCRIPT_DIR/50.cd_hmm_tied/slave_convg.pl",   "$ST::CFG_SCRIPT_DIR/90.deleted_interpolation/deleted_interpolation.pl",   "$ST::CFG_SCRIPT_DIR/99.make_s2_models/make_s2_models.pl", Verifies which directory from a list has the most recent file, and assume this is the last time the user compiled something. Therefore, this must be the directory the user cares about. Adds bin/Release and bin/Debug to the list (that's where MS Visual C compiles files to), as well as any existing bin.platform.   1) Creates a new Experiment directory in the root experiment directory.   2) Runs the setup_SphinxTrain.pl script with the appropriate arguments.   3) Gives that base experiment directory group read/write permissions. 1) Edit the configs on lines 6-8 & lines 79-80.   2) (Optionally with -r) Replace a config file with a supplied one. 3) (Optionally with -b flag, implied with -r) Backup the existing config file by appending a ".old" to the end of the filename. If a prospective backup filename is already used, the script will append successively higher numbers until it finds a unique one.   4) (Optionally with -c), copies over a pre-existing config file at the specified location OR experiment and adapt it for use for the given experiment. -e: Experiment Number. This is a REQUIRED argument. States which experiment's config file the actions will be performed on. Not supplying this argument, or supplying an invalid experiment number will result in the script erroring out. -h: Help, This text. When this flag is given, the script will not perform any other actions. -b: Backup. When used, the script will make a backup of the sphinx_train.cfg file within the experiment's 'etc' directory before performing any operations on it. The filename of this backup will have a '.old', if a file exists using this name, the script will append the lowest possible number at the end of '.old' to create a unique filename. This flag is implied when using the -r or -c flags. -r: Replace. Instead of editing an existing config file, the script will replace it with the one specified. Give an ABSOLUTE filepath with this argument or the script probably will error out. When using this flag, it implies -b is used too; in other words, it will make a backup of the old config file before replacing it. This is to ensure that you have something to to back to in the event something goes wrong. -c: Copy. Similar to -r, except that a config file is adapted for use for this experiment. The script will only edit the lines which deal with the experiment directories. You can either put in an absolute filepath to a config file OR you can put in the experiment number you wish to you wish to copy and adapt the experiment for.";  -m :    Merge. This parameter is REQUIRED. Meaning that trying to use this tool without it will bring you to this screen. This is intentional, the order in which you pass the filenames is important!   -v :    Verbose. Print out everything that is being done. Useful for debugging and to generally see what the script is thinking.   -f :    Force merge. This option specifies that all pronunciations in the additions file take precedence over the dictionary's pronunciation and thus won't ask you for confirmation to do so. CAUTION: This option may have unintended side effects if you aren't careful!    -h :    Help. Print this page. Will ignore any other options.
 * run_decode.pl - mnt/main/scripts/user/run_decode.pl
 * Author: unknown
 * Runs Sphinx 3 decoding job.
 * usage: run_decode.pl  
 * Takes 2 parameters (the experiment number of the train to be decoded and the experiment number of the Acoustic Model to be used)
 * run_decode2.pl - mnt/main/scripts/user/run_decode2.pl
 * Author: unknown
 * Runs Sphinx 3 decoding job.
 * usage: run_decode.pl   
 * Takes 3 parameters (the experiment number of the train to be decoded and the experiment number of the Acoustic Model to be used, AND the senone value used to run your train).
 * run_decode3.pl - mnt/main/scripts/user/run_decode3.pl
 * Author: unknown
 * Runs Sphinx 3 decoding job.
 * usage: run_decode.pl   
 * Not sure what the difference between this script and run_decode2.pl is.
 * setup_SphinxTrain.pl - mnt/main/scripts/user/setup_SphinxTrain.pl
 * Author: Copyright (c) 2000 Carnegie Mellon University.
 * This script is for setting up the SphinxTrain environment for a new task.
 * Checks to see if the current directory ( where the setup will be installed ) is empty.
 * Copy in a template if it exists.
 * Starts building the directory structure.
 * Figures out the platform string definition.
 * Copies all executables to the local bin directory.
 * Copies the scripts from the scripts_pl directory, additional files, and python modules.
 * Sets the permissions to executable.
 * Finally, generates the config file for this specific task.
 * Look for a config template in the target directory.
 * setup_tutorial.pl - mnt/main/scripts/train/scripts_pl/setup_tutorial.pl
 * Author: Copyright (c) 2000 Carnegie Mellon University.
 * Tutorial for setting up Sphinx I believe.
 * train_01.pl - mnt/main/scripts/user/train_01.pl
 * Author: Eric Beikman
 * This script sets up the experiment directory.
 * It accomplishes the following tasks:
 * usage: $scriptName [-hw] [-n ] [-s ]
 * Normal operation is indicated with no given arguments, the script will automatically determine the lowest available experiment number to use based on the previous experiments within the /mnt/main/Exp directory. Normally, it will choose the next number in the experiment number sequence (E.G. it will choose experiment# 0071 if the last experiment number used was #0070); however, in the event there is a gap in between two experiment numbers, it will choose the lowest available experiment number within that gap. For example: If there is an experiment# 0095 and experiment #0098 with no experiments 0096-0097, the script will choose experiment# 0096.
 * Creates a logfile of what happened.
 * train_02.pl - mnt/main/scripts/user/train_02.pl
 * Author: Eric Beikman
 * This script sets up the experiment configuration file.
 * Prepares a given Experiment config file (sphinx_train.cfg) for an experiment.
 * It accomplishes the following tasks:
 * Usage:
 * Creates a log file.
 * tune_senones.pl mnt/main/scripts/train/scripts_pl/tune_senones.pl
 * Author: David Huggins-Daines <dhuggins@cs.cmu.edu>
 * This script generates acoustic models with different numbers of tied states in the range from I to I in steps of I, and runs decoding tests on each of them.
 * updateDict.pl - mnt/main/scripts/user/updateDict.pl
 * Author: Eric Beikman
 * usage: updateDict.pl -m -
 * This utility will take an experiment dictionary file, and a textfile containing list of additions, automatically merging the two by taking the entries in the additions inserting them into the dictionary based on the proper alphabetic location. The additions textfile must be written in the same <word-pronunciation> format as the dictionary, with a single word on each line. Alphabetical order does NOT matter. If there is a word with the same spelling but different pronunciations in both the addition file and the dictionary, the application will prompt you if you wish to over-write the experiment dictionary; duplicate words with the same pronunciation will be ignored.
 * NOTE: I noticed that Eric's other scripts create log files. This one does not. Should it be modified to generate a log file?

NOTE: Many of these scripts are located in a couple different places. There is a folder called old under mnt/main/old/scripts where a few scripts are archived. Though they are in the old folder, they still exist in other places too.


 * Plan:

My plan is to read through any wiki pages where I might find information about the experiment file directory and do some outside research to try and understand exactly what each file's purpose is. I want to find out in more detail, what each and every file inside each folder is for if I can.

Next week I would like to briefly summarize what a senone value is, I wish the media wiki page had a search function. Would make finding information so much easier!


 * Concerns:

not so much a concern but a note that the Experiment Setup Page under the information section of the Speech wiki needs to be updated. It is outdated.
 * The files to find within an experiment folder are not all listed.
 * The file path of the create_expdir.pl script is no longer as currently stated. I found it in mnt/main/old/scripts/expdir_scripts/create_expdir.pl.
 * It also appears that we no longer use the create_expdir.pl script. We use setup_SphinxTrain.pl. This preps the experiment directory by creating all sub-folders, copies over some essential scripts (though not all), and imports a generic train configuration file (sphinx_train.cfg). It is located at this file path /mnt/main/root/tools/SphinxTrain-1.0/scripts_pl/setup_SphinxTrain.pl
 * this page should contain information about what each folder in the directory is for.

Week Ending February 25, 2014
02/22/2014 - logged in, read logs, and planned

02/23/2014 - logged in and read logs

02/24/2014 - logged in

02/25/2014 - logged in and worked scripts page


 * Task:


 * Keep going with researching through media wiki for more information on all the scripts.
 * Create a wiki page for the all existing scripts to be documented. This page will be a great reference. I want it to look a lot like the experiment wiki page where it lists all scripts in alphabetical order as links that lead to a whole separate page about the script. Each time a script is created the author should create a page for it. No one can describe a script better than the person who wrote it.


 * Results:

02/22/2014
 * Just logged in to catch up and read group members' logs. I was out really sick with the flu for a couple days. Updates the authors of some of the scripts. I found of that David Meehan wrote some of the latest ones/ is working on some of them currently.
 * Posted my tasks and plan for the week.

02/25/2014
 * The Scripts Page has been generated. You can click on the link to the left or you can get there by going to the information page. There is now a link there that is called Scripts Page.
 * This is the template that should be used when creating a page for a script:

==Summary== Title: Author: Date: Location: file path on Caesar Usage: ==Description== ==Code==


 * The way it will work is that rather than creating a page for each script, if there is a script that is a newer version of a script that already exists, then the author can create a page for that script under the first version. For an example of what I mean, see GenTrans script. You'll see that it has a general description of what the script does and then links to several versions of the script.
 * While exploring for more information about scripts, I found an interesting page about something called Git. There is a guide here on Using Git.. This page was created by the students that were doing research in the summer of 2013 so my guess is that Eric Beikman wrote this. He wrote it to solve potentially running into problems due to the increasing number and complexity of scripts developed by us for use in Sphinx. He found Git to provide a safe way to manage versions and facilitate development by multiple individuals during a semester, and ease testing of new and improved scripts; while ensuring that there will always be a "stable" source of scripts for use during experiments. This could be something of value to us because I know a few of us are already writing scripts.
 * I mentioned this to Joshua Anderson and he has already been using it for other classes but the only problem with it was that it charges for private accounts. Anything we put up there would have to be public. He did however find a website called BitBucket and they DO offer free private accounts. He created an account for us. I imagine he will log the credentials. If not I will do so.
 * Working on filling the Scripts Page with the information I have in my logs and then going deeper and seeing if I can find any more information on the wiki about them and possibly find their creators.


 * Plan:

My ultimate plan is to create a wiki page for scripts and transfer all the information I've gathered about scripts to the page. Work on updating the information from there.


 * Concerns:

02/22/2014- no concerns as of now.

Week Ending March 4, 2014
02/26/2014- logged in and added a couple tasks that we were assigned during the Capstone meeting today. Went off task a little bit with work but it's okay it's only day one of this week.

03/02/2014- logged in and tried to log on to caesar because my work this week is mostly to be done on the caesar but could not.

03/03/2014-logged in and started running an experiment. I lost a whole day of work on Sunday and I figured going through running an experiment would be more important than SpEAK right now and I know I can accomplish that where I am not so sure with SpEAK!

03/04/2014-logged in and read others' logs to prepare for tomorrow's meeting. Abandoned experiment 0202 and started a new one 0203.


 * Task:


 * Spend half of myself (10 hours) on SpEAK this week. At minimum, get it up and running on Rome.
 * Continue to work on filling in the scripts page with information on the scripts.
 * The rest of my time will probably be taken up by achieving a successful run of an experiment from start to finish!


 * Results:

02/26/2014
 * NOTE: Upon complaining to Brian Gailis about there not being a search function on the media wiki or that it is not working. He told me that it does work, once you search something, and it comes up with no results, you have you click the button below that says to include "everything". Then the search does work! Just incase anyone else didn't figure that out on their own. This has been very helpful in the past week!
 * Filled in the links for each version of GenTrans perl script with information about the differences and improvements between the different versions. I found this information in the Spring 2013 Report : http://foss.unh.edu/projects/index.php/Speech:Spring_2013_Report

03/02/2014
 * Tried logging in to Caesar at 9:00AM, 11:00AM, and 1:00PM. no luck yet. will keep trying throughout the day because most of my tasks for this week are dependent on getting in. Thought I did spend a lot of time searching through media wiki trying to catch up on exactly what state SpEAK was left in last semester to help me in figuring out how to get it up and running.
 * I found that the last work that was done with SpEAK appears to be by Eric Beikman in the summer for 2013. He got SpEAK running off of XAMPP like all members of experiment group did at the beginning of the semester and wanted to get it up downloaded onto Caesar. He left these suggestions for improvement for future classes (us):

Recommendations for future classes:
 * Clean up the Trunk branch
 * There is quite a bit of scripts, documentation, and other files which are not applicable to the current code base and should be removed.
 * For example, there is a PDF in the SQL directory containing diagrams for the database, however, its out of date!
 * There are a few scripts to populate the data with sample data which don't work! Presumably due to changes in the database schema.
 * There is little to no instructions on how to set up the system!
 * There is little to no documentation in general!
 * That being said, the code itself is pretty good in regards to commenting.
 * add.php has a note which states that two people think they are making an experiment with a given experiment number, but only one will take.
 * This really isn't acceptable, think of how frustrating it would be if you create and type up an experiment, only to have it not save because somebody took your experiment number while you were typing.
 * PHP is whining that date has not been assigned a time zone.
 * The User edit screen needs to have a second 'confirmation' password field.
 * Getting locked out of your account because of a typo when you tried to change your password is not cool.
 * Include a radio checkbox determining the type of experiment.
 * There are a few different types of experiments due to how the caesar is set up:
 * Training experiments, where acoustic models are generated but not tested.
 * Decode experiments, where Language models are generated and an acoustic model is tested.
 * The reason to keep these separate is that normally you wish to test models using a corpus that is smaller than the corpus that was used to make the models. This is mainly for time reasons, there isn't a good reason to run a 5 hour decode on a 5 hour train! A 30 minute decode to test models using 5 hours worth of audio is more efficient in terms of time!
 * Train and Decode experiments, where models are created, then a decode is ran using the same audio data used to create the models.
 * Other: A catch all for all experiments which don't meet the above types.
 * A database table for corpuses.
 * Depending on the type of experiment performed, each experiment will have an association with the corpus used to create the models and/or decode.
 * Info about the corpus could include the filename/path, length, source (I.E. Switchboard), and offset if the corpus is a subset of a larger corpus (for example, the last_5hr corpus has an offset of 303 hours as it starts 303 hours from the beginning of the 308-hour long Switchboard corpus).
 * The SpEAK google site page needs more detail as to what exactly the code represents, how it works, and licensing. If this is open software, then we need to detail this knowledge to distribute it to those who might find it useful.

Eric Provided very detailed instructions of all the installations he did to get SpEAK onto caesar in his log here: http://foss.unh.edu/projects/index.php/Speech:Summer_2013_Eric_Beikman Unfortunately, I ran into a bit of trouble with SpEAK. Namely, it was erroring out with some constructs. Error logs are found in the /var/logs/apache2/ directory, the one I was interested in was in speak-error_log. After some reading, I found it was throwing errors at the portions of code utilizing a syntax introduced in   later versions of the PHP interpreter, turns out the PHP interpreter included was out of date. Since this version cannot be found in the mainstream repositories for the version of OpenSUSE installed, I had to do it manually. Updating Caesar to a more modern version (of any distro) should be a priority for this reason.
 * IMPORTANT NOTE: Eric created a SSL cert and noted that it is only good for one year. It needs to be renewed before June 28, 2014. I may contact Eric about this and ask him how to do this. We can add it to the list of things to do before the semester ends. But may because SpEAK will not be running on Caesar this is not a concern.
 * Looks like there is a reason to update to a newer version of OpenSUSE. Maybe it's already been done. This could be something to bring up to the Systems group.

After a series of problems caused by SpEAK not being able to run with OpenSUSE, Eric stopped trying to run SpEAK on Caesar. Rome was formerly known as Marathon, until Eric configured it into Rome with Fedora. And he then tried to get SpEAK running on Rome.

Yikes... it doesn't look like Eric ever did get SpEAK running on Rome... his logs for the summer end before he got to that. I think Rome was left in a bit of a mess.

03/03/2014  /mnt/main/scripts/user/genTrans6.pl /mnt/main/corpus/switchboard/first_5hr/train 0202 genTrans.pl 100% completed so it should be complete. Maybe someone cleaned up the transcript files like we talked about in class! I checked the Data group's log for this week and saw that it was a task for John Kelly to "Work with Modeling group on OOV". So I read John Kelly's log and saw that he did work on cleaning up the transcript files! very nice :)... Mayyybeeee our group should be posting our tasks in our group log too so that other people can quickly figure out who worked on what like I was just able to do by looking at the Data group's log.  /mnt/main/scripts/train/scripts_pl/pruneDictionary2.pl 0202_train.trans /mnt/main/corpus/dist/cmudict.0.6d 0202.dic   cp -i /mnt/main/root/tools/SphinxTrain-1.0/train1/etc/train1.filler 0202.filler Checked to make sure it was there and it is as 0202.filler.   cp -i /mnt/main/scripts/user/genPhones.csh . and then executed it:   ./genPhones.csh 0202   Broadcast message from root@caesar (pts/0) (Tue Mar  4 01:23:08 2014):   The system is going down for reboot NOW!   Read from remote host caesar.unh.edu: Connection reset by peer Connection to caesar.unh.edu closed. Not sure it this is regular. I have had bad luck with caesar working this week!   /mnt/main/scripts/train/scripts_pl/make_feats.pl -ctl /mnt/main/Exp/0202/etc/0202_train.fileids This generated TONS of .sph files. It automatically placed the in the mnt/main/Exp/0202/wav folder. It only took about a minute. I didn't experience any apparent errors. /mnt/main/scripts/train/scripts_pl/RunAll.pl  Something failed: (/mnt/main/Exp/0202/scripts_pl/00.verify/verify_all.pl) mkdir temp in the etc folder of my experiment. mv 0202.dic temp
 * Running an experiment using the scripts we have available. I figure it will be most effective now, for me to learn how to run an experiment in the most efficient and quickest way (not including Josh Anderson's master script because it is still under construction).
 * The experiment number is 0202.
 * I first ran the train_01 script without any arguments. The default experiment number was set to be one higher than the last experiment in the directory which is perfect and the default setup Sphinx train script is setup_SphinxTrain.pl.
 * Then I simply ran the train_02 script. It is not dependent on where it is executed. It just requires an experiment number to be given. When it executed, it gave no indication or confirmation that it had properly completed so I checked to make sure by opening up the cfg file located in the etc folder of the experiment and made sure that lines 6-8, 79, and 80 had been changed.
 * Next it was time to generate the transcript files. These consist of two portions: The text transcript files: 0202_train.trans and the audio file ID list which contains the list of audio files which make up the transcript: 0202_train.fileids. I first made sure that I was in the base experiment folder, and then executed the the GenTrans6.pl script to do this just because I think it is the most recent version of the genTrans.pl script created by David. I used the first_5hr corpus subset. This was how I executed it.
 * This took about 3 minutes to complete. I was under the impression and remember from the first time I did this that this took a lot longer than that. I remember it taking about 15 minutes at the beginning of the semester when I ran my first experiment. But it did prompt me:
 * Also, the two files are present in the etc folder so I should be good.
 * Next step was to create a custom dictionaryfor the experiment. I was already in my experiment's directory but I had to make sure I was in the etc folder.
 * I know there is a PruneDictionary3.pl script (I've been referencing my scripts information page a lot to see what the most up to date version of things is. Hopefully we can keep this up.). But I just used PruneDictionary2.pl instead because I think PruneDictionary3.pl is currently under construction by David Meehan. I executed it like this:
 * Directions said to use /mnt/main/corpus/dist/cmudict.0.6d as the master dictionary but I looked through the dictionaries because I know Colby Johnson was saying that he put some work into creating and combining dictionaries. I found some more recently created dictionaries but eh I'll just play it safe for now so I can get this thing running and then play around the second time.
 * This step took about twenty minutes. It couldn't find pronunciations for 160 words that were listed in add.txt.
 * Next, I copied over the filler dictionary into the same directory 0202/etc, which I was already in using this command:
 * Then, generating the phones list, I copied over the genPhones.csh script to the etc folder that I was still already in:
 * I used the vi method to add SIL to the 0202.phone list in it's correct alphabetical order.
 * Then I generated the feats data. Before I did anything I brought myself to the base experiment folder.
 * I was interrupted by this message...
 * I was able to get right back on :) and continued.
 * Got myself to the base experiment folder for experiment 0202 and made the feats:
 * Then I ran the train. executing the RunAll.pl script:
 * It lasted about 30 seconds of warnings down the window and ended with this message:
 * I guess it's common for it to fail the first time.
 * I opened up the HTML file to check what went wrong. Looks like I ran into issue 1. The Trainer could not find words referenced in the transcript within the dictionary.
 * Given the choice between two methods of solving this issue...I chose the easiest one of course which is available thanks to Eric Beikman! Method two. This involved the following commands:

mv add.txt temp

cd temp

cp -i /mnt/main/scripts/user/updateDict.pl.

./updateDict.pl -m 0202.dic add.txt

mv 0202.dic ..


 * Then I went back to the base folder and tried executing the RunAll.pl script again... and failed :(. I'll ask Josh Anderson or someone in the Modeling group for help tomorrow.

03/04/2014
 * Having trouble logging into Rome.
 * Asked David Meehan to help me out to find out what went wrong with my experiment yesterday. Looks like I needed to copy the add.txt file from 0116. This was something that Colby discovered!
 * Tried to run it again. Failed again due to a different issue. The warning that was given was that there were three phones that occur in the dictionary (/mnt/main/Exp/0202/etc/0202.dic), but not in the phonelist (/mnt/main/Exp/0202/etc/0202.phone). (AH), (EH), and (OW).
 * I looked to see how to solve this issue in the information page for running a train. Found that some of the phone are missing stress indicators. I thought to just add them but David curiously looked further into the problem for me.
 * Once we solved that it kept erroring out with an unknown error so I am just going to abandon this experiment and start a new one. A change illl be using the 10hr dictionary in the custom folder and using genTrans5.pl instead of genTrans6.pl. This should effect anything it will just give me a better word error rate and more words in the dictionary to make it run faster.
 * Best way to learn is by trial and error I suppose. Second try: experiment number 0203 and I'm running it on Methusalix... instead of Caesar this time like I should be.
 * Using genTrans5.pl does in fact take a bit longer than using genTrans5.pl it took about 15 minutes versus the 3 minutes.
 * Using the custom 10hr.dic was SO much faster than the cmudict.0.6d! It took 1 or two minutes versus twenty minutes.
 * Got the experiment to run successfully this way without any errors that caused the train run to interrupt.

SpEAK: getting speak up and running on Rome is going to be a lot more work and research than I expected seeing as I don't think anyone has done this yet based on past logs. Something I'm willing spend some time on but I want to make sure this is something that Mike Jonas wants me to be spending time on.

I realized getting comfortable with the experimentation process and directory structure are the most important task right now. But I feel pretty good about the first part after running through it successfully and having my logs to reference for when I need to do it again and again makes me feel comfortable. Onward with the next steps in the experimentation process.


 * Plan:


 * To do whatever it takes to get SpEAK up and running on Rome.
 * Run an experiment from start to finish... successfully!
 * Keep adding information to the scripts page. Hopefully others will start adding what they know to it too.
 * I would like to do some more research on the Markov model for my personal understanding.
 * Check up on what the Data group has gathered on where the Transcripts are from.
 * 03/04/2014-Plan today is to try and get myself logged into Rome and finish my experiment.


 * Concerns:

02/26/2014 - Figuring out what's currently wrong with SpEAK and why it's not working

03/04/2014-Getting speak up and running on Rome is going to be a lot more work and research than I expected seeing as I don't think anyone has done this yet based on past logs. Something I'm willing spend some time on but I want to make sure this is something that Mike Jonas wants me to be spending time on.

Week Ending March 18, 2014
03/05/2014- logged in to update tasks for the week after the meeting.

03/17/2014- logged in

03/18/2014- logged in and read logs. This weeks tasks are going to be spilling over into next week's tasks. I had a really busy week and I am using this week as my one week off from major Capstone work. I'll work Double next week to make up some work.


 * Task:


 * Create a user table for which scripts to use when you need something done.
 * Create a flow chart for the training process.
 * Maybe I could create a glossary of speech experiment terms.
 * Continue to do research on SpEAK.
 * Continue to run experiments from the exp 0203 train that I ran last week.


 * Results:


 * Plan:

Next week I'll be working on drafting up pieces for the Experiment page reconstruction.


 * Concerns:

Sorry. Had a busy week preparing for Girl Technology Day. This is my freebee week off.

Week Ending March 25, 2014
03/22/2014 - logged in and read logs. Started to build a language model using experiment 0203 on Methusalix.

03/23/2014 - logged in and checked the status of Josh's script

03/25/2014 - logged in and did read over how to do decodes.

03/26/2014 - Running the decode on Exp 0203 on Methusalix

03/26/2014 - logged in and did a google hang out into the meeting because I could not attend.


 * Task:


 * Finish a full experiment decode.
 * Draft an update for the experiment page.


 * Results:

Creating the Language Model

mkdir LM     cp -i /mnt/main/corpus/switchboard/10hr/train/trans/train.trans trans_unedited /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ParseTranscript.perl trans_unedited trans_parsed cp -i /mnt/main/scripts/user/lm_create.pl. ./lm_create.pl trans_parsed
 * First step is to set up the language model directory inside your experiment's base directory. This Folder is called LM.
 * Then go into this directory, and copy the same transcript that you used in running the train, in to this LM folder using the corpus directory file path.
 * You then prepare the transcript. The perl script used to do this is located at the following file path.
 * The result of this step generates creates a trans_parsed file in the LM folder of the experiment and also prints the file out in the cmd window. It looks like a series of jumbled sentences.
 * Then you copy the script that creates the language model.
 * And execute this script. This will create the Language model.
 * This includes:
 * Generating a word frequency file
 * Generating a Vocab file containing the most frequent 20000 words.
 * Generating an ID 3 Gram file
 * Then finally generates the Language Model
 * Creating a language model should take no longer than 5 minutes. There is not wait time.

Running the Decode cp -i /mnt/main/scripts/user/run_decode.pl. ./run_decode.pl <experiment#> <AM#> cp -i /mnt/main/scripts/user/run_decode4.pl.
 * First step in running a Decode is to create a new directory called DECODE in your experiment's base directory.
 * Then, cd into this directory and copy the run_decode script into the directory. It can be found at the file path below.
 * To run this script, it requires two parameters. (They both usually end up being the same)
 * The experiment number of the train to be decoded
 * The experiment number of the Acoustic Model to be used
 * Run the script using the following format.
 * I used:  ./run_decode.pl    and got the error "Missing name for redirect."
 * I then looked through the user folder in the scripts directory and saw that there were updated versions of the run_decode perl script in there. So I copied over run_decode4.pl
 * When I tried running that script, the same error occurred.


 * Plan:


 * Understand and outline the Language Model building process.
 * Understand and outline the Decode process.


 * Concerns:

I ran in the a problem when initially trying to build the language model over the weekend. For whatever reason, I was not granted access to Methusalix. So I just read up on it and tried again during the week.

Week Ending April 1, 2014
03/26/2014 - logged in, had a group meeting, and worked on the Experiment Setup page.

03/27/2014 - logged in and read logs. Started drafting for the Experiment Directory Page.


 * Task:


 * Move Old Experiment Setup Page to "new location". Replace it's old location with a new and updated Experiment Setup Page. Provide a link at the bottom of this new and updated setup pages for the old "archived" set up page. It really is not longer useful because the information is so outdated and hard to follow. The page needed to be completely reconstructed.
 * Write the Experiment Directory Page from the information that me and Josh have gathered in our personal logs.
 * Write the Filezilla page, explaining where you download it from, how to connect to caesar, and what it can be useful for.
 * Add more scripts to the scripts page. I have dozens and dozens of scripts defined in my personal logs but I only put a a select few on the scripts page. Colby Chenard has requested that I add more so that will be another task for this week.


 * Results:

03/26/2014


 * Moved the old Experiment Setup page to a new location. The new location link is: Speech:Exp_Archive. It was not deleted.
 * The new Experiment Setup page has taken place in the previously existing link.
 * The Experiment Group (Team Scriptosaurus) got together today to discuss the structure of the page. Today I will structure it based on what we discussed so that we can begin to fill it with the information that we have gathered in our personal logs so far this semester.
 * We had a pretty specific idea about how we wanted the page to look so that it would be easy to follow and navigate through when first learning about experiments. So at the top of the page I used a borderless table to format the who columns of links; Quick Links and Build Your Own Experiment.
 * I added headers for each step of running an experiment for a written explanation to help future semesters get a feel for what is going on when running an experiment.
 * At the bottom of the page is a place for archived Experiment Setup pages. Here is where the links to the archived pages should go along with the date that it was replaced. Maybe some day our page will be archived in a future semester's Experiment Setup page.

03/27/2014

Drafting for the Experiment Directory Page - idea is to first go over some terms that I will be using throughout the page.


 * A Directory is a file system cataloging structure which contains references to other computer files, and possibly other directories. The terms directory and folder can be used interchangeably.
 * A directory contained inside another directory is called a subdirectory
 * The terms parent and child are often used to describe the relationship between a subdirectory and the directory in which it is cataloged, the latter being the parent.
 * The top-most directory in such a filesystem, which does not have a parent of its own, is called the root directory.

1. First you make sure you have an ssh connection to caesar. It's good practice to also ssh into a drone and work from there.
 * If you want to view what files and other directories are in a particular directory through a terminal window:

2. The Unix command for listing files is     ls

3.You'll see that each individual experiment can have different folders within it's the Experiment's base directory. Here is an example for displaying the files inside two different experiments.




 * So we will outline all possible files and folders that you might find in an experiment's base directory.
 * If you want a better visualization of the structure of the directories on Caesar, Filezilla is a great tool to use. Using Filezilla will be allow for a more interactive, point and click view of what is inside a directory, as you see below.




 * I'll insert this table into the page somewhere:


 * Filled the Experiment Directory Page. Feel free to add to it if you have more information that I do not have.
 * I just realized it would be a good idea to add how an experiment directory is set up.


 * Plan:


 * Create the new Experiment Setup Page.
 * Update the new Experiment Setup Page.
 * Create the Experiment Directory Page. The link to this page is on the  Experiment Setup page.
 * Create the Filezilla Page. The link to this page can also be found on the  Experiment Setup page.


 * Concerns:

03/26/2014
 * A concern with moving the old experiment set up page is that any log that was previously referencing the information on it specifically in the form of a link, will now take them to the new experiment page. But it should be fine because the information on the old page was no longer of any use.
 * I find myself confused about where to be documenting certain things. Should some things be documented in our group log and not our personal log? Or in both places? What should be where?

Week Ending April 8, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 15, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 22, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 29, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: