Speech:Spring 2014 Modeling Group


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Groups

 * Systems Group
 * Experiment Group
 * Tools Group
 * Data Group
 * [Modeling Group]


 * Proposal Group

Group Member Logs

 * Colby Chenard
 * Colby Johnson
 * David Meehan
 * Forrest Surprenant

Assigned machines are: miraculix & majestix

Week of 2/3/2014

 * Research the current system configurations of Sphinx on Caesar.
 * Run successful trains.
 * Begin analyzing areas in the system to improve.
 * Become familiar with the main scripts used in training.

Week of 2/10/2014

 * We have a full train/decode complete: experiment 0150
 * Train and decode (0156 and 0158) was completed using a smaller number of senones. The end results were bad, but the script was stopped before it could complete (possibly, the fact that we could score seems to suggest otherwise).
 * The data in the tiny data transcript is out of order, as is the mini data. The other data sets are in an order to some extent, so perhaps this is the problem.
 * run_decode2.pl was generated, which accounts for different numbers of senones, which the previous script did not.
 * We created general experiment logs for everyone.
 * Everyone worked on running trains/decodes.

Discoveries

 * Discovered that the corpus switchboard transcription files contain a train, and test. Test is a subset of a full train.
 * These will be useful in completing many different trains and decodes without overloading the system
 * Using subsets will dramatically reduce the time needed to complete tasks and provide data much faster

Master Script: I had discussed making a master script that automates running a train, well it does seem that train_01 does that first part, Train_02 does the second part, genTrans#.pl (i.e. 6) does the third part, pruneDictionary2.pl does some of the fourth part, and make_feats. does the 6th part.

The fourth part needs a file copied as well as the script run. but that could be a simple automation. The fifth part also needs a file copied but will need an entry to the file.

this would be the order the operations for the master script. each of these steps have necessary parameters. Such as dictionaries to use. these could be optional and default to certain values or there could just be a lot of necessary parameters.

This would be a useful to those who understand the process and just need to expedite the process for data collection purposes.

Notes while running trains!
 * Dictionary must not contain the blue, ^W characters, this will cause an error.
 * this will also affect the outcome of your Phone file, causing more errors.
 * 13Feb2014 the first_5hr/test will error out at module 50
 * Make sure you follow the rules online to a T they are very precise

getTrans Gentrans5: does not separate channels gneTrans6: Separates channels genTrans7: Created to test the efficiency of writing at the end.
 * writes to file every iteration
 * writes to file every iteration
 * Gained 3% efficiency
 * not much but adds up

States Per HMM Parameter Lets look into Statesperhmm parameter in the sphinx_train.cfg file. This is a parameter CMU recommends changing depending on the training data If you have "difficult" speech (noisy/spontaneous/damaged), use 3-state hmms with a noskip topology. For clean speech you may choose to use any odd number of states, depending on the amount of data you have and the type of acoustic units you are training. If you are training word models, for example, you might be better off using 5 states or higher. 3-5 states are good for shorter acoustic units like phones. You cannot currently train 1 state hmms with the Sphinx.
 * The default is 3
 * This is what has been used for all experiments to my knowledge
 * How do we classify our speech?
 * Is it clean enough to change the statesPerHmm value?
 * Is it not worth changing the value? (take too long to train/decode)
 * Will it over-train data that has a higher value?
 * What is the recommended for lengths of data?

Parallelization There is a -npart option to help you partition your training data This option is set in the .cfg file. This will allow you to use multiple CPUs on a local machine or across a network using PBS or Torque. Npart will allow multiple parts (CPUS - not sure if it supports Hyperthreading...supposedly it does but I cannot get it working) If Npart is greater that 1 and you are using one machine. the $CFG_QUEUE_TYPE must be set to POSIX. To use multiple machines PBS or TORQUE must be entered. If PBS or TORQUE is entered $CFG_QUEUE_NAME must be set to the correct queue.

Running a Train
Copy and paste these commands in the groups provided only if doing using a first_5hr/train corpus subset A Trick is to copy and past all of it and do a find a replace of all  with your exp#

/mnt/main/scripts/user/train_01.pl cd  /mnt/main/scripts/user/train_02.pl -e 

(Change Senone and Density Values if needed) (If you would like to use a different corpus subset and know the dictionary modify the text before copying)

Must be in /mnt/main/Exp/

/mnt/main/scripts/user/genTrans5.pl /mnt/main/corpus/switchboard/ 

Let the gentrans file run...

cd etc /mnt/main/scripts/train/scripts_pl/pruneDictionary2.pl _train.trans /mnt/main/corpus/dist/ .dic cp -i /mnt/main/root/tools/SphinxTrain-1.0/train1/etc/train1.filler .filler cp -i /mnt/main/scripts/user/genPhones.csh. ./genPhones.csh  vi .phone

(insert SIL into your phone File)

cd .. /mnt/main/scripts/train/scripts_pl/make_feats.pl -ctl 	/mnt/main/Exp//etc/_train.fileids nohup /mnt/main/scripts/train/scripts_pl/RunAll.pl &