Speech:Spring 2014 Forrest Surprenant Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4, 2014
Explore wiki
 * Task:
 * Feb 1

Create a log page.
 * Feb 2

I signed in.
 * Feb 3

Attempt a train on Caesar
 * Feb 4

Explored wiki
 * Results:
 * Feb 2

Created a log page and read some logs from former Model members.
 * Feb 1

Sign in was successful!
 * Feb 3

Logged into Caesar and created my experiment folder 0146. Ran a couple scripts and configured sphinx for my director
 * Feb 4

Just thought I'd share something I found out with the aid of David. You can ftp into Caesar using sftp://caesar.unh.edu with FileZilla using your credentials. This is handy if you don't want to modify text files using the command line.


 * Plan:

Click a link, then go bake cookies for an hour.

Solve speech for humanity
 * Feb 3
 * Concerns:

With the number of links that need to be clicked, the over abundance of cookies will likely put me into a diabetic comma. I fear this log could prove detrimental to my health.

Week Ending February 11, 2014

 * Plan:

Feb 7

log in

Feb 9

Figure out how to log into all of the different drones on the LAN so that I can view the processes running on each system.

Find out the system info for each of the drones.

Learn more about Torque.

Verify that it has been installed on multiple servers. If fond on a server, record the IP address and server name for future use.

Feb 10

Logged in

Feb 11 From my earlier readings this week from Tommy McCarthy, I am going to verify that the nodes have all have torque installed.


 * Results:

Feb 7

log in went well. Server said "Have a lot of fun..." So I did.

Feb 9

Read through Tommy McCarthy's from Summer 2013. He said he had torque installed on 8 of the drones aka servers.

Figured out how to view system info: cat /proc/cpuinfo Here's what a got for results from caesar. processor      : 0 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 3.06GHz stepping       : 5 cpu MHz        : 3056.569 cache size     : 1024 KB physical id     : 0 siblings       : 2 core id        : 0 cpu cores      : 1 apicid         : 0 initial apicid : 0 fdiv_bug       : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bogomips       : 6113.13 clflush size   : 64 cache_alignment : 128 address sizes  : 36 bits physical, 32 bits virtual power management:

processor      : 1 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 3.06GHz stepping       : 5 cpu MHz        : 3056.569 cache size     : 1024 KB physical id     : 3 siblings       : 2 core id        : 0 cpu cores      : 1 apicid         : 6 initial apicid : 6 fdiv_bug       : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bogomips       : 6113.37 clflush size   : 64 cache_alignment : 128 address sizes  : 36 bits physical, 32 bits virtual power management:

processor      : 2 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 3.06GHz stepping       : 5 cpu MHz        : 3056.569 cache size     : 1024 KB physical id     : 0 siblings       : 2 core id        : 0 cpu cores      : 1 apicid         : 1 initial apicid : 1 fdiv_bug       : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bogomips       : 6113.29 clflush size   : 64 cache_alignment : 128 address sizes  : 36 bits physical, 32 bits virtual power management:

processor      : 3 vendor_id      : GenuineIntel cpu family     : 15 model          : 2 model name     : Intel(R) Xeon(TM) CPU 3.06GHz stepping       : 5 cpu MHz        : 3056.569 cache size     : 1024 KB physical id     : 3 siblings       : 2 core id        : 0 cpu cores      : 1 apicid         : 7 initial apicid : 7 fdiv_bug       : no hlt_bug         : no f00f_bug        : no coma_bug        : no fpu             : yes fpu_exception  : yes cpuid level    : 2 wp             : yes flags          : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cid xtpr bogomips       : 6113.36 clflush size   : 64 cache_alignment : 128 address sizes  : 36 bits physical, 32 bits virtual power management:

From this I'm concluding that each of the Nodes have a quad core processing power. After a preliminary review of the train.config files, It appears that only one core is being utilized. Just by using all 4 cores, we could cut down our train time considerably. I'm estimating that instead of cutting the time down 1/10th we may be looking at hitting something more like 1/40th of the time if we can configure torque for all 10 nodes aka servers aka drones aka what ever you want to call them next...really.

After a review off all 10 systems, I found that with the exception of Caesar all of the servers have quad cores with a clock of 2783.289 MHz. Caesar has a quad core as well but it's clock is oscillating at 3056.569 MHz. Essentially this means we have 40 processors with an average clock of 2810.617 MHz to give use the processing time of 112424.68 Mhz or 112.4 GHz... yeah. that's all of power!

So as far a logging into other local servers it's no so bad. ssh 192.168.10.2 or ssh asterix will log you into the desired server. Just type Yes when prompted to add the server to the known host list and YOU'RE IN!

Feb 10

Log in went well

Feb 11

Tried running torque, excuse me... Tried starting torque using pbs_server I kept receiving the error command not found Not sure what this is about, I'm thinking something is missing from the install or I need to run this as a root user. I tried logging into as root, but I think the password has been changed....

Week Ending February 18, 2014

 * Task:
 * Feb 15

Logged in - Run torque


 * Feb 16

Logged in - Read through Colby Johnson's logs and saw is finding's on the correlation between the bit density and the word error rate for a decode.


 * Feb 17

Logged in - Today I tasked myself with running a complete 5hr train.


 * Feb 18

Logged in - Today I am picking up where I left off yesterday with a 5hr train. I ended yesterday after generating the feats data

---


 * Results:


 * Feb 16

I was able to start torque on Caesar today. The pbs_server Command has to be run as an admin.

su

then

pbs_server


 * Feb 16

After reading through the experiment results, I'm seeing the best results (lowest word error rate) are being produced from 32 - 64 density trains. I see a correlation between the word error rate at a bit density and the train time, but a more data is needed to draw up an accurate model.


 * Feb 17

I was able to get further than previous attempts today on a 5hr train. I created exp 0143 for this purpose. I'm currently waiting on my dictionary substitution. I ended today after attempting to run: Something failed: (/mnt/main/Exp/0143/scripts_pl/00.verify/verify_all.pl) And received this: Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in                                                                                                                     the phonelist, and all phones in the phonelist appear at least once Something failed: (/mnt/main/Exp/0143/scripts_pl/00.verify/verify_all.pl)

I'm going to read through the 0143.html to look for more information about the error.


 * Feb 18

results pending...

One thing I found today when creating a dictionary for the train, you can remove duplicates from your dictionary by using this command awk '!x[$0]++' <#experiment>.dic > <#experiment>.new.dic

The 0143.new.dic file will now contain a unique set of words! I would just rename your old <#experiment>.dic to something like <#experiment>.old.dic just for keepsakes! Don't forget to rename your 0143.new.dic to 0143.dic when you are done.

---


 * Plan:


 * Feb 17

My plan for today was to complete a 5hr train. I'm doing this so that I can better understand the training process. I think this will help me get a better grasp the jobs that I will be splitting up using torque.


 * Feb 18

My plan for today is to find the error by reading through the 0143.html experiment file to find out why my train failed.

---


 * Concerns:

Not finding all that much information on configuring torque successfully with sphinx. My biggest concern is finding documentation for this project

Week Ending February 25, 2014

 * Task:


 * Feb 22

Logged in - Read logs


 * Feb 23

Logged in - Attempt train with torque. Exp directory folder 0143


 * Feb 24

Logged in - Analyze result from my 5hr train.

Calculate elapse time for train.


 * Feb 25


 * Logged in - Attempt to run more jobs using torque.
 * Figure out error messages when running jobs.


 * Results:

Was finally to get the 5hr train running today. We seriously need to update the master dictionary.

I am currently timing how long it takes for 5hrs of training at 8 bit density. Start time: 2:13:am


 * Feb 23

I was able to run a 5hr train today without anymore annoying errors. When it completes, I will have an approximate time for a single 5hr train. I started the train at 2:13 AM when the server was least likely to be affect by other jobs. The last update to my train html files happened at exactly 3:06 am. My total runtime was 53 minutes.


 * Feb 24

So here is what how to start up torque thus far. I will continue to update this and move it to another page...eventually.

Torque server is installed on Cesar. The the other 9 nodes are installed with the a client edition. However, after reading Tommy's logs, Eric was using asterix for his own experiments so there are actually only 8 available nodes.

% su % pbs_server
 * To start the server on caeser. Enter the following command. Note: as of this time, you must be a system user with root access to run the pbs/torque commands.

% pbsnodes
 * Next verify all of the nodes are up in running. This can be done by entering the following...

This will give an output something like this.

asterix state = offline np = 1 ntype = cluster status = rectime=1393302602,varattr=,jobs=,state=free,netload=8591101,gres=,loadave=0.61,ncpus=4,physmem=4087048kb,availmem=161427016kb,totmem=159995220kb,idletime=906858,nusers=1,nsessions=6,sessions=1810 1814 1815 1978 1981 2105,uname=Linux asterix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

obelix state = free np = 1 ntype = cluster status = rectime=1393302607,varattr=,jobs=,state=free,netload=3217851972,gres=,loadave=0.00,ncpus=4,physmem=4087048kb,availmem=2718801608kb,totmem=2539901956kb,idletime=603383,nusers=1,nsessions=6,sessions=3204 3210 3211 3218 3221 3238,uname=Linux obelix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

miraculix state = free np = 1 ntype = cluster status = rectime=1393302585,varattr=,jobs=,state=free,netload=502505649,gres=,loadave=2.45,ncpus=4,physmem=4087048kb,availmem=44093936kb,totmem=38094100kb,idletime=603308,nusers=2,nsessions=4,sessions=16317 16407 16579 16640,uname=Linux miraculix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

traubadix state = free np = 1 ntype = cluster status = rectime=1393302605,varattr=,jobs=,state=free,netload=3206444646,gres=,loadave=0.03,ncpus=4,physmem=4087048kb,availmem=60055512kb,totmem=53295900kb,idletime=460889,nusers=1,nsessions=6,sessions=4141 4147 4148 4211 4214 4232,uname=Linux traubadix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

majestix state = free np = 1 ntype = cluster status = rectime=1393302605,varattr=,jobs=,state=free,netload=1903406012,gres=,loadave=0.01,ncpus=4,physmem=4087048kb,availmem=14593552kb,totmem=7618820kb,idletime=27560,nusers=1,nsessions=6,sessions=6066 6072 6073 6080 6082 6098,uname=Linux majestix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

idefix state = free np = 1 ntype = cluster status = rectime=1393302572,varattr=,jobs=,state=free,netload=2714286452,gres=,loadave=0.00,ncpus=4,physmem=4087048kb,availmem=73694632kb,totmem=68523300kb,idletime=21885,nusers=1,nsessions=2,sessions=6051 6058,uname=Linux idefix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

automatix state = free np = 1 ntype = cluster status = rectime=1393302596,varattr=,jobs=,state=free,netload=1931792,gres=,loadave=0.00,ncpus=4,physmem=4087048kb,availmem=105995680kb,totmem=98978100kb,idletime=21791,nusers=0,nsessions=0,uname=Linux automatix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

methusalix state = free np = 1 ntype = cluster status = rectime=1393302604,varattr=,jobs=,state=free,netload=3461957861,gres=,loadave=0.00,ncpus=4,physmem=4087048kb,availmem=3001291636kb,totmem=4262434132kb,idletime=449790,nusers=1,nsessions=6,sessions=6090 6096 6097 6104 6106 6121,uname=Linux methusalix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

verleihnix state = free np = 1 ntype = cluster status = rectime=1393302592,varattr=,jobs=,state=free,netload=1893575,gres=,loadave=0.00,ncpus=4,physmem=4087048kb,availmem=145919848kb,totmem=139129156kb,idletime=37015,nusers=1,nsessions=6,sessions=1631 1637 1639 1798 1817 1942,uname=Linux verleihnix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

Notice that asterix is offline. This means that it is not available for queuing or submitting jobs. If a node is offline, proceed with the following. % ssh miraculix % pbs_mom
 * ssh into the node that is offline and start the pbs service.

% exit
 * Next, exit back to caesar.

% sudo qsub /mnt/main/scripts/train/scripts_pl/RunAll.pl
 * Submit you're job using

If successful, you should receive out like the following. 184.caesar.unh.edu The number's in this example are 184. This is your Job ID.

qrun 184
 * Run your Job using your job #. Example:

As of right now, the jobs run, and then fail.

Can't locate SphinxTrain/Config.pm in @INC (@INC contains: /var/spool/torque/mom_priv/jobs/lib /usr/lib/perl5/site_perl/5.12.1/i586-linux-thread-multi /usr/lib/perl5/site_perl/5.12.1 /usr/lib/perl5/vendor_perl/5.12.1/i586-linux-thread-multi /usr/lib/perl5/vendor_perl/5.12.1 /usr/lib/perl5/5.12.1/i586-linux-thread-multi /usr/lib/perl5/5.12.1 .) at /var/spool/torque/mom_priv/jobs/183.caesar.unh.edu.SC line 47. BEGIN failed--compilation aborted at /var/spool/torque/mom_priv/jobs/183.caesar.unh.edu.SC line 47.

I'll post more when I can figure this out.


 * Concerns:

Time...

Week Ending March 4, 2014

 * Plan:
 * March 1
 * To log in and read logs


 * March 2
 * To log in
 * To make more progress with torque
 * To get past the point of initializing my nodes
 * Read through the 198.html file to narrow down the possible causes
 * March 3
 * To log in
 * To log into caesar...it was down yesterday
 * To pick up where I left of yesterday an make more progress with torque
 * March 4
 * Because I can't seem to get torque to run a train using the runAll.pl as shown in Tommie's tutorial for torque. I am going to reduce the chances of error. First I would like to verify that torque is actually working as it should without trying to train. For this reason, I am creating a test script to verify that nodes are actually processing jobs from the pbs_server.
 * Plan develop a test script
 * Test script should run on each nod on a qsub
 * Test script should save to a common file
 * Test script should save information about the node to verify that it worked on that node.


 * Task:
 * March 1
 * Log in
 * March 2
 * Log in
 * Figure out why toque jobs are failing
 * Jobs are failing running qrun
 * March 3
 * Figure out why toque jobs are failing
 * March 4
 * Develop a test script to submit a test job

Phase 1: Cleaning up directories: logs... qmanager... models... completed Phase 2: Flat initialize mk_mdef_gen Log File completed mk_flat Log File completed init_gau Log File completed norm Log File completed init_gau Log File completed norm Log File completed cp_parm Log File completed cp_parm Log File completed Phase 3: Forward-Backward
 * Results:
 * March 1
 * Log in was successful!
 * March 2
 * I attempted to log into caesar today, but was unsuccessful.
 * Caesar appears to be down :(
 * I opened a local copy of the 0198.html file I save on my desk top

The console usually just freezes. Sometimes it will say "Something failed" ... like that's real helpful..

var/spool/torque/server_priv/node_status asterix 1 this notation apparently disables the node from the cluster. I removed the line and asterix is now ONLINE!
 * March 3
 * Successfully logged into caesar
 * Found that torque server was down today and every node was also down. Had to restart each service with PBS_server for caesar and PBS_mom for the nodes.
 * Figured out why Asterix wasn't ever showing up as free.

Results pending...
 * March 4


 * Concerns:
 * March 1
 * none at this time
 * March 2
 * Caesar appears to be down today.
 * I was unable to access anything on the UNH intranet
 * March 3
 * none
 * March 4

Week Ending March 25, 2014

 * Task:
 * March 22
 * logged in
 * to run a my test script on torque.

cd $PBS_O_WORKDIR echo " " echo " " echo "Job started on `hostname` at `date`" ./hello echo " " echo "sleep 30" echo "Job Ended at `date`" echo " "
 * 1) This is a sample PBS script. It will request 1 processor on 1 node
 * 2) for 2 minutes and 5 seconds  hours.
 * 3)   Request 2 processors on 8 nodes
 * 4) PBS -l nodes=8:ppn=2
 * 5)   Request 4 hours of walltime
 * 6) PBS -l walltime=0:02:05
 * 7)   Request 1 gigabyte of memory per process
 * 8) PBS -l pmem=1gb
 * 9)   Request that regular output and terminal output go to the same file
 * 10) PBS -j oe
 * 11)   The following is the body of the script. By default,
 * 12)   PBS scripts execute in your home directory, not the
 * 13)   directory from which they were submitted. The following
 * 14)   line places you in the directory from which the job
 * 15)   was submitted.
 * 1)   Request that regular output and terminal output go to the same file
 * 2) PBS -j oe
 * 3)   The following is the body of the script. By default,
 * 4)   PBS scripts execute in your home directory, not the
 * 5)   directory from which they were submitted. The following
 * 6)   line places you in the directory from which the job
 * 7)   was submitted.
 * 1)   line places you in the directory from which the job
 * 2)   was submitted.
 * 1)   Now we want to run the program "hello".  "hello" is in
 * 2)   the directory that this script is being submitted from,
 * 3)   $PBS_O_WORKDIR.
 * 1)   $PBS_O_WORKDIR.


 * March 23
 * log in
 * Find solution for my results concerning torques permissions


 * March 24
 * Log in
 * Publish helpful resources I've been collecting along the way. Resources that have helped me better understand torque.
 * Publish commands that I uses everyday with torque.


 * March 25
 * Log in
 * Got sick. Not feeling it.


 * Results:

Unable to copy file /var/spool/torque/spool/5.caesar.unh.edu.OU to frs2000@caesar.unh.edu:/mnt/main/home/sp14/frs2000/test.sh.o5 Host key verification failed. lost connection Output retained on that host in: /var/spool/torque/undelivered/5.caesar.unh.edu.OU
 * March 22
 * Found a good resource for installing and testing torque.
 * So my results for today is that toque doesn't seem to have access to torque files...or that torque doesn't have write permissions to the user files. I'm hypothesizing based on the error below found in /var/spool/mail/frs2000
 * error from copy
 * end error output

Ok, Here's just some helpful info for the next person who can't find their mail when torque says "You have mail". qsub -I test.sh Where test.sh is the name of your script.
 * March 23
 * Go to /var/spool/mail and you will have a file with your named with your user name. The latest errors and info is appended to the end of this file so you'll have to scroll to the bottom.
 * You can also run a job in interactive mode by adding -I to your qsub request.

So today marks the day when a job has been Queued and Run. The jobs are still failing but I think I know why. Torque seems to be working when it initializes the connect between the node going forward. But the know doesn't seem to have access back to the server. After more investigation, I found that when I tried to ssh back from a node to a client I got this.

asterix sp14/frs2000> ssh caesar ssh-keysign not enabled in /etc/ssh/ssh_config ssh_keysign: no reply key_sign failed Last login: Mon Mar 24 00:39:35 2014 from 75.69.159.2 Have a lot of fun... Even though I lets me log using ssh. I don't think torque is making it past this step via the message I found in my mail. Unable to copy file /var/spool/torque/spool/8.caesar.unh.edu.ER to frs2000@caesar.unh.edu:/mnt/main/torqueTestingGround/test.sh.e8 Host key verification failed. lost connection Output retained on that host in: /var/spool/torque/undelivered/8.caesar.unh.edu.ER
 * error from copy
 * end error output


 * March 24

More of a resource dump today.


 * Resources
 * http://docs.adaptivecomputing.com/torque/4-1-3/Content/topics/1-installConfig/serverConfig.htm

Commands:


 * Restart SSH
 * 1) rcsshd restart
 * Create Server

The following commands are used to create a pbs server process to execute jobs. pbs_server -t create

qmgr -c "set server acl_hosts = caesar" qmgr -c "set server scheduling=true" qmgr -c "create queue batch queue_type=execution" qmgr -c "set queue batch enabled=true" qmgr -c "set queue batch started=true" qmgr -c "set queue batch resources_default.nodes=1" qmgr -c "set queue batch resources_default.walltime=3600" # Specified in seconds, denotes how much CPU time the job is allowed to allocate. qmgr -c "set server default_queue=batch" qmgr -c "set server auto_node_np = True" # Auto detects the resources of a given node. e.g number of processors

still haven't made any progress with the host key validation error. I double checked host.eval. Thought this could have been the problem since there were nodes entered in it. I read that all of the compute nodes had to be entered in this file as well, but I'm still receiving that same error. I don't understand why I would receive this error as a user can ssh from one node to another without entering a password. Maybe torque needs validation to ssh from one node to another without entering a password.
 * March 25


 * Plan:


 * March 22
 * Run my test script to verify if torque is even working as it should.
 * https://wiki.archlinux.org/index.php/TORQUE


 * March 23
 * Continue to trouble shoot host validation error


 * March 24
 * Continue to trouble shoot host validation error
 * Progress made with host.equiv file

Research Linux host validation
 * March 25

Week Ending April 1, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 8, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 15, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 22, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 29, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: