Speech:Spring 2014 Jared Rohrdanz Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4th, 2014
Gain a further understanding of the speech corpus and make sure the transcripts match the audio files within the Speech system.
 * Task:


 * Results:

1/30  Reviewed the corpus info. It's a pretty terse and general rundown. It also contains information on how the directories in the system are set up and a few commands that I assume will be helpful later. Reviewed the Speech Data. One thing I'm noticing is that there is possibly an issue with the length of some of the audio files. Spring 2013 team mentioned it in their proposal but I can't seem to find if they were able to fix it or not in the report. I'll look through their logs again tomorrow to see if they mentioned this. They also made mention of missing transcripts in their proposal but give no indication on whether or not it was resolved in the report. Successfully logged into Caesar and explored the file system a bit. ; In order to SSH into Caesar:
 * 1) download a client such as Putty
 * 2) enter caesar.unh.edu as the server you want to connect to
 * 3) enter your user name (your wildcats account)
 * 4) enter your password (same as your user name)
 * 5) once you're in change your password with the "passwd" command

2/3

Read Matt Henninger's log.. I've viewed the spreadsheet he created that compares the audio data with the written transcripts. I struggled with viewing or downloading the spreadsheet on my Windows machine but eventually[WinSCP] discovered which worked nicely.

2 out of the 4940 matches are off in the spreadsheet. We will need to look in to this and see if it was corrected or not. Following this entry in his log, there isn't much more. The other two members of the '13 Data Team don't have any valuable information.

Tried to gain familiarity with Perl. Made a simple script that would print a message on the command line if executed properly. Unsuccessfully attempted to run some of Matt's earlier scripts.

2/4 I am trying to determine if the 3 discrepancies in the spreadsheet were resolved or not. I am still working largely off of Matt Henninger's log..

I checked the text file that he made against the actual transcripts and they are identical. I'm not sure what the purpose of his text file is/was.

I had a lot of trouble getting his scripts to work in order to check the lengths and see if the issues were resolved and also struggled in writing my own. I am unfamiliar with Perl. Hopefully I can get something working tomorrow before the meeting.
 * Plan:


 * Concerns:

Week Ending February 11, 2014
Verify the audio, dictionary, and transcripts are complete and up to date, learn how to run a train/experiments, and gain familiarity with genTrans.pl.
 * Task:

2/6
 * Results:
 * 1) Reached out to team members about how to split the tasks.
 * 2) Reviewed logs to learn more about genTrans.pl and running trains and experiments.

2/9 Today I wanted to check that the transcripts that are currently in the system are the most up to date version. I started by researching how to view a file's creation time on a Unix system. http://www.unixtutorial.org/2008/04/atime-ctime-mtime-in-unix-filesystems/ is a good tutorial on checking file times in Unix. You can check access time, change time, or modify time. The full transcripts are in the directory /mnt/main/corpus/switchboard/full/train/trans/ idefix train/trans> ls -lc total 25596 -rw-r--r-- 1 root root 26203468 2013-02-16 06:21   full_transcript.text lrwxrwxrwx 1 root root      66 2013-08-05 18:28 train.trans ->  /mnt/main/corpus/switchboard/full/train/trans/full_transcript.text

There are two files in the directory, train.trans and full_transcript. diff -s reveals that the files are identical. train.trans was last changed in August 2008.

The Data Notes section of the Wiki states that the transcription files were released in 10/19/02. There is a newer version released 1/29/03 on the Switchboard page. The changes that were made include:

Several lexicon items were fixed in the 10/19/02 release, and about 45 start/stop times that had negative durations (stop time preceded the start time) were repaired. We are no longer actively developing this resource, but continue to include bug fixes. Included in this release are the final transcriptions for the entire database, the complete lexicon, and automatic word alignments.

2/7

I was thinking about what I did yesterday and the directory I was getting the transcripts from didn't seem right. I looked back to some of the logs from last semester and found documentation on a transcript creation script, here http://foss.unh.edu/projects/index.php/Speech:Summer_2012_Create_Experiment. The page also reveals that there is a master transcript kept here:

/mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ms98_icsi_word.text

idefix trans/icsi> ls -l ms98_icsi_word.text -rwxrwxrw- 1 root cis790 750812 2001-03-21 12:12 ms98_icsi_word.text

The master transcript file was last modified in March, 2001. This is interesting because the wiki states that 10/19/02 release is used.

Meet with the team and assign tasks outlined in the Data Group page. Dealing with the genTrans.pl. I'm am not very familiar with Perl or running experiments yet.
 * Plan:
 * Concerns:

Week Ending February 18, 2014
Resolve transcript version issue Find total transcript and audio times Ensure that eval and dev are separate corpus samples Research .sph files and conversion to .wav
 * Task:

2/12

Uploaded the newest version of the transcripts to a test directory in order to experiment with it. Read logs of past semesters to find out how transcripts were installed.

2/15 Logged in and read peers logs for updates.
 * Results:

After looking more into what version of transcripts are being used, I discovered that I was mistaken in my analysis. The transcript corpus is using the latest version which was released.

My confusion comes from the fact that information on the wiki gives the latest release of the dictionary as the transcripts date.

http://www.isip.piconepress.com/projects/switchboard/

Using ls -l on the transcripts and word alignments yields:

That latest version of the transcripts is 3/21/01 and is in use in the Speech System.

The latest version of the dictionary is 1/29/03 and is in use in the Speech System.

The dictionary and transcripts are all up to date. I edited the information page to reflect the actual version of the transcripts. http://foss.unh.edu/projects/index.php/Speech:Data#Transcription_Corpus_Data

2/17

Today I'm going to try and verify the actual length of the transcripts and audio.

I downloaded Matt Henniger's spreadsheet of the transcripts located here: /mnt/main/corpus/Transcrpt_Spreadsheet/

He reports a total transcript time of 259:08:48 and a total audio time of 255:37:03.

He also reports that three of the matches are off for a total of .05 seconds but I think that may be due to rounding in his scripts.

Looking back at the scripts he definitely seemed to be rounding the numbers off. In totaltime.pl he has this:

printf "Total time in Seconds: %.2f\n",$seconds; printf "Total time in Minutes: %.2f\n",$minutes; printf "Total time in Hours: %.2f\n",$hours; printf "Total time in Days: %.2f\n",$days;
 * 1) print rounded numbers

%.2f/n is a Unix command that rounds the results to two decimal places and prints to a new line (I couldn't find any official documentation but it's demonstrated here). I think this could be the cause of the discrepancy and considering it's only .05 seconds it's not worth worrying about.

I wrote a simple Perl script called count.pl. It simply takes a file as an argument and alerts of it. It's located in /mnt/main/home/sp14/rohrdanz/bin/scripts/ if you're really that interested. It is just a first step for me never having worked with Perl before. I want it to eventually parse any transcript, grab the start and stop time for each utterance, and add them for a total time. It should be much simpler and cleaner than the older scripts and able to work on incomplete trains.

2/18 Logged in to review logs.

Communicated with team members to assign tasks for the proposal.

Updated the Data Group Page to reflect new objectives.


 * Plan:


 * Concerns:

Week Ending February 25, 2014
Create a script to determine total time of transcript sets
 * Task:

2/21 Reviewed Logs Continued to learn Perl. Created a simple script to count the lines in a file. Next I will change it to grab the start and end time, get the difference, and add the results for all lines for a total time.
 * Results:

2/23 Continued working on a script that will count the total time in a set of transcripts. I ended up using a bit of code that Matt Henniger wrote last year to get the start and stop times using a function that splits the rows by where the spaces are.

use strict; use warnings; my $file = $ARGV[0]; # perl stores all arguements in an array  called $ARGV my $seconds = 0; my $lines = 0; open ('fh', '<', $file) or die $!; # open the file read only access while (my $row = ) { #the loop that Matt Henniger wrote last year chomp $row; my @splitrow = split(/ /, $row); $seconds = $seconds + ($splitrow[2] - $splitrow[1]); $lines++; } printf " $seconds "; #print total seconds printf "$lines total lines"; #print number of lines close ('fh');
 * 1) !/usr/bin/perl

I tried my new script on the master transcripts (/mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ms98_icsi_word.text) and it returned 36116.7 seconds of audio over 5397 lines. I think there is an error in my script because that is roughly ten hours of audio... or less than a second for each spoken line.

In order to find the error I tried the script on the tiny train transcript set (/mnt/main/corpus/switchboard/tiny/train/trans/train.trans).

This time it returned 598 seconds of audio over 83 total lines.

I used the command 'wc -l train.trans' to count the lines of the text file from command line and 83 was returned.

Side note - In manually checking the times for the tiny train I noticed duplicate entries.

I need to change my line counter to count transcription entries. Most of them span several lines so it's not very accurate. There is really only 10 entries in the tiny train.

Manually counting the times I get about 94 seconds of spoken transcripts. While my script if functional, it does not report any accurate information yet. This needs to be addressed next.

2/24 Logged in to read peer's logs.

2/25

I was incorrect in stating that there was only ten entries in tiny train. I don't know where I got that from. There is 83 so my script was reporting the number of entries correctly after all. The unix command 'wc -l' counts the new lines so even though a block of text appears to take up more sometimes, the file really only has as many lines as new lines exist.

Originally I thought that there was an error with my script as it returned a really short time for the master transcripts. In testing it on the other sets of data I think the master may be incomplete.

The results of testing count.pl on a few transcript sets:

The master transcripts appear to be missing quite a bit from my results.

The script I made is located in /mnt/main/home/sp14/rohrdanz/bin/scripts/count.pl.


 * Plan:


 * Concerns:

Week Ending March 4, 2014
Create a narrative of the history and standing of the data corpus.
 * Task:

2/26 I checked some of the transcript sets for repeating data. There are no duplicate lines in Full Train, Tiny Train, or dist. The lines I noticed in Tiny Train transcripts are the same entry, but differ in punctuation (one has "" and one doesn't).
 * Results:

sw2005A-ms98-a-0052 390.784625 401.230250 "we we don't we we we choose not to deal with the" extended family because we feel it's kind  of cumbersome when [noise] in reality it makes things much much easier sw2005A-ms98-a-0052 390.784625 401.230250 we we don't we we we choose not to deal with the extended "family because we feel it's kind of cumbersome" when [noise] in reality it makes things much  much easier

I talked with Colby and the transcript set in the dist/ directory is not the master. He told me that when the subsets are created the transcripts from the Full Train directory are used. I'm not ever sure that there is a master set on Caesar at this point.

In reading David Meehan's log, he solved the issue of the transcription sets getting weird times. It is due to overlap of the two channels. His script corpusSize.pl fixes this issue. This are the results that he got:

full: 256.4 Hours 308hr: 256.4 Hours 100hr: 76.7 Hours 10hr: 10.0 Hours first_5hr: 5.0 Hours last_5hr: 4.9 Hours mini: 11.6 Hours tiny: 1.2 Hours

3/2 Logged in to read logs. Communicated with group to discuss tasks for the week. Updated Data Group page to reflect tasks.

3/3

I'm going to document the corpus from it's purpose and history to it's standing in Caesar today. I am going to collect my data and thoughts here in my log before I update any Information pages.

ftp://jaguar.ncsl.nist.gov/swbd/revswbd_manual.txt

SWITCHBOARD is a corpus of spontaneous conversations which addresses the growing need for large multispeaker databases of telephone bandwidth speech. Collected at Texas Instruments with funding by DARPA, the complete set of CD-ROMs includes about 2430 conversations averaging 6 minutes in length; in other terms, over 240 hours of recorded speech, and about 3 million words of text, spoken by over 500 speakers of both sexes from every major dialect of American English.

LDC released a newer version of the corpus in 1997. The new version contained error corrections to the data and updated the header of the sphere files to reflect the new release. http://www.elsnet.org/list/sep97/4.01Sep97.html

In checking our files on Caesar, our sphere files have the release 2.0 in the header.

The original issue of SWITCHBOARD in early 1993 lacked about 150 conversations which were intended for publication but omitted by error. They were published in May 1994 and distributed to all previous recipients of SWITCHBOARD. The Switchboard Corpus was collected at Texas Instruments and produced on CD-ROM at the National Institute of Standards and Technology. It is distributed in a notebook-style binder with 28 CD-ROMs, (27 containing speech data, and one containing all transcription data). Preparation of the data for CD-ROM production was done by NIST. The waveform files use the NIST SPHERE format.

http://ccl.pku.edu.cn/douBTfire/CorpusLinguistics/LDC_Corpus/available_corpus_from_ldc.html#switchb

3/4 There seems to be a discrepancy on how many discs the corpus contains from the different sources online. The one I referenced above says 28 CD-ROMs, the National Institute of Standards and Technology claims it contains 25 ftp://jaguar.ncsl.nist.gov/swbd/revswbd_manual.txt, and the current version being offered by LDC is on 4 DVDs http://catalog.ldc.upenn.edu/LDC97S62.

On Caesar in the /mnt/main/corpus/dist/Switchboard directory we have directories disk1-disk23.

Each disk directory contains a readme that is full of valuable information.


 * the version of our corpus is release 2 released August, 1997
 * we should (and do) have 23 discs total
 * all known errors predating August 1997 are resolved in this release
 * each directory contains a subdirectory containing
 * list of speech files on the particular disc
 * list of all speech files in the corpus

Now I am trying to determine a way to measure the total length of the audio we have on Caesar.


 * I copied (cp) file, sw02001.sph from the disk1 directory to my own
 * I converted the file to .wav http://sox.sourceforge.net/sox.html
 * In looking for a way to measure wave files, I discovered the soxi can do it to sphere files. There is no need to convert. http://sox.sourceforge.net/soxi.html

idefix sp14/rohrdanz> soxi -D sw02001.wav 252.298375 idefix sp14/rohrdanz> soxi -D sw02001.sph 252.298375

As a side note that I found while looking back at the history of the corpus setup...

According to the Corpus setup section of the wiki http://foss.unh.edu/projects/index.php/Speech:Summer_2012_Create_Experiment, /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ms98_icsi_word.text is the location of the master transcripts. This is what I initially thought. The master transcripts are incomplete.


 * Plan:


 * Concerns:

Week Ending March 18, 2014

 * Task:

3/15 Logged in to read logs and catch up after Spring Break.
 * Results:

3/17 Read up on how to conduct experiments. I am going to duplicate exp 0161 since it ran and details are filled out.

/mnt/main/scripts/train/scripts_pl/RunAll.pl Configuration (e.g. etc/sphinx_train.cfg) not defined Compilation failed in require at /mnt/main/scripts/train/scripts_pl/RunAll.pl line 48. BEGIN failed--compilation aborted at /mnt/main/scripts/train/scripts_pl/RunAll.pl line 48.
 * created exp 0215 in /mnt/main/Exp
 * created the exp page on the wiki to document
 * followed instructions on the wiki
 * ran to setup my directory: /mnt/main/root/tools/SphinxTrain-1.0/scripts_pl/setup_SphinxTrain.pl -task 
 * edited the config file according to the tutorial page
 * ran gentrans6.pl on the first_5hr corpus
 * loaded the dictionary with pruneDictionary2.pl
 * generated the phone list with genPhones.pl
 * had trouble with make_feats.pl... worked around it by copying the feats directory from exp 0161
 * error out on RunAll.pl

3/18 I'm going to get the full and complete master transcripts. I downloaded the 3/21/01 version from http://www.isip.piconepress.com/projects/switchboard/.

I ran my count.pl script on the new data set and it matches the current master.

perl /mnt/main/home/sp14/rohrdanz/bin/scripts/count.pl ms98_icsi_phone.text 5397 transcript entries. Seconds: 13561.0030266 Hours: 3.76694528516666

I'm stumped. 75% of the links relating to the transcripts are not good any more and the project seems to be abandoned.

I wrote a script called audio.pl. It gathers the total time of an audio disk, taking the disk as an argument (ex. "perl audio.pl disk17"). The script is a little heavy handed and I think it could be done a lot better but I'm still a Perl novice. Specifically I think the read directory structure I used is a bit unnessacary and it could have been handled better by reading the files directly but it's functional the way it is. The disks will likely never move so it's not a big deal any way. use strict; use warnings;
 * 1) !/usr/bin/perl

my $disk = $ARGV[0]; # take the disk as an argument my $dir = "/mnt/main/corpus/dist/Switchboard/$disk/swb1/"; opendir (DIR, $dir) or die "can't opendir $dir: $!";

my $total = 0; my $seconds = 0;

my $sox = "soxi -D "; #the unix command that returns the length of audio files while((my $filename = readdir(DIR))) {

next if $filename =~ /^\./; #skips the two hidden directories (.,..) my $cmd = "$sox$dir$filename"; #the unix command to run # ex soxi -D /mnt/main/corpus/dist/Switchboard/$disk/swb1/swb021021.sph $seconds = `$cmd`; #grabs the time using backticks $total = ($total + $seconds); #adds to total time } print "$total seconds of audio on $disk\n";

closedir(DIR);

My next step will be to configure the script to run on all disks and give a grand total. I'm thinking this may be easier to accomplish on the command line rather than within the Perl script though.


 * Plan:


 * Concerns:

Week Ending March 25, 2014

 * Task:


 * Results:

The results of my script on each disk:

Note the disk4 and disk8 didn't work initially because the directory the audio files are in is capitalized in those two cases. I just altered my script for those two.

931728 seconds 15528.8 minutes 258.81 hours

3/21 Logged in to read logs.

3/23

The final task I was assigned in the proposal was to find a way to verify that the audio and transcripts match in each train. I think the best way to accomplish this will be to write a script that:


 * takes the desired train as an argument
 * parses the /wav directory to create a list of each segment (sw02387.sph, etc)
 * parses the transcript for the corresponding header
 * check that each audio file is represented in the transcripts

This should function as a quick and dirty way to check that each audio file in the train has a transcript file for it.

Depending on how quickly I can get this working I may also implement a feature to check the time of each transcript against it's audio to make sure that they match.

3/25

I got the initial version of my new script running. Here is the code:

use strict; use warnings;
 * 1) !/usr/bin/perl

my $train = $ARGV[0]; my $dir = "/mnt/main/corpus/switchboard/$train/train/wav"; my @wav; my @trans;

opendir (WAV, $dir) or die "can't open $dir: $!";

while ((my $filename = readdir(WAV))) { next if $filename =~ /^\./; $filename = substr($filename,3,4); push @wav, $filename; }
 * 1) closedir(WAV);

open 'trans', '<', "/mnt/main/corpus/switchboard/$train/train/trans/train.trans" or die $!;

while (my $row = ) {

my $trans_num = substr($row,2,4); push @trans, $trans_num }

my %seen = ; my @uniq_trans; my $item; foreach $item (@trans) { push (@uniq_trans, $item) unless $seen{$item}++; }

@uniq_trans = sort { $a <=> $b } @uniq_trans; @wav = sort { $a <=> $b } @wav;

if ( @uniq_trans @wav) {print "The audio files and transcripts match.\n"; } else{ print "The audio files don't match the transcripts.\n";}
 * 1) print "transcripts from a .sph file... \n @uniq_trans \n";
 * 2) print ".sph files... \n @wav \n";

The script works by creating an array of all four digit identification numbers from the file names (2012 from sw2012A-etc-etc). Then another array is created from the transcripts grabbing the ID. This array is then pruned for unique values as multiple transcripts match each audio file. Next, the two arrays are compared against each other giving a true or false as to whether the files match each other.

I plan on changing the script to take the audio directory and transcript location as arguments since I think this will make it much more flexible.

I will also create a table showing the results for each train.
 * Plan:


 * Concerns:

Week Ending April 1, 2014

 * Task:

3/26

Full audio time for the 6 disks: 258.81 hours Full transcript time (- overlap) for 308hr subset: 256.4 hours

2.4 hours are missing from the transcripts.

I downloaded the latest version from http://www.isip.piconepress.com/projects/switchboard/ again and unpacked it in my own directory just to test. I must have been unpacking it incorrectly last time because the transcripts all seem to be here this time.

The setup of the new file is a little strange so I need to look back at how previous semesters organized the data.

3/28

Logged in to read logs.

After reading the logs from previous semesters I can't find out what happened to the the master transcripts. I moved the updated version I downloaded to /mnt/main/corpus/dist/Switchboard/master_trans/.

Made some changes to the Data Notes page http://foss.unh.edu/projects/index.php/Speech:Data#Transcription_Corpus_Data.

Added my scripts to the script page.


 * Results:


 * Plan:


 * Concerns:

Week Ending April 8, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 15, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 22, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 29, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: