Speech:Data


 * Home
 * Semesters - Project Work by Semester
 * Information
 * Experiments - List of speech experiments

Project Notes

 * Unix Notes
 * Active Directory
 * Backups
 * Network Bridge
 * Speech S/W Installation
 * Speech Corpus Setup
 * [Switchboard Data Notes]
 * Experiment Setup
 * Scripts Page
 * Model Building
 * Step 2: Run a Train
 * Step 2: Create the Language Model
 * Step 3: Run the Decode

Audio Corpus Data
The Audio is from a set for corpus DVDs that contain 2438 audio files that amount to 259 hours of audio. These files are two channel,8khs, sph files. All the files vary in size and duration due to the fact that each one of these files represents one phone conversation between two people over a telephone line.

The audio data is stored in 23 directories on Caesar (the release used 23 CDs). The audio data can be found here: /mnt/main/corpus/dist/Switchboard/

LDC released a newer version of the corpus in 1997. The new version contained error corrections to the data and updated the header of the sphere files to reflect the new release. Speech is currently using the second release.

Source: http://www.elsnet.org/list/sep97/4.01Sep97.html

Transcription Corpus Data
The Transcription text files that we have represents the latest version of the Telephone Speech Corpus this is the latest manually corrected release(3/21/01) and can be downloaded here transcript. Once this file is extracted the data is organized similarly to the audio with folders containing sub folders. Each of these sub folders contain 4 files a transcript for channels A and B as well as word files for channels A and B. Transcript files are organized by utterance, and word files are organized by word. For the purpose of capstone the word files are not used.

It is unclear how the data was organized on Caesar prior to Spring 2014. As of Spring 2014, a copy of the 3/21/01 release is located here: /mnt/main/corpus/dist/Switchboard/master_trans.

The transcription has around 518 hours of transcripts this should be twice the the information found in the audio files. The reason for this is the transcript is split into A and B channels. A single Channels transcript is 259 hours long but the audio data is 255 hours long. This is because the audio file and the transcript files are from different sources so for the purpose of capstone these transcript files have to be removed. Transcirpt File Numbers to Remove for Capstone 2289 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730  2731 2731 2732 2733 2734 2735 4361 4379

For a more in depth breakdown of speech data compared to audio data an excel spread sheet can be found on ceasar in /mnt/main/corpus/Transcript_Spreadsheet/

Transcript Files
One transcript file represents half of a spoken conversation. Each line in a transcript file can represent one utterance from a speaker, or a specified amount of time the speaker is silent. An utterance can be a line of dialog or a noise that the speaker is making or both.

Example of dialog utterance sw4927B-ms98-a-0007 50.531500 53.172375 the idea itself of service is good Example of noise utterance sw4927A-ms98-a-0070 297.363875 297.858000 [vocalized-noise] Example of silence sw4927B-ms98-a-0026 190.269875 192.835625 [silence]

Each line of dialog is marked with a header starting with containing the file name, corpus, and line number. The next two items in the line are the start and stop times of the utterance in seconds. The rest of the line is the transcript for the utterance.

Ex  sw4927B-ms98-a-0007 50.531500 53.172375 the idea itself of service is good - -

The transcript cotains notation for uninteligable or unspoken dialog these are usually containd between brackets. But they can imply partial words.

Transcirpt Notation      Spoken Audio -[ha]ppy                 ppy -[p]oppin[g]-            oppin [laughter-bongs]         bongs said while laughing [compooter/computer]     compooter

The transcirpt has notation for words made by the user these are represented by a word surrounded by {}. Ex  {chowser}

There other notations found in the transcripts are Ex  _1 i-