SUBJECT: Validation German SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 2.0 DATE : 21 June 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the German SpeechDat(M) database are contained in this document. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one for the German data base offered by IPSK at Munich University. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 LABEL FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY The validation center modified a few label items that deviated from the SpeechDat format and could easily be changed: - The mnemonics LBO and LBR were mixed up in the original label files. We swapped them. - The format of the RED value was wrong. We changed it. - The version number of the GZIP software was put in the mnemonic CMP. Due to these modifications the actual files are therefore slightly different from the information in the documentation files. ====================================================================== 1. DOCUMENTATION The following documentation files were supplied on the CD-ROM FINREPRT.PS INSTRUCT.PS SCRIPT.PS HANDBOOK.PS INSTRUCT.TXT SUMMARY.TXT The documentation validated in this section was in file FINREPRT.PS. - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK - Number of CDs / Tapes OK, this information was provided in the README.TXT file - Contents of each CD / tape OK, this information was provided in the README.TXT file - The directory structure of the CDs / tapes OK, directory structure is specified. - Speaker demographics . which regions, how many of each . motivation for selection of regions . which age groups, how many of each . sexes: males, females, also children?; how many of each. This information was provided. There is no explicit information how many children (below 15 yrs) have called. - Reference to a file where speaker characteristics are stored (SPEAKER.TBL) OK - The number of items on the CD and per speaker Not in the documentation file FINREPT.PS as such, but can be derived from SUMMARY.TXT - Naming conventions for directories and files OK, information is provided in the README.TXT file. - Prompting . linguistic specification (and motivation) for the prompting material . connection of sheet items to item numbers on CD / tape . sheet example . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions right after another are not allowed) This information was provided in the documentation. => From the sheet scheme in \FIXED0DE\PROMPT\SHEET.PS it can be derived that the items were not well spread over the sheet. They were grouped by item type. - analysis of frequency of occurence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK, the documentation states that each phoneme appears at least twice in the sentences produced by each speaker. - Transcription manual . is it there? It is there as HANDBOOK.PS in directory \FIXED0DE\DOC . does it contain the relevant information? . What is done with non-speech events => The types are mentioned but there is no account of when these markers were used. . What is done with capitals => It is not stated when these were used. This is nonetheless especially relevant for German, and needs therefore clarification. (What was done with pronouns such as "Sie, Ihr, Dein"; what with so called "verbalsubstantive" (like "das Laufen", "die Sonstigen")? . How are non-speaker sounds dealt with. => There is a reference to the EAGLES documentation, which has been included to the handbook.ps as an appendix, but not all categories were used. . Only one spelling of each word is allowed Therefore a list of normalised spellings for words with alternative spellings should be included => This list was not supplied, nor was a motivation given why such a file is not necessary for German. - Recording platform should be specified . digital telephone net link There is no specific information about the recording platform. It is stated, however, that data were recorded using ISDN lines, resulting in a-law speech files. - Signal characteristics (number of bits per sample; bandwith; coding type; compression procedures) => This information could be given more explicitly. It can now only be implicitly derived from section 1.1. - The format and the file header structure of speech files OK, this information is provided. - The format and the file header structure of annotation files OK, this information is provided. - Annotation . procedure OK, mentioned in section 1.3 . quality assurance OK, in section 1.5. The last number should read 5%. . character set used for annotation (transliteration) . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] . list of symbols used to denote word interruptions and break-offs OK, in section 1.3. - Lexicon information . Which graphemic characters and conventions are used in annotations and lexicon Follows SpeechDat guidelines. . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) OK, in section 1.4 . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols) Was provided in \FIXED0DE\TABLE\PHONEMES.TBL, as is mentioned in the README.TXT file. . horizontal ordering of information OK - Indication of how many of the files were double checked by the producer together with percentage of detected errors => There is no information about this topic. =========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for Speechdat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - Documentation should be in \\DOC OK - The contents list is in \\INDEX and is obligatory. OK - Tables should be in \\TABLE OK - Index files (optional) should be in \\LST Not present - Prompt sheet files (optional) should be in \\PROMPT OK - Any source code supplied should be in \\SOURCE (SAMLIB, V4 and GNU gunzip + licence) OK - A copyright statement should be given in COPYRIGH.TXT OK - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) Not present - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1 : isolated digit C1 : 4 digit id of prompt sheet C2 : ~10 digit telephone number C3 : ~12 digit credit card number N1-3 : 3 natural numbers M1-2 : 2 money amounts L1-3 : 3 spelled words T1 : 1 time of day T2 : 1 time phrase D1-3 : 3 dates Q1-3 : 3 yes/no questions P1 : city of call/birth A1-6 : 6 common application words E1-3 : 3 application word phrases S1-9 : 9 phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK - Counts should match information in documentation . count of files in each subdirectory . count grand total => There are no counts of calls in the documentation, nor is there a list of missing items. This information can only be derived from the SUMMARY.TXT file. - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus could be designed for training and a (typically smaller) part for testing. This is optional. No partition has been indicated. - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK, => The transcription field of the file contains also the information for start sample, center sample and end sample. This information should not be there. ============================================================================= 3. ITEMS - 1 isolated digit (code I1) . read OK - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read => OK, however, the hyphen was not intended by the specifiations - ~10 digit telephone number . read or spontaneous(?) OK, however, there are 14 digits instead of 10. - ~12 digit credit card number . read . if there is a checksum then formula must be provided OK, however, there are 16 digits instead of 12 (but 16 is better). . 26 digits per call are required There are four digits more. . at least one example per digit per caller => This is not explicitly stated in the documentation. According to our inquiries, it was not taken care that every digit was realised by each speaker. We found 237 calls with one or more digits missing, with an average 0f 343/237=1.45 missing digits in these calls. . digits must appear numerically on the sheet, not as words OK - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training OK - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN => OK, however, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE are not included in the set. L1 seems to be the spontaneous one (it has an empty string field in the prompt mnemonic LBR). In the documentation it is suggested (in 3.5) that the third one (L3) is the spontaneous one. Please modify documentation text. - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW => OK, however, equivalents for AFTERNOON, TODAY, YESTERDAY, TOMORROW do not seem to be included. The following frequency count was made for the non-digit words Uhr : 237 abends : 11 dreiviertel : 81 früher : 42 halb : 31 jetzt : 7 morgens : 34 nach : 26 nachher : 48 nachts : 40 nun : 58 später : 45 tagsüber : 14 viertel : 198 vor : 40 vorher : 17 über : 54 It can be concluded that these words are not very carefully balanced. - 1 date (code D1) . spontaneous OK - 2 dates (code D2-3) . read, wordstyle . analogue form . covering all weekdays and months OK. We also performed a frequency count on the words in the date strings April : 101 August : 36 Dezember : 20 Dienstag : 60 Donnerstag : 17 Februar : 48 Freitag : 30 Januar : 42 Juli : 15 Juni : 69 Mai : 33 Mittwoch : 46 Monat : 84 Montag : 49 März : 22 November : 28 Oktober : 64 Samstag : 51 September : 17 Sonnabend : 96 Sonntag : 55 Woche : 194 einer : 101 gestern : 52 heute : 104 in : 43 letzte : 40 morgen : 48 nächste : 53 nächsten : 84 vor : 58 vorgestern : 48 übermorgen : 50 The words are not very carefully balanced in their frequencies but occur in sufficient amounts. The lowest frequency was found for "September". The ordinals were less well balanced : 1. : 89 10. : 84 11. : 23 12. : 72 13. : 13 14. : 11 15. : 42 16. : 4 17. : 23 18. : 87 19. : 25 2. : 80 20. : 49 21. : 27 22. : 42 23. : 27 24. : 4 25. : 12 26. : 25 27. : 47 28. : 21 29. : 53 3. : 151 30. : 7 31. : 11 4. : 67 5. : 88 6. : 53 7. : 70 8. : 81 9. : 85 => A few ordinals occur very seldom (16. 24. 25. 30. 31.). - 3 yes/no questions (code Q1-3) . spontaneous, not prompted OK - city of call/birth (code P1) . preferably spontaneous; read is permitted OK - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read OK - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided. All application words were found in the corpus and in a sufficient quantity (in the prompt text). Each word occurs at least 40 times. Most words occur more than 100 times. A full overview is displayed below. : 96 Anruf : 137 Ansage : 107 Aufnahme : 92 Auskunft : 86 Beantworter : 126 Dienst : 121 Eingabe : 109 Ende : 154 Halt : 157 Hilfe : 112 Konferenz : 91 Menü : 139 Nachricht : 139 Nummer : 75 Quadrat : 145 Raute : 147 Rücklauf : 187 Rückrufwunsch : 133 Schutz : 115 Stern : 96 Stop : 103 Telefon : 82 Umleitung : 112 Vermittlung : 156 Vorlauf : 118 Wahl : 86 Wiedergabe : 147 Wiederholung : 127 abhören : 121 aktivieren : 129 anhören : 134 aufheben : 127 auslösen : 80 bestätigen : 123 blättern : 73 extern : 83 hinterlassen : 95 intern : 96 löschen : 121 nächste : 114 programmieren : 180 speichern : 97 verbinden : 168 weiter : 160 weiterleiten : 124 wiedergeben : 69 zurück : 149 zuschalten : 109 Übergabe : 153 3. Incidentally missing items We found that 50 obligatory files were missing in the corpus. These missing files are: A00029L3 A00040L3 A00051L2 A00051L3 A00052M1 A00058E1 A00071T1 A00079E3 A00083L3 A00100M1 A00121T1 A00125L3 A00156T1 A00170D3 A00170T1 A00175L3 A00175M1 A00176E3 A00178E3 A00180C3 A00182L3 A00197L3 A00225D1 A00227D3 A00227M1 A00264M1 A00290L3 A00311T1 A00314C3 A00316L3 A00360T1 A00363L3 A00449L3 A00461C3 A00467T1 A00474L3 A00484L3 A00515Q3 A00594E1 A00644C3 A00659C3 A00668Q3 A00684E3 A00757L2 A00814T1 A00880E2 A00930L3 A00943L3 A00378S2 A00381S6 Sorting the missing files to individual calls we found the following. Calls with 1 obligatory item missing: 42 Calls with 2 obligatory items missing: 4 According to the specifications 10% (=100) of the calls may miss up to 3 obligatory items. It is clear that German fulfills this without any problem. => There are also files that are effectively missing (empty or corrupted speech files). We examined the label files that had an empty transcription field or only noise symbols in it. In total 417 items with empty transcriptions were found, 286 of which were associated with the obligatory items. Sorting these obligatory files to individual calls we observed the following: Freq. Nr of items mising in a call 163 1 40 2 10 3 2 4 1 5 => Thus, there are 213 calls with maximally 3 items missing. This is far beyond the 10% threshold. Further there are 3 calls that miss more than 3 items, which is below the 5% threshold. There are also (a few) files that do have an associated transcription, but that are nevertheless empty. We come to these in the section on the contents of the sampled data files. => 4. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) In the German corpus none of the mandatory items is structurally missing and 44 calls miss less than 3 items incidentally. Therefore German fulfils the requirements well. However, if we regard the files with empty transcriptions as missing as well, then the thresholds are exceeded. In that case we find 21.3% calls with up to 3 items missing and 0.3% calls with more than 3 items missing. The specifications mention that only files that are missing due to system errors have to be considered as missing. Since the empty files can be considered as flaws of the speakers, only the really missing files are relevant for the count incomplete calls. In that case the German corpus fulfils the specifications well. ========================================================================== 4. CONTENTS SAMPLED DATA FILES 1 File structure . NIST (header : contains file info -> ant.txt) . SAM SAM label files were supplied. OK 2 Coding . A-law, 8 bit, 8 kHz . Compression by Gzip OK 3 Sample distribution The phonetically rich sentences are stored in separate directories if we follow the path given by the CD's (the sentences are in directory CD02\etc in that case). In order to get a clearer picture of generally deviating calls, we put all the data of one call in one single subdirectory (after having checked that all data were originally stored in the correct directories). We make a distinction in statistics computed over all items and statistics computed over only the obligatory items. In the case of German the only non-obligatory item is S0, which contains the spontaneous speech portion about breakfast habits. This means that the differences that we observe between the two statistics can only emerge from this item. Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Distribution of file durations in all items (in seconds): #Seconds Occurences 0 - 1 : 11 1 - 2 : 21 2 - 3 : 15763 3 - 4 : 153 4 - 5 : 9636 5 - 6 : 1225 6 - 7 : 2786 7 - 8 : 2483 8 - 9 : 1696 9 - 10 : 1016 10 - 11 : 2921 11 - 12 : 313 12 - 13 : 404 13 - 14 : 374 14 - 15 : 749 15 - 16 : 24 16 - 17 : 42 17 - 18 : 22 18 - 19 : 16 19 - 20 : 19 20 - 21 : 195 (these are all exactly 20s) Distribution of file durations over all obligatory items: #Seconds Occurences 2 - 3 : 15733 3 - 4 : 126 4 - 5 : 9612 5 - 6 : 1178 6 - 7 : 2737 7 - 8 : 2425 8 - 9 : 1637 9 - 10 : 967 10 - 11 : 2863 11 - 12 : 270 12 - 13 : 360 13 - 14 : 332 14 - 15 : 712 >From these data it can be observed that the extreme durations (smaller than 2s and larger than 15s) all come from S0. The remaining distribution does not show extremes which necessitate further investigations. Also the distribution over the calls showed no alarming tendencies. Duration distribution per call: 3 - 4 : 2 4 - 5 : 263 5 - 6 : 676 6 - 7 : 59 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all files: Clipping Occurences rate (in %) 0.0 - 0.1 : 5767 0.1 - 0.2 : 1383 0.2 - 0.3 : 774 0.3 - 0.4 : 334 0.4 - 0.5 : 277 0.5 - 0.6 : 186 0.6 - 0.7 : 115 0.7 - 0.8 : 65 0.8 - 0.9 : 72 0.9 - 1.0 : 44 1.0 - 1.1 : 26 1.1 - 1.2 : 31 1.2 - 1.3 : 27 1.3 - 1.4 : 23 1.4 - 1.5 : 17 1.5 - 1.6 : 12 1.6 - 1.7 : 15 1.7 - 1.8 : 10 1.8 - 1.9 : 16 1.9 - 2.0 : 16 2.0 - 2.1 : 11 2.1 - 2.2 : 15 2.2 - 2.3 : 16 2.3 - 2.4 : 12 2.4 - 2.5 : 12 2.5 - 2.6 : 12 2.6 - 2.7 : 16 2.7 - 2.8 : 9 2.8 - 2.9 : 10 2.9 - 3.0 : 7 3.0 - 3.1 : 4 3.1 - 3.2 : 4 3.2 - 3.3 : 4 3.3 - 3.4 : 5 3.4 - 3.5 : 2 3.5 - 3.6 : 1 3.6 - 3.7 : 2 3.7 - 3.8 : 1 3.8 - 3.9 : 3 3.9 - 4.0 : 1 4.1 - 4.2 : 1 4.2 - 4.3 : 1 4.3 - 4.4 : 1 4.4 - 4.5 : 1 4.5 - 4.6 : 2 4.7 - 4.8 : 2 4.8 - 4.9 : 1 5.2 - 5.3 : 1 5.4 - 5.5 : 1 5.9 - 6.0 : 1 7.8 - 7.9 : 1 Number of files with absolute maximum < 32256: 30499 For the obligatory items we found the following distribution: Clipping Occurences rate (in %) 0.0 - 0.1 : 5537 0.1 - 0.2 : 1309 0.2 - 0.3 : 718 0.3 - 0.4 : 313 0.4 - 0.5 : 258 0.5 - 0.6 : 174 0.6 - 0.7 : 110 0.7 - 0.8 : 60 0.8 - 0.9 : 64 0.9 - 1.0 : 34 1.0 - 1.1 : 20 1.1 - 1.2 : 29 1.2 - 1.3 : 26 1.3 - 1.4 : 21 1.4 - 1.5 : 17 1.5 - 1.6 : 11 1.6 - 1.7 : 13 1.7 - 1.8 : 10 1.8 - 1.9 : 16 1.9 - 2.0 : 16 2.0 - 2.1 : 9 2.1 - 2.2 : 15 2.2 - 2.3 : 16 2.3 - 2.4 : 11 2.4 - 2.5 : 12 2.5 - 2.6 : 12 2.6 - 2.7 : 15 2.7 - 2.8 : 9 2.8 - 2.9 : 10 2.9 - 3.0 : 7 3.0 - 3.1 : 4 3.1 - 3.2 : 4 3.2 - 3.3 : 4 3.3 - 3.4 : 5 3.4 - 3.5 : 2 3.5 - 3.6 : 1 3.6 - 3.7 : 2 3.7 - 3.8 : 1 3.8 - 3.9 : 3 3.9 - 4.0 : 1 4.1 - 4.2 : 1 4.2 - 4.3 : 1 4.3 - 4.4 : 1 4.4 - 4.5 : 1 4.5 - 4.6 : 2 4.7 - 4.8 : 2 4.8 - 4.9 : 1 5.2 - 5.3 : 1 5.4 - 5.5 : 1 5.9 - 6.0 : 1 7.8 - 7.9 : 1 Number of files with absolute maximum < 32256: 30040 The two distributions do not differ very much, so there is no obvious difference between the clipratios of the obligatory items and the optional item. By listening to sets of files we concluded that files with a clip ratio over 1.5% can be viewed as corrupted (clipped in every sylable, and that the quality of files with a clip ratio between 1.0% and 1.5% is suspicious. => This implies that 221 files are severely distorted due to clipping, whereas another 113 are at least suspicious in this respect. The following distribution over the calls was found: Clip distribution per call: Clipping Occurences rate (in %) 0.0 - 0.1 : 643 0.1 - 0.2 : 54 0.2 - 0.3 : 18 0.3 - 0.4 : 11 0.4 - 0.5 : 7 0.5 - 0.6 : 3 0.9 - 1.0 : 2 1.1 - 1.2 : 1 1.6 - 1.7 : 2 1.7 - 1.8 : 1 2.1 - 2.2 : 2 2.3 - 2.4 : 2 Number of directories with absolute maximum < 32256: 254 => By listening we concluded that virtually all files in the directories with a mean clip ratio higher than 2.0 were corrupted due to clipping. The directories involved are: SES0105, SES0222, SES0848, SES0950. Further, in directories with a mean clip ratio higher than 1.0 quite a lot files were corrupted due to clipping. The directories involved are: SES0029, SES0262, SES0555, SES0654, SES0849. 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. The extreme values in the files are (-)32256. Mean distribution over all items: -2850- -1030: 14 -990 - -980 : 1 -970 - -960 : 1 -960 - -950 : 2 -900 - -890 : 1 -890 - -880 : 1 -880 - -870 : 1 -870 - -860 : 1 -840 - -830 : 1 -810 - -800 : 1 -770 - -760 : 1 -730 - -720 : 1 -720 - -710 : 1 -690 - -680 : 1 -670 - -660 : 1 -660 - -650 : 1 -650 - -640 : 3 -640 - -630 : 3 -630 - -620 : 3 -620 - -610 : 1 -610 - -600 : 2 -590 - -580 : 1 -580 - -570 : 2 -570 - -560 : 1 -530 - -520 : 1 -520 - -510 : 1 -490 - -480 : 42 -480 - -470 : 5 -470 - -460 : 10 -460 - -450 : 29 -450 - -440 : 7 -440 - -430 : 34 -430 - -420 : 4 -420 - -410 : 19 -410 - -400 : 110 -400 - -390 : 24 -390 - -380 : 80 -380 - -370 : 53 -370 - -360 : 198 -360 - -350 : 14 -350 - -340 : 121 -340 - -330 : 91 -330 - -320 : 100 -320 - -310 : 98 -310 - -300 : 46 -300 - -290 : 52 -290 - -280 : 182 -280 - -270 : 58 -270 - -260 : 20 -260 - -250 : 95 -250 - -240 : 11 -240 - -230 : 1 -230 - -220 : 2 -220 - -210 : 2 -210 - -200 : 8 -200 - -190 : 96 -190 - -180 : 84 -180 - -170 : 20 -170 - -160 : 93 -160 - -150 : 66 -150 - -140 : 97 -140 - -130 : 71 -130 - -120 : 82 -120 - -110 : 61 -110 - -100 : 58 -100 - -90 : 56 -90 - -80 : 58 -80 - -70 : 102 -70 - -60 : 177 -60 - -50 : 148 -50 - -40 : 410 -40 - -30 : 484 -30 - -20 : 915 -20 - -10 : 1947 -10 - 0 : 5022 0 - 10 : 18278 10 - 20 : 7608 20 - 30 : 1237 30 - 40 : 439 40 - 50 : 252 50 - 60 : 129 60 - 70 : 126 70 - 80 : 20 80 - 90 : 13 90 - 100 : 45 100 - 110 : 48 110 - 120 : 4 120 - 130 : 4 130 - 140 : 3 140 - 150 : 18 150 - 160 : 24 170 - 180 : 2 190 - 200 : 3 200 - 210 : 1 240 - 250 : 1 300 - 310 : 1 430 - 440 : 1 620 - 630 : 1 Mean distribution over the obligatory items: -490 - -480 : 40 -480 - -470 : 4 -470 - -460 : 8 -460 - -450 : 28 -450 - -440 : 5 -440 - -430 : 33 -430 - -420 : 2 -420 - -410 : 17 -410 - -400 : 106 -400 - -390 : 24 -390 - -380 : 76 -380 - -370 : 50 -370 - -360 : 196 -360 - -350 : 13 -350 - -340 : 118 -340 - -330 : 89 -330 - -320 : 97 -320 - -310 : 96 -310 - -300 : 45 -300 - -290 : 50 -290 - -280 : 178 -280 - -270 : 51 -270 - -260 : 19 -260 - -250 : 91 -250 - -240 : 7 -210 - -200 : 5 -200 - -190 : 95 -190 - -180 : 79 -180 - -170 : 18 -170 - -160 : 88 -160 - -150 : 63 -150 - -140 : 89 -140 - -130 : 67 -130 - -120 : 76 -120 - -110 : 54 -110 - -100 : 52 -100 - -90 : 49 -90 - -80 : 51 -80 - -70 : 92 -70 - -60 : 172 -60 - -50 : 136 -50 - -40 : 391 -40 - -30 : 468 -30 - -20 : 896 -20 - -10 : 1906 -10 - 0 : 4908 0 - 10 : 17960 10 - 20 : 7482 20 - 30 : 1216 30 - 40 : 428 40 - 50 : 244 50 - 60 : 125 60 - 70 : 122 70 - 80 : 20 80 - 90 : 12 90 - 100 : 44 100 - 110 : 44 110 - 120 : 4 120 - 130 : 3 130 - 140 : 2 140 - 150 : 18 150 - 160 : 24 170 - 180 : 1 190 - 200 : 2 200 - 210 : 1 240 - 250 : 1 300 - 310 : 1 By comparing the two distributions it may well be seen that signals with means smaller than -500 and larger than 400 are only found in the optional S0 items. A look at files with means smaller than -1000 (so S0-items only) revealed that these files only contained noise or a telephone tone. The remaining distribution for obligatory items does not contain extremes that require further investigation. The distribution over the calls is, as a consequence, severly biased by the S0-item, and is therefore not interesting. 3.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was substracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items: SNR occurrences 0 - 5 : 71 5 - 10 : 114 10 - 15 : 213 15 - 20 : 496 20 - 25 : 1453 25 - 30 : 4154 30 - 35 : 8570 35 - 40 : 9671 40 - 45 : 8337 45 - 50 : 4681 50 - 55 : 1597 55 - 60 : 355 60 - 65 : 90 65 - 70 : 37 70 - 75 : 16 75 - 80 : 5 80 - 85 : 3 85 - 90 : 2 90 - 95 : 4 For the obligatory items only the following distribution was computed: 0 - 5 : 68 5 - 10 : 111 10 - 15 : 188 15 - 20 : 450 20 - 25 : 1401 25 - 30 : 4054 30 - 35 : 8385 35 - 40 : 9481 40 - 45 : 8177 45 - 50 : 4585 50 - 55 : 1557 55 - 60 : 348 60 - 65 : 83 65 - 70 : 35 70 - 75 : 15 75 - 80 : 5 80 - 85 : 3 85 - 90 : 2 90 - 95 : 4 It can be seen that there is no reason to assume that the optional item S0 brings a specific bias into the distributions. => By looking at and listening to files with SNR < 5 dB, we concluded that these files can be regarded as empty. This means that 68 files can be regarded as empty and therefore as practically missing. The following files had SNR < 5 dB: Low snr 4.5 in file A00103T2.DEZ Low snr 4.1 in file A00107Q1.DEZ Low snr 4.2 in file A00109P1.DEZ Low snr 4.4 in file A00115Q1.DEZ Low snr 3.9 in file A00121N2.DEZ Low snr 2.4 in file A00129N3.DEZ Low snr 3.1 in file A00182N3.DEZ Low snr 3.2 in file A00226P1.DEZ Low snr 4.2 in file A00261N3.DEZ Low snr 3.1 in file A00284N2.DEZ Low snr 3.0 in file A00288A6.DEZ Low snr 1.2 in file A00289Q2.DEZ Low snr 1.2 in file A00289T2.DEZ Low snr 4.1 in file A00298Q1.DEZ Low snr 3.7 in file A00302N2.DEZ Low snr 3.8 in file A00330N2.DEZ Low snr 3.1 in file A00337Q3.DEZ Low snr 4.4 in file A00346P1.DEZ Low snr 4.9 in file A00381T2.DEZ Low snr 3.4 in file A00385Q3.DEZ Low snr 3.4 in file A00386Q1.DEZ Low snr 4.0 in file A00389I1.DEZ Low snr 4.2 in file A00389P1.DEZ Low snr 3.3 in file A00300T2.DEZ Low snr 5.0 in file A00416P1.DEZ Low snr 4.7 in file A00450A4.DEZ Low snr 2.3 in file A00450T2.DEZ Low snr 3.7 in file A00464P1.DEZ Low snr 2.4 in file A00475I1.DEZ Low snr 3.5 in file A00515N2.DEZ Low snr 3.2 in file A00523N3.DEZ Low snr 3.6 in file A00530P1.DEZ Low snr 4.8 in file A00557Q2.DEZ Low snr 3.3 in file A00573Q2.DEZ Low snr 4.7 in file A00616Q1.DEZ Low snr 3.7 in file A00616Q2.DEZ Low snr 4.9 in file A00641P1.DEZ Low snr 3.0 in file A00666N3.DEZ Low snr 4.3 in file A00668D3.DEZ Low snr 3.2 in file A00675Q1.DEZ Low snr 2.9 in file A00676Q1.DEZ Low snr 4.3 in file A00681T2.DEZ Low snr 4.7 in file A00684Q1.DEZ Low snr 3.7 in file A00697P1.DEZ Low snr 3.6 in file A00706T2.DEZ Low snr 1.1 in file A00709N3.DEZ Low snr 2.2 in file A00710A4.DEZ Low snr 1.7 in file A00738T2.DEZ Low snr 2.5 in file A00746Q1.DEZ Low snr 4.8 in file A00804Q3.DEZ Low snr 3.8 in file A00811P1.DEZ Low snr 3.3 in file A00811Q1.DEZ Low snr 3.6 in file A00834A2.DEZ Low snr 4.7 in file A00863N3.DEZ Low snr 0.7 in file A00865N2.DEZ Low snr 4.0 in file A00915T2.DEZ Low snr 4.1 in file A00919Q1.DEZ Low snr 2.4 in file A00942Q1.DEZ Low snr 4.1 in file A00942Q3.DEZ Low snr 2.5 in file A00972N3.DEZ Low snr 2.1 in file A00984Q1.DEZ Low snr 4.6 in file A00989Q1.DEZ Low snr 3.0 in file A00017P1.DEZ Low snr 4.3 in file A00067N2.DEZ Low snr 3.0 in file A00071N3.DEZ Low snr 3.2 in file A00079A5.DEZ Low snr 3.1 in file A00079A6.DEZ Low snr 3.8 in file A00079N3.DEZ We checked for correspondances with empty transcription fields in the label files. In general, if a speech file had a very low SNR (i.e below 5 dB), then the transcription field was empty (or contained only noise symbols). The following SNR distribution over calls was found: SNR occurrences 5 - 10 : 1 10 - 15 : 4 15 - 20 : 6 20 - 25 : 27 25 - 30 : 100 30 - 35 : 235 35 - 40 : 252 40 - 45 : 232 45 - 50 : 106 50 - 55 : 33 55 - 60 : 3 60 - 65 : 1 => Calls with SNRs below 15 dB were found to be of poor quality due to heavy buzzes on the signal, sometimes accompanied by a very low recording level. These directories were: SESSION Mean SNR Characteristic ses0376 12.0 heavy buzz ses0450 9.0 buzz and low recording level ses0427 13.5 heavy buzz ses0709 13.5 heavy buzz ses0902 13.0 heavy buzz CONCLUSION. => Clipping has distorted some 221 files and has made another 113 suspicious. These files should be marked in some way to denote that they are of poor quality. This could be necessary information for future users who want to use the corpus to train speech recognisers for example. We found, as a rule, that files with a very low SNR (i.e. below 5 dB) also have empty transcription fields. This means that these files do not add to the items already mentioned as effectively missing in section 4.4. ======================================================================== 5. LABEL FILES - File empty? OK, never the case - No illegal mnemonics used OK, no illegal mnemonics were used - There are no mnemonics missing OK - all mnmonics should be SAM mnemonics or explicitly defined in documentation OK - Mandatory (SAM) mnemonics: LHD: SAM, 5.00 DBN: SpeechDat(M)_ VOL: FIXED1_LL ... etc SES: session number DIR: SRC: CCD: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: 1 SSB: 8 QNT: A-law CMP: GZIP, 1.2.4 REG: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM LBD: LBR: [start,] [end,] [gain,] [minimum val,] [maximum val,] orthogr. prompt LBO: , , transliteration EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword => RET has an erroneous value in SES0906 and SES0932. The hour specified is 115 or 116 in these files. => For 16 sessions the orthographic prompt of all items is not present. These sessions are: 0055, 0214, 0216, 0238, 0335, 0353, 0531, 0570, 0576, 0592, 0680, 0852, 0911, 0930, 0948, 0956. - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). OK - Optional (SAM) mnemonics (may be ommitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM REP: PCF: RCC: ENV: - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. OK - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation These symbols are explicitly mentioned in the main documentation. - Asterisks should be used to indicate incomplete realisations OK - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found => There is no information about this. - The label files are associated with the correct speech files. (This cannot be done automatically at this moment. We can only point at files that were incidentally found as mismatched during the transcription and/or speech file validation) OK - Assessment of speech items in terms of SNR, presence of additional noise, adherence to prompting text is provided (optional) Not provided. ======================================================================== 6. LEXICON - Check lexicon existence OK, it is there as \FIXED0DE\TABLE\PRONDICT.TBL - Lexicon contents should be taken from actual utterances (from LBO) OK - The entries should be alphabetically ordered The entries are not ordered according to ASCII-conventions but according to the German lexicon: capitals and small letters mixed; letters with `Umlaute' considered as letters without. - In transcription only legal SAMPA symbols are allowed The SAMPA version used is VERBMOBIL-SAMPA. The symbols are in \FIXED0DE\TABLE\PHONEMES.TBL => /9:/ was found in the transcription of `earl'; this phoneme is not in the list. - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) The German capitalisation conventions were followed. However, it appears that capitals and small letters were not used consistently in the grapheme strings of lexicon entries and label transcription fields. (see overcompleteness and underconpleteness of the lexicon following below. - Phoneme symbols must be separated by blanks OK - Grapheme form and phonemic transcription must be separated by [TAB] OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) OK, not provided. - A line in the lexicon should have the following format [ ] [] => Frequency count and transcription are in the wrong order. - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) => If we check if all words in the transcription are in the lexicon and do this in a case-sensitive way, then we find 94 words that are absent in the lexicon. These words are (behind the : we give the frequency): Apfel: 6 Arrangement: 1 Bock: 1 Brei: 1 Café: 1 Dank: 17 Depfa: 23 Deutsche: 221 Deutschen: 40 Deutscher: 1 E: 1870 Erlangen: 99 Feines: 1 Früchten: 11 Früh: 1 Frühe: 1 Frühstücke: 1 Frühstücken: 6 Halt: 209 Herzhaftes: 1 IT: 1 Königs: 1 Kündigen: 1 LU: 1 Larosière: 25 Magens: 1 NE: 1 Neukirchen-Vluyn: 1 PKW-Neuzulassungen: 2 Pille: 1 Saft: 16 Schräger: 1 Sohle: 1 Speck: 21 Stände: 1 Trinken: 1 Verstand: 2 Wacht: 1 Wusterhausen: 1 Zigarette: 5 arbeitslose: 15 bayerisch: 1 bestehen: 1 betrug: 24 butter: 1 cornflakes: 35 doppel: 22 drucken: 1 eintausend: 5 elfte: 10 f: 2 füfnzig: 1 gefahren: 1 h: 1 habee: 2 habeee: 2 hunhdert: 1 koche: 1 kollegialen: 1 lachen: 1 lese: 2 loste: 1 mahle: 1 n: 1 neunundneuzig: 1 nö: 2 pancakes: 3 peanut: 1 platt: 1 ruf: 1 s: 3 sage: 2 schweizer: 19 siebenundneu: 1 sonntags: 2 special: 1 speichern: 126 strukturgesetzes: 1 süßes: 2 toast: 1 toll: 1 tschüß: 61 tu: 1 u: 2 uno: 1 v: 1 wiederhören: 5 wirklich: 1 wohlhabenden: 1 ~neunundneunzig~: 1 Ä: 285 Ö: 256 Österreichische: 15 Übergabe: 223 . Check for overcompleteness (invalid words have a * and should not be in lexicon) (the same goes for words truncated due to a recording error; this is indicated by ~) Words with * or ~ are not in the lexicon, which is OK. => If we check if all words in the lexicon are in the transcriptions and do this in a case-sensitive manner, then we find 105 lexicon entries that are never used. => The full stop is in the lexicon, but it shouldn't. The missing words are: . A. Amerika Arbeitslose Augusten Banat Bayerisch Bayern Betrug Bit Brandenburg Bundesgebietes Bundesland Cafe DePfa Doppel Eintausend Elfte Frankreich Freistaat Gefahren Grundschule Heidi Hessen It Julei Kanada Köche Königswusterhausen Lachen Larosiere Mahle Mecklenburg-Vorpommern NRW Neukirchen Niederbayern Niedersachsen Niederösterreich Norddeutschland Nordrhein-Westfalen Oberschlesien Pancakes Peanut Rheinland-Pfalz Rumänien S] Saarland Sachsen Sachsen-Anhalt Salzburg Schlesien Schleswig-Holstein Schwabenland Sie Singapur Sonntags Special Speichern Stadtstaat Steiermark Strukturgesetzes Säge Temickuk Tirol Tschüß Türkei Unterfranken Vluyn Volksschule Westpreußen Wien Wohlhabenden arrangement außerhalb besucht blöd bock brei dank dans e erlangen et früchten frühe geboren korrekt kündigen magens männlich neuzehnter pille saft schräger seinerzeit siebenfünfzig siebenten sohle speck stände verstand wacht weiblich zigarette übergabe - Stress information is optional OK, not provided. ======================================================================= 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes The first variant was chosen. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD or SES (DIR?) 2. sex SEX 3. age AGE 4. region of call REG OK => the session number was chosen as the speaker key. - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic gruop ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC Not provided - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% OK, 50% males and 50% females provided. - Balance of regions . which regions and how many of each should match specification in documentation file OK, a count of the regions in the SPEAKER.TBL file yielded the same distribution of the speakers as given in the documentation file. - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. OK, the following distribution was computed from the label files (age field): under 17: 23 = 2.30 % 17 - 30 : 265 = 26.50 % , OK 31 - 45 : 419 = 41.90 % , OK 46 - 60 : 246 = 24.60 % , OK over 60 : 29 = 2.90 % This is well in accordance with the SpeechDat specifications. =========================================================================== 8. RECORDING CONDITIONS - Digital telephone line OK - A-law coding OK - Specification of wireless telephone or not (optional) Not provided - Time stamps on file => Not provided - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP This optional file was not provided. ====================================================================== 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by a native speaker of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences Given the fact that 23 long items were present in the database, and that there were 1000 speakers, a selection of 5% of the long items would comprise 1150 samples. A selection of 1187 items was made. For the short items all 16 items were included in the database. A selection of 5% yields a sample 800 items. A random selection of 814 short items was used for the evaluation. - The evaluation comprises the following criteria . did the speaker actually speak the translitterated words . did the speaker speak the prompted text . is translitteration of non-speech acoustics events correct . speech quality, line quality . up to 5% transcription errors are allowed - Abbreviations may only be used if spoken as such RESULTS 1. Long items => In the sample of 1187 long items, transcription errors were found in 94 items. This amounts to 7.9%, which is a bit too high. It must be attested that most errors were not of a serious nature. We found quite some errors with respect to where an interupted utterance had been broken off exactly, and errors in the notation of abbreviations (missing blank between letters). 2. Short items In the sample of 814 short items, transcription errors were found in 33 items, which is in 4.1%, which is OK. A list of errors found in the samples can be supplied upon request. ========================================================================= 10. SUMMARY Below we give a brief overview of our findings with respect to the German SpeechDat database. The subsections follow the order of the various topics in the previous sections of the report. 1. Documentation In general, the information in the doc-file was sufficiently complete with respect to the contents of the CDs, the items that were recorded, speaker demographics, naming conventions for directories and files, prompting, type and structure of file headers However, information about missing items, spelling alternatives, signal characteristics, quality assurance and double checking procedures was poor. 2. Database structure and file names The database structure and the file names are fine. 3. items None of the obligatory items was structurally missing. In total 50 files belonging to obligatory items were incidentally missing in the corpus. This fulfils the specifications well. However, if we regard the files with empty transcriptions as missing as well, then we find 21.3% calls with up to 3 items missing, which is too much. All obligatory application words were present and in sufficient quantities. In the spelled words, there are no equivalents of HYPHEN and APOSTROPHE. In the time phrases, there are no equivalents for AFTERNOON, TODAY, YESTERDAY, and TOMORROW. A few ordinals in the date phrases occur very seldom. 4. Sampled data files The speech files were delivered in the correct coding. Clipping has distorted some 221 files and has made another 113 suspicious. (These numbers are for the obligatory items.) These files should be marked in some way to denote that they are of poor quality. This could be necessary information for future users who want to use the corpus to train speech recognisers for example. By listening we concluded that virtually all files in the directories with a mean clip ratio higher than 2.0 were corrupted due to clipping. The directories involved are: SES0105, SES0222, SES0848, SES0950. Further, in directories with a mean clip ratio higher than 1.0 quite a lot files were corrupted due to clipping. The directories involved are: SES0029, SES0262, SES0555, SES0654, SES0849. 5. Label files SpeechDat specifications were nicely followed in the use of the mnemonics. A few deviations were found for which we recommend modification: RET has an erroneous value in SES0906 and SES0932. The hour specified is 115 or 116 in these files. For 16 sessions the orthographic prompts of all items are not present. These sessions are: 0055, 0214, 0216, 0238, 0335, 0353, 0531, 0570, 0576, 0592, 0680, 0852, 0911, 0930, 0948, 0956. 6. Lexicon The lexicon is present and rather complete. The transcription and frequency of occurrence are in the wrong order. /9:/ was found in the transcription of `earl'; this phoneme is not in the list. 94 words from the transcription could not be found in the lexicon, and another 105 entries in the lexicon never occurred in the transcriptions. 7. Speakers Balance of sexes, ages and regions has well been taken care of. A speaker table was provided, which was well structured. The session number was chosen as the speaker key. 8. Recording platform We have no comments on this topic. The specifications were followed nicely. 9. Transcription A. Long items In the sample of 1187 long items, transcription errors were found in 94 items. This amounts to 7.9%, which is a bit too high. It must be attested that most errors were not of a serious nature. We found quite some errors with respect to where an interupted utterance had been broken off exactly, and errors in the notation of abbreviations (missing blank between letters). B. Short items In the sample of 814 short items, transcription errors were found in 33 items, which is in 4.1%, which is OK. =========================================================================