SUBJECT: Validation English SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.0 DATE : 14 August 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the British English SpeechDat(M) database are contained in this document. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION => There is no textual documentation with the English database, apart from the general design and format specifications for the SpeechDat(M) project. This is deliverable 1.4.1 of the project, which is included as FIXED0EN\DOC\DESIGN.DOC on the CDs. Specific information for English is contained in section 2 of part II of the document (pages 72-77). Further there is a README.TXT file in the root listing all speech and label files on the CDs. - Language of doc file: preferably English OK - Contact person: name, address, affiliation In COPYRIGH.TXT => Not further documented - Number of CDs => Not documented, but can be derived from README.TXT. - Contents of each CD => Not properly documented - The directory structure of the CDs In section 5.2 of DESIGN.DOC - List of missing items => Not documented - Speaker demographics . which regions, how many of each . motivation for selection of regions . which age groups, how many of each . sexes: males, females, also children?; how many of each. => Not documented, but can be derived from FIXED0EN\TABLE\SPEAKER.TBL There is some provisional speaker information in section 2.3 of Part II in DESIGN.DOC. - Reference to a file where speaker characteristics are stored => Not provided - The number of items on the CD and per speaker => Not documented, but can be derived from README.TXT - Naming conventions for directories and files In section 5.2 and 5.3 of DESIGN.DOC - Prompting . linguistic specification (and motivation) for the prompting material . connection of sheet items to item numbers on CD . sheet example . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) => Not documented - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) => Not documented - Recording platform should be specified . digital telephone net link => Not documented - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) In sections 3 and 3.1 of DESIGN.DOC some general information is given. - The format and the file header structure of speech files In section 5.4 of DESIGN.DOC - The format and the file header structure of annotation files In section 5.5 of DESIGN.DOC - Annotation . procedure . quality assurance . character set used for annotation (transcription) . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] . list of symbols used to denote word interruptions and break-offs => Not documented - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols). (Alternatively, D141 refers to the standard SAMPA definitions on a WWW server and this may be sufficient to check against) => Not documented - Transcription manual: TRANSCRIP.DOC (optional) . is it there? . does it contain the relevant information? . What is done with non speech events . What is done with capitals . Only one spelling of each word is allowed Not provided - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included. Otherwise a statement why such a list is not necessary. => Not provided - Indication of how many of the files were double checked by the producer together with percentage of detected errors => Not provided ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. There is a README.TXT file. => It contains a listing of all speech and label files in the database, but not a description of the files. - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC => There is no documentation in this directory; only SUMMARY files. - The summary file (SUMMARY.TXT) should be in \\DOC OK - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE OK - Index files (optional) should be in \\INDEX Not provided. - Prompt sheet files (optional) should be in \\PROMPT Not provided. - Any source code supplied should be in \\SOURCE (SAMLIB, V4, and GNU gunzip, version 1.2.4 + licence) OK - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) Not provided - All sessions indicated in the documentation are present on the CDs OK - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1 : isolated digit C1 : 4 digit id of prompt sheet C2 : ~10 digit telephone number C3 : ~12 digit credit card number N1-3 : 3 natural numbers M1-2 : 2 money amounts L1-3 : 3 spelled words T1 : 1 time of day T2 : 1 time phrase D1-3 : 3 dates Q1-3 : 3 yes/no questions P1 : city of call/birth A1-6 : 6 common application words E1-3 : 3 application word phrases S1-9 : 9 phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted Zero length files were not found among the label files. => However 9 speech files were empty: A01072E2.ENZ A01072E3.ENZ A01072E4.ENZ A01072I1.ENZ A01072L1.ENZ A01075A1.ENZ A01075A2.ENZ A01075A3.ENZ A01075A4.ENZ Since their label files did not contain an empty transcription, something must have gone wrong during copying. - Counts should match information in documentation . count of files in each subdirectory . count grand total => There is no documentation to count against - Missing items per speaker Check with documentation => There is no documentation - File match: For each label file there must be one speech file and vice versa. => There are 10 speech files that have no matching label files: A01003E4.ENZ A01017A5.ENZ A01017A6.ENZ A01017D3.ENZ A01017E3.ENZ A01017E4.ENZ A01017M1.ENZ A01017N3.ENZ A01017Q3.ENZ A01018E4.ENZ It can be seen that 8 of them stem from session 1017. - Part of the corpus should be designed for training and a (typically smaller) part for testing. This is optional. Partitioning is not provided. - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK Session number is used as speaker code. - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically 39 codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all fields are separated by spaces OK, all information is presented Session number is used as speaker code. => Individual items are separated by blanks whereas they should have been connected. => Double quotes are used for the directory name and the date. => The dates should have been in the format DD/Mmm/YYYY, instead of Mmm DD => The times should have been in the format HH:MM:SS, instead of HH:MM => The contents of the SUMMARY.TXT file are not CD-dependent. They are all identical. => There are three sessions with wrong date and time information: \FIXED0EN\CD01\BLOCK12\SES1275 \FIXED0EN\CD01\BLOCK13\SES1310 \FIXED0EN\CD01\BLOCK13\SES1350 ====================================================================== 3. ITEMS - 1 isolated digit (code I1) . read OK - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read - ~10 digit telephone number . read or spontaneous(?) - ~12 digit credit card number (16 digits would be better) . read . if there is a checksum then formula must be provided . 26 digits per call are required . at least one example per digit per caller . digits must appear numerically on the sheet, not as words OK, all items are present. => We observed 369 sessions for which at least one digit was not printed in the prompts of C1-3. The exact result of our search was: #Calls Nr of digits missing in a call 309 1 58 2 2 3 - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training OK - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK, most of the words occur more than 20 times, but some of them less: afternoon : 30 am : 85 at : 11 eight : 137 eighteen : 19 eleven : 101 evening : 25 fifteen : 59 fifty : 64 five : 295 forty : 58 four : 112 fourteen : 22 half : 15 in : 134 midday : 3 midnight : 1 minute : 14 minutes : 533 morning : 79 night : 11 nine : 107 nineteen : 18 noon : 3 o'clock : 22 one : 119 past : 350 pm : 74 quarter : 28 seven : 122 seventeen : 19 six : 114 sixteen : 12 ten : 149 the : 134 thirteen : 11 thirty : 65 three : 109 to : 351 today : 215 tomorrow : 226 twelve : 103 twenty : 303 two : 116 yesterday : 214 - 1 date (code D1) . spontaneous - 2 dates (code D2-3) . read, wordstyle . analogue form . covering all weekdays and months OK and well distributed. - 3 yes/no questions (code Q1-3) . spontaneous, not prompted . balance between yes/no OK - city of call/birth (code P1) . preferably spontaneous; read is permitted OK - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read OK, see 2. - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK In conclusion, all obligatory items were recorded. There is one additional item: an extra application word phrase coded as E4. 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided. This list is repeated in section 2.2.10 of Part II of DESIGN.DOC. It was found that most words were present and in sufficient quantities (over 130 tokens). Some words, however, that should have been there according to the deliverable were absent. These words were: number, announcement, information, operator. A full listing of the words and their observed frequencies follows below. activate: 140 again: 136 answering machine: 148 back: 130 call: 138 cancel: 143 conference: 147 continue: 136 delete: 144 dial: 136 end: 135 external: 140 forward: 140 gate: 143 hash: 145 help: 145 internal: 142 last: 139 menu: 144 message: 137 messages: 135 next: 145 off: 142 on: 132 pause: 133 phone: 142 play: 141 previous: 142 program: 137 record: 129 redial: 134 repeat: 140 replay: 138 rewind: 143 save: 148 skip: 138 star: 138 stop: 137 store: 145 switch off: 135 switch on: 144 telephone: 134 transfer: 136 3. Incidentally missing items a. files that are not there We found 7 missing files, one of which of the optional item E4. Concentrating on the mandatory items only, we concluded that 1 call missed 1 item, 1 call missed 2 items, and yet another call missed 3 items, yielding a total of 6 missing files. SES1018 missed 1 item (S9); SES0010 missed 2 items (N3, S9); SES1017 missed 3 items (S7, S8, S9). Note that SES1017 was also the session that missed 10 label files. b. files with empty transcriptions in the LBO label field Apart from files physically missing in the the databases there are also files that do not not contain the targeted speech. We have looked at the transcriptions. If we do not find speech, but only noise events, then this file can be considered effectively missing. In this way we found 1180 speechless files, 42 of which were of the optional item E4. Concentrating on the obligatory items only, we computed the following statistic. Freq. Nr of items missing in a call 236 1 141 2 69 3 30 4 17 5 12 6 10 7 1 8 4 9 1 10 1 12 which adds up to 1138 mandatory items missing in total. The calls missing 10 or more items are SES0076 (10 items) and SES0063 (12 items). The calls missing 9 items are SES0015 SES0793 SES0875 SES1236 4. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) There are no structurally missing items in the English database. Looking at the incidentally missing items we find 3 calls missing up to three items, and no calls that miss more items. This is well within the SpeechDat limits. If we take into account the files that we consider as effectively missing as well, i.e. the files with no target speech in their transcriptions - see 3b. - and add these to the above-mentioned, then we find a total of 449 calls missing up to three obligatory items, and 76 calls missing more. There may also be other files that are effectively missing (corrupted speech files). These are dealt with in the next section. =========================================================================== 4. SAMPLED DATA FILES 1 File structure . SAM OK 2 Coding . A-law, 8 bit, 8 kHz . Compression by GZIP OK 3 Sample distribution Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all items (including E4): Length (s) #Occurrences 0 - 1 : 9 2 - 3 : 387 3 - 4 : 2956 4 - 5 : 4055 5 - 6 : 5468 6 - 7 : 6009 7 - 8 : 4804 8 - 9 : 3461 9 - 10 : 2287 10 - 11 : 1662 11 - 12 : 1210 12 - 13 : 857 13 - 14 : 605 14 - 15 : 502 15 - 16 : 374 16 - 17 : 291 17 - 18 : 243 18 - 19 : 213 19 - 20 : 172 20 - 21 : 147 21 - 22 : 136 22 - 23 : 103 23 - 24 : 93 24 - 25 : 68 25 - 26 : 85 26 - 27 : 72 27 - 28 : 54 28 - 29 : 53 29 - 30 : 2655 30 - 31 : 952 Duration distribution per call: Length (s) #Occurrences 4 - 5 : 1 5 - 6 : 106 6 - 7 : 303 7 - 8 : 196 8 - 9 : 106 9 - 10 : 60 10 - 11 : 41 11 - 12 : 24 12 - 13 : 19 13 - 14 : 18 14 - 15 : 12 15 - 16 : 6 16 - 17 : 5 17 - 18 : 9 18 - 19 : 3 19 - 20 : 7 20 - 21 : 1 21 - 22 : 1 22 - 23 : 3 23 - 24 : 1 24 - 25 : 2 25 - 26 : 11 26 - 27 : 5 27 - 28 : 3 28 - 29 : 5 29 - 30 : 52 There is a very large number of very long files. A total of 4418 files has a duration of more than 20 s, and there 84 calls with an average duration of 20s or more. It appears that all these long calls have problems with background noises due to which the silence detection of the platform does not succeed in stopping the recording. => This means therefore that at least 84 calls have a very high level of background noise. It also follows that at least 4,500 files contain mainly silence/noise with a bit of speech in the beginning. 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all items (including E4): Clipping #Occurences rate (in %) 0.0 - 0.1 : 3590 0.1 - 0.2 : 397 0.2 - 0.3 : 175 0.3 - 0.4 : 67 0.4 - 0.5 : 67 0.5 - 0.6 : 41 0.6 - 0.7 : 21 0.7 - 0.8 : 19 0.8 - 0.9 : 10 0.9 - 1.0 : 4 1.0 - 1.1 : 3 1.1 - 1.2 : 5 1.3 - 1.4 : 2 1.4 - 1.5 : 2 1.8 - 1.9 : 1 2.6 - 2.7 : 1 3.3 - 3.4 : 1 Number of files with absolute maximum < 32256: 35577 Clip distribution per call: Clipping #Occurences rate (in %) 0.0 - 0.1 : 423 0.1 - 0.2 : 13 0.2 - 0.3 : 5 0.3 - 0.4 : 1 0.5 - 0.6 : 1 0.6 - 0.7 : 1 Number of directories with absolute maximum < 32256: 556 We observed 8 calls with a mean clipping rate of 0.2% and higher. The speech parts in all of these calls are characterised by severe clipping. Most files contain large silence/noise portions. Also most files in calls with a mean clipping rate between 0.1% and 0.2% are severely clipped in their speech portions. The calls concerned are: Clipping rate (in %) Session 0.20 SES0048 0.17 SES0126 0.19 SES0158 0.18 SES0183 0.63 SES0281 0.18 SES0291 0.24 SES0398 0.17 SES0508 0.14 SES0558 0.59 SES0660 0.11 SES0697 0.26 SES0707 0.37 SES0765 0.11 SES0924 0.16 SES0952 0.12 SES0953 0.14 SES1026 0.16 SES1069 0.23 SES1080 0.10 SES1162 0.27 SES1259 In general, files with a clipping rate higher than 0.1% must be regarded as spurious. 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items (including E4): Mean #Occurrences -110 - -100 : 1 -100 - -90 : 2 -90 - -80 : 1 -70 - -60 : 42 -60 - -50 : 12 -50 - -40 : 46 -40 - -30 : 75 -30 - -20 : 317 -20 - -10 : 643 -10 - 0 : 1236 0 - 10 : 27402 10 - 20 : 5846 20 - 30 : 1424 30 - 40 : 1647 40 - 50 : 837 50 - 60 : 285 60 - 70 : 101 70 - 80 : 26 80 - 90 : 40 Mean distribution per call: Mean #Occurrences -70 - -60 : 1 -40 - -30 : 4 -30 - -20 : 6 -20 - -10 : 16 -10 - 0 : 33 0 - 10 : 687 10 - 20 : 146 20 - 30 : 34 30 - 40 : 40 40 - 50 : 22 50 - 60 : 7 60 - 70 : 2 70 - 80 : 1 There are no extreme mean sample values in the files, so we have no further diagnostics here. 3.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items (including E4): SNR #occurrences 0 - 5 : 403 5 - 10 : 628 10 - 15 : 808 15 - 20 : 1206 20 - 25 : 1781 25 - 30 : 3171 30 - 35 : 5524 35 - 40 : 8179 40 - 45 : 9422 45 - 50 : 6232 50 - 55 : 2133 55 - 60 : 388 60 - 65 : 60 65 - 70 : 24 70 - 75 : 10 75 - 80 : 3 80 - 85 : 6 85 - 90 : 3 95 - 100 : 1 105 - 110 : 1 There were 9 files with SNR = 0.0. These were the 9 empty zero-length files mentioned in section 2 above. A01072E2.ENZ A01072E3.ENZ A01072E4.ENZ A01072I1.ENZ A01072L1.ENZ A01075A1.ENZ A01075A2.ENZ A01075A3.ENZ A01075A4.ENZ SNR distribution over calls: SNR #occurrences 0 - 5 : 1 5 - 10 : 8 10 - 15 : 13 15 - 20 : 29 20 - 25 : 40 25 - 30 : 80 30 - 35 : 149 35 - 40 : 270 40 - 45 : 257 45 - 50 : 120 50 - 55 : 28 55 - 60 : 3 60 - 65 : 1 70 - 75 : 1 The sessions with a mean SNR below 10 dB were: Session SNR SES0011 5.5 SES0016 8.5 SES0175 6.0 SES0241 9.0 SES0755 5.5 SES0944 7.5 SES0964 9.0 SES1091 4.0 SES1134 9.5 All these sessions were inspected by ear and found to contain weak speech recordings, accompanied by a buzz. Of the 13 calls having a mean SNR between 10 and 15 dB 5 calls were inspected by listening. Also these calls were characterised by very low speech intensity level. =========================================================================== 5. ANNOTATION FILE - File empty? OK, there are no empty files. - Mandatory (SAM) mnemonics: LHD: V5.0 DBN: SPEECHDAT(M)_ VOL: FIXED0_ SES: DIR: SRC: CCD: REP: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 SCD: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM REG: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , , , EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword . no line may exceed 80 chars => The obligatory mnemonics for recording date (RED) and recording time (RET) are systematically missing. => We found that in total 1465 lines in the label files exceeded the maximum length. Most of them were 81 characters (259 occurrences) or 82 characters long (1197 occurrences). All 1465 occurrences were found in LBR (prompt text) fields, with 4 exceptions found for the LBO fields. => 172 files have an empty line between LBO and LBR. This only occurs for sentence items. => In the LBR and LBO fields lines are terminated within words and then continued on the next line with the remaining part of the chopped word (preceded by the mnemonic EXT). - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM PCF: RCC: ENV: ASS: ! mnemo is not SAM None of the optional mnemonics is used. - All mnemonics should be SAM mnemonics or explicitly defined in documentation OK - No illegal mnemonics used OK - There are no mnemonics missing => The obligatory mnemonics RED and RET are missing. - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). OK, LBR is empty in these cases. - Obligatory and optional label mnemonics not provided in the label files should be provided in the file `CONTENTS.LST' from which this information can be derived (and added to the label file by the validating institute, if necessary). OK, not applicable. - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. => Capitals are used for spelled letters only. All other words (names!) are in full lower case. - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - There may be no digits in the transcription OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation The following symbols were used for non-speech acoustic events: ah art ay back ground talk background talk bark beep blame bmm breasthing breath breath_noise breathing buzz chair_squeak child noise child shout click clock clock chimes clock_chimes cough cross_talk cry crying dd dog dog-bark dog_bark door_bell door_slam eating crisps eb eh em er erm errr ff grunt how hum jj laughter lip_smack loud breath loud_breath luxur med message mm muffled mumbling music nn noise oh oo paper rustle paper_rustle phone_hang_up phone_ring plea shh sigh singing sniff ss swallow tap th throat clear throat_clear tongue_click tt tw uh um unintelligible washing_machine woooh yawn The markers for non-speech acoustic events are written between square brackets. => The nature of each marker is unknown. Some are probably misspellings (breasthing, ss, th, tt). Also the use of underscores is not consistent. It seems that stretches of affected speech are put between [/ /] (we are not sure because of lacking documentation). However, the SpeechDat(M) specifications prescribe that stretches of such events should be indicated by placing the marker before each individual word. => The meaning of < in [ in this context. - Asterisks should be used to indicate mispronunciations OK - Tildes should be used to indicate truncations OK - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found => There is no documentation on this. We had no time to validate this item ourselves. However, we did carry out a bracket check. We found 41 instances of unbalanced square brackets (taking into account that stretches of background noise should be enclosed between [/ /]. These unbalanced cases were found in the transcription fields of the following files: A00011D1.ENO A00020A2.ENO A00076A2.ENO A00135E3.ENO A00192A1.ENO A00239L1.ENO A00264D3.ENO A00379P1.ENO A00501E3.ENO A00501L1.ENO A00533M1.ENO A00543C3.ENO A00543P1.ENO A00565Q3.ENO A00612E2.ENO A00687A5.ENO A00719N3.ENO A00774D1.ENO A00777P1.ENO A00847Q1.ENO A00914E2.ENO A00921E2.ENO A00921Q2.ENO A00940M1.ENO A00997Q3.ENO A01000D1.ENO A01046L1.ENO A01092M2.ENO A01162T1.ENO A01188N2.ENO A01215A5.ENO A01231A3.ENO A01231L3.ENO A00239S6.ENO A00247S9.ENO A00501S5.ENO A00658S2.ENO A00764S2.ENO A00904S6.ENO A01262S4.ENO A01352S6.ENO - A comparison (of some sort) of prompted with spoken text will be carried out to check if they match. Not done - The label files are associated with the correct speech files. (This cannot be done automatically at this moment. We can only point at files that are incidentally found as mismatched during the transcription and/or speech file validation) No mismatches were observed during validation. - Assessment of speech items in terms of SNR, presence of additional noise adherence to prompting text is provided (optional) Not provided ======================================================================== 6. LEXICON - Check lexicon existence OK - Lexicon contents should be taken from actual utterances (from LBO) OK - The entries should be alphabetically ordered OK - In transcriptions only SAMPA symbols are allowed OK - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) => Capitals were used were for spelled letters only. Names are in full lower case. - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) OK - Orthographic entries are as a rule split by apostrophes, but not by dashes. There are neither apostrophes nor dashes in the lexicon entries. => According to the SpeechDat specifications for English (section 5.9.5 of DESIGN.DOC) all words containing hyphens and apostrophes should be retained. Since this rule was not kept, it is impossible to retrieve words like "didn't" from the lexicon. => But also the entries 'didnt', 'dont', 'hadnt', 'isnt', 'oclock', as found in the transcriptions, yet cannot be found in the lexicon. See the list of missing words below. - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) => We checked whether all words in the transcriptions could also be found in the lexicon. We found 434 missing entries. However, lots of these were obvious misspellings, and were found only once in the transcriptions. It is clear that such entries should have been corrected, but that indeed they do not belong in the lexicon. => If we exclude words that were found only once, then we still end up with 154 missing entries (which may still include misspellings, of course). These are listed below, with their frequency of occurrence: [eating: 2 a: 2006 alfs: 11 alices: 12 amelia: 13 apostrophy: 4 bachs: 13 backpack: 2 beretta: 14 bla: 3 blandford: 5 blueish: 3 blythe: 11 bookend: 2 bournmouth: 7 brie: 11 bub: 11 c: 2 cashbox: 16 chablis: 10 clamshell: 11 cleopatra: 10 clueless: 2 cordless: 2 couldnt: 29 creaminess: 3 criss: 14 cuthbert: 8 dartboard: 2 deluxe: 2 didnt: 28 dont: 62 dutyfree: 2 dwarves: 3 e: 4 eggcup: 3 eigth: 17 engorge: 2 spadrilles: 3 ethanol: 2 etiquate: 2 eyedrops: 14 flightdeck: 2 fourty: 42 fructose: 10 fundraising: 16 h: 3 hadnt: 13 halloween: 30 hallway: 23 hed: 26 hoyle: 18 hypothermia: 15 i: 438 i'm: 8 i`m: 2 irvings: 11 isnt: 15 itemise: 13 ive: 31 jeffs: 30 jennifers: 12 jims: 11 johns: 11 joshuas: 12 liecester: 3 litchfield: 17 m: 12 mayan: 13 megawatt: 14 meow: 12 middlesborough: 3 mouthe: 11 mr: 75 mu: 2 mugicians: 2 multimedia: 15 n: 2 nancys: 16 needlepoint: 10 neoclassical: 14 neurological: 13 nieteen: 2 nineten: 2 nineth: 25 ninteen: 5 ninteenth: 58 ninty: 4 noras: 14 o: 8 o'clock: 18 oclock: 11 olympia: 15 p: 4 parenthood: 14 pauls: 10 perrier: 12 phils: 13 pizzerias: 3 pong: 14 poundsd: 2 pushy: 3 rationalise: 10 redial: 131 rewind: 139 rons: 14 ruc: 4 s: 5 saabs: 13 sandras: 10 screenplay: 3 seaserpents: 2 secretarys: 15 sh: 2 sh: 2 shes: 2 slipstitch: 3 snakeskin: 13 soysauce: 15 sportiness: 2 sprog: 2 ss: 2 sssay: 2 ssseven: 3 st: 2 standardised: 12 superstore: 14 t: 7 tantalise: 3 teardrop: 3 ternal: 2 thats: 15 therell: 13 theyre: 20 throughfare: 2 tofu: 14 toiletries: 2 topspin: 2 tranquillising: 13 trapp: 10 traumatize: 2 trish: 26 tunafish: 12 twelth: 89 ty: 2 vegan: 3 von: 10 wacky: 3 wasnt: 12 westchester: 14 youll: 44 youre: 2 youve: 10 zagged: 11 zig: 11 These accounts confirm the idea that the transcriptions contain quite some spelling errors (see also section 9). . Check for overcompleteness (invalid words have a * and should not be in lexicon) (the same goes for words truncated due to a recording error; this is indicated by ~) => We also checked for words that are in the lexicon but that are never used in the transcriptions. We found 255 entries. Of course, undercompleteness of the lexicon is worse than overcompleteness. - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. Not provided ========================================================================== 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes OK, option b. was chosen. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE 4. region of call REG OK, => but region of call is absent. Session number is used as speaker code. - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic group ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC Education level is listed, and to the right of it another code which we do not understand. - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% The following sex distribution was computed: Females: 543 = 54.30% Males : 457 = 45.70% The disbalance thus does not exceed 5%. - Balance of regions . which regions and how many of each should match specification in documentation file The following statistic on the regions of call was computed: 0: 5 = 0.50 % Bath: 11 = 1.10 % Birmingham: 43 = 4.30 % Blackburn: 10 = 1.00 % Blackpool: 19 = 1.90 % Bournemouth: 31 = 3.10 % Brighton: 5 = 0.50 % Bristol: 37 = 3.70 % Canterbury: 19 = 1.90 % Carlisle: 7 = 0.70 % Chelmsford: 7 = 0.70 % Coventry: 44 = 4.40 % Dartford: 5 = 0.50 % Dorchester: 27 = 2.70 % Dudley: 18 = 1.80 % Durham: 37 = 3.70 % Enfield: 18 = 1.80 % Gloucester: 10 = 1.00 % Harrow: 7 = 0.70 % Hull: 11 = 1.10 % Ilford: 10 = 1.00 % Leeds: 11 = 1.10 % Leicester: 21 = 2.10 % Liverpool: 18 = 1.80 % London: 33 = 3.30 % Manchester: 45 = 4.50 % Medway: 17 = 1.70 % Newcastle: 116 = 11.60 % Nottingham: 70 = 7.00 % Oldham: 13 = 1.30 % Portsmouth: 65 = 6.50 % Preston: 20 = 2.00 % Sheffield: 33 = 3.30 % Southampton/Isle of Wight: 17 = 1.70 % Southend: 10 = 1.00 % Stockport: 8 = 0.80 % Stoke on Trent: 5 = 0.50 % Sutton: 8 = 0.80 % Swindon: 15 = 1.50 % Taunton: 5 = 0.50 % Tonbridge: 18 = 1.80 % Walsall: 35 = 3.50 % Watford: 7 = 0.70 % Wolverhampton: 29 = 2.90 % This cannot be further validated since there is no documentation to check against. The listing of places given in section 2.3.1 of Part II of DESIGN.DOC is pretty much different from the list above. - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. The following distribution of ages was computed: under 17: 1 = 0.10% 17 - 30 : 255 = 25.50% 31 - 45 : 309 = 30.90% 46 - 60 : 262 = 26.20% over 60 : 173 = 17.30% Thus the age distribution is well in agreement with the SpeechDat specifications. ======================================================================= 8. RECORDING CONDITIONS - Digital telephone line => Not documented - A-law coding OK - Specification of wireless telephone or not (optional) Not provided - Time stamps on file => Not known - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - name of file: TABLE\REC_COND.SAM or TABLE\REC_COND.TBL - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP Not provided ============================================================================= 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences A generally observed phenomenon is the omission of apostrophes in contractions like isn't, don't, hasn't. The variants with apostrophes, however, do occur in a minority of cases. During transcription validation, the forms without apostrophe were taken as the correct forms, and were therefore not corrected. If we counted these as errors as well, we would have found much more transcription errors than reported below. We will now turn to the transcription validation itself. - The evaluation comprises the following criteria . did the speaker actually speak the transliterated words . did the speaker speak the prompted text . is transliteration of non-speech acoustics events correct . speech quality, line quality . up to 5% transcription errors are allowed - Abbreviations may only be used if spoken as such A random selection of 1143 long items and 831 short items was used for the transcription validation of the English database. A. Long items In 545 of the 1143 checked items a correction was considered necessary. By far the most corrections (484) were related to the transcription of non-speech acoustic events. There were 35 corrections in the transcription itself and 24 combined corrections (transcription AND non-speech acoustic events). Further 2 typing errors were found. A total of 545 errors on a total of 1143 checked items yields an error rate of 47.68%. Serious errors concerning the transcription itself were observed in 59 cases yielding an error rate of 5.16%, which is slightly above the 5% criterion. B. Short items In 154 of the 831 checked items a correction was considered necessary. Most corrections (111) were related to the transcription of non-speech acoustic events. There were 36 corrections in the transcription itself and 2 combined corrections (transcription AND non-speech acoustic events). Further 5 typing errors were found. A total of 154 errors on a total of 808 checked items yields an error rate of 18.53%. Serious errors concerning the transcription itself were observed in 38 cases yielding an error rate of 4.57%, which is just below the 5% criterion. In general, we conclude that the non-speech acoustic events are not captured well by the transcriptions, whereas the transcription of the target speech is of appreciably good quality. Further, the short items appear to be better transcribed than the long ones, especially with respect to the background acoustics. ========================================================================== 10. SUMMARY Below we give a brief overview of our findings with respect to the English SpeechDat(M) database. In general, the documentation of the database is very parsimonious. The main documentation file is not present. The directory structure and file names are OK. All obligatory items were recorded. There is one additional item: E4. There are hardly any missing files. The speech files are in the proper format (A-law, 8 bit, gzipped). There is a very large number of very long files. A total of 4418 files has a duration of more than 20 s, and there 84 calls with an average duration of 20s or more. The lexicon is in the correct format. It contains transcriptions in the required SAMPA phoneme symbols. 154 words from the transcriptions could not be retrieved in the lexicon. The speakers are well balanced with respect to age and sex. The targeted speech is transcribed reasonably well, but the non-speech acoustic events are less well transcribed. A somewhat more detailed account follows below. The subsections follow the order of the various topics in the previous sections of the report. 1. Documentation Due to the absence of the main documentation file there is no clear textual survey over the speaker demographics of the database, the lexicon contents and generation, the (generation of the) prompt sheets, the recording platform used, and the annotation and transcription procedure. 2. Data base structure and file names The README.TXT file does not contain the information it should contain. It is a listing of speech and label files but not an overview over the database contents. The directory structure and file names are OK. There are 9 zero-length speech files and 10 speech files without a corresponding label file. 3. Items All obligatory items were recorded. There is one additional item: an extra application word phrase coded as E4. All 50 application words are present apart from : number, announcement, information, operator. There are hardly any missing files: SES1018 missed 1 item (S9); SES0010 missed 2 items (N3, S9); SES1017 missed 3 items (S7, S8, S9). This is well within the SpeechDat criteria for missing files. However, there are a lot of files that contain only silence or background noises. If we take these files into account as effectively missing as well, then we find a total of 449 calls missing up to three obligatory items, and 76 calls missing more items. 4. Sampled data files The speech files are in the proper format (A-law, 8 bit, gzipped). There is a very large number of very long files. A total of 4418 files has a duration of more than 20 s, and there 84 calls with an average duration of 20s or more. It appears that all these long calls have problems with background noises due to which the silence detection of the platform does not succeed in stopping the recording. This means therefore that at least 84 calls have a very high level of background noise. It also follows that at least 4,500 files contain mainly silence/noise with a bit of speech in the beginning. We observed 8 calls with a mean clipping rate of 0.2% and higher. The speech parts in all of these calls are characterised by severe clipping. Most files contain large silence/noise portions. Also most files in calls with a mean clipping rate between 0.1% and 0.2% are severely clipped in their speech portions. A set of 9 calls had a mean SNR of less than 10 dB. These sessions were inspected by ear and found to contain very weak speech recordings. Of the 13 calls having a mean SNR between 10 and 15 dB 5 calls were inspected by listening. Also these calls were characterised by very low speech intensity level. 5. Label files The annotation files well follow the SAM format. The obligatory mnemonics for recording date (RED) and recording time (RET) are systematically missing. We found that in total 1465 lines in the label files exceeded the maximum length. Most of them were 81 characters (259 occurrences) or 82 characters long (1197 occurrences). All 1465 occurrences were found in LBR (prompt text) fields, with 4 exceptions found for the LBO fields. 172 files have an empty line between LBO and LBR. This only occurs for sentence items. Capitals are used for spelled letters only. All other words (names!) are in full lower case. 6. Lexicon The lexicon is in the correct format. It contains transcriptions in the required SAMPA phoneme symbols. Capitals were used for spelled letters only. Names are in full lower case. Words containing apostrophes and hyphens were not retained in the lexicon. Words like "don't", and "isn't" are written without apostrophe in the transcriptions, but do not appear as such in the lexicon. 434 words from the transcriptions could not be retrieved in the lexicon. Lots of these were obvious misspellings, and were found only once in the transcriptions. It is clear that such entries should have been corrected, but that indeed they do not belong in the lexicon. If we exclude words that were found only once, then we still end up with 154 missing entries. We also checked for words that are in the lexicon but that are never used in the transcriptions. We found 255 entries. Of course, undercompleteness of the lexicon is worse than overcompleteness. 7. Speakers A speaker file is present. It contains most of the requested information, but the region of call is absent. The balance of sexes is OK. The balance of ages is OK as well. 8. Recording platform There is no documentation to check against. 9. Transcription In general, we conclude that the non-speech acoustic events are not captured well by the transcriptions, whereas the transcription of the target speech is of appreciably good quality. Further, the short items appear to be better transcribed than the long ones, especially with respect to the background acoustics. A generally observed phenomenon is the omission of apostrophes in contractions like isn't, don't, hasn't. The variants with apostrophe, however, do occur in a minority of cases. During transcription validation, the forms without apostrophe were taken as the correct forms, and were therefore not corrected. If we counted these as errors as well, we would have found much more transcription errors than reported below. A. Long items In 545 of the 1143 checked items a correction was considered necessary. By far the most corrections (484) were related to the transcription of non-speech acoustic events. There were 35 corrections in the transcription itself and 24 combined corrections (transcription AND non-speech acoustic events). Further 2 typing errors were found. A total of 545 errors on a total of 1143 checked items yields an error rate of 47.68%. Serious errors concerning the transcription itself were observed in 59 cases yielding an error rate of 5.16%, which is slightly above the 5% criterion. B. Short items In 154 of the 831 checked items a correction was considered necessary. Most corrections (111) were related to the transcription of non-speech acoustic events. There were 36 corrections in the transcription itself and 2 combined corrections (transcription AND non-speech acoustic events). Further 5 typing errors were found. A total of 154 errors on a total of 808 checked items yields an error rate of 18.53%. Serious errors concerning the transcription itself were observed in 38 cases yielding an error rate of 4.57%, which is just below the 5% criterion. =========================================================================