SUBJECT: Validation Danish SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.0 DATE : February 1997 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the Danish SpeechDat(M) database are contained in this document. As a general remark it must be stated that the Danish database deviates from the other SpeechDat(M) databases in various respects. There is very much optional material, which is of course an advantage as such. As a consequence, speakers had to call twice in order to collect the full set of items. A lot of speakers only called once, so that in the end there are 459 callers of which a complete data collection was obtained, 519 callers only called for the first part, and another 545 callers only called for the second part. A further consequence of the large amount of optional material is that the data of the 459 speakers for whom all items were collected are stored in two distinct directories (and not in one), since the data was obtained during two calling sessions. All this is clearly described in the documentation. Finally, the speech files are not gzipped. The other SpeechDat(M) databases have gzipped speech files. The validation was delimited to the obligatory items only. As such were considered : E1-3, N1-3, D1-3, A1-6, L1-3, M1-2, P1, Q1-3, S1-9, T1-2. As optional items we considered: E4-5, G0-9, H1-2, N4-9, O0-9, V1-4, A7-9, A0, B0-9, F1-7, S0, R0-9, U1-5. (see also section 3). In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention are marked by =>. In detail, the following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION The file FIXED0DA\DOC\DESIGN.DOC was used to check for the following information. - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK, in 1.1 - Number of CDs OK, in 1.3 - Contents of each CD OK, in 1.3 - The directory structure of the CDs OK, in 1.3 - List of missing items => Not supplied - Speaker demographics . which regions, how many of each OK, in 2.3.1 . motivation for selection of regions OK, in 2.3.1 . which age groups, how many of each OK, in 2.3.2 . sexes: males, females, also children?; how many of each. OK, in 2.3.2 - Reference to a file where speaker characteristics are stored OK, in 1.2 - The number of items on the CD and per speaker OK - Naming conventions for directories and files OK, in 1.2 - Prompting . linguistic specification (and motivation) for the prompting material OK, in 2. . connection of sheet items to item numbers on CD OK, in 2.1 . sheet example OK, all sheets are in the PROMPT directory . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) => It can be derived from the prompt sheets in the PROMPT directory that the items were not spread over the call. All items were grouped together. => A very limited number of prompt sheets was used: for each of the two parts there were only 10 different prompt sheets. This is only partly compensated by the large amount of items on each prompt sheet. - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) => Not supplied - Recording platform should be specified . digital telephone net link OK, in section 3 - Statement that all signal transmission between CO and recording site is digital OK, in section 3 - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) Not extensive, but A-law is mentioned in section 3 - The format and the file header structure of speech files OK, in section 3 - The format and the file header structure of annotation files OK, section 1.2 - Annotation . procedure OK, section 4 . quality assurance OK, in section 4 . character set used for annotation (transcription) OK, in sections 4 and 6.1 . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] OK, 6.1.6 . list of symbols used to denote word interruptions and break-offs OK, in 6.1 - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) OK, in 4.1.1, but very brief. => It is not clear if any form of automatic grapheme-to-phoneme converter was used prior to the work of the phoneticians. . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols). OK, in 4.1.2 - Transcription manual: TRANSCRIP.DOC (optional) . is it there? No . does it contain the relevant information? . What is done with non speech events . What is done with capitals . Only one spelling of each word is allowed The basics of the transcription manual are summarised in 6.1. - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included. Otherwise a statement why such a list is not necessary. A very brief remark is found in 6.1.2. The lexicon entries are the normalised spellings. - Indication of how many of the files were double checked by the producer together with percentage of detected errors OK, in section 4 ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. The README.TXT file is OK. - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC OK - The summary file (SUMMARY.TXT) should be in \\DOC OK - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE OK, LEXICON.TBL and SPEAKER.TBL - Index files (optional) should be in \\INDEX Not supplied - Prompt sheet files (optional) should be in \\PROMPT OK, however, the files are in postscript format, which cannot be derived from the documentation. - Any source code supplied should be in \\SOURCE (SAMLIB, V4, and GNU gunzip, version 1.2.4 + licence) Compression software is not supplied; the data is not compressed. The SAMLIB sotfware is included in SOURCE. - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) Not supplied - All sessions indicated in the documentation are present on the CDs OK - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1 : isolated digit C1 : 4 digit id of prompt sheet C2 : ~10 digit telephone number C3 : ~12 digit credit card number N1-3 : 3 natural numbers M1-2 : 2 money amounts L1-3 : 3 spelled words T1 : 1 time of day T2 : 1 time phrase D1-3 : 3 dates Q1-3 : 3 yes/no questions P1 : city of call/birth A1-6 : 6 common application words E1-3 : 3 application word phrases S1-9 : 9 phonetically rich sentences OK, but many more additional and deviant items are provided (see section 3) - NNNN in filenames is not in conflict with BLOCK and SES numbers in path name OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK => There are a few extremely short files (see also section 4, SNR measurements): A00046M1.DAO A00914Q3.DAO A08367N1.DAO => One directory is completely empty! SES6952 - Counts should match information in documentation . count of files in each subdirectory . count grand total We only checked the obligatory items. - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus should be designed for training and a (typically smaller) part for testing. This is optional. Not used. - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full path name (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK, TAB-delimited format used. => The files are not identical for all CDs; its contents are tuned to the CD in question. - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically N codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all fields are separated by spaces OK ====================================================================== 3. ITEMS In principle we looked only at the obligatory items. Obligatory items are: E1-3, N1-3, D1-3, A1-6, L1-3, M1-2, P1, Q1-3, S1-9, T1-2. Optional items are : E4-5, G0-9, H1-2, N4-9, O0-9, V1-4, A7-9, A0, B0-9, F1-7, S0, R0-9, U1-5. - 1 isolated digit (code I1) . read => Not recorded - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read - ~10 digit telephone number . read or spontaneous(?) - ~12 digit credit card number (16 digits would be better) . read . if there is a checksum then formula must be provided => C1-3 were not recorded. Instead 12 other items were recorded (G0-9, H1-2) consisting of 8 connected digits each. . 26 digits per call are required OK . at least one example per digit per caller This cannot be checked, because the optional items are included in this validation. . digits must appear numerically on the sheet, not as words OK - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training OK, decimals numbers and quantity words were not used. - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK, they are mainly taken from the application word list. - 1 time of day (code T1) . spontaneous OK, but it is T2! - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK, but it is T1! => The Danish equivalents of MORNING, AFTERNOON, EVENING, NIGHT, TODAY YESTERDAY and TOMORROW are missing in the prompts. Also the Danish words for 2, 6, 14, 15, 19 are missing here. - 1 date (code D1) . spontaneous OK, but it is D3! - 2 dates (code D2-3) . read, word style . analogue form . covering all weekdays and months => Weekdays are not recorded; the month December is missing! Instead of weekdays ordinal day numbers are used, => but many of them are missing; this pertains to the equivalents of: 1st, 4th, 7th, 8th, 10th, 11th, 13th, 17th, 18th, 19th, 20th, 21st, 24th, 25th, 29th, 20th, 31st. Therefore date words are very incomplete. No doubt this is due to the small set of prompt sheets used. - 3 yes/no questions (code Q1-3) . spontaneous, not prompted . balance between yes/no OK - city of call/birth (code P1) . preferably spontaneous; read is permitted OK, spontaneous - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read => Too few were recorded (see next section). The 27 that were recorded appear in sufficient quantities. - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous => Only 5 application words were used to this end. Speakers had to make their own sentence. - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK; => there are no phoneme counts. 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided. We checked the application words used from the prompt sheets as given in directory PROMPT. We counted 27 words used. These are also listed in DESIGN.DOC, section 2.2.4. => From the list of 39 obligatory SpeechDat(M) application words, Danish equivalents of the following 12 words are missing: help, next, repeat, stop, number, star, greeting, end, information, square, again, rewind. 3. Incidentally missing items The Danish database is organised in a very different manner than the other databases. The complete data of one speaker were collected in two calls and were accordingly stored into two directories. Since quite some speakers called just once, there is a great deal of half complete sets. For this reason it was not possible to follow the normal procedure to examine missing files. The normal procedure looks at the files that are missing in a call and has some criteria for this. Now we have looked at how many instances of each obligatory item are present. Obligatory items are: E1-3, N1-3, D1-3, A1-6, L1-3, M1-2, P1, Q1-3, S1-9, T1-2. Optional items are : E4-5, G0-9, H1-2, N4-9, O0-9, V1-4, A7-9, A0, B0-9, F1-7, S0, R0-9, U1-5. a. files that are not there We counted how many files of the obligatory items were missing. 27 E2 25 D3 24 P1 23 S1 22 E1 21 T2 18 N2 17 N1 16 N3 16 E3 10 S2 9 M1 8 Q3 6 L1 6 A1 5 S3 5 Q1 5 L2 4 T1 4 Q2 4 L3 4 A6 4 A5 4 A4 4 A3 4 A2 3 D2 2 S9 2 S8 2 S7 2 S6 2 S5 2 S4 2 D1 1 M2 Considered from the original perspective for the SpeechDat(M) corpora to collect complete data sets for 1000 speakers, we may conclude all data sets are complete for 97% or more. b. files with empty transcriptions in the LBO label field If we also take into account the files that are present but that contain defective or non-speech according to their transcriptions, then there are 17 other files missing. These are spread over the items as follows: 1 D3 1 M1 2 P1 6 Q1 1 Q2 5 Q3 1 T2 This does not change our conclusion that all obligatory item sets are complete for more than 97%. The files with defective or non-speech according to their transcriptions were: A00030Q3.DAO A00046M1.DAO A00474D3.DAO A00474P1.DAO A00474Q1.DAO A00474T2.DAO A00618P1.DAO A06004Q3.DAO A06072Q3.DAO A06130Q3.DAO A06178Q1.DAO A06178Q2.DAO A06178Q3.DAO A06664Q1.DAO A06674Q1.DAO A06742Q1.DAO A06998Q1.DAO There may also be other files that are effectively missing (corrupted speech files). These are dealt with in the next section. =========================================================================== 4. SAMPLED DATA FILES 1 File structure . SAM OK 2 Coding . A-law, 8 bit, 8 kHz . Compression by GZIP OK, but not zipped 3 Sample distribution The sample distributions were only computed for the obligatory items (E1-3, N1-3, D1-3, A1-6, L1-3, M1-2, P1, Q1-3, S1-9, T1-2). The directories with odd numbers contain only 6 obligatory items, whereas the directories with even numbers contain 29 obligatory items. Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all obligatory items: Length (s) #Occurrences 0 - 1 : 5809 1 - 2 : 10167 2 - 3 : 7905 3 - 4 : 5335 4 - 5 : 2736 5 - 6 : 1303 6 - 7 : 664 7 - 8 : 324 8 - 9 : 157 9 - 10 : 97 10 - 11 : 41 11 - 12 : 27 12 - 13 : 20 13 - 14 : 12 14 - 15 : 9 15 - 16 : 1 16 - 17 : 4 17 - 18 : 2 Duration distribution per directory: Length (s) #Occurrences 1 - 2 : 440 2 - 3 : 1347 3 - 4 : 171 4 - 5 : 15 5 - 6 : 2 9 - 10 : 1 The directory with the extreme mean duration of 9.4s was SES8443. It was found that only E1 and E3 in this directory were obligatory. Because these are two fairly long items, their mean duration ends up high as well. But nothing was wrong with them. 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Files with a clipping rate higher than 1.0% must be regarded as spurious. There are 67 such files. Clip distribution for all obligatory items: Clipping Occurences rate (in %) 0.0 - 0.1 : 771 0.1 - 0.2 : 196 0.2 - 0.3 : 118 0.3 - 0.4 : 49 0.4 - 0.5 : 48 0.5 - 0.6 : 38 0.6 - 0.7 : 31 0.7 - 0.8 : 27 0.8 - 0.9 : 16 0.9 - 1.0 : 17 1.0 - 1.1 : 16 1.1 - 1.2 : 9 1.2 - 1.3 : 7 1.3 - 1.4 : 6 1.4 - 1.5 : 2 1.5 - 1.6 : 5 1.6 - 1.7 : 2 1.7 - 1.8 : 2 1.8 - 1.9 : 4 1.9 - 2.0 : 1 2.0 - 2.1 : 3 2.1 - 2.2 : 1 2.4 - 2.5 : 1 2.5 - 2.6 : 1 2.6 - 2.7 : 2 2.7 - 2.8 : 1 2.8 - 2.9 : 1 3.0 - 3.1 : 1 3.1 - 3.2 : 2 Number of files with absolute maximum < 32256: 33235 Files with clip ratios of 2.0 and higher are: Clip ratio 2.69 in file A00010A1.DAA Clip ratio 2.07 in file A00095E1.DAA Clip ratio 2.40 in file A00470P1.DAA Clip ratio 3.13 in file A00726P1.DAA Clip ratio 2.12 in file A06262Q2.DAA Clip ratio 2.05 in file A06856Q1.DAA Clip ratio 2.84 in file A06856Q2.DAA Clip ratio 3.19 in file A06896D3.DAA Clip ratio 2.53 in file A06896M1.DAA Clip ratio 2.60 in file A06896P1.DAA Clip ratio 2.76 in file A06896T2.DAA Clip ratio 3.03 in file A08725E1.DAA Clip ratio 2.03 in file A00768S6.DAA Clip distribution per directory: Clipping Occurences rate (in %) 0.0 - 0.1 : 253 0.1 - 0.2 : 17 0.2 - 0.3 : 4 0.3 - 0.4 : 8 0.4 - 0.5 : 3 0.5 - 0.6 : 5 0.6 - 0.7 : 2 0.7 - 0.8 : 1 0.9 - 1.0 : 2 1.5 - 1.6 : 1 Number of directories with absolute maximum < 32256: 1680 We found three directories with a mean clip ratio > 0.9: Clip ratio 0.90 in dir SES6896 Clip ratio 0.93 in dir SES8669 Clip ratio 1.53 in dir SES8725 These directories contained some bad files but were not distorted as a whole. 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution for all obligatory items: Mean Occurrences -400 - -375 : 1 -150 - -125 : 1 -125 - -100 : 17 -100 - -75 : 135 -75 - -50 : 837 -50 - -25 : 1668 -25 - 0 : 8842 0 - 25 : 17386 25 - 50 : 4101 50 - 75 : 1480 75 - 100 : 76 100 - 125 : 60 125 - 150 : 7 150 - 175 : 1 275 - 300 : 1 The two files with the most extreme mean sample values were: A08853E1.DAA (Mean: 299.8) A00370S4.DAA (Mean: -378.0) Both files contain normal speech. They also contain a portion with a fixed value, which should not be there. In the first file this portion is in the middle of a silence interval. => In the second file this portion is in the middle of the speech. This file is therefore damaged. A number of files with less extreme mean values were checked and found OK. Mean distribution per directory: Mean Occurrences -125 - -100 : 1 -100 - -75 : 6 -75 - -50 : 70 -50 - -25 : 131 -25 - 0 : 513 0 - 25 : 949 25 - 50 : 230 50 - 75 : 66 75 - 100 : 5 100 - 125 : 5 General tendencies were not found here. 3.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution for all obligatory items: SNR occurrences 0 - 5 : 3 10 - 15 : 3 15 - 20 : 38 20 - 25 : 404 25 - 30 : 2096 30 - 35 : 6705 35 - 40 : 10711 40 - 45 : 8999 45 - 50 : 4192 50 - 55 : 1182 55 - 60 : 195 60 - 65 : 44 65 - 70 : 19 70 - 75 : 8 75 - 80 : 7 80 - 85 : 5 85 - 90 : 1 100 - 105 : 1 We found 3 files for which SNR could not be computed. These files were: A00046M1.DAA, A00914Q3.DAA, A08367N1.DAA. These files proved to be extremely short (resp. 44 ms, 190 ms en 2.5 ms). => A00046M1.DAA does not contain speech according to the transcription field in the corresponding label file; A00914Q3.DAA, A08367N1.DAA should contain speech according to their label file, but this is not the case. SNR distribution over directories: SNR occurrences 15 - 20 : 2 20 - 25 : 18 25 - 30 : 94 30 - 35 : 423 35 - 40 : 657 40 - 45 : 565 45 - 50 : 184 50 - 55 : 29 55 - 60 : 4 The two directories SES0305 and SES8215 with a mean SNR below 20 dB contain background noise and low intensity speech. Yet, they are usable. =========================================================================== 5. ANNOTATION FILE => The label files are not in DOS format; they do not contain at their line ends. - Mandatory (SAM) mnemonics: LHD: V5.0 DBN: SPEECHDAT(M)_ VOL: FIXED0_ SES: DIR: SRC: CCD: REP: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 SCD: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM REG: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , , , EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword . no line may exceed 80 chars - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM PCF: RCC: ENV: ASS: ! mnemo is not SAM - No illegal mnemonics used Illegal mnemonics were not found. - There are no mnemonics missing There were no obligatory mnemonics found missing. => In a few files illegal values were found in the transcription field of LBO: A06328M1.DAO: BEGIN33 toogtres fyrre A07032M1.DAO: BEGIN33 [Nonspeaker_other] toogtres kroner og fyrre ører A00146S1.DAO: BEGIN1 ~forfærdelig mørkeræd A06060S1.DAO: BEGIN1 ~min søster er forfærdelig mørkeræd A06506S1.DAO: BEGIN1 jeg har et problem med min vandvarmer => In a few other files the transcription field of LBO was empty: A00030Q3.DAO, LBR: 0, 4531, A00474Q3.DAO, LBR: 0, 13478, A06004Q3.DAO, LBR: 0, 6639, A06072Q3.DAO, LBR: 0, 3040, A06130Q3.DAO, LBR: 0, 3116, A06178Q3.DAO, LBR: 0, 84902, => In three cases the SRC mnemonic refers to item N1 whereas N2 is correct: A08019N2.DAO, SRC: A08019N1.DAA A08947N2.DAO, SRC: A08947N1.DAA A08957N2.DAO, SRC: A08957N1.DAA - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK - Each lowest subdirectory does not refer to multiple sheet ids. OK - A line should not contain more than 80 characters => We found 53 lines that were longer than 80 characters. 16 lines contained 81 characters; 4 lines contained 82 characters; 13 lines contained 83 characters; 20 lines contained 84 characters; - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). Mnemonics were used. - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. OK, capitals were used only for proper names and spelled items - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form in the transcription OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation OK - Asterisks should be used to indicate mispronunciations OK - Tildes should be used to indicate truncations OK - The label files are associated with the correct speech files. (This cannot be done automatically at this moment. We can only point at files that are incidentally found as mismatched during the transcription and/or speech file validation) OK - Assessment of speech items in terms of SNR, presence of additional noise adherence to prompting text is provided (optional) Not provided. ======================================================================== 6. LEXICON - Check lexicon existence OK - Lexicon contents should be taken from actual utterances (from LBO) OK, in DESIGN.DOC section 4.1.1. - The entries should be alphabetically ordered OK, but in a case unsensitive manner. - In transcriptions only SAMPA symbols are allowed All phoneme symbols used are valid SAMPA symbols. => However, the diphthongs are missing (e.g. /au/). - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) OK - Phoneme symbols must be separated by blanks OK => but phoneme transcriptions often are headed by two blanks We suspect that blanks were inserted in the phonemic transcriptions later. This also explains why diphthongs are missing: they are separated by blanks. - A line in the lexicon should have the following format [ ] [] OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) OK, [TAB] delimited alternative is chosen. - Orthographic entries are as a rule split by apostrophes, but not by hyphens. There are no apostrophes in the lexicon, and there are no hyphens (only in Oster-Marie). - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) The following words were found in the transcriptions but not in the lexicon: Arenborg: 1 BEGIN1: 3 BEGIN33: 2 Egholdt: 1 Gerning: 1 april;: 1 klok: 1 ldig: 1 repræsentationssarbejdet: 1 All of them are typing errors in the transcriptions, it seems. This is confirmed by the fact that most of them occur only once. Therefore, the lexicon is not undercomplete. . Check for overcompleteness Since we only checked for obligatory items, and not for all items, there is no sense in checking in lexicon overcompleteness. The words that we might consider superfluous may well be present in the optional items. - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. OK, this optional information is not included in the transcriptions. ========================================================================== 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes OK, option b. was chosen. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE 4. region of call REG OK - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic group ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC PROMPT_NR (nr of prompt sheet) was added as extra information. - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% There are 730 male speakers and 793 female speakers according to the documentation and to the speaker table. Counting on the label files we find: F: 791 = 52.01 % , OK M: 730 = 47.99 % , OK This is well within the critical interval. - Balance of regions . which regions and how many of each should match specification in documentation file The following distribution was observed by scanning the REG field in the label files: Bornholm: 85 = 5.59 % East Jutland: 270 = 17.75 % Funen: 159 = 10.45 % Middle Jutland: 150 = 9.86 % North Jutland: 154 = 10.12 % Northern Sealand: 259 = 17.03 % South Funen with islands: 92 = 6.05 % South Sealand: 83 = 5.46 % Southern Jutland east part: 74 = 4.87 % Southern Jutland west part: 85 = 5.59 % West Jutland: 110 = 7.23 % This is slightly deviating from the documentation. The reason for this minor deviation is unclear but also irrelevant. - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. under 17: 247 = 16.24 % 17 - 30 : 445 = 29.26 % , OK 31 - 45 : 439 = 28.86 % , OK 46 - 60 : 259 = 17.03 % , Too few over 60 : 131 = 8.61 % => There are too few speakers between 46 and 60 years of age. Since there are many speakers that called twice, their age in both calls may differ. It appears that the age in the second call is used in the documentation and in the speaker table, but this is not mentioned in the documentation. ======================================================================= 8. RECORDING CONDITIONS - Digital telephone line OK - A-law coding OK - Specification of wireless telephone or not (optional) Not provided - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - name of file: TABLE\REC_COND.SAM or TABLE\REC_COND.TBL - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP This optional table is not provided ============================================================================= 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences Only the transcriptions of the obligatory items (as mentioned in the introduction) were selected for validation. It must be mentioned explicitly that at the time the Danish database was compiled there was now prior knowledge as to which items were to be validated. Therefore the results of the validation can be considered as valid and representative for the other items as well. A random selection of 740 short items and 1017 long items went into the transcription validation. A. Long items In 305 of the 1017 checked items a correction was considered necessary. By far the most corrections (203) were related to the transcription of non-speech acoustic events. There were 102 corrections in the transcription itself. We did not observe errors of another type. With respect to the transcription corrections made for the non-speech acoustic events it must be noted that the most obvious noises were transcribed, but the noises at a lower noise level were not. A total of 305 errors on a total of 1017 checked items yields an error rate of 30%. Serious errors concerning the transcription itself were observed in 102 cases yielding an error rate of 10%, which is above the criterion of 5%. Most of the transcription errors could be attributed to the insertion of 'og' in the transcription of digits (73 cases). When we leave out these errors then 102-73 = 29 transcription errors remain. This amounts to 3% errors in the transcription. This is well below the 5% criterion. B. Short items In 109 of the 740 checked items a correction was considered necessary. Most corrections (93) were related to the transcription of non-speech acoustic events. There were 16 corrections in the transcription itself. We did not observe errors of another type. With respect to the transcription corrections made for the non-speech acoustic events it must be noted that the most obvious noises were transcribed, but the noises at a lower noise level were not. A total of 109 errors on a total of 740 checked items yields an error rate of 15%. Serious errors concerning the transcription itself were observed in 16 cases yielding an error rate of 2%, which is well below the criterion of 5%. A number of the transcription errors could be attributed to the insertion of 'og' in the transcription of digits (9 cases). When we leave out these errors then 16-9 = 7 transcription errors remain. This amounts to 1% errors in the transcription. This is well below the 5% criterion. In general, the short items were found to need less corrections than the long items. Most of the real transcription errors for the long items were caused by insertion of the monophonemic word 'og' in digit strings. ========================================================================== 10. SUMMARY Below we give a brief overview of our findings with respect to the Danish database. The subsections follow the order of the various topics in the previous sections of the report. As a general remark it must be stated that the Danish database deviates from the other SpeechDat(M) databases in various respects. There is very much optional material. Speakers had to call twice, but quite some didn't. For speakers with a complete data set the items are put into two directories, because they had to make two calls. The speech files are not gzipped. The other SpeechDat(M) databases have gzipped speech files. The validation was delimited to the obligatory items only. As such were considered : E1-3, N1-3, D1-3, A1-6, L1-3, M1-2, P1, Q1-3, S1-9, T1-2. As optional items we considered: E4-5, G0-9, H1-2, N4-9, O0-9, V1-4, A7-9, A0, B0-9, F1-7, S0, R0-9, U1-5. 1. Documentation The DESIGN.DOC gives a clear and detailed overview of the database. It contains most of the requested information. Countings of phoneme occurrences in the phonetically rich sentences are missing. A very limited number of prompt sheets was used. It can be derived from the prompt sheets in the PROMPT directory that the items were not spread over the call. All items were grouped together. 2. Data base structure and file names The directory tree and file formats and names are in general correct. One directory is completely empty! SES6952 The CONTENTS.LST files are not identical for all CDs; its contents are tuned to the CD in question. 3. Items Of the mandatory items within SpeechDat(M) a few are not recorded: I1: isolated digit C1: sheet number C2: telephone number C3: credit card number Instead 12 other items were recorded (G0-9, H1-2) consisting of 8 connected digits each. Not all time words are present in the prompts: The Danish equivalents of MORNING, AFTERNOON, EVENING, NIGHT, TODAY YESTERDAY and TOMORROW are missing. Also the Danish words for 2, 6, 14, 15, 19 are missing here. Not all data words are present in the prompts: Weekdays are not recorded; the month December is missing. Instead of weekdays ordinal day numbers are used, but many of them are missing; this pertains to the equivalents of: 1st, 4th, 7th, 8th, 10th, 11th, 13th, 17th, 18th, 19th, 20th, 21st, 24th, 25th, 29th, 20th, 31st. No doubt this is due to the small set of prompt sheets used. From the list of 39 obligatory SpeechDat(M) application words, Danish equivalents of the following 12 words are missing: help, next, repeat, stop, number, star, greeting, end, information, square, again, rewind. Only 5 application words were used to make sentences. We have looked at how many instances of each obligatory item are present. Considered from the original perspective for the SpeechDat(M) corpora to collect complete data sets for 1000 speakers, we conclude all data sets of obligatory items are complete for 97% or more. 4. Sampled data files The sample distributions were only computed for the obligatory items. In general the quality of the recordings is good. A few damaged files were detected. - There were 13 files with a clipping ratio higher than 2.0%. - One file contains a fixed value portion in the middle of its speech. - There are three extremely short files 5. Label files The SpeechDat(M) conventions for label files were well maintained. Only a few deviations were observed: 11 label files have illegal values in their transcription fields. In three cases the SRC mnemonic refers to item N1 whereas N2 is correct. We found 53 lines that were (slightly) longer than 80 characters. The label files are not in DOS format; they do not contain at their line ends. 6. Lexicon As checked for the obligatory items only, the lexicon was found complete. It is delivered in the correct format. Diphthongs are missing in the phonemic transcriptions (or they are torn apart by blanks). 7. Speakers The speaker table is in the correct format and complete. The balance of sexes is OK. The balance of ages is OK, except for the age group between 46-60 years of which there are too few representatives according to SpeechDat(M) criteria. 8. Recording platform Recording conditions were fine. A separate recording conditions file is optional in SpeechDat(M) and not present for this database. 9. Transcription Only the transcriptions of the obligatory items were selected for validation. A random selection of 740 short items and 1017 long items was chosen. In general, the short items were found to need less corrections than the long items. Most of the real transcription errors for the long items were caused by insertion of the monophonemic word 'og' in digit strings. A. Long items In 305 of the 1017 checked items a correction was considered necessary. By far the most corrections (203) were related to the transcription of non-speech acoustic events. There were 102 corrections in the transcription itself. We did not observe errors of another type. Most of the transcription errors could be attributed to the insertion of 'og' in the transcription of digits (73 cases). When we leave out these errors then 102-73 = 29 transcription errors remain. This amounts to 3% errors in the transcription. This is well below the 5% criterion. B. Short items In 109 of the 740 checked items a correction was considered necessary. Most corrections (93) were related to the transcription of non-speech acoustic events. There were 16 corrections in the transcription itself. We did not observe errors of another type. A number of the transcription errors could be attributed to the insertion of 'og' in the transcription of digits (9 cases). When we leave out these errors then 16-9 = 7 transcription errors remain. This amounts to 1% errors in the transcription. This is well below the 5% criterion. =========================================================================