SUBJECT: Validation French SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 2.0 DATE : 25 June 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the French SpeechDat(M) database are contained in this document. In the validation procedure we systematically checked a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ===================================================================== 1. DOCUMENTATION The documentation is provided in a set of files deviating from the SpeechDat instructions. The files are in the proper directory (\FIXED0FR\DOC). The contents are described in an additional README.TXT file located in the \FIXED0FR\DOC directory. The disadvantage of the approach chosen is that much of the information is given double. E.g. the contents of the database are in MATERIAL.TXT, in FINALREP.TXT and in DATABASE.TXT. - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK, in the README.TXT in the root. - Number of CDs / Tapes OK - Contents of each CD / tape OK - The directory structure of the CDs / tapes OK - List of missing items OK, separately listed in \FIXED0FR\DOC\MISSING.TXT. - Speaker demographics . which regions, how many of each OK, in FINALREP.TXT (section 2.5) . motivation for selection of regions => Not provided . which age groups, how many of each OK, in FINALREP.TXT (section 2.5) . sexes: males, females, also children?; how many of each. OK, in FINALREP.TXT (section 2.5) - Reference to a file where speaker characteristics are stored (speaker.tbl) OK, in DATABASE.TXT. - Naming conventions for directories and files OK, in DATABASE.TXT. - Prompting . linguistic specification (and motivation) for the prompting material OK, in MATERIAL.TXT . connection of sheet items to item numbers on CD OK, in DATABASE.TXT . sheet example OK, in LETTER.TXT . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK, a list of phoneme counts is presented in WORDLIST.TXT. => There is no information if every phoneme is present twice for each caller. - Recording platform should be specified . digital telephone net link OK, in PLATFORM.TXT - Statement that all signal transmission between CO and recording site is digital OK, in PLATFORM.TXT (Appendix A) - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) OK, in DATABASE.TXT (section 3). - The format and the file header structure of speech files OK, in DATABASE.TXT (section 3). - The format and the file header structure of annotation files OK, in DATABASE.TXT (section 3). - Annotation . procedure OK, in DATABASE.TXT and TRANSEN.TXT . quality assurance OK, in TRANSEN.TXT . character set used for annotation (transcription) OK, in TRANSEN.TXT . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] OK, in TRANSEN.TXT . list of symbols used to denote word interruptions and break-offs OK, in TRANSEN.TXT - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) => There is only scarce information as to by which means the lexicon was generated (which software, manual checking etc.) . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols). (Alternatively, D141 refers to the standard SAMPA definitions on a WWW server and this may be sufficient to check against) OK, in MATERIAL.TXT and in WORDLIST.TXT - Transcription manual: TRANSCRIP.DOC (optional) . is it there? OK, its name is TRANSEN.TXT (and the French version is called TRANSFR.TXT) . does it contain the relevant information? . What is done with non speech events . What is done with capitals . Only one spelling of each word is allowed OK - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included. Otherwise a statement why such a list is not necessary. => Not present - Indication of how many of the files were double checked by the producer together with percentage of detected errors => Not provided - Other documentation files included: A file containing the handset specifications in terms of CORDED and CORDLESS as a function of session number is added as file. - Other remarks: => In the file DATABASE.TXT a few example lines are printed which are taken from the CONTENTS.LST and the SUMMARY.TXT files. These lines are longer than can be coped with by most printers. If DATABASE.TXT is printed, one should be aware that such interrupted lines may be present. ========================================================================= 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK There is an additional README.TXT file in the \FIXED0FR\DOC directory with a description of the contents of the DOC directory. - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC OK - The summary files (SUMMARY.TXT) should be in \\DOC OK - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE OK - Index files (optional) should be in \\LST Not provided - Prompt sheet files (optional) should be in \\PROMPT Not provided - Any source code supplied should be in \\SOURCE (SAMLIB, V4, and GNU gunzip, version 1.2.4 + licence) OK - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) Not applicable - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1: isolated digit C1: 4 digit id of prompt sheet C2: ~10 digit telephone number C3: ~12 digit credit card number N1-3: 3 natural numbers M1-2: 2 money amounts L1-3: 3 spelled words T1: 1 time of day T2: 1 time phrase D1-3: 3 dates Q1-3: 3 yes/no questions P1: city of call/birth A1-6: 6 common application words E1-3: 3 application word phrases S1-9: 9 phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK - Counts should match information in documentation . count of files in each subdirectory . count grand total OK - Missing items per speaker Check with documentation OK - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus should be designed for training and a (typically smaller) part for testing. This is optional. Partition not made. - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK, session number is used as the speaker code. [TAB] is used as field delimiter. => Region of call (REG) is replaced by assessment (ASS), as is indicated in the documentation. - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically 39 codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all fields are separated by spaces OK ===================================================================== 3. ITEMS - 1 isolated digit (code I1) . read OK - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read - ~10 digit telephone number . read or spontaneous(?) - ~12 digit credit card number (16 digits would be better) . read . if there is a checksum then formula must be provided . 26 digits per call are required . at least one example per digit per caller . digits must appear numerically on the sheet, not as words OK => However, not every speaker was prompted each digit. There were 317 calls missing one of the digits, 58 calls missed two digits, and 5 calls three digits. Thus a total of 380 calls misses one or more digits in the C1-3 items. - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training => One natural number is provided, the other two are systematically absent. - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK - 1 time of day (code T1) . spontaneous Two time items are provided, a digital and an analog one. => Both are read, instead of spontaneous - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK, the item is present. The distribution of the numbers is a bit skewed. Below 20 most numbers are represented 50 times or more; over 20 there are quite some numbers that occur between 10 and 15 times; the number 29 occurs only 8 times. => Some of the time words are very scarcely represented: après-midi : 1 demi : 2 le : 4 midi : 5 minuit : 2 minute : 3 quarts : 3 =>The words 'matin', 'soir' and 'nuit' were not found at all. - 1 date (code D1) . spontaneous OK - 2 dates (code D2-3) . read, wordstyle . analogue form . covering all weekdays and months OK, the item is present, with good coverage of all weekdays and months. - 3 yes/no questions (code Q1-3) . spontaneous, not prompted . balance between yes/no OK, four items are provided. - city of call/birth (code P1) . preferably spontaneous; read is permitted OK => not a city but a 'departement' is asked for. - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read OK, 9 are provided instead of 6. - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK All obligatory items are present except for two natural numbers. The following additional (optional) items are provided: - 1 yes/no question (Q4) - 1 time of day (T3) - 3 application words (A7-9) - 1 comment (R1) - 3 words (spoken versions of spelled items) (W1-3) 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided. All words are present in the database, => apart from the word 'relire'. Each application word occurs about 150 times in the database, varying from 142 times for the word 'annuaire', and 204 times for the word 'fin'. 3. Incidentally missing items a. files that are not there Looking at the obligatory items only, we found the following files missing: A00157C1 A00182M1 A00894M1 A00894N1 A00894T1 A00928M2 A00940C3 A00973A3 These were exactly the files mentioned in the documentation (\FIXED0FR\DOC\MISSING.TXT). This means that 1 call misses 3 items and 5 calls miss 1 item, which is 8 files in total. b. files with empty transcriptions in the LBO label field Looking at the obligatory items only, we found that 45 files did not contain speech in their LBO-fields. 39 calls missed 1 item in this respect, and 2 calls miss 3 items. If we combine the countings of a. and b. we end up with 44 calls missing 1 item, 3 calls missing 2 items, and 1 call missing 3 items, which is 53 files in total. 4. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) If we look at the missing files only, then the criterion is met very easily: only 4 calls miss up to 3 mandatory items and 0 calls miss more items. If we add the files that, according their LBO-field, do not contain the envisaged target speech, then we end up with 48 calls missing up to 3 items and 0 calls missing more items. This is still well within the limits imposed by the above criterion. On the other hand, if we also take into account the structurally missing files (which are the two natural numbers in each call), then the criterion is not met, obviously. There may also be other files that are effectively missing (corrupted speech files). These are dealt with in the next section. ========================================================================== 4. SAMPLED DATA FILES 1 File structure . SAM OK 2 Coding . A-law, 8 bit, 8 kHz . Compression by GZIP OK 3 Sample distribution Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all items: Length (s) #Occurrences 1 - 2 : 233 2 - 3 : 9900 3 - 4 : 7076 4 - 5 : 7419 5 - 6 : 6663 6 - 7 : 4189 7 - 8 : 2492 8 - 9 : 2988 9 - 10 : 1483 10 - 11 : 1020 11 - 12 : 916 12 - 13 : 330 13 - 14 : 466 14 - 15 : 192 15 - 16 : 64 16 - 17 : 49 17 - 18 : 36 18 - 19 : 42 19 - 20 : 40 20 - 21 : 26 21 - 22 : 17 22 - 23 : 22 23 - 24 : 20 24 - 25 : 17 25 - 26 : 17 26 - 27 : 13 27 - 28 : 17 28 - 29 : 13 29 - 30 : 20 30 - 31 : 24 31 - 32 : 3 32 - 33 : 3 33 - 34 : 1 34 - 35 : 2 35 - 36 : 4 36 - 37 : 3 37 - 38 : 6 38 - 39 : 2 39 - 40 : 3 42 - 43 : 2 43 - 44 : 3 44 - 45 : 3 46 - 47 : 2 47 - 48 : 1 48 - 49 : 1 50 - 51 : 2 51 - 52 : 2 52 - 53 : 2 53 - 54 : 1 54 - 55 : 1 55 - 56 : 1 56 - 57 : 2 57 - 58 : 2 58 - 59 : 1 59 - 60 : 1 61 - 62 : 60 Duration distribution over obligatory items: Length (s) #Occurrences 1 - 2 : 160 2 - 3 : 6216 3 - 4 : 4739 4 - 5 : 6757 5 - 6 : 5896 6 - 7 : 3791 7 - 8 : 2373 8 - 9 : 2705 9 - 10 : 1432 10 - 11 : 949 11 - 12 : 868 12 - 13 : 287 13 - 14 : 429 14 - 15 : 161 15 - 16 : 42 16 - 17 : 25 17 - 18 : 16 18 - 19 : 20 19 - 20 : 21 20 - 21 : 10 21 - 22 : 10 22 - 23 : 8 23 - 24 : 8 24 - 25 : 5 25 - 26 : 10 26 - 27 : 7 27 - 28 : 8 28 - 29 : 5 29 - 30 : 13 30 - 31 : 21 It can be derived from these statistics that the items with a duration over 31 s are all from optional items, in particular from the comment item. As for the obligatory items with a duration over 20 s, we found no indications that these utterances are distorted. Most of the time the target item itself is recorded correctly, but the recording platform refused to stop, due to a high offset, or line noise, or background sounds. All items with durations over 20s stem from the first 460 sessions. After that the recording platform was modified such that a maximum duration was imposed on each item. This has been documented in the file PLATFORM.TXT. The long items observed are fairly randomly distributed among session numbers and item types. Duration distribution per call: Length (s) #Occurrences 3 - 4 : 21 4 - 5 : 474 5 - 6 : 286 6 - 7 : 80 7 - 8 : 36 8 - 9 : 27 9 - 10 : 71 10 - 11 : 3 11 - 12 : 1 12 - 13 : 1 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Files with a clipping rate higher than 0.4% must be regarded as potentially spurious. Clip distribution for all files: Clipping Occurences rate (in %) 0.0 - 0.1 : 2225 0.1 - 0.2 : 360 0.2 - 0.3 : 162 0.3 - 0.4 : 78 0.4 - 0.5 : 37 0.5 - 0.6 : 42 0.6 - 0.7 : 25 0.7 - 0.8 : 16 0.8 - 0.9 : 16 0.9 - 1.0 : 3 1.0 - 1.1 : 6 1.1 - 1.2 : 5 1.2 - 1.3 : 3 1.3 - 1.4 : 1 1.4 - 1.5 : 1 1.6 - 1.7 : 2 1.7 - 1.8 : 1 1.8 - 1.9 : 2 1.9 - 2.0 : 2 2.0 - 2.1 : 3 2.1 - 2.2 : 2 2.2 - 2.3 : 3 2.3 - 2.4 : 1 2.4 - 2.5 : 1 2.5 - 2.6 : 2 2.8 - 2.9 : 1 3.1 - 3.2 : 1 3.2 - 3.3 : 2 3.5 - 3.6 : 2 4.1 - 4.2 : 1 4.3 - 4.4 : 1 4.4 - 4.5 : 1 4.6 - 4.7 : 1 4.7 - 4.8 : 1 4.9 - 5.0 : 1 5.2 - 5.3 : 1 6.0 - 6.1 : 2 6.4 - 6.5 : 1 7.3 - 7.4 : 1 8.1 - 8.2 : 1 8.2 - 8.3 : 1 8.6 - 8.7 : 1 Number of files with absolute maximum < 32256: 34537 Clip distribution for obligatory items only: Clipping Occurences rate (in %) 0.0 - 0.1 : 1806 0.1 - 0.2 : 295 0.2 - 0.3 : 124 0.3 - 0.4 : 66 0.4 - 0.5 : 30 0.5 - 0.6 : 39 0.6 - 0.7 : 22 0.7 - 0.8 : 12 0.8 - 0.9 : 11 0.9 - 1.0 : 2 1.0 - 1.1 : 5 1.1 - 1.2 : 5 1.2 - 1.3 : 2 1.3 - 1.4 : 1 1.4 - 1.5 : 1 1.6 - 1.7 : 2 1.7 - 1.8 : 1 1.8 - 1.9 : 1 1.9 - 2.0 : 2 2.0 - 2.1 : 2 2.1 - 2.2 : 2 2.2 - 2.3 : 3 2.3 - 2.4 : 1 2.5 - 2.6 : 2 2.8 - 2.9 : 1 3.1 - 3.2 : 1 3.2 - 3.3 : 2 3.5 - 3.6 : 1 4.3 - 4.4 : 1 4.4 - 4.5 : 1 4.6 - 4.7 : 1 4.7 - 4.8 : 1 4.9 - 5.0 : 1 5.2 - 5.3 : 1 6.0 - 6.1 : 2 6.4 - 6.5 : 1 7.3 - 7.4 : 1 8.1 - 8.2 : 1 8.2 - 8.3 : 1 8.6 - 8.7 : 1 Number of files with absolute maximum < 32256: 42899 There are no obvious differences between the distributions for the obligatory items and all items together. As for the obligatory items, it was found that files with a clipping ratio over 1.0% are, in general, severely distorted. These are 48 files in total. A listing of clipping ratios in the files concerned follows: 1.08 in file A00254L2.FRZ 1.05 in file A00254L3.FRZ 1.13 in file A00343A1.FRZ 2.18 in file A00343A2.FRZ 1.73 in file A00343A3.FRZ 1.23 in file A00343A4.FRZ 1.64 in file A00343A5.FRZ 2.33 in file A00343A6.FRZ 4.64 in file A00343C1.FRZ 2.52 in file A00343C2.FRZ 7.34 in file A00343C3.FRZ 1.94 in file A00343D1.FRZ 5.23 in file A00343D2.FRZ 3.19 in file A00343D3.FRZ 2.54 in file A00343E1.FRZ 8.68 in file A00343E2.FRZ 8.17 in file A00343E3.FRZ 1.12 in file A00343I1.FRZ 8.22 in file A00343L1.FRZ 4.99 in file A00343L2.FRZ 6.08 in file A00343L3.FRZ 2.20 in file A00343M1.FRZ 3.21 in file A00343M2.FRZ 2.09 in file A00343N1.FRZ 1.83 in file A00343Q2.FRZ 1.34 in file A00343Q3.FRZ 2.30 in file A00343S1.FRZ 3.59 in file A00343S2.FRZ 6.03 in file A00343S3.FRZ 4.39 in file A00343S4.FRZ 6.44 in file A00343S5.FRZ 1.94 in file A00343S6.FRZ 3.20 in file A00343S7.FRZ 4.73 in file A00343S8.FRZ 2.87 in file A00343S9.FRZ 2.19 in file A00343T1.FRZ 2.22 in file A00343T2.FRZ 1.41 in file A00367E2.FRZ 1.05 in file A00367S7.FRZ 4.45 in file A00419C3.FRZ 1.60 in file A00419E3.FRZ 1.03 in file A00656L1.FRZ 1.26 in file A00656S6.FRZ 1.07 in file A00700S6.FRZ 2.09 in file A00706A3.FRZ 1.15 in file A00782L1.FRZ 1.14 in file A00782T2.FRZ 1.20 in file A00894S8.FRZ By far the most files stem from session 0343; other sessions appearing more than once in the list are: 0254, 0367, 0419, 0656, and 0782. Bad calls are examined in more detail below. Clip distribution per call: Clipping Occurences rate (in %) 0.0 - 0.1 : 262 0.1 - 0.2 : 9 0.2 - 0.3 : 4 0.3 - 0.4 : 3 0.4 - 0.5 : 1 3.1 - 3.2 : 1 Number of directories with absolute maximum < 32256: 720 SES0343 had a mean clipping rate of 3.14%. This call was found to be unacceptable due to clipping distortion. SES0656 (mean clipping rate 0.43), SES0254 (mean clipping rate 0.34), SES0700 (mean clipping rate 0.31), and SES0782 (mean clipping rate 0.31) must be considered spurious for some of their files, but not unacceptable. 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items: Mean Occurrences -9900 - -9800 : 1 -9800 - -9700 : 2 -9600 - -9500 : 1 -9500 - -9400 : 2 -9100 - -9000 : 1 -9000 - -8900 : 2 -8900 - -8800 : 3 -8700 - -8600 : 4 -8500 - -8400 : 4 -8400 - -8300 : 1 -8300 - -8200 : 2 -8200 - -8100 : 1 -8100 - -8000 : 1 -8000 - -7900 : 2 -7900 - -7800 : 2 -7800 - -7700 : 2 -7700 - -7600 : 1 -7600 - -7500 : 2 -7500 - -7400 : 1 -7400 - -7300 : 2 -7300 - -7200 : 1 -7100 - -7000 : 2 -6900 - -6800 : 1 -6800 - -6700 : 2 -6700 - -6600 : 1 -6500 - -6400 : 1 -1200 - -1100 : 1 -1000 - -900: 25 -900 - -800 : 25 -800 - -700 : 45 -700 - -600 : 39 -600 - -500 : 41 -500 - -400 : 188 -400 - -300 : 685 -300 - -200 : 1688 -200 - -100 : 3092 -100 - 0 : 18214 0 - 100 : 15225 100 - 200 : 2025 200 - 300 : 1781 300 - 400 : 1014 400 - 500 : 613 500 - 600 : 378 600 - 700 : 214 700 - 800 : 213 800 - 900 : 86 900 - 1000 : 54 1000 - 1100 : 57 1100 - 1200 : 42 1200 - 1300 : 30 1300 - 1400 : 38 1400 - 1500 : 25 1500 - 1600 : 9 1600 - 1700 : 10 1700 - 1800 : 6 1800 - 1900 : 3 1900 - 2000 : 3 2000 - 2100 : 2 2100 - 2200 : 1 2200 - 2300 : 1 Mean distribution over obligatory items only: Mean Occurrences -9900 - -9800 : 1 -9800 - -9700 : 1 -9500 - -9400 : 2 -9100 - -9000 : 1 -9000 - -8900 : 2 -8900 - -8800 : 3 -8700 - -8600 : 3 -8500 - -8400 : 3 -8400 - -8300 : 1 -8300 - -8200 : 2 -8200 - -8100 : 1 -8100 - -8000 : 1 -8000 - -7900 : 2 -7900 - -7800 : 1 -7800 - -7700 : 2 -7700 - -7600 : 1 -7600 - -7500 : 1 -7500 - -7400 : 1 -7400 - -7300 : 2 -7300 - -7200 : 1 -7100 - -7000 : 2 -6800 - -6700 : 2 -6700 - -6600 : 1 -1200 - -1100 : 1 -1000 - -900 : 20 -900 - -800 : 19 -800 - -700 : 32 -700 - -600 : 25 -600 - -500 : 34 -500 - -400 : 150 -400 - -300 : 537 -300 - -200 : 1307 -200 - -100 : 2536 -100 - 0 : 14652 0 - 100 : 12347 100 - 200 : 1683 200 - 300 : 1418 300 - 400 : 798 400 - 500 : 493 500 - 600 : 294 600 - 700 : 168 700 - 800 : 162 800 - 900 : 60 900 - 1000 : 48 1000 - 1100 : 47 1100 - 1200 : 33 1200 - 1300 : 23 1300 - 1400 : 24 1400 - 1500 : 17 1500 - 1600 : 7 1600 - 1700 : 8 1700 - 1800 : 4 1800 - 1900 : 2 1900 - 2000 : 3 2000 - 2100 : 2 2100 - 2200 : 1 There are no marked differences between the distributions of all items and the obligatory ones. A subset of the files with an absolute mean value of more than 1000 was inspected (visually and auditorily). There were no indications of spurious files depending on high mean values. There was, however one exception: all files in SES0971. The sample values in this call had an average mean value of -8001, which was exceptionally low. All files in the call were severely distorted. The exceptional status of this call can be seen in the distribution of mean values over complete calls as listed below. Mean distribution over all calls: Mean Occurrences -8100 - -8000 : 1 -900 - -800 : 1 -600 - -500 : 3 -500 - -400 : 2 -400 - -300 : 13 -300 - -200 : 45 -200 - -100 : 62 -100 - 0 : 407 0 - 100 : 321 100 - 200 : 40 200 - 300 : 43 300 - 400 : 21 400 - 500 : 15 500 - 600 : 6 600 - 700 : 7 700 - 800 : 7 800 - 900 : 1 900 - 1000 : 2 1000 - 1100 : 2 1300 - 1400 : 1 3.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items: SNR occurrences 0 - 5 : 61 5 - 10 : 104 10 - 15 : 287 15 - 20 : 1247 20 - 25 : 4244 25 - 30 : 10619 30 - 35 : 13327 35 - 40 : 9837 40 - 45 : 4696 45 - 50 : 1260 50 - 55 : 195 55 - 60 : 27 60 - 65 : 7 65 - 70 : 2 70 - 75 : 2 80 - 85 : 1 85 - 90 : 1 90 - 95 : 1 SNR distribution over obligatory items: SNR occurrences 0 - 5 : 42 5 - 10 : 64 10 - 15 : 199 15 - 20 : 913 20 - 25 : 3137 25 - 30 : 8124 30 - 35 : 10592 35 - 40 : 8322 40 - 45 : 4237 45 - 50 : 1155 50 - 55 : 172 55 - 60 : 23 60 - 65 : 6 65 - 70 : 1 70 - 75 : 2 80 - 85 : 1 85 - 90 : 1 90 - 95 : 1 Among the obligatory items the following 42 have a SNR below 5 dB. 4.8 in file A00027N1.FRZ 2.9 in file A00286Q2.FRZ 4.7 in file A00473Q3.FRZ 3.1 in file A00623Q1.FRZ 4.9 in file A00623Q3.FRZ 4.0 in file A00866Q1.FRZ 3.6 in file A00866Q3.FRZ 1.6 in file A00880A1.FRZ 1.8 in file A00880A2.FRZ 2.8 in file A00880A3.FRZ 2.0 in file A00880A4.FRZ 1.6 in file A00880A5.FRZ 1.1 in file A00880A6.FRZ 3.8 in file A00880C1.FRZ 4.7 in file A00880C2.FRZ 2.6 in file A00880C3.FRZ 2.0 in file A00880D2.FRZ 4.7 in file A00880D3.FRZ 4.0 in file A00880E1.FRZ 1.8 in file A00880E2.FRZ 3.2 in file A00880E3.FRZ 2.5 in file A00880I1.FRZ 2.3 in file A00880L1.FRZ 4.5 in file A00880L2.FRZ 2.3 in file A00880L3.FRZ 1.7 in file A00880M1.FRZ 1.6 in file A00880M2.FRZ 0.9 in file A00880N1.FRZ 0.5 in file A00880Q1.FRZ 2.3 in file A00880Q2.FRZ 0.8 in file A00880Q3.FRZ 3.2 in file A00880S1.FRZ 3.7 in file A00880S2.FRZ 4.0 in file A00880S3.FRZ 3.7 in file A00880S4.FRZ 2.9 in file A00880S5.FRZ 3.3 in file A00880S6.FRZ 3.1 in file A00880S7.FRZ 2.4 in file A00880S8.FRZ 2.4 in file A00880S9.FRZ 1.9 in file A00880T1.FRZ 1.7 in file A00880T2.FRZ These files must be considered corrupted (or empty). It can be easily derived that most originate from call 0880. This call contains a severe 50 Hz buzz, which makes the call unacceptable. SNR distribution over calls: SNR occurrences 0 - 5 : 1 10 - 15 : 4 15 - 20 : 14 20 - 25 : 66 25 - 30 : 244 30 - 35 : 350 35 - 40 : 238 40 - 45 : 76 45 - 50 : 6 50 - 55 : 1 SES0880 had a mean SNR of 2.5 dB and is unacceptable. Calls with a mean SNR between 10 and 15 dB were SES0623: 11.5 dB This call contains a high background buzz SES0830: 12.5 dB This call contains a 50 Hz + 2500 Hz background buzz SES0866: 13.0 dB This call contains a 50 Hz background buzz SES0757: 14.5 dB This call contains a 50 Hz + 2500 Hz backgrounds buzz These four calls are acceptable, except for the files with a SNR below 5 dB, listed above. ========================================================================== 5. ANNOTATION FILE - No illegal mnemonics used OK - There are no mnemonics missing OK - all mnemonics should be SAM mnemonics or explicitly defined in documentation OK - Mandatory (SAM) mnemonics: LHD: V5.0 DBN: SPEECHDAT(M)_ VOL: FIXED0_ SES: DIR: SRC: CCD: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 SCD: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM REG: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , , , EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword . no line may exceed 80 chars => The mnemonic SCD is missing. => The item REG is left empty. - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM REP: PCF: RCC: ENV: ASS: ! mnemo is not SAM OK, optional items used are: SHT, REP, ASS - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). => None of the permitted options was used. LBR just contains the question asked. - Obligatory and optional label mnemonics not provided in the label files should be provided in the file `CONTENTS.LST' from which this information can be derived (and added to the label file by the validating institute, if necessary). Not relevant - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. OK - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation For the French database 10 non-speech acoustic event types are mentioned in the documentation (TRANSEN.TXT). These correspond exactly to the words we found in brackets in the LBO fields. Also spans of noise are indicated. This is done by using slashes within the brackets. => In this respect the database deviates from the SpeechDat conventions, which determine that each individual word should be marked in case a span of speech is affected. - Asterisks should be used to indicate incomplete realisations OK - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found A spelling checker was used (WORD 6.0), => but there is no information about error rates. - The label files are associated with the correct speech files. (This cannot be done automatically at this moment. We can only point at files that are incidentally found as mismatched during the transcription and/or speech file validation) - Assessment of speech items in terms of SNR, presence of additional noise adherence to prompting text is provided (optional) OK, is provided as under mnemonic ASS. ======================================================================= 6. LEXICON - Check lexicon existence OK - Lexicon contents should be taken from actual utterances (from LBO) OK - The entries should be alphabetically ordered (ISO) OK - In transcriptions only SAMPA symbols are allowed OK - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) According to documentation, this is done. A visual inspection of the lexicon confirmed this. - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) Not provided. - Orthographic entries are as a rule splitted by apostrophes, but not by dashes. Entries are not splitted. Entries containing a dash and/or apostrophe are provided as a single entry. - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) => There were 106 words that could not be found in the lexicon. These are : Word : Freq. of Occurrence Arias: 1 D'abord: 1 Hautes-Alpes: 1 Industrielle: 1 Jessie: 2 Lot: 1 Nos: 1 Prophète: 1 Saint-Antoine: 1 Tacite: 1 anniversaires: 1 anonyme: 1 attendez: 1 aveugle: 1 bienvenu: 1 bourgeoises: 1 boutefeu: 1 commenter: 1 concentrer: 1 consonances: 1 couche: 1 crimes: 1 cruels: 1 cuivres: 1 discrète: 1 dise: 1 disparates: 1 dissoudre: 1 distraite: 1 dépecée: 1 etc: 1 expressions-types: 1 fastes: 1 fur: 1 gigantesques: 1 glace: 1 grammes: 1 graphiques: 1 hexagone: 1 irremplaçable: 1 l'Aine: 1 l'Aveyron: 1 l'innovation: 1 l'intrigue: 2 l'état-major: 1 lointaines: 1 légende: 1 législatifs: 1 m'incline: 1 m'induis: 1 machin: 2 maquettes: 2 meure: 1 microscope: 1 modiques: 1 méprise-t-il: 1 métropoles: 1 neuvième: 1 nuls: 1 optimistes: 1 organique: 1 paumée: 1 pente: 1 phare: 1 pins: 1 pipe: 1 podium: 1 presbytes: 1 privilège: 1 pronostic: 1 protagonistes: 1 préférée: 1 prénom: 1 péter: 1 quatre-vingt-quartorze: 1 racie: 1 regroupe: 1 rétracte: 1 s'entre-déchirent: 4 sabre: 1 sel: 1 sinistre: 1 sols: 1 sous-fifre: 1 spécif: 1 surveille: 1 syndicale: 1 tardif: 1 tutelle: 1 verbal: 1 voit-il: 1 Â: 26 Ç: 18 È: 72 É: 787 Ê: 21 Ë: 14 Î: 30 Ï: 33 Û: 10 á: 4 ç'a: 2 ç'aurait: 1 é: 3 échéances: 1 équipements: 1 . Check for overcompleteness (invalid words have a * and should not be in lexicon) (the same goes for words truncated due to a recording error; this is indicated by ~) => We found 259 entries in the lexicon that did not appear in the transcriptions. - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. Optional information is not provided. ========================================================================= 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes OK, SAM format provided. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE 4. region of call REG => SES is used instead of SCD. => REG is not provided. - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic group ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC Provided are ACC and NLN (native language). - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% OK, 521 males and 479 females were recorded. Thus, the sex distribution is well balanced according to our criterion. - Balance of regions . which regions and how many of each should match specification in documentation file OK The following regional distribution was observed: Unknown: 19 = 1.90 % Alsacien: 13 = 1.30 % Bourbonnais, Berry: 12 = 1.20 % Bourgogne: 48 = 4.80 % Breton: 101 = 10.10 % Champagne, Brie: 30 = 3.00 % Corse: 1 = 0.10 % Dauphiné, Savoie: 26 = 2.60 % Franche-Comté: 2 = 0.20 % Gascon: 25 = 2.50 % Ile-de-France ('Parisien'): 353 = 35.30 % Languedoc-Méditerranéen: 12 = 1.20 % Languedoc-Occidental: 9 = 0.90 % Limousin: 12 = 1.20 % Lorraine germanophone: 11 = 1.10 % Lorraine-Romane: 24 = 2.40 % Lyonnais, (Forez): 34 = 3.40 % Massif Central: 3 = 0.30 % Normandie: 111 = 11.10 % Picardie: 63 = 6.30 % Poiton, Aunis, Angounnois, Saintonge: 19 = 1.90 % Provençal: 27 = 2.70 % rest of the world: 45 = 4.50 % This matches the figures in the documentation. There is a clear overbalance of speakers from Paris. The nationality distribution was the following: FRENCH: 956 = 95.60 % NOT_FRENCH: 44 = 4.40 % - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. OK, the following age distribution was observed: under 17: 24 = 2.40 % 17 - 30 : 320 = 32.00 % 31 - 45 : 353 = 35.30 % 46 - 60 : 289 = 28.90 % over 60 : 14 = 1.40 % This is well in agreement with the criteria. ====================================================================== 8. RECORDING CONDITIONS - Digital telephone line OK - A-law coding OK - Specification of wireless telephone or not (optional) OK, in HANDSET.LST - Time stamps on file OK - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - name of file: \FIXED0FR\TABLE\REC_COND.SAM or \FIXED0FR\TABLE\REC_COND.TBL - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP Not provided A file HANDSET.LST is provided in the INDEX directory. It gives a listing of all handset types (CORDED, CORDLESS) used in the calls, attached to the session number of the call. ============================================================================ 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences A random selection of 1074 long items and 960 short items was used for transcription validation. A. Long items In 491 of the 1074 checked items a correction was made. By far the most corrections (388) were related to the transcription of non-speech acoustic events. There were 61 corrections in the transcription itself, of which 43 were not serious (a difference in just one phoneme of the whole item). We found 1 typing error, and 42 combined errors. This yields a total of 60 serious errors on 1074 items. This is an error rate of 5.59%, which exceeds the criterion value of 5%, but only marginally. B. Short items In 316 of the 960 checked items a correction was made. By far the most corrections (290) were related to the transcription of non-speech acoustic events. There were 15 corrections in the transcription itself, of which 7 were not serious (a difference in just one phoneme of the whole item). We found 3 typing errors, and 8 combined errors. This yields a total of 19 serious errors on 960 items. This is an error rate of 1.98%, which is well below the criterion value of 5%. ========================================================================== 10. SUMMARY Below we give a brief overview of our findings with respect to the French database. The subsections follow the order of the various topics in the previous sections of the report. In general, the database closely follows the SpeechDat format specifications. This is in particular true for directory structures, filenames, information tables and lexicon. A more detailed summary follows below. 1. Documentation The documentation was partitioned over a set of files not conforming the SpeechDat specifications. In general the information in the documentation files is very complete. A few omissions were observed: - There is no motivation for the selection of the speaker regions. - There is only scarce information as to by which means the lexicon was generated (which software, manual checking etc.) - There is no information about double checking of files. 2. Data base structure and file names The data base structure and file names was well in agreement with the SpeechDat specifications. Only the structure of the documentation and the filenames in \FIXED0FR\DOC deviate from SpeechDat specifications. A minor point: In CONTENTS.LST: - Region of call (REG) is replaced by assessment (ASS), as is indicated in the documentation. 3. Items All obligatory items are present except for two natural numbers. The following additional (optional) items are provided: - 1 yes/no question (Q4) - 1 time of day (T3) - 3 application words (A7-9) - 1 comment (R1) - 3 words (spoken versions of spelled items) (W1-3) All mandatory application words are present in the database, apart from the word 'relire'. SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) If we look at the missing files only, then the criterion is met very easily: only 4 calls miss up to 3 mandatory items and 0 calls miss more items. If we add the files that, according their LBO-field, do not contain the envisaged target speech, then we end up with 48 calls missing up to 3 items and 0 calls missing more items. This is still well within the limits imposed by the above criterion. On the other hand, if we also take into account the structurally missing files (which are the two natural numbers in each call), then the criterion is not met, obviously. 4. Sampled data files As for the obligatory items with a duration over 20 s, we found no indications that these utterances are distorted. It was found that files with a clipping ratio over 1.0% are, in general, severely distorted. As for the obligatory items, these are 48 files in total. The call recorded as session 0343 had a mean clipping rate of 3.14%. It was found to be unacceptable due to clipping distortion. There were no indications of spurious files depending on high mean values. There was, however one exception: all files in session 0971 had an average mean value of -8001, which was exceptionally low. All files in the call were severely distorted. Among the obligatory items 42 files had a SNR below 5 dB. These files must be considered corrupted (or empty). It can be easily derived that most originate from session 0880. This call contains a severe 50 Hz buzz, which makes the call unacceptable. 5. Label files In the label files the mnemonic SCD is missing and the mnemonic REG is left empty. Both are obligatory mnemonics. SES is used as key field. Optional items used are: SHT, REP, ASS. For spontaneous speech LBR should be left blank or contain a mnemonic word (like ), but None of the permitted options was used. LBR just contains the question asked. 6. Lexicon The lexicon is in the proper SpeechDat format. Entries containing an apostrophe or dash are not splitted. There were 106 words that could not be found in the lexicon. We found 259 entries in the lexicon that did not appear in the transcriptions. 7. Speakers The speaker table was delivered in the proper SAM format. However, in this table SES was used instead of SCD and REG is not provided. Optional, extra mnemonics in the table are ACC (accent of speaker) and NLN (native language). 521 males and 479 females were recorded. As a result the balance of sexes does not exceed the 5% limit. The balance of the regions is as described in the documentation. There is a clear overbalance of speakers from Paris. The balance of speakers' ages was well in agreement with the SpeechDat specifications: more than 20% of speakers was located in each of the following age categories: 17-30, 31-45, 46-60 years of age. 8. Recording platform The recording conditions of the database complied with the SpeechDat requirements. 9. Transcription For the long items we counted a total of 60 serious errors on 1074 items. This is an error rate of 5.59%, which exceeds the criterion value of 5%, but only marginally. For the short items a total of 19 serious errors on 960 items was counted. This is an error rate of 1.98%, which is well below the criterion value of 5%. =========================================================================