SUBJECT: Validation German FDB1000 SpeechDat corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.0 DATE : August 1997 The speech databases made within the SpeechDat project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat format and content specifications, as documented in Deliverables 1.3.1, 1.3.2 and 1.3.3 of the project. The validation results of the German Fixed Network SpeechDat database (first 1000 calls) are contained in this document. The validation of the full corpus of 4000 speakers will follow later. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention are marked by => throughout the document and can be extracted by using a grep command. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION - File DESIGN.DOC; & deliverables SD131 and SD132 can be handy OK, but => File is delivered as DESIGNDE.PS in postscript format. It appears to => be unprintable on some locations. It is safer to make it an MS-WORD file, => according to SpeechDat specs. - Language of doc file: English OK - Contact person: name, address, affiliation OK - Number of CDs OK, section 1 - Contents of each CD => Contents of each CD not specified. - The directory structure of the CDs OK, section 1.3 - Description of all the items in the corpus OK, sections 1.2 and 3. However, => Files of item type X are not described anywhere but some 400 files are => in the database. Presumably these files were erroneously included, => since only speech files and no label files were delivered. => Section 3.4 contains an error. The database contains 1 spontaneous => date (D1), and two prompted ones (D2 and D3). => Section 3.12 misses a description of the spontaneous time item. - Prompting . linguistic specification (and motivation) for the prompting material (in case of additional optional items) OK, section 3 (header) for Y type items. . connection of sheet items to item numbers on CD OK, section 1.2 . sheet example OK, reference in section 8.2 . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK, section 2.3 - Naming conventions for directories and files OK, section 1 - Speaker recruitment OK, section 2.2 - Speaker demographics . which regions, how many of each . motivation for selection of regions . which age groups, how many of each . sexes: males, females, also children?; how many of each. . each call is made by a unique speaker OK, section 4 - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . minimum number of phone examples = #speakers/10 OK, section 3.11. 5 of the listed phonemes occur less than 100 times. - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich words (either of phones, biphones, triphones) . minimum number of phone examples = #speakers/5 OK, section 3.13. 12 of the listed phonemes occur less than 200 times - Recording platform and telephone link decription (which part is digital) OK, section 2.1. No information about telephone network. - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) OK, section 1.1 - The format of the speech files (A-law, 8 bit, 8 kHz, uncompressed) OK, section 1.1 - The format of the annotation files (SAM label files) OK, sections 1.1 and 1.4 - Annotation . procedure . quality assurance . character set used for annotation (transcription) (ISO-8859) . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Noise] [Stationary_noise] [Intermittent_noise] . list of symbols used to denote word truncations, mispronunciations and not understandable speech . case sensitivity of transcriptions OK, section 2.4 - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) . (Reference to) SAMPA symbols used . case sensitivity of entries (matching the transcriptions) OK, section 5 - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included (SPELLALT.DOC). Otherwise a statement why such a list is not necessary. => A comment to normalisation of spellings was not found - Indication of how many of the files were double checked by the producer together with percentage of detected errors OK, section 2.4, but this is only for transcriptions. - Other remarks: => Table of contents and table numbers are missing in DESIGNDE.PS => It is not necessary to include lists of all prompting material. => Section 7 is empty => In your final doc you cannot say that the list of credit card nrs => and PIN codes are "of K. Kordi". => In section 3.6.2 the list is announced but not given. => The table in section 4.2 with speaker ages is completely wrong. ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES => Only 988 calls are included instead of 1000. - Directory / subdirectory conventions Format of directory tree should be \\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They correspond to the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - All text files should be in MS-DOS format ( at line ends => The following files do not obey this convention: => DISK.ID, README.TXT, COPYRIGH.TXT, CONTENTS.LST, SPEAKER.TBL, LEXICON.TBL - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK => The README.TXT is not in MS-DOS format - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED1EN_01. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK => The COPYRIGH.TXT is not in MS-DOS format - Documentation should be in \\DOC . DESIGN.DOC . TRANSCRIP.DOC (optional) . SPELLALT.DOC (optional) . SAMPALEX.PS . ISO8859<1,2,7>.PS . SUMMARY.TXT . SAMPSTAT.TXT => Missing in this directory are SAMPSTAT.TXT. => DESIGN.DOC is delivered as a postscript file DESIGNDE.PS which => is not recommended, since it cannot be printed everywhere. => SUMMARY.TXT is erroneously in FIXED1DE\TABLE => SAMPALEX.PS is erroneously in FIXED1DE\SOURCE as SAMPA_DE.PS => ISO88591.PS is erroneously in FIXED1DE\SOURCE - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE . SPEAKER.TBL . LEXICON.TBL . REC_COND.TBL (optional) . SESSION.TBL (optional) OK, SPEAKER.TBL and LEXICON.TBL are there - Index files (optional) should be in \\INDEX Only CONTENTS.LST is mandatory. There are no further index files - Prompt sheet files (optional) should be in \\PROMPT OK - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type A is for A-law O is for Orthographic label file OK - Correct item codes should be used: A1-3/6: common application words B1 : sequence of isolated digits C1 : prompt sheet number C2 : telephone number C3 : credit card number C4 : PIN code D1-3 : dates E1 : application word phrase I1 : isolated digit L1-3 : spelled words M1 : money amount N1 : natural number O1 : spontaneous name O2 : city of call/birth O3 : most frequent city name O5 : most frequent company/agency name O7 : forename & surname Q1-2 : yes/no questions S1-9 : phonetically rich sentences T1 : time of day T2 : time phrase W1-4 : phonetically rich words OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted => 23 empty files were found: A11774A3.DEO A11774B1.DEO A11774C1.DEO A11774C2.DEO A11774S6.DEO A11774S7.DEO A11774S8.DEO A11774S9.DEO A11774T1.DEO A11774T2.DEO A11774W1.DEO A11774W2.DEO A11774W3.DEO A11774W4.DEO A11774Y1.DEO A11774Y3.DEO A11774Y4.DEO A11774Y5.DEO A11775A1.DEO A11775A2.DEO A11775A3.DEO A11775B1.DEO A11775C1.DEO - Counts should match information in documentation . count of files in each subdirectory . count grand total OK, 988 calls are present - Missing items per speaker Check with documentation => There is no information about missing files in the documentation - File match: For each label file there must be one speech file and vice versa. => For A11433S9.DEO there was no matching speech file. => For A11164W3.DEA, A11487W2.DEA, A11711W3.DEA, A11877W3.DEA there => were no matching label files => For 255 X1 speech files there were no matching label files => For 228 X2 speech files there were no matching label files => But probably the X files should not be there anyhow. - Part of the corpus is designed for training and a (typically smaller) part for testing. To be arranged for full corpus - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . corpus code (CCD:) . corpus repetition (CRP:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . speaker accent (ACC:) . orthographic transcription of uttered item (LBO:) The first line should be a header specifying the information in each record. This file must be supplied as an ASCII TAB delimited file. OK, but => the transcription information erroneously contains sample numbers => CONTENTS.LST misses the line feeds () => The contents of CONTENTS.LST is CD-dependent - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically N codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all these fields are separated by spaces . Note: The contents of the SUMMARY.TXT file are not CD-dependent OK, but => SUMMARY.TXT has an illegal ^M after each date field => The contents of SUMMARY.TXT are CD-dependent which should not. => The SUMMARY.TXT file of CD4 has no line feeds ====================================================================== 3. ITEMS - 1 isolated digit (code I1) . read or prompted OK - 1 sequence of 10 isolated digit (code B1) . each sequence must include all digits . optional are hash and star OK, star and hash were used - 4 connected digits (code C1-4) - 4-6 digit number to identify the prompt sheet . read OK - ~10 digit telephone number . read . local numbers . inclusion of GSM numbers recommended OK - 14-16 digit credit card number . read . set of 150 . if there is a checksum then formula must be provided OK, 150 different credit card numbers were found. - 6 digit PIN code . read . set of 150 OK => 149 different PIN codes were detected. . ~30 digits per call are required . digits must appear numerically on the sheet, not as words OK, the producers did not warrant that each digit occurred at least once in the connected digits of a call. - 1 natural number (code N1) . read . provided as numbers (numerically) . numbers must be < 1,000,000 . decimal numbers only allowed for additional natural numbers OK - 1 money amount (code M1) . read . currency words should be included . mixture of small amount including decimals and large amounts not including decimals OK - 3 spelled words (code L1-3) . L1 is spontaneous name spelling linked to O1 . others are read . equal balance of all vocabulary letters artificial words can be used to enforce this balance . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK - 1 date (code D1) . spontaneous OK - 1 date (code D2) . read, wordstyle . analogue form . covering all weekdays and months, ordinals and year expressions (also exceeding 2000) OK - 1 relative date (code D3) . read . analogue . should include forms such as TODAY, TOMORROW, THE DAY AFTER TOMORROW, THE NEXT DAY, THE DAY AFTER THAT, NEXT WEEK, GOOD FRIDAY, EASTER MONDAY, etc. OK - 2 yes/no questions (code Q1-2) . spontaneous, not prompted . one question should elicit (predominantly) 'no' answers; the other (predominantly) 'yes' answers . also fuzzy answers should be envisaged OK - 3/6 common application words (code A1-3/6) . read . set of 30 should be used, 25 of which are fixed for all . minimum number of examples of each word = #speakers/10 . 6 are needed, but only 3 for 4000+ FDBs => In the list of application words an equivalent of is missing => The application word 'German' is listed in section 3.1 of the documentation => file, but is not recorded => The following application words are recorded but are not in => the documentation file: Beantworter, Bestätigung, Eingabe, => Nachricht, Raute, Stern All application words are recorded more than 80 times. - 1 application word phrase (code E1) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . minimum number of phone examples = #speakers/10 OK, For the following phones less than 100 examples were reported in the doc file (section 3.11): O~, dZ, dS, Z, a~. These are all rare phones in German. - 4 phonetically rich words (code W1-4) . read . minimum number of phone examples = #speakers/5 OK, For the following phones less than 200 examples were reported in the doc file (section 3.13): O~, a~, dZ, Z, E:, tS, pf 9 Y, OY, y:, 2: - 5 directory assistance names (code O1-7) . 1 spontaneous name (e.g. forename) OK . 1 spontaneous city name OK . 1 read city name (from list of 500 most frequent) OK, but => only 333 different city names were found . 1 read company/agency name (from list of 500 most frequent) OK, but => only 323 different company names were found . 1 read proper name, fore- and surname (from list of 150 SDB names) OK, but => 151 names were detected. 1. Structurally missing items All obligatory items were recorded. 2. Incidentally missing items a. files that are not there => First of all, only 988 calls are provided instead of 100. => So 12 occurrences of each item are missing anyhow Furthermore also 23 empty files were spotted (see section 2). By scanning the database recordings we found 156 missing files in addition. If we add these to the empty files then we get the following distribution of missing files per item : 10 A1 5 A2 8 A3 6 B1 3 C1 22 C2 1 C3 6 C4 1 D1 5 D2 1 D3 2 E1 2 I1 2 L1 1 L2 2 L3 3 M1 1 N1 1 O1 12 O2 3 O3 1 O5 1 O7 1 Q1 7 S1 5 S2 7 S3 5 S4 11 S5 9 S6 11 S7 4 S8 3 S9 3 T1 1 T2 2 W1 4 W2 1 W3 2 W4 b. files with empty transcriptions in the LBO label field (effectively missing files) There are 144 files that have an empty trancription (only noise symbols and/or **). If we add these files to the missing files given above then we get the following distributions : 13 A1 5 A2 9 A3 9 B1 3 C1 26 C2 3 C3 6 C4 1 D1 5 D2 2 D3 2 E1 65 I1 5 L1 1 L2 3 L3 7 M1 17 N1 1 O1 19 O2 7 O3 2 O5 3 O7 1 Q1 2 Q2 12 S1 7 S2 11 S3 6 S4 11 S5 9 S6 12 S7 5 S8 3 S9 7 T1 3 T2 3 W1 6 W2 3 W3 4 W4 c. corrupted speech files If we regard utterances which have only truncated or mispronounced words as corrupted files, and merge these with the effectively missing files then the following distribution emerges : 14 A1 6 A2 11 A3 9 B1 3 C1 26 C2 3 C3 6 C4 1 D1 5 D2 2 D3 3 E1 66 I1 5 L1 1 L2 3 L3 7 M1 17 N1 12 O1 29 O2 16 O3 8 O5 6 O7 10 Q1 10 Q2 12 S1 7 S2 15 S3 7 S4 12 S5 9 S6 13 S7 5 S8 3 S9 7 T1 3 T2 8 W1 21 W2 14 W3 14 W4 (This will not be used to reject or approve a database but it will be supplied as supplementary information.) d. files containing truncation and mispronunciation marks (*,**,~ are counted in the transcriptions of the individual items to get an idea of distorted speech data. This will not be used to reject or approve a database but it will be supplied as supplementary information.) We found 2463 transcriptions with at least one *, or **, or ~, according to the following distribution: A1: 9 A2: 4 A3: 4 B1: 34 C1: 84 C2: 21 C3: 39 C4: 19 D1: 21 D2: 48 D3: 35 E1: 15 I1: 13 L1: 114 L2: 85 L3: 88 M1: 72 N1: 65 O1: 23 O2: 71 O3: 20 O5: 19 O7: 55 Q1: 24 Q2: 38 S1: 103 S2: 117 S3: 167 S4: 129 S5: 145 S6: 131 S7: 156 S8: 150 S9: 137 T1: 40 T2: 86 W1: 22 W2: 21 W3: 20 W4: 19 3. Overall conclusion SpeechDat has the following criteria for missing items: . At least 95% of the files of each mandatory item (corpus code) must be present. . As missing files are counted: absent files, and files containing non-speech events only. . There will be no further comparison of prompt and transcription text in order to decide if a file is effectively missing. As a consequence: If there is some speech in the transcription, then the file will NOT be considered missing, even if it is in fact useless. For the decision of completeness of an item the distribution given in 1b above can be used, if 12 is added to the given numbers because of the 12 absent calls. By applying the 95% criterion to 1000 calls, 50 occurrences of each item may be effectively missing at the most. => Thus it is found that (only) item set I1 is not represented well enough => (viz. 92.3%) =========================================================================== 4. SAMPLED DATA FILES 1 Coding . A-law, 8 bit, 8 kHz, no compression OK 2 Sample distribution Several sample statistics are generated: File length, clipping rate, mean sample value, Signal-to-Noise Ratio (SNR). Statistics were generated on file level by the producer of the database, using SPEX software. The results were delivered to SPEX. SPEX compiled histograms on the basis these results. These histograms are presented below, both on file level and on directory (call) level. The histograms are presented as they are and not further interpreted by SPEX. On the basis of these data the user of the database should be able to decide which acoustic quality is still acceptable for the application at hand. Statistics on the acoustics of individual speech files can be retrieved from file \DOC\SAMPSTAT.TXT. => Since SAMPSTAT.TXT was not delivered it was not possible to make => histograms of acoustical characteristics of the speech files. =========================================================================== 5. ANNOTATION FILE - Each line must be delimited by OK - Mandatory (SAM) mnemonics: LHD: SAM, 5.10 DBN: SPEECHDAT__Fixed_Network VOL: FIXED1_ SES: DIR: SRC: CCD: CRP: < = corpus repetition, empty> REP: RED: RET: SAM: 8000 < = sampling freq.> BEG: END: SNB: 1 < = number of bytes per sample> SBF: < = sample byte order, meaningless with single bytes> SSB: 8 < = number of significant bits per sample> QNT: A-LAW < = quantisation> SCD: SEX: M/F/UNKNOWN AGE: ! mnemo is not SAM ACC: ! mnemo is not SAM REG: ENV: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , [centre sample], , EXT: 80 chars on one line> ELF: - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 < = number of channels recorded> ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM CMP: EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM NET: PSTN < = network> ! mnemo is not SAM DSC: < = discontinuity marker> EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM HLT: TRD: RCC: ASS: ! mnemo is not SAM - Order restrictions: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword - All mnemonics should be SAM mnemonics or explicitly defined in documentation OK - No illegal mnemonics used OK - There are no mnemonics missing => The following obligatory mnemonics are structurally missing: SBF, CRP. => Since they have a null value there is no loss of information. The speaker code is empty. - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK, optional mnemonics used are: SHT, NET, PHM. - No illegal field values should appear => A large number (14145) of RET values missed seconds => RED has only one digit for day values below 10, should be two digits => 17 calls are made from MOBILE phones. Sessions concerned are : 1022 1044 1128 1130 1137 1186 1274 1280 1326 1408 1506 1519 1661 1782 1835 2026 2179 => A few sessions have two different values for one mnemonic in different => files : File 1 File 2 Mnem File 1 File 2 --------------------------------------------------- A11017B1.DEO: A11017A3.DEO: AGE: 25 : 24 A11494C1.DEO: A11494B1.DEO: RED: 21/Mar/1997: 20/Mar/1997 A11494C2.DEO: A11494C1.DEO: RED: 20/Mar/1997: 21/Mar/1997 A11895L3.DEO: A11895L2.DEO: RED: 1/May/1997: 01/May/1997 A11944Y4.DEO: A11944Y3.DEO: ACC: BE: BY A12181S8.DEO: A12181S7.DEO: RED: 6/Jun/1997: 2/Jun/1997 A12181S9.DEO: A12181S8.DEO: RED: 2/Jun/1997: 6/Jun/1997 - No line may exceed 80 chars OK - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should contain a mnemonic word. D1 : L1 : O1 : O2 : Q1 : or Q2 : or OK - Transliterations is case-sensitive unless specified otherwise. ( In general lower case is used also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. ) OK, initial capitals are used for nouns and spelled letters, see section 2.4 of the design documentation file - Punctuation marks should not be used in the transliterations OK, but the use of apostrophe in: A12038Q1.DEO: [spk] was 'ne bescheuerte Frage [spk] is questionable. - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [fil] [spk] [sta] [int] Other symbols (and language equivalents) must be mentioned in the documentation OK, only [fil] [spk] [sta] [int] were used. - Asterisks should be used to indicate mispronunciations OK - Double asterisks should be used for not understandable parts OK - Tildes should be used to indicate truncations OK - Assessment of speech items in terms of SNR, presence of additional noise, adherence to prompting text is provided (optional) Not provided ======================================================================== 6. LEXICON - Check lexicon existence (\TABLE\LEXICON.TBL) OK - The entries should be alphabetically ordered OK - Used SAMPA symbols are provided in \DOC\SAMPALEX.PS => The location of the SAMPA symbols is \SOURCE - In transcriptions only SAMPA symbols are allowed OK, SAMPA symbols were used. Two extra symbols were used: a~ and O~ to indicate nasalisation See also section 5 of the DESIGN document. => Two symbols were used which are not SAMPA: => q : occurs once; error in entry: /Q/ q => dS : not official SAMPA, should be dZ; only used in foreign words - All SAMPA phoneme symbols should be covered. OK, but => the phoneme r was used instead of R. R is the general phoneme; r is => reserved for the apico-alveolar variant. - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] [TAB] is ASCII 9. OK, frequency information is not included - Each line is delimited by => No, lines in LEXICON.TBL are not delimited by => There is no header line in the lexicon, describing the structure => of each record. - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) OK, alternative transcriptions are not provided - Orthographic entries are as a rule split by spaces only, not by apostrophes, and not by hyphens. OK - Words with * or ~ should not appear in the lexicon OK - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) OK, the lexicon is fairly complete. => Only one word was not found in the lexicon: Stop . Check for overcompleteness (Undercompleteness is worse than overcompleteness. Overcompleteness cannot be a reason for rejection) 37 words were found that appear in the lexicon, but not in the transcriptions. - Lexicon contents should be taken from actual utterances (from LBO), so the entries should exactly match the transcriptions. OK - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. OK, not provided ========================================================================== 7. SPEAKERS - Check existence speaker database file (SPEAKER.TBL) or, alternatively, the session table file (SESSION.TBL) OK, SPEAKER.TBL is present. However, => SPEAKER.TBL should be SESSION.TBL because of missing speaker codes. => The speaker table gives data for 990 speakers whereas only 988 are present. - Obligatory information in SPEAKER.TBL and SESSION.TBL: 1. unique number (speaker/caller) SCD (SPEAKER.TBL only) session number SES (SESSION.TBL only) 2. sex SEX 3. age AGE 4. accent ACC OK, apart from the missing speaker code. - Optional information: . height HET . weight WET . native language NLN . ethnic group ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC . health HLT . tiredness TRD OK, not used - Each line is delimited by => No, lines in SPEAKER.TBL are not delimited by - Each field is separated by [TAB] (ASCII 9) OK - Balance of sexes . How many males, how many females, should match specification in documentation file . Misbalance may not exceed 5% (Each sex must be represented between 45-55%) OK, the following balance was found (based on the label files in the database): F: 486 M: 501 UNKNOWN: 1 - Balance of dialect regions . which dialect regions and how many of each should match specification in documentation file . each region should be represented by at least 0.5% of the speakers The following distribution of accents was found (based on the label files in the database): A: 8 BB: 4 BE: 34 BW: 91 BY: 404 CH: 1 HB: 2 HE: 74 HH: 12 MV: 1 NI: 85 NW: 120 OTHER: 21 RP: 39 SH: 15 SL: 3 SN: 35 ST: 12 TH: 17 UNKNOWN: 10 => A few accents appear less than 5 times: BB, HB, MV, SL. => But the criteria apply for the full database only, not for this subset. Compared to section 4.1 of the documentation file, there are a few differences: Berlin has one speaker more than 33; Bayern has one speaker less than 405; (Also the table in section 4.1 adds up to 988 speakers ...) - Balance of ages . which age groups and how many of each should match specification in documentation file . Criteria < 16 : >= 1% strongly recommended 16-30 : >= 20% mandatory 31-45 : >= 20% mandatory 46-60 : >= 15% mandatory (The age criteria are meant for the whole database; they are not to be applied for male and female speakers separately) The following age distribution was found (based on the label files in the database): 00-15 : 63 16-30 : 353 31-45 : 328 46-60 : 207 61-99 : 37 This matches the criteria very well. => The table in section 4.2 with speaker ages is completely wrong. ======================================================================= 8. RECORDING CONDITIONS - Check existence (optional) recording conditions table (\TABLE\REC_COND.TBL) or session table (\TABLE\SESSION.TBL) Not provided - Information in REC_COND.TBL and SESSION.TBL (if supplied): Minimum set . recording conditions code RCC (REC_COND.TBL only) . region of call REG . environment ENV - At least 2% of the calls must be from a public place (check ENV) OK, BOOTH: 236 HOME: 505 OFFICE: 231 UNKNOWN: 16 => 17 calls are made from MOBILE phones. ============================================================================= 9. TRANSCRIPTION This validation was carried out by taking 5% of the mandatory short items and 5% of the mandatory long items in a corpus of 1000 speakers. This amounts to 1150 short items and 1000 long items. The transcriptions in the label files for these samples were checked by listening to the corresponding speech files and correcting the transcription if necessary. In case of doubt nothing was corrected. This check was performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - names - application words - phonetically rich words Long items are: - isolated digit string - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences - The evaluation comprised the following guidelines: . Two types of errors were distinguished: speech and non-speech transcription errors . Non-speech refers to [fil] [spk] [sta] [int] only . For non-speech all symbols were mapped to one during validation. i.e. If a non-speech symbol was at the proper location then it was validated as correct (regardless if it was the correct non-speech symbol or not). . Only noise deletions in the transcription were counted as wrong, not noise insertions . the given transcription is given the benefit of the doubt; only obvious errors are corrected. . Errors were only determined on item level, not on word level . For speech a maximum of 5% of the validated items (=files) may contain a transcription error . For non-speech a maximum of 20% of the validated items (=files) may contain a transcription error. A selection of 1005 long items and 1149 short items was used for the transcription validation. RESULTS 1. Long items Transcription errors with respect to speech were found in 63 items. This amounts to 6.3%, which is somewhat above the criterion of 5%. Errors in the transcription of non-speech were found in 59 items. This amounts to 5.9% of the items, which is below the criterion of 20%. 2. Short items Errors with respect to the transcription of speech were found in 24 items. This amounts to 2.1%, which is well below the criterion of 5%. Errors in the transcription of non-speech were found in 65 items. This amounts to 5.7% of the items, which is below the criterion of 20%. 3. Overall result When we take the long and short item sets together then we find errors with respect to the transcription of speech in 87 items. This amounts to 4.0%, which is below the 5% criterion. Errors in the transcription of non-speech were found in 124 items. This amounts to 5.8% which is below the 20% criterion. => The transcriptions of the long items as such appear to contain somewhat => more errors in the transcription of speech than allowed. ========================================================================== 10. SUMMARY Below we give a brief overview of our findings with respect to the German FDB database (first 1000 speakers). The subsections follow the order of the various topics in the previous sections of the report. A main shortcoming of this database is that it consists of only 988 calls instead of 1,000. 1. Documentation The main documentation file DESIGNDE.PS is in postscript and not in MS-WORD. Therefore it cannot be printed everywhere. The document contains all required information. A few errors were found. Section 3.4 contains an error. The database contains 1 spontaneous date (D1), and two prompted ones (D2 and D3). Section 3.12 misses a description of the spontaneous time item. 2. Data base structure and file names The database had the correct structure for speech files and label files. The following files were not in MS-DOS format: DISK.ID, README.TXT, COPYRIGH.TXT, CONTENTS.LST, SPEAKER.TBL, LEXICON.TBL SAMPSTAT.TXT was missing (see also section 4 below); SUMMARY.TXT is erroneously in FIXED1DE\TABLE; SAMPALEX.PS is erroneously in FIXED1DE\SOURCE as SAMPA_DE.PS; ISO88591.PS is erroneously in FIXED1DE\SOURCE. Files of the X-item are probably erroneously included. 23 empty (label) files were detected For A11433S9.DEO there was no matching speech file. For A11164W3.DEA, A11487W2.DEA, A11711W3.DEA, A11877W3.DEA there were no matching label files. Some formatting errors were found in CONTENTS.LST and SUMMARY.TXT. Furthermore, these files were found to be CD-dependent. 3. Items The database contains all obligatory items according to the Speechdat specifications. All item sets are complete up to 95% as specified by our criteria. Only item set I1 is less complete (92.3%). 4. Sampled data files Since SAMPSTAT.TXT was not delivered it was not possible to make histograms of acoustical characteristics of the speech files. 5. Label files In general the label files are OK. They contain the required information. The following obligatory mnemonics are structurally missing: SBF, CRP. The formats for recording date and recording time are not correct. A few mnemonics have different values in one call. 6. Lexicon The lexicon table is fine. Just one word was found missing: Stop. Two symbols were used which are not SAMPA: q : occurs once; error in entry: /Q/ q dS : not official SAMPA, should be dZ; only used in foreign words The phoneme r was used instead of R. R is the general phoneme; r is reserved for the apico-alveolar variant. 7. Speakers A speaker table was delivered whereas a session table would be appropriate, since there are no speaker codes. The speaker table gives data for 990 speakers whereas only 988 are present. 8. Recording conditions The recording conditions are OK. We note that 17 calls are made from MOBILE phones. 9. Transcription A selection of 1005 long items and 1149 short items was used for the transcription validation. When we take the long and short item sets together then we find errors with respect to the transcription of speech in 87 items. This amounts to 4.0%, which is below the 5% criterion. Errors in the transcription of non-speech were found in 124 items. This amounts to 5.8% which is below the 20% criterion. The transcriptions of the long items as such appear to contain somewhat more errors in the transcription of speech than allowed. =========================================================================