SUBJECT: Validation Spanish SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 2.0 DATE : 15 Aug. 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the Spanish SpeechDat(M) database are contained in this document. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one for the Spanish database. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION Two documentation files were provided: one with the general SpeechDat specifications (DESIGN.DOC) and one with information specific for Spanish (SPANISH.DOC). In the following we refer to the latter document unless stated otherwise. - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK, first page. - Number of CDs / Tapes OK, in README.TXT - Contents of each CD / tape OK, in README.TXT - The directory structure of the CDs / tapes OK, in section 1.1.3. - List of missing items OK, in 1.1.3. - Speaker demographics . which regions, how many of each OK, in 1.5.1 . motivation for selection of regions OK, in 1.5.1. . which age groups, how many of each OK, in 1.5.2 . sexes: males, females, also children?; how many of each. OK, in 1.5.2 The reason for including 1002 speakers in the database instead of the required 1000 is explained in 1.4. - Reference to a file where speaker characteristics are stored (SPEAKER.TBL) OK, in 1.1.3. - The number of items on the CD and per speaker OK, implicitly in 1.1.3 - Naming conventions for directories and files OK, in 1.1.2 and 1.1.3. - Prompting . linguistic specification (and motivation) for the prompting material OK, in 1.3. . connection of sheet items to item numbers on CD / tape OK, in 1.1.2 . sheet example OK, provided as 1.6.1 and 1.6.2 . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK, pointed out in 1.6.4. - Analysis of frequency of occurence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) OK, these are provided in separate files: \FIXED0ES\DOC\PHONEMES.TXT, \FIXED0ES\DOC\BIPHONES.TXT, \FIXED0ES\DOC\TRIPHONES.TXT. These files are referred to in section 1.2.3. . recommended: at least 2 samples of each phone per caller (should appear from documentation) It was strived at to include at least one of every phoneme for each caller, see 1.3.12. - Recording platform should be specified . digital telephone net link OK, in 1.1 and 1.2.1. - Statement that all signal transmission between CO and recording site is digital OK, in 1.2.1. - Signal characteristics (number of bits per sample; bandwith; coding type; compression procedures) OK, in 1.1.1. - The format and the file header structure of speech files OK, in 1.1.1 - The format and the file header structure of annotation files OK, in 5.5 of DESIGN.DOC - Annotation . procedure OK, in 1.2.4 . quality assurance OK, in 1.2.4 . character set used for annotation (transcription) OK, ISO-Latin-I, stated in 1.4. . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] OK, in last paragraph of 1.2.4. . list of symbols used to denote word interruptions and break-offs OK, in section 6 of DESIGN.DOC. - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) => There is no information on this topic in the documentation. . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols). OK, a list of deviating symbols is provided in 1.4. - Transcription manual: TRANSCRIP.DOC (optional) . is it there? . does it contain the relevant information? . What is done with non speech events . What is done with capitals . Only one spelling of each word is allowed There is no transcription manual. - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included. Otherwise a statement why such a list is not necessary. The lexicon contains the list of normalised spellings, as stated in 1.4. - Indication of how many of the files were double checked by the producer together with percentage of detected errors OK, the information is provided in 1.4: There has not been any cross-checking of transcriptions, since this requirement was proposed in SpeechDat after the work was substantially complete. ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for Speechdat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - Documentation should be in \\DOC OK - The summary file (SUMMARY.TXT) should be in \\DOC OK - The contents list (CONTENTS.LST) is in \\INDEX and is obligatory. OK - Tables should be in \\TABLE OK - Index files (optional) should be in \\LST Not provided - Prompt sheet files (optional) should be in \\PROMPT OK - Any source code supplied should be in \\SOURCE (SAMLIB, V4, and GNU gunzip, version 1.2.4 + licence) OK - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) Not provided - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1 : isolated digit C1 : 4 digit id of prompt sheet C2 : ~10 digit telephone number C3 : ~12 digit credit card number N1-3 : 3 natural numbers M1-2 : 2 money amounts L1-3 : 3 spelled words T1 : 1 time of day T2 : 1 time phrase D1-3 : 3 dates Q1-3 : 3 yes/no questions P1 : city of call/birth A1-6 : 6 common application words E1-3 : 3 application word phrases S1-9 : 9 phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK - Counts should match information in documentation . count of files in each subdirectory . count grand total OK - Missing items per speaker Check with documentation In 1.1.3 of DESIGN.DOC three missing files are listed. This was confirmed by our analyses. There were no other files missing. The only obligatory file missing therefore is SES0066 which has no Q3. - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus should be designed for training and a (typically smaller) part for testing. This is optional. Not provided - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK, all information is provided. There is one additional file containing the information in ASCII format; fields are separated by [TAB]; session number is used as speaker code. - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically 39 codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all fields are separated by spaces OK ====================================================================== 3. ITEMS - 1 isolated digit (code I1) . read OK - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read OK - ~10 digit telephone number . read or spontaneous(?) OK - ~12 digit credit card number (16 digits would be better) . read OK . if there is a checksum then formula must be provided OK . 26 digits per call are required OK . at least one example per digit per caller => not every digit is spoken by each speaker. We found 675 digits missing in 559 calls. An explanation for this is given in section 1.3.2 of SPANISH.DOC. . digits must appear numerically on the sheet, not as words OK - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training OK - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK, (decimals are uselesss for for pesetas). - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK, Each word occurs at least 20 times. - 1 date (code D1) . spontaneous OK - 2 dates (code D2-3) . read, wordstyle . analogue form . covering all weekdays and months OK Each year between 1920 and 2020 occurs at least 13 times, mostly about 20 times. The other key words appear 50 times or more, a few somewhat less ('dos': 33 times; 'tres': 34 times). - 3 yes/no questions (code Q1-3) . spontaneous, not prompted . balance between yes/no (no bias due to question, or sex - cf Portuguese) OK - city of call/birth (code P1) . preferably spontaneous; read is permitted OK, the province is asked, not a city. The reason for this is explained in section 1.3.11 of SPANISH.DOC. - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read OK - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK, each phoneme was pronounced once by each caller - Additional optional items There is an optional telephone number, supplied with corpus code C4. There are two extra application words, supplied with corpus codes A7 and A8. All items are present. There are no items structurally missing. 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided. According to the documentation (DESIGN.DOC) all mandatory application words were included in the Spanish data base. A count on the occurrences of these words in the database showed that there are at least 100 tokens of each application word. 3. Incidentally missing items a. files that are not there In 1.1.3 of DESIGN.DOC three missing files are listed. This was confirmed by our analyses. There were no other files missing. The only obligatory file missing therefore is SES0066 which has no Q3. b. files with empty transcriptions in the LBO label field We found 92 files which did not contain speech according to the LBO: field in the label file. 5 files belonged to the additional items C4,A7,A8, which leaves 87 mandatory items with an empty LBO field. Below we list how often how many files of a call were not containing speech according to the LBO-field. Freq. Nr of items missing in a call 59 1 6 2 4 3 1 4 which adds up to 87 mandatory items missing in total. As a result, 69 calls miss up to three mandatory items according to this count, and 1 call misses more items. On the other hand, we found some files that had empty LBO-fields but contained the prompted speech! (This could not be checked systematically) This was the case for A00075C3, A00063C3 and A00094C3. (This list is not exhaustive). This means that the missing 87 items mentioned above are actually lower. 4. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) Since there is only one mandatory item actually missing, the criterion is met without problems. If we take into account files with empty LBO-fields, then we find that 6.9% of the calls miss upto 3 mandatory items, and 0.1% of the calls miss more than three items. This means that the criterion is fulfilled easily. There may also be other files that are effectively missing (corrupted speech files). These are dealt with in the next section. =========================================================================== 4. SAMPLED DATA FILES 1 File structure . NIST (header contains file info) . SAM OK, with SAM label filss. 2 Coding . A-law, 8 bit, 8 kHz . Compression by GZIP OK 3 Sample distribution Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length or shortness. As extreme durations we regard file lengths shorter than 1s, and longer than 15s. Duration distribution over all items: Length (s) #Occurrences 2 - 3 : 5094 3 - 4 : 6182 4 - 5 : 3512 5 - 6 : 6386 6 - 7 : 7806 7 - 8 : 8457 8 - 9 : 1533 9 - 10 : 671 10 - 11 : 1721 11 - 12 : 195 12 - 13 : 157 13 - 14 : 367 Duration distribution per call: 3 - 4 : 2 4 - 5 : 39 5 - 6 : 659 6 - 7 : 230 7 - 8 : 51 8 - 9 : 21 There were no files with extreme lengths indicating spurious events during recording. 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Files with a clipping rate higher than 0.4% must be regarded as spurious. Number of files with absolute maximum < 32256: 37459 Clip distribution for all files: Clipping Occurences rate (in %) 0.0 - 0.1 : 3665 0.1 - 0.2 : 426 0.2 - 0.3 : 199 0.3 - 0.4 : 97 0.4 - 0.5 : 70 0.5 - 0.6 : 33 0.6 - 0.7 : 23 0.7 - 0.8 : 18 0.8 - 0.9 : 17 0.9 - 1.0 : 6 1.0 - 1.1 : 11 1.1 - 1.2 : 7 1.2 - 1.3 : 13 1.3 - 1.4 : 7 1.4 - 1.5 : 5 1.6 - 1.7 : 6 1.7 - 1.8 : 3 1.8 - 1.9 : 3 1.9 - 2.0 : 1 2.0 - 2.1 : 2 2.1 - 2.2 : 1 2.2 - 2.3 : 2 2.3 - 2.4 : 1 2.6 - 2.7 : 1 2.7 - 2.8 : 1 2.9 - 3.0 : 1 3.6 - 3.7 : 1 5.2 - 5.3 : 1 5.4 - 5.5 : 1 Clip distribution for obligatory files only: Number of files with absolute maximum < 32256: 34717 Clipping Occurences rate (in %) 0.0 - 0.1 : 3463 0.1 - 0.2 : 394 0.2 - 0.3 : 189 0.3 - 0.4 : 92 0.4 - 0.5 : 69 0.5 - 0.6 : 32 0.6 - 0.7 : 20 0.7 - 0.8 : 14 0.8 - 0.9 : 15 0.9 - 1.0 : 5 1.0 - 1.1 : 11 1.1 - 1.2 : 7 1.2 - 1.3 : 13 1.3 - 1.4 : 7 1.4 - 1.5 : 5 1.6 - 1.7 : 6 1.7 - 1.8 : 3 1.8 - 1.9 : 3 1.9 - 2.0 : 1 2.0 - 2.1 : 2 2.1 - 2.2 : 1 2.2 - 2.3 : 2 2.6 - 2.7 : 1 2.7 - 2.8 : 1 2.9 - 3.0 : 1 3.6 - 3.7 : 1 5.2 - 5.3 : 1 5.4 - 5.5 : 1 => By listening to the files we concluded that files with a clip ratio over 0.2% are potentially bad, and that files with a clip ratio over 1.4 are, as a rule, highly distorted. The files with a clip ratio over 1.4% are listed below: 1.45 in file A00113Q1.ESZ 1.44 in file A00555S2.ESZ 1.43 in file A01332S8.ESZ 1.50 in file A00555E3.ESZ 1.50 in file A01332S2.ESZ 1.65 in file A00094L3.ESZ 1.60 in file A00286L2.ESZ 1.66 in file A00286S8.ESZ 1.69 in file A01332S6.ESZ 1.77 in file A00042A5.ESZ 1.74 in file A00555D3.ESZ 1.70 in file A00555S1.ESZ 1.70 in file A01332C3.ESZ 1.74 in file A01332L1.ESZ 1.86 in file A00286C2.ESZ 1.89 in file A00286L3.ESZ 1.94 in file A00555C3.ESZ 1.90 in file A00555T1.ESZ 2.60 in file A00113C1.ESZ 2.20 in file A00113C3.ESZ 2.06 in file A00113D1.ESZ 2.97 in file A00113E3.ESZ 2.11 in file A00113L3.ESZ 2.30 in file A00113N1.ESZ 2.02 in file A00555L2.ESZ 2.74 in file A00555S4.ESZ 3.65 in file A00113S2.ESZ 5.40 in file A00113L1.ESZ 5.25 in file A00113S1.ESZ Number of directories with absolute maximum < 32256: 611 Clip distribution per call: Clipping Occurences rate (in %) 0.0 - 0.1 : 366 0.1 - 0.2 : 16 0.2 - 0.3 : 3 0.3 - 0.4 : 2 0.5 - 0.6 : 1 0.8 - 0.9 : 1 0.9 - 1.0 : 1 1.2 - 1.3 : 1 => By listening to the files in the directories we concluded that the following directories are severely distorted due to clipping: SES0113 having a mean clipping ratio of 1.25% SES0286 having a mean clipping ratio of 0.53% SES1332 having a mean clipping ratio of 0.98% 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items: Mean Occurrences -330 - -320 : 42 -320 - -310 : 1 -290 - -280 : 1 -240 - -230 : 1 -230 - -220 : 1 -220 - -210 : 1 -210 - -200 : 1 -200 - -190 : 1 -190 - -180 : 2 -160 - -150 : 6 -150 - -140 : 33 -140 - -130 : 34 -130 - -120 : 43 -120 - -110 : 84 -110 - -100 : 168 -100 - -90 : 445 -90 - -80 : 831 -80 - -70 : 1055 -70 - -60 : 822 -60 - -50 : 626 -50 - -40 : 345 -40 - -30 : 810 -30 - -20 : 1199 -20 - -10 : 3056 -10 - 0 : 10239 0 - 10 : 14680 10 - 20 : 2481 20 - 30 : 1412 30 - 40 : 970 40 - 50 : 609 50 - 60 : 835 60 - 70 : 326 70 - 80 : 217 80 - 90 : 95 90 - 100 : 85 100 - 110 : 170 110 - 120 : 7 120 - 130 : 169 130 - 140 : 43 140 - 150 : 41 150 - 160 : 1 170 - 180 : 43 180 - 190 : 1 210 - 220 : 1 240 - 250 : 1 260 - 270 : 4 270 - 280 : 38 360 - 370 : 1 410 - 420 : 1 590 - 600 : 1 600 - 610 : 1 840 - 850 : 1 Mean distribution over all obligatory items: Mean Occurrences -330 - -320 : 39 -320 - -310 : 1 -290 - -280 : 1 -240 - -230 : 1 -230 - -220 : 1 -220 - -210 : 1 -210 - -200 : 1 -200 - -190 : 1 -190 - -180 : 2 -160 - -150 : 5 -150 - -140 : 29 -140 - -130 : 29 -130 - -120 : 41 -120 - -110 : 81 -110 - -100 : 155 -100 - -90 : 392 -90 - -80 : 753 -80 - -70 : 988 -70 - -60 : 774 -60 - -50 : 600 -50 - -40 : 334 -40 - -30 : 766 -30 - -20 : 1143 -20 - -10 : 2856 -10 - 0 : 9424 0 - 10 : 13560 10 - 20 : 2332 20 - 30 : 1335 30 - 40 : 921 40 - 50 : 572 50 - 60 : 780 60 - 70 : 302 70 - 80 : 202 80 - 90 : 89 90 - 100 : 79 100 - 110 : 157 110 - 120 : 7 120 - 130 : 157 130 - 140 : 40 140 - 150 : 38 150 - 160 : 1 170 - 180 : 40 180 - 190 : 1 210 - 220 : 1 240 - 250 : 1 260 - 270 : 4 270 - 280 : 35 360 - 370 : 1 410 - 420 : 1 590 - 600 : 1 600 - 610 : 1 840 - 850 : 1 Mean distribution over all calls: Mean Occurrences -330 - -320 : 1 -140 - -130 : 1 -130 - -120 : 2 -120 - -110 : 1 -100 - -90 : 9 -90 - -80 : 22 -80 - -70 : 28 -70 - -60 : 26 -60 - -50 : 7 -50 - -40 : 3 -40 - -30 : 18 -30 - -20 : 27 -20 - -10 : 74 -10 - 0 : 242 0 - 10 : 365 10 - 20 : 73 20 - 30 : 32 30 - 40 : 14 40 - 50 : 11 50 - 60 : 18 60 - 70 : 7 70 - 80 : 5 80 - 90 : 2 90 - 100 : 2 100 - 110 : 4 120 - 130 : 4 130 - 140 : 1 140 - 150 : 1 170 - 180 : 1 270 - 280 : 1 Files in directories with a mean value of more than 100 or less than -90 were visually and aurally inspected. There were no marked deviations in these files. 3.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was substracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items: SNR occurrences 0 - 5 : 17 5 - 10 : 90 10 - 15 : 240 15 - 20 : 668 20 - 25 : 2192 25 - 30 : 5524 30 - 35 : 10228 35 - 40 : 10473 40 - 45 : 7845 45 - 50 : 3470 50 - 55 : 906 55 - 60 : 235 60 - 65 : 100 65 - 70 : 40 70 - 75 : 21 75 - 80 : 17 80 - 85 : 5 85 - 90 : 6 90 - 95 : 2 95 - 100 : 2 SNR distribution over obligatory items: SNR occurrences 0 - 5 : 17 5 - 10 : 85 10 - 15 : 218 15 - 20 : 614 20 - 25 : 2010 25 - 30 : 5117 30 - 35 : 9500 35 - 40 : 9733 40 - 45 : 7282 45 - 50 : 3256 50 - 55 : 840 55 - 60 : 223 60 - 65 : 93 65 - 70 : 39 70 - 75 : 19 75 - 80 : 17 80 - 85 : 5 85 - 90 : 6 90 - 95 : 1 95 - 100 : 2 SNR distribution over calls: SNR occurrences 5 - 10 : 1 10 - 15 : 3 15 - 20 : 6 20 - 25 : 41 25 - 30 : 120 30 - 35 : 282 35 - 40 : 268 40 - 45 : 202 45 - 50 : 57 50 - 55 : 16 55 - 60 : 3 60 - 65 : 2 65 - 70 : 1 We found one call with an overall SNR of 4 dB, being SES0646. The speech in this call is very weak and contains a severe buzz. There are three calls with an overall SNR between 10 and 15 dB: Call: Mean SNR SES0752 13.5 Weak recording with heavy 'spiky' noise SES0812 14.5 Weak recording with weak buzz; acceptable SES1045 14.5 Weak recording with weak buzz; acceptable There are six calls with an overall SNR between 15 and 20 dB: Call: Mean SNR SES1216 15.5 Weak recording; considerable noise but acceptable SES0039 19.0 Weak recording; considerable noise but acceptable SES1020 19.0 Weak recording; weak noise; acceptable SES0064 19.5 Weak recording; considerable noise but acceptable SES0831 19.5 Weak recording; considerable noise but acceptable SES1024 20.0 Weak recording; weak noise; acceptable In the ~/DOC/SUMMARY.TXT files there are quite some other directories mentioned that contain line noise. By inspection of a subset of these, we concluded that the noise in these directories is present, but acceptable, except for the ones indicated as corrupted above. => We further looked at files with high SNR values but with no speech in it according to the transcription (LBO-field). We observed that files with an SNR over 30 dB and no speech in according to the LBO-field are likely to have invalid transcriptions. There are 24 of such files. A listing follows below: File SNR in dB A00007Q3.ESO: 37.96 A00050C3.ESO: 37.43 A00075C3.ESO: 30.21 A00094C3.ESO: 40.06 A00162C3.ESO: 36.77 A00170C3.ESO: 34.56 A00192C1.ESO: 40.64 A00192C4.ESO: 39.14 A00166C3.ESO: 30.49 A00191C1.ESO: 31.74 A00191C4.ESO: 33.76 A00214C3.ESO: 37.56 A00265A6.ESO: 30.07 A00326C3.ESO: 32.33 A00346C3.ESO: 33.15 A00334C3.ESO: 39.94 A00350T1.ESO: 33.98 A00439C4.ESO: 40.13 A00929Q3.ESO: 34.24 A01091P1.ESO: 45.38 A01096A6.ESO: 35.49 A01115Q2.ESO: 30.13 A01182Q3.ESO: 32.79 A01201C1.ESO: 34.56 =========================================================================== 5. ANNOTATION FILES - File empty? OK - No illegal mnemonics used OK - There are no mnemonics missing OK - all mnmonics should be SAM mnemonics or explicitly defined in documentation OK - Mandatory (SAM) mnemonics: LHD: V5.0 DBN: SPEECHDAT(M)_ VOL: FIXED0_ SES: DIR: SRC: CCD: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 SCD: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM REG: LBD: LBR: , , [gain], [minimum val], [maximum val], orthogr. prompt LBO: , , , transliteration EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword . no line may exceed 80 chars OK. We did not find any illegal mnemonics. SCD was made identical to SES. This is explained in section 1.4 of SPANISH.DOC. - Optional (SAM) mnemonics (may be ommitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM REP: PCF: RCC: ENV: - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). => For spontaneous items the literal prompt was put in triangular brackets. => Further we discovered some files that had empty LBO-fields but contained the prompted speech! This was the case for A00075C3, A00063C3 and A00094C3. (This list is not exhaustive). - Obligatory and optional label mnemonics not provided in the label files should be provided in a file `CONTENTS.LST' from which this information can be derived and added to the label file by the validating institute. OK - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. OK - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation OK - Asterisks should be used to indicate incomplete realisations OK - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found A spelling check could not be performed by us. A check on bracket balancing did not yield any incomplete pairs. - The label files are associated with the correct speech files. (This cannot be done automatically at this moment. We can only point at files that are incidentally found as mismatched during the transcription and/or speech file validation) OK - Assessment of speech items in terms of SNR, presence of additional noise adherence to prompting text is provided (optional) OK. This information is in the SUMMARY.TXT files. ====================================================================== 6. LEXICON - Check lexicon existence OK, it is provided as \FIXED0ES\TABLE\LEXICON.TBL and as \FIXED0ES\TABLE\LEXICON.ASC. - Lexicon contents should be taken from actual utterances (from LBO) OK - The entries should be alphabetically ordered (ISO) OK - In transcriptions only SAMPA symbols are allowed OK, a few additional symbols (/N,z/) are listed in the documentation. - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) OK - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) Not used - Orthographic entries are as a rule splitted by apostrophes, but not by dashes. There are neither apostrophes nor dashes in the orthographic entries. - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) OK . Check for overcompleteness (invalid words have a * and should not be in lexicon) (the same goes for words truncated due to a recording error; this is indicated by ~) OK - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. OK, stress information is supplied correctly by single quotes, and syllable boundaries are correctly indicated by periods. ========================================================================== 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes Option b. was chosen. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE 4. region of call REG OK It seems that SES is used as speaker code. This is explained in section 1.4 of SPANISH.DOC. - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic gruop ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC Not used. - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% OK, we found by our own inspection 494 female speakers and 508 male speakers. This exactly matches the documentation. The reason for including 1002 speakers in the database instead of the required 1000 is explained in section 1.4 of SPANISH.DOC. - Balance of regions . which regions and how many of each should match specification in documentation file We found some 50 different 'regions', but we are unable to match these against the five main regions mentioned in the documentation (section 1.5.1 of SPANISH.DOC). We have verified the number of unknown speakers, which, as documented, added up to 19. - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. We found the following age distribution: under 17: 13 = 1.30 % 17 - 30 : 527 = 52.59 % 31 - 45 : 283 = 28.24 % 46 - 60 : 156 = 15.57 % over 60 : 23 = 2.30 % => It can be concluded that that there are not sufficient speakers in the age category between 46-60. The 9 speakers whose age is unknown according to the speaker table, have 0 as field value for the mnemonic AGE in the label files. ======================================================================= 8. RECORDING CONDITIONS - Digital telephone line OK - A-law coding OK - Specification of wireless telephone or not (optional) Not provided - Time stamps on file OK - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - name of file: ~\TABLE\REC_COND.SAM or ~\TABLE\REC_COND.TBL - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP This optional file is not provided. ============================================================================= 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences Given the fact that 23 long items were present in the database, and that there were 1002 speakers, a selection of 5% of the long items would comprise 1152 samples. A selection of 1132 items was actually used for evaluation. For the short items all 16 items were included in the database. A selection of 5% yields a sample 802 items. A random selection of 791 short items was actually used for the evaluation. - The evaluation comprises the following criteria . did the speaker actually speak the translitterated words . did the speaker speak the prompted text . is translitteration of non-speech acoustics events correct . speech quality, line quality . up to 5% transcription errors are allowed - Abbreviations may only be used if spoken as such RESULTS 1. Long items => In the sample of 1132 long items, transcription errors were found in 165 items. This amounts to 14.5%, which is over 5%. However, most of the errors were spotted in the transcription of non-speech acoustic events. If we refrain from these errors, only 33 errors remain, yielding an acceptable figure of 2.9% transcription errors. 2. Short items In the sample of 791 short items, transcription errors were found in 77 items, which is in 9.7%, which is over 5%. However, most of the errors were spotted in the transcription of non-speech acoustic events. If we refrain from these errors, only 5 errors remain, yielding a very satifactory figure of 0.6% transcription errors. A list of errors found in the samples can be supplied upon request. OTHER REMARKS => Dates above (the year) 2000 are prompted incorrecty according to our Spanish informant. Above 2000 people expect 'del', but the prompt text reads 'de'. Quite some people are confused because of this erroneous prompt and start hesitating. A remark about this is made in SPANISH.DOC, section 1.3.9. => Further we discovered some files that had empty LBO-fields but that did contain the prompted speech! This was the case for A00075C3, A00063C3 and A00094C3. (This list may not be exhaustive). ========================================================================== 10. SUMMARY Below we give a brief overview of our findings with respect to the Spanish database. The subsections follow the order of the various topics in the previous sections of the report. As a general comment it can be reported that the Spanish database follows the SpeechDat specifications very closely. 1. Documentation The documentation is extensive and in general correct. A description of the procedure that was used for the lexicon to generate the phonemic transcriptions from the orthographic entries is absent. 2. Data base structure and file names The database structure and file names comply nicely with the SpeechDat specifications. 3. Items None of the obligatory items is structurally missing. There are three extra optional items: C4, A7, and A8. All mandatory application words are included in the data base. There is only one obligatory file missing. If we take into account files with empty LBO-fields, then we find that 6.9% of the calls miss upto 3 mandatory items, and 0.1% of the calls miss more than three items. This means that the SpeechDat criterion for missing items is fulfilled easily. 4. Sampled data files There were no files with extreme lengths. By listening to the files we concluded that files with a clip ratio over 0.2% are potentially bad, and that files with a clip ratio over 1.4 are, as a rule, highly distorted. There are 30 files with a clip ratio over 1.4%. Three directories are heavily distorted due to clipping, i.e. SES0113, SES0286, SES1332. Files in directories with a mean value of more than 100 or less than -90 were visually and auditively inspected. There were no marked deviations in these files. We found one call with an overall SNR of 4 dB, being SES0646. The speech in this call is very weak and contains a severe buzz. Further the call in SES0752 contains a weak recording level with severe noise. 5. Label files We did not find any illegal mnemonics. SCD was made identical to SES. Further we discovered some files that had empty LBO-fields but contained the prompted speech! This was the case for A00075C3, A00063C3 and A00094C3. 6. Lexicon The lexicon is perfectly formated and complete. 7. Speakers The information on the speakers is documented in the correct format in the speaker table. The balancing of speaker sexes is OK. But there is an imbalance of speaker ages: there are 15.57% of the speakers in age group 46-60, in stead of 20%. 8. Recording platform The recording platform complied with the SpeechDat criteria. 9. Transcription Long items In the sample of 1132 long items, transcription errors were found in 165 items. This amounts to 14.5%, which is over 5%. However, most of the errors were spotted in the transcription of non-speech acoustic events. If we refrain from these errors, only 33 errors remain, yielding an acceptable figure of 2.9% transcription errors. Short items In the sample of 791 short items, transcription errors were found in 77 items, which is in 9.7%, which is over 5%. However, most of the errors were spotted in the transcription of non-speech acoustic events. If we refrain from these errors, only 5 errors remain, yielding a very satifactory figure of 0.6% transcription errors. Other remarks Dates above (the year) 2000 are prompted incorrecty according to our informant. =========================================================================