SUBJECT: Validation Italian SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.1 DATE : 31 July 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the Italian SpeechDat(M) database are contained in this document. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION The documentation of the database is in file \FIXED0IT\DOC\ITALIAN.DOC. - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK - Number of CDs OK, in section 1 - Contents of each CD OK, in section 1 - The directory structure of the CDs OK, section 1.3 - Naming conventions for directories and files OK - List of missing items Not provided, but a way to retrieve it is suggested in section 1.3 of the document. - Speaker demographics . which regions, how many of each OK, in section 5.1 . motivation for selection of regions OK, in section 5.1 . which age groups, how many of each OK, in section 5.2 . sexes: males, females, also children?; how many of each. OK, in section 5.2 => The total number of speakers emerging from section 5.1 and section 5.2 respectively is different (see also section 7 of our report). This is because there are some speakers that called more than once. This fact has been acknowledged in section 5.2 but not in section 5.1. - Reference to a file where speaker characteristics are stored (SPEAKER.TBL) OK, in section 1.3 - The number of items on the CD and per speaker OK in README.TXT and in the SUMMARY.TXT files - Prompting . linguistic specification (and motivation) for the prompting material OK, in section 2.3 . connection of sheet items to item numbers on CD OK, in section 3, and in the README.TXT file . sheet example OK, section 6 . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK, appears from sample sheet; information is also provided in section 2.3. - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) => There are no such statistics - Recording platform should be specified . digital telephone net link OK, in section 1 and section 2.1 - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) OK, in section 1.1 - The format and the file header structure of speech files OK, in section 1.1 - The format and the file header structure of annotation files OK, in section 1.1 - Annotation . procedure => There is very scarce information about the procedure Some information on transcriptions is in sections 2.4, 3.7, 4.3, 4.4, 4.5. . quality assurance Not done, section 4.4. . character set used for annotation (transcription) OK, in section 4.3 . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] OK, in section 4.5 . list of symbols used to denote word interruptions and break-offs Italian follows the SpeechDat conventions listed in DESIGN.DOC. - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) . Overview of SAMPA symbols used (only in this manner it can be checked if the lexicon contains only legal symbols). (Alternatively, D141 refers to the standard SAMPA definitions on a WWW server and this may be sufficient to check against) => There is no special information about the lexicon whatsoever. We assume that everything in DESIGN.DOC is valid. - Transcription manual: TRANSCRIP.DOC (optional) . is it there? . does it contain the relevant information? . What is done with non speech events . What is done with capitals . Only one spelling of each word is allowed => There is no transcription manual. Some information on transcriptions is in sections 2.4, 3.7, 4.3, 4.4, 4.5. - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included. Otherwise a statement why such a list is not necessary. => Not provided - Indication of how many of the files were double checked by the producer together with percentage of detected errors => Not provided Other remarks: None ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC OK - The summary file (SUMMARY.TXT) should be in \\DOC OK - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE OK - Index files (optional) should be in \\INDEX OK The index files were checked against the missing items that we found, the idea being that both should be complementary. For all item codes we obtained an exact computational match, => except for the phonetically rich sentences. It appears that for 11 calls the sentences are not listed in FIXED0IT\INDEX\A0SIT.LST. These calls are: SES1430 SES1700 SES2043 SES2108 SES2145 SES2181 SES2585 SES2701 SES2984 SES3165 SES3728 - Prompt sheet files (optional) should be in \\PROMPT Not provided. - Any source code supplied should be in \\SOURCE (SAMLIB, V4, and GNU gunzip, version 1.2.4 + licence) OK - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) OK - All sessions indicated in the documentation are present on the CDs OK - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1 : isolated digit C1 : 4 digit id of prompt sheet C2 : ~10 digit telephone number C3 : ~12 digit credit card number N1-3 : 3 natural numbers M1-2 : 2 money amounts L1-3 : 3 spelled words T1 : 1 time of day T2 : 1 time phrase D1-3 : 3 dates Q1-3 : 3 yes/no questions P1 : city of call/birth A1-6 : 6 common application words E1-3 : 3 application word phrases S1-9 : 9 phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK - Counts should match information in documentation . count of files in each subdirectory . count grand total => There is no information about this in ITALIAN.DOC nor in the README.TXT file. - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus should be designed for training and a (typically smaller) part for testing. This is optional. Partitioning is not provided. - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically 39 codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all fields are separated by spaces OK, however some minor deviations from the SpeechDat specifications were observed: => All summary files are identical, they do not contain the data of the particular CD, but all data. => The 39 items are separated by spaces, which should not be the case. ====================================================================== 3. ITEMS - 1 isolated digit (code I1) . read OK - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read OK - ~10 digit telephone number . read or spontaneous(?) OK - ~12 digit credit card number (16 digits would be better) . read . if there is a checksum then formula must be provided OK . at least 26 digits per call are required OK . digits must appear numerically on the sheet, not as words OK => However, in the label files the digits are spelled as words and not represented in numbers. This feature of VOX (see sections 2.4, 3.5) is fine for the transcribed text but not for the prompted text. . at least one example per digit per caller => This cannot be validated since digits are not given numerically. - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training OK => However, in the label files the digits are spelled as words and not represented in numbers. This feature of VOX (see sections 2.4, 3.5) is fine for the transcribed text but not for the prompted text. - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK, there is no distinction between large and small numbers because the Italian currency does not use decimals. - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK => equivalents of CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN were not employed. - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK, the items are present. We calculated a distribution over T2 and T3. => Some of the time words do not occur very often: a : 20 del : 19 dell'una : 11 della : 10 la : 16 meta` : 15 mezzanotte : 12 mezzogiorno : 7 nella : 11 punto : 11 serata : 12 ventitre` : 7 - 1 date (code D1) . spontaneous OK - 2 dates (code D2-3) . read, wordstyle . analogue form . covering all weekdays and months OK, well covered - 3 yes/no questions (code Q1-3) . spontaneous, not prompted . balance between yes/no OK - city of call/birth (code P1) . preferably spontaneous; read is permitted OK - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read OK See further below 2. - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK => There is no information about phoneme coverage per speaker. Not in the ITALIAN.DOC file, nor in other files. All obligatory items are present; none is structurally missing. There is an additional, optional, time phrase in the database: T3. 2. Application words In appendix A of SpeechDat deliverable 1.4-1 a list of 39 obligatory application words is provided (DESIGN.DOC in the Italian database). It was checked if all application words that should be present according to this list were there. It was found that all words were recorded, with the exception of: 'imposta' and 'ancora'. The latter words are mentioned in DESIGN.DOC, but not anymore in the Italian documentation itself (ITALIAN.DOC, section 3.12). 3. Incidentally missing items a. files that are not there We found that 426 files were missing. Distributed over calls this yields the following statistic: Freq. Nr of items missing in a call 26 1 3 2 1 3 1 5 1 9 1 10 1 11 1 13 1 18 1 19 2 20 2 21 1 23 1 24 5 25 2 26 The calls that miss more than 3 items are: Session Nr of items missing SES2095 5 SES1409 9 SES1410 10 SES2290 11 SES1417 13 SES2986 18 SES2181 19 SES1700 20 SES2984 20 SES2145 21 SES2701 21 SES1309 23 SES3728 24 SES1380 25 SES1430 25 SES2043 25 SES2108 25 SES2585 25 SES1608 26 SES3165 26 b. files with empty transcriptions in the LBO label field Apart from files physically missing in the the datbases there are also files that do not not contain the targeted speech. We have looked at the transcriptions. If we do not find speech, but only noise events, then this file is considered effectively empty. In this way we found 12 calls with one effectively missing item. 4. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) There are no structurally missing items in the Italian database. Looking at the incidentally missing items we find 30 calls missing up to three items, and 20 calls that miss more items. This is well within the SpeechDat limits. The calls that miss 10 or more items are 18 calls in total, which is rather a lot. If we take into account the files that we consider as effectively missing as well, i.e. the files with no target speech in their transcriptions - see 3b. - then we find a total of 41 calls that miss up to three obligatory items and 20 calls missing more. This is still well within the SpeechDat limits. There may also be other files that are effectively missing (corrupted speech files). These are dealt with in the next section. =========================================================================== 4. SAMPLED DATA FILES 1 File structure . SAM OK 2 Coding . A-law, 8 bit, 8 kHz . Compression by GZIP OK 3 Sample distribution Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all items (including the optional item T3): Length (s) #Occurrences 0 - 1 : 4 1 - 2 : 22 2 - 3 : 2290 3 - 4 : 7474 4 - 5 : 6808 5 - 6 : 10448 6 - 7 : 5995 7 - 8 : 2250 8 - 9 : 1379 9 - 10 : 1041 10 - 11 : 428 11 - 12 : 262 12 - 13 : 143 13 - 14 : 156 14 - 15 : 141 15 - 16 : 141 16 - 17 : 124 17 - 18 : 104 18 - 19 : 75 19 - 20 : 47 20 - 21 : 46 21 - 22 : 31 22 - 23 : 14 23 - 24 : 9 24 - 25 : 4 25 - 26 : 4 26 - 27 : 9 Duration distribution per call: Length (s) #Occurrences 3 - 4 : 4 4 - 5 : 171 5 - 6 : 637 6 - 7 : 171 7 - 8 : 14 8 - 9 : 3 A long file duration did not appear to be indicative of problems with the file concerned. The directories with a very long duration will be addressed when we consider directories with a very low SNR. 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all items (including T3): Clipping #Occurences rate (in %) 0.0 - 0.1 : 7231 0.1 - 0.2 : 1770 0.2 - 0.3 : 926 0.3 - 0.4 : 436 0.4 - 0.5 : 304 0.5 - 0.6 : 214 0.6 - 0.7 : 129 0.7 - 0.8 : 113 0.8 - 0.9 : 105 0.9 - 1.0 : 55 1.0 - 1.1 : 36 1.1 - 1.2 : 46 1.2 - 1.3 : 37 1.3 - 1.4 : 14 1.4 - 1.5 : 21 1.5 - 1.6 : 16 1.6 - 1.7 : 13 1.7 - 1.8 : 9 1.8 - 1.9 : 13 1.9 - 2.0 : 6 2.0 - 2.1 : 10 2.1 - 2.2 : 6 2.2 - 2.3 : 9 2.3 - 2.4 : 7 2.4 - 2.5 : 2 2.5 - 2.6 : 3 2.6 - 2.7 : 4 2.8 - 2.9 : 4 2.9 - 3.0 : 1 3.0 - 3.1 : 2 3.1 - 3.2 : 2 3.2 - 3.3 : 2 3.3 - 3.4 : 2 3.4 - 3.5 : 1 3.5 - 3.6 : 3 3.6 - 3.7 : 1 3.7 - 3.8 : 2 3.9 - 4.0 : 2 4.2 - 4.3 : 1 4.3 - 4.4 : 2 4.4 - 4.5 : 1 4.5 - 4.6 : 2 4.6 - 4.7 : 1 4.7 - 4.8 : 1 4.8 - 4.9 : 1 4.9 - 5.0 : 1 5.5 - 5.6 : 1 6.0 - 6.1 : 1 6.3 - 6.4 : 1 Number of files with absolute maximum < 32256: 27879 Clip distribution per call: Clipping #Occurences rate (in %) 0.0 - 0.1 : 513 0.1 - 0.2 : 64 0.2 - 0.3 : 18 0.3 - 0.4 : 12 0.4 - 0.5 : 12 0.5 - 0.6 : 7 0.6 - 0.7 : 2 0.7 - 0.8 : 1 0.8 - 0.9 : 2 1.0 - 1.1 : 1 1.2 - 1.3 : 1 1.4 - 1.5 : 1 1.5 - 1.6 : 1 3.3 - 3.4 : 1 Number of directories with absolute maximum < 32256: 364 => By auditory and visual inspection of subsets of the files we concluded that all files with a clip ratio over 0.7% are severly clipped; this amounts to a total of 560 files. Also files with a clipping rate between 0.4% and 0.7% should be considered as potentially severely clipped. => A directory appears to have been clipped severely if the mean clipping exceeds 0.6%; this is a total of 10 calls. These calls are of very bad quality. The directories concerned are: Call: Mean clipping rate (%): SES1860 0.65 SES2656 0.68 SES1239 0.77 SES1889 0.84 SES1891 0.88 SES2164 1.04 SES1855 1.24 SES3401 1.41 SES1949 1.56 SES2447 3.39 However, directories with a mean clipping rate below 0.6% may also contain many severely clipped files. This is especially true for directories with mean clipping rates over 0.3%. 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items (including T3): Mean #Occurrences -1325 - -1300 : 1 -850 - -825 : 1 -750 - -725 : 1 -700 - -675 : 1 -575 - -550 : 3 -550 - -525 : 1 -525 - -500 : 3 -500 - -475 : 1 -400 - -375 : 3 -375 - -350 : 1 -350 - -325 : 1 -325 - -300 : 2 -300 - -275 : 1 -225 - -200 : 2 -175 - -150 : 5 -150 - -125 : 25 -125 - -100 : 136 -100 - -75 : 683 -75 - -50 : 1712 -50 - -25 : 2630 -25 - 0 : 14391 0 - 25 : 17927 25 - 50 : 1375 50 - 75 : 327 75 - 100 : 125 100 - 125 : 55 125 - 150 : 13 150 - 175 : 11 175 - 200 : 7 200 - 225 : 4 225 - 250 : 1 Mean distribution per call: Mean #Occurrences -125 - -100 : 4 -100 - -75 : 12 -75 - -50 : 49 -50 - -25 : 62 -25 - 0 : 393 0 - 25 : 434 25 - 50 : 40 50 - 75 : 3 75 - 100 : 3 After auditory and visual inspection of a subset of the files we conclude that the extreme mean sample values are not indicative for bad calls. The one file with an average sample value of -1305 (A02612T1.ITZ) was found to be interupted after the speech recording. After the break the sample values are all about -5500 for the tail of 1.7s. The file should be cut to be usable. 3.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items (including T3): SNR #Occurrences 0 - 5 : 8 5 - 10 : 12 10 - 15 : 70 15 - 20 : 334 20 - 25 : 1208 25 - 30 : 3517 30 - 35 : 7851 35 - 40 : 10637 40 - 45 : 9808 45 - 50 : 4746 50 - 55 : 1080 55 - 60 : 140 60 - 65 : 28 65 - 70 : 8 75 - 80 : 1 90 - 95 : 1 SNR distribution per call: SNR #Occurrences 15 - 20 : 5 20 - 25 : 23 25 - 30 : 74 30 - 35 : 207 35 - 40 : 303 40 - 45 : 266 45 - 50 : 109 50 - 55 : 11 55 - 60 : 2 The 5 directories with an average SNR below 20 dB were inspected. They were all bad calls: Call: SNR: SES1357 16.5 Very weak and noisy recording SES1764 17.5 Very noisy SES1863 19.5 Severe buzz SES2051 17.5 Very weak and noisy recording SES2650 19.5 Very weak and noisy recording These directories do not overlap with the bad calls found for highly clipped directories reported in 3.2. Most files with an SNR below 15 dB were very bad. The recording is very weak or there is no speech. This amounts to a total of about 90 files. An SNR of 0.0 dB was found in files A01409A5.ITZ, A01700I1.ITZ, A02145E1.ITZ, A02181A1.ITZ. Three of these files were cut very short (about 50 ms) and are therefore unusable. A01700I1.ITZ was longer but contained only silence. The other files with a SNR below 10 dB were: A02258A1.ITZ A02257T2.ITZ A01968Q3.ITZ A03552S5.ITZ A02509E3.ITZ A02234C2.ITZ A02271L2.ITZ A01357Q2.ITZ A02051Q3.ITZ A02908A5.ITZ A01357I1.ITZ A01887L3.ITZ A02612T1.ITZ A01344Q3.ITZ A01357A1.ITZ A02051A6.ITZ These files contained only silence or noise. A01344Q3.ITZ contains a motorlike sound. =========================================================================== 5. ANNOTATION FILE - File empty? OK, there are no empty annotation files - Mandatory (SAM) mnemonics: LHD: V5.0 DBN: SPEECHDAT(M)_ VOL: FIXED0_ SES: DIR: SRC: CCD: REP: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 SCD: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM REG: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , , , EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword . no line may exceed 80 chars All mandatory mnemonics are used. => The digits in the prompt field (LBR) of the C1-3 items are converted from numerical into the orthographical forms. This should not have occurred. The same holds for the natural numbers N1-3. - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM PCF: RCC: ENV: ASS: ! mnemo is not SAM The only optional mnemonic used is ARC. - All mnemonics should be SAM mnemonics or explicitly defined in documentation OK - No illegal mnemonics used OK - There are no mnemonics missing There are no obligatory mnemonics missing structurally; however, => a few incidentally missing mnemonics were found: File: Missing mnemonic: A01291C3.ITO: LBO A01291C3.ITO: ELF A01811C3.ITO: LBO A01811C3.ITO: ELF A02066N1.ITO: LBO A02066N1.ITO: ELF A02399D3.ITO: LBO A02399D3.ITO: ELF A02501T2.ITO: LBO A02501T2.ITO: ELF A02519M1.ITO: LBO A02519M1.ITO: ELF A02919C2.ITO: LBO A02919C2.ITO: ELF A02464S1.ITO: LBO A02464S1.ITO: ELF A02494S5.ITO: LBO A02494S5.ITO: ELF A02495S5.ITO: LBO A02495S5.ITO: ELF It can be seen that missing mnemonics always occur in pairs: LBO and ELF missing. Thus, there are 10 files missing both mnemonics. - All files must contain the same mnemonics. This holds as well for the optional mnemonics. OK - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). => Instead of this, the explicit question is put in the LBR field of the spontaneous items (Q1-3, D1, T1, P1). - Obligatory and optional label mnemonics not provided in the label files should be provided in the file `CONTENTS.LST' from which this information can be derived (and added to the label file by the validating institute, if necessary). Not the case. - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. OK Lower case letters are used throughout the database. Capital letters are not used. Spelled letters are written in full. All text is in plain ASCII. Also the prompts and the transcriptions, for which ISO-Latin-1 should have been used. - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation The following markers were found in the transcribed texts: [INT] [RMR] [SCH] [clc] [clp] [grd] [imp] [int] [lin] [rmr] [rsp] [scc] [sch] [scr] [sff] [slz] [speaker_other] [tss] [vot] => Not all of these are mentioned in section 4.5 of the ITALIAN.DOC file. Not mentioned are: [INT] [RMR] [SCH] [clc] [lin] [vot] The first four may be typing errors: the first three should be decapitalised, and in [clc] the final letter may be misspelled. The meaning of [lin] and [vot] is unclear. - Asterisks should be used to indicate mispronunciations OK - Tildes should be used to indicate truncations OK - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found Bracket check was OK. Spelling check was not possible. - A comparison (of some sort) of prompted with spoken text will be carried out to check if they match. Not done. - Assessment of speech items in terms of SNR, presence of additional noise adherence to prompting text is provided (optional) Not provided. ======================================================================== 6. LEXICON - Check lexicon existence OK - Lexicon contents should be taken from actual utterances (from LBO) OK - The entries should be alphabetically ordered OK - In transcriptions only SAMPA symbols are allowed OK - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) Only small letters were used. Spelled letters were written in full, not as capitals. - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) Alternative transcriptions are not provided. - Orthographic entries are as a rule splitted by apostrophes, but not by dashes. OK, also underscores were used to divide orthographic digit strings into breath groups (see section 4.4 in FIXED0IT\DOC\ITALIAN.DOC). => The orthographic entries are not in ISO-LATIN-1 but in plain ASCII. - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) => If words are split by apostrophes and underscores then we find 53 entries that are missing in the lexicon. These entries are listed below together with their frequencies of occurrence. abeba: 1 angelis: 1 armerina: 2 arsizio: 9 basilicata: 1 brenta: 1 calabria: 71 centodiciotto: 1 centotrentanove: 8 cinquantam: 2 cinquantami: 1 cinquantamil: 2 dalmazzo: 1 doppio: 2 elicone: 1 emilia: 2 friuli: 1 ionica: 1 janeiro: 1 jonico: 1 landi: 1 ligure: 1 lucania: 1 michele: 1 monferrato: 2 normanni: 1 novantaquattr: 1 novecentomila: 1 novecentosessantasette: 1 novntacinque: 1 olona: 2 ottocen: 1 quarantami: 3 quarantamil: 2 rangone: 1 reale: 1 salvo: 1 seicentoventimila: 1 sesessantuno: 1 sessantamil: 2 settantam: 1 settantami: 1 settantunomila: 1 severo: 1 sicula: 1 telefonica: 232 telesino: 1 terme: 1 trentamil: 1 tronto: 1 valsugana: 1 ventimi: 1 vetere: 1 Quite some of these missing entries are probably spelling errors (they have frequency 1). It is correct that such entries are not in the lexicon, but they should have been corrected in the transcriptions. . Check for overcompleteness (invalid words have a * and should not be in lexicon) (the same goes for words truncated due to a recording error; this is indicated by ~) => There are 6 entries in the lexicon that are not used in the transcriptions: aminoacido artiglio ingresso onore oro ostruzionismo Overcompleteness is not much of a problem from a practical point of view. - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. Not provided. ========================================================================== 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes OK, the second option was chosen. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE 4. region of call REG OK, but => the region of call is missing. - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic group ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC Not provided. - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% The following sex distribution was observed: Females: 588 = 58.80 % Males : 412 = 41.20 % => It can be seen that the disbalance exceeds the 5% boundary. => In section 5.2 of ITALIAN.DOC the following distribution is printed. Females: 583 = 58.95 % Males : 406 = 41.05 % The difference is attributable to the fact that some speakers called more than once. In the documentation the computation is based on the number of speakers pooled over sessions, whereas our computation is based on the assumption that each call is made by a different speaker. => There is a contradiction in section 5.1 and 5.2 of ITALIAN.DOC. According to section 5.1 there is a total of 1000 speakers, and according to section 5.2 a total of 989 speakers. In section 5.1 the calculation is based on the sessions; in section 5.2 it is based on the speakers themselves. This is confusing. - Balance of regions . which regions and how many of each should match specification in documentation file The following region distribution was found : CENTRO: 53 = 5.30 % NORD: 627 = 62.70 % SARDEGNA: 31 = 3.10 % SUD: 289 = 28.90 % The figures in section 5.1 of ITALIAN.DOC are slightly different: CENTRO: 53 = 5.30 % NORD: 626 = 62.60 % SARDEGNA: 31 = 3.10 % SUD: 288 = 28.80 % The difference is in 2 speakers. These are the speakers of unknown origin according to section 5.1 of ITALIAN.DOC. The marker unknown is not found in the label files => (mnemonic REG:). So, the unknown speakers appear to have been assigned to the regions NORD and SUD. - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. The following age distribution was found: under 17: 31 = 3.10 % 17 - 30 : 464 = 46.40 % 31 - 45 : 301 = 30.10 % 46 - 60 : 179 = 17.90 % over 60 : 25 = 2.50 % It appears that there are too few speakers in the class 46-60 years of age. In section 5.2 of ITALIAN.DOC we find slightly different figures due to a different grouping, and again pooling calls with the same speaker: under 16: 27 17 - 30 : 466 31 - 45 : 297 46 - 60 : 174 over 60 : 25 ======================================================================= 8. RECORDING CONDITIONS - Digital telephone line OK - A-law coding OK - Specification of wireless telephone or not (optional) Not provided - Time stamps on file OK - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - name of file: TABLE\REC_COND.SAM or TABLE\REC_COND.TBL - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP Not provided ============================================================================= 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences - The evaluation comprises the following criteria . did the speaker actually speak the transliterated words . did the speaker speak the prompted text . is transliteration of non-speech acoustics events correct . speech quality, line quality . up to 5% transcription errors are allowed - Abbreviations may only be used if spoken as such A random selection of 1135 long items and 808 short items was used for the transcription validation of the Italian database. A. Long items In 313 of the 1135 checked items a correction was considered necessary. By far the most corrections (288) were related to the transcription of non-speech acoustic events. There were 25 corrections in the transcription itself. We did not observe errors of another type. We found two errors that originated from an error in the prompt: The original text was: registra i messaggi del aiuto whereas it should read: registra i messaggi dell' aiuto A total of 313 errors on a total of 1135 checked items yields an error rate of 27.58%. Serious errors concerning the transcription itself were observed in 25 cases yielding an error rate of 2.20%, which is well below the 5% criterion. B. Short items In 203 of the 808 checked items a correction was considered necessary. Most corrections (191) were related to the transcription of non-speech acoustic events. There were 12 corrections in the transcription itself. We did not observe errors of another type. A total of 203 errors on a total of 808 checked items yields an error rate of 25.12%. Serious errors concerning the transcription itself were observed in 12 cases yielding an error rate of 1.49%, which is well below the 5% criterion. In general, we conclude that the non-speech acoustic events are not captured well by the transcriptions, whereas the transcription of the target speech is of good quality. ========================================================================== 10. SUMMARY Below we give a brief overview of our findings with respect to the Italian SpeechDat(M) database. The subsections follow the order of the various topics in the previous sections of the report. In general, the SpeechDat(M) format specifications (as documented in deliverable 1.4.1) are followed closely. This goes for the documentation file FIXED0IT\DOC\ITALIAN.DOC, for the structure of the database and the filenames, for the annotation files, the lexicon, the speaker table and the contents file. The number of missing files remains within the SpeechDat(M) criteria. A total of 10 calls is severely clipped; 5 others have a very low mean SNR. In general, the targeted speech is transcribed well, but the non-speech acoustic events are less well transcribed. A more detailed account follows below. 1. Documentation The main documentation is in file \FIXED0IT\DOC\ITALIAN.DOC. The documentation is generally in good agreement with the SpeechDat guidelines. However, the information about the transcription procedure is rather scarce and scattered. Further, there is no documentation about the lexicon. 2. Data base structure and file names The SpeechDat(M) specifications are closely followed. On inspection of the index files it appeared that for 11 calls the sentences are not represented in FIXED0IT\INDEX\A0SIT.LST. In the FIXED0IT\DOC\SUMMARY.TXT some minor deviations from the SpeechDat specifications were observed: - All summary files are identical, they do not contain the data of the particular CD, but all data. - The 39 items are separated by spaces, which should not be the case. 3. Items All obligatory items are present. None was structurally missing. There is one additional optional item: T3. Looking at the incidentally missing items we find 30 calls missing up to three items, and 20 calls that miss more items. This is well within the SpeechDat limits. If we take into account the files that we consider as effectively missing as well, i.e. the files with no target speech in their transcriptions - see 3b. - then we find a total of 41 calls that miss up to three obligatory items and 20 calls missing more. This is still well within the SpeechDat limits. 4. Sampled data files The speech files are correctly coded and compressed. By auditory and visual inspection of subsets of the files we concluded that all files with a clip ratio over 0.7% are severly clipped; this amounts to a total of 560 files. Also files with a clipping rate between 0.4% and 0.7% should be considered as potentially severely clipped. A directory appears to have been clipped severely if the mean clipping exceeds 0.6%; this is a total of 10 calls. With respect to SNR we observed 5 directories with an average SNR below 20 dB They were all found to be bad calls, most of them containing very weak and noisy recordings. Most files with an SNR below 15 dB were very bad. The recording is very weak or there is no speech. This amounts to a total of about 90 files. 5. Label files The format of the label files is well in agreement with the SpeechDat(M) specifications. The following deviation was found. - In the LBR field of the C1-3 and N1-3 items the digits are printed orthographically instead of numerically. There were 10 label files that mised the obligatory mnemonics LBO and ELF. In the transcriptions only small letters were used. Spelled letters were written in full, not as capitals. Not all markers for non-speech acoustic events are mentioned in the documentation. 6. Lexicon The lexicon was in the proper format. Only small letters were used. Spelled letters were written in full, not as capitals. 53 entries found in the transcriptions are not present in the lexicon. 6 entries are present in the lexicon but not used in the transcriptions. 7. Speakers The speaker table is in the proper format. However, the region of the call is missing. There is a disbalance of sexes exceeding SpeechDat's 5% criterion. There are too few speakers between 46 and 60 years of age. 8. Recording platform The recording conditions are well in agreement with SpeechDat(M) conventions. 9. Transcription In general, we conclude that the non-speech acoustic events are not captured well by the transcriptions, whereas the transcription of the target speech is of good quality. A. Long items In 313 of the 1135 checked items a correction was considered necessary. By far the most corrections (288) were related to the transcription of non-speech acoustic events. There were 25 corrections in the transcription itself. We did not observe errors of another type. A total of 313 errors on a total of 1135 checked items yields an error rate of 27.58%. Serious errors concerning the transcription itself were observed in 25 cases yielding an error rate of 2.20%, which is well below the 5% criterion. B. Short items In 203 of the 808 checked items a correction was considered necessary. Most corrections (191) were related to the transcription of non-speech acoustic events. There were 12 corrections in the transcription itself. We did not observe errors of another type. A total of 203 errors on a total of 808 checked items yields an error rate of 25.12%. Serious errors concerning the transcription itself were observed in 12 cases yielding an error rate of 1.49%, which is well below the 5% criterion. =========================================================================