SUBJECT: Validation Portuguese SpeechDat(M) corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.3 DATE : 25 July 1996 The speech databases made within the SpeechDat(M) project were validated by SPEX, Leidschendam, the Netherlands, to assess their compliance with the SpeechDat(M) format and content specifications, as documented in Deliverable 1.4.1 of the project. The validation results of the Portuguese SpeechDat(M) database are contained in this document. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention are marked by =>. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 RECORDING PLATFORM 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION The documentation file used is DESIGN.DOC in the FIXED0PT\DOC directory. - Language of doc file: preferably English OK - Contact person: name, address, affiliation OK - Number of CDs OK - Contents of each CD OK => There is no information as to the partition of the information on the three CDs. In particular, it is not mentioned that the third CD contains the phonetically rich sentences. - The directory structure of the CDs OK - List of missing items OK, in section 2, it is said that 20 calls are missing 1 or 2 items, => but there is no (separate) full list of missing items (see section 3 of this report). - Speaker demographics . which regions, how many of each OK, speaker information is in section 4. . motivation for selection of regions OK . which age groups, how many of each OK . sexes: males, females, also children?; how many of each. OK 1001 speakers were recorded. A motivation for the extra one is to be found in section 4. => However, not all 550 speakers are present on the first CD (see section 2 of this report). - Reference to a file where speaker characteristics are stored (speaker.tbl) OK - Naming conventions for directories and files OK - Prompting . linguistic specification (and motivation) for the prompting material OK, section 3 . connection of sheet items to item numbers on CD OK . sheet example OK, section 3.12 . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) . recommended: at least 2 samples of each phone per caller (should appear from documentation) There are files PHONEMES.TXT and DIPHONES.TXT which contain this information. They are mentioned in the README.TXT file in the root directory. - Recording platform should be specified . digital telephone net link OK, section 5. => We miss some of the information: Is the local loop a digital one? Are the telephone lines in Portugal digital? - Statement that all signal transmission between CO and recording site is digital => Not clear is what part of the signal transmission over the telephone network is digital and which part not. - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) OK, section 2 - The format and the file header structure of speech files OK, section 2 - The format and the file header structure of annotation files OK, section 2 - Annotation . procedure OK, section 6 . quality assurance OK, section 6 . character set used for annotation (transcription) => Not mentioned . annotations symbols for non-speech acoustic events must be mentioned at least for [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] OK, section 6 . list of symbols used to denote word interruptions and break-offs OK, section 6 - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) OK, section 8 . Overview of SAMPA symbols used (in this manner it can be checked if the lexicon contains only legal symbols). OK, section 8 - Transcription manual: TRANSCRIP.DOC (optional) . is it there? OK, it is included in DESIGN.DOC (section 6) . does it contain the relevant information? . What is done with non speech events . What is done with capitals . Only one spelling of each word is allowed OK - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included. Otherwise a statement why such a list is not necessary. => We could not find information on this. - Indication of how many of the files were double checked by the producer together with percentage of detected errors => We could not find information on this. - Other remarks => There is no table of contents in DESIGN.DOC => We could not find section 7 of DESIGN.DOC => The note at the end of section 3.2 is unclear. Is sentence 1 the shortest sentence and sentence 9 the longest? ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . volume : is a progressive number specifying the CD containing the material. Defined as CD where is the number. . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They could typically be the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK => The third paragraph in this file lists the CD numbers. The second should be CD01 instead of CD00. - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED0EN_00. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC OK - The summary file (SUMMARY.TXT) should be in \\DOC OK => Summary files were not made CD-dependent. Every CD contains the same SUMMARY.TXT file covering the complete database. - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE OK - Index files (optional) should be in \\LST Not provided - Prompt sheet files (optional) should be in \\PROMPT Not provided - Any source code supplied should be in \\SOURCE (SAMLIB, V4, and GNU gunzip, version 1.2.4 + licence) OK - The index files (if presented) obey the nomenclature .LST where e.g. A0ENN3.LST (see below for item_code) Not provided - All sessions indicated in the documentation are present on the CDs => In the README.TXT it reads that there are 550 calls present on CD00. However, we only found 541, which is a serious error. Furthermore, these sessions are present in the CONTENTS.LST table and the SUMMARY.TXT tables. These missing sessions are: 0387 0388 0389 0390 0391 0392 0393 0394 0399 - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat(M): A0 = fixed net, B0 = mobile For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type Z is for A-law, compressed O is for Orthographic label (label file) OK - Correct item codes should be used: I1 : isolated digit C1 : 4 digit id of prompt sheet C2 : ~10 digit telephone number C3 : ~12 digit credit card number N1-3 : 3 natural numbers M1-2 : 2 money amounts L1-3 : 3 spelled words T1 : 1 time of day T2 : 1 time phrase D1-3 : 3 dates Q1-3 : 3 yes/no questions P1 : city of call/birth A1-6 : 6 common application words E1-3 : 3 application word phrases S1-9 : 9 phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK, no such files present - Missing items per speaker Check with documentation OK, there are 16 calls with missing items, which is about what the documentation states. - File match: For each label file there must be one speech file and vice versa. => For the following 14 label files there was no corresponding speech file: A00386A2.PTO A00386A3.PTO A00386A4.PTO A00386A5.PTO A00386A6.PTO A00386C1.PTO A00386C2.PTO A00386C3.PTO A00386D1.PTO A00386D2.PTO A00386D3.PTO A00386E1.PTO A00386E2.PTO A00386A1.PTO It can be seen that all missing speech files are in SES0386. => For the following speech files there was no corresponding label file: A00328D1.PTZ A00008A1.PTZ It appears from this account that session 0386 is very incomplete. - Part of the corpus should be designed for training and a (typically smaller) part for testing. This is optional. Not provided - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . region of call (REG:) . orthographic transcription of uttered item (LBO:) This file must be supplied as an ASCII delimited file (either using TAB, or commas and (double) quoted strings). OK, speaker code is session number - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically 39 codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all fields are separated by spaces OK, As a standard one space is used as separator; sometimes however more spaces are used. => The date and time fields are missing for SES0366. ====================================================================== 3. ITEMS - 1 isolated digit (code I1) . read OK - 3 connected digits (code C1-3) - 4 digit number to identify the prompt sheet . read OK - ~10 digit telephone number . read or spontaneous(?) OK, but they contain 6-7 digits - ~12 digit credit card number (16 digits would be better) . read . if there is a checksum then formula must be provided OK, 16 digit numbers were used . 26 digits per call are required OK . at least one example per digit per caller => We are not able to calculate this because C1 was the sheet number to be read. This number does not appear as a digit string in the prompt. . digits must appear numerically on the sheet, not as words OK - 3 natural numbers (code N1-3) . read . provided as numbers . numbers must be < 1,000,000 . one may be a decimal number . one may be a quantity (including a unit of measurement) . sufficient examples of each word to permit training OK There are no decimals, but that is permitted. - 2 money amounts (code M1-2) . read . currency words should be included . one small amount including decimals and one large amount not including decimals OK - 3 spelled words (code L1-3) . read . equal balance of all vocabulary letters . average length at least 7 letters . may include names, cities and other frequently spelled items . should include equivalents of : A-Z, accent words, CAPITAL, SMALL, UPPER-CASE, LOWER-CASE, DOUBLE, APOSTROPHE, HYPHEN OK - 1 time of day (code T1) . spontaneous OK - 1 time phrase (code T2) . read . analogue form . equal balance of all words . should include equivalents of : AM/PM, HALF/QUARTER PAST/TO, NOON, MIDNIGHT, MORNING, AFTERNOON, EVENING, NIGHT, TODAY, YESTERDAY, TOMORROW OK But there are quite some words that appear rather infrequently: a : 13 amanhã : 3 as : 57 catorze : 15 cinco : 206 cinquenta : 135 da : 23 dez : 103 dezanove : 15 dezasseis : 14 dezassete : 15 dezoito : 13 dois : 60 doze : 10 duas : 57 e : 1200 hoje : 3 hora : 2 horas : 9 manhã : 12 meia : 8 meia-noite : 126 meio-dia : 62 menos : 103 minuto : 2 minutos : 63 noite : 4 nove : 105 o : 3 oito : 153 ontem : 3 onze : 62 para : 73 quarenta : 122 quarto : 57 quatorze : 1 quatro : 121 quinze : 16 seis : 124 sete : 143 tarde : 7 treze : 13 trinta : 123 três : 115 um : 119 uma : 105 vinte : 196 - 1 date (code D1) . spontaneous OK - 2 dates (code D2-3) . read, wordstyle . analogue form . covering all weekdays and months OK, the weekdays and months are all well covered. - 3 yes/no questions (code Q1-3) . spontaneous, not prompted . balance between yes/no OK Deviant answers were collected in an additional item (Z1) - city of call/birth (code P1) . preferably spontaneous; read is permitted OK => It is a region name instead of a city - 6 common application words (code A1-6) . set of 50 should be defined . 39 are fixed for all partners, see Appendix A Del 1.4-1 . read OK These are well distributed. Each word occurs at least 80 times. - 3 application word phrases (code E1-3) . application word is embedded in phrase . read or spontaneous OK - 9 phonetically rich sentences (code S1-9) . read . recommended: at least 2 samples of each phone per caller (should appear from documentation) OK => There is no information about phone coverage per speaker in the documentation. All obligatory items were recorded. None is structurally missing. The one additional, optional item is: Z1: 'Fuzzy' answer to yes/no question 2. Incidentally missing items a. files that are not there Concentrating only on the obligatory items, we found that 7 calls missed 1 item, 6 calls missed 2 items, and 1 call missed 16 (!) items. This very poor call is SES0386. This call also missed 14 speech files (for which only the label files are there, see section 2). As a matter of fact, this call only contains the phonetically rich sentences. b. files with empty transcriptions in the LBO label field We found 138 files that did not contain the target speech in their transcription (the LBO field of the label file). Certainly we did not find 200 calls with only-noise items, as is stated in section 2 of DESIGN.DOC. After having excluded the optional item Z1, still 117 files remained. Distributed over calls we found: Freq. Nr of items mising in a call 81 1 13 2 2 3 1 4 which adds up to 117 mandatory items missing in total. 3. Overall conclusion SpeechDat has the following criteria for missing items: - 85% (850) out of 1000 calls must be complete . A maximum of 10% (100) of the calls may miss up to 3 mandatory items . A maximum of 5% (50) of the calls may miss more items (A complete call is one with all speech files recorded for all prompt items) If we look at the missing files only, then we find 13 calls missing up to three items, and one call missing more. This is well within the SpeechDat limits. However, not all calls are there. On CD00 only 541 calls are present instead of 550. For this reason we have another 8 calls (or 9, if we consider the 1001 calls claimed) that miss more than three items. But the database still remains within the completeness criteria imposed by SpeechDat if we take these missing calls into account. If we take into account the files that do not have the intended speech in their transcription as effectively missing, then we end up with a total of 8+106=124 calls that miss up to three items. There may also be other files that are effectively missing (corrupted speech files). These are dealt with in the next section. =========================================================================== 4. SAMPLED DATA FILES 1 File structure . SAM OK 2 Coding . A-law, 8 bit, 8 kHz . Compression by GZIP OK 3 Sample distribution Several sample distributions are checked: 3.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all items (including Z1): Length (s) #Occurrences 0 - 1 : 1804 1 - 2 : 8173 2 - 3 : 6339 3 - 4 : 5952 4 - 5 : 4127 5 - 6 : 2827 6 - 7 : 2242 7 - 8 : 1815 8 - 9 : 1346 9 - 10 : 978 10 - 11 : 695 11 - 12 : 512 12 - 13 : 438 13 - 14 : 503 14 - 15 : 1188 15 - 16 : 65 16 - 17 : 57 17 - 18 : 87 18 - 19 : 122 19 - 20 : 356 Duration distribution per call: Length (s) #Occurrences 2 - 3 : 65 3 - 4 : 446 4 - 5 : 273 5 - 6 : 84 6 - 7 : 29 7 - 8 : 19 8 - 9 : 9 9 - 10 : 15 10 - 11 : 5 11 - 12 : 6 12 - 13 : 2 13 - 14 : 5 14 - 15 : 9 15 - 16 : 25 It was observed that a file with a long duration mostly contained a high level of background noise of some sort (due to which the recording platform did not terminate recording after the speech utterance). Sessions with mean durations over 15 s appeared to be severely distorted by background noises and buzzes. These sessions were 25 in total: Mean dur.(s) Session 15.7 SES0048 15.4 SES0063 15.6 SES0074 15.7 SES0136 15.7 SES0225 15.7 SES0322 15.7 SES0462 15.7 SES0464 15.7 SES0482 15.7 SES0485 15.7 SES0534 15.7 SES0559 15.7 SES0659 15.7 SES0932 15.7 SES1202 15.7 SES1238 15.7 SES1253 15.1 SES1259 15.7 SES1290 15.2 SES1379 15.4 SES1933 15.7 SES1954 15.2 SES1998 15.7 SES2034 15.7 SES2097 3.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all items (including Z1): Clipping Occurences rate (in %) 0.0 - 0.1 : 13860 0.1 - 0.2 : 7595 0.2 - 0.3 : 4753 0.3 - 0.4 : 2267 0.4 - 0.5 : 1555 0.5 - 0.6 : 909 0.6 - 0.7 : 428 0.7 - 0.8 : 210 0.8 - 0.9 : 119 0.9 - 1.0 : 57 1.0 - 1.1 : 19 1.1 - 1.2 : 10 1.2 - 1.3 : 3 1.3 - 1.4 : 3 1.4 - 1.5 : 1 Number of files with absolute maximum < 32256: 7837 Clip distribution per call: Clipping Occurences rate (in %) 0.0 - 0.1 : 440 0.1 - 0.2 : 305 0.2 - 0.3 : 156 0.3 - 0.4 : 61 0.4 - 0.5 : 18 0.5 - 0.6 : 5 0.6 - 0.7 : 1 Number of directories with absolute maximum < 32256: 6 => It can be seen that only few files and directories were not clipped at some point. As a matter of fact there are only six calls that are not clipped in some way. In general, individual files with a clipping ratio more than 0.5 are severely clipped. This amounts to a total of 1759 files, which is quite a lot. Moreover, also files with lower clipping rates that have a long pause tail may be heavily clipped in their speech part. => As a rule, the calls with a mean clipping rate over 0.4 are as a whole severly clipped. These are 24 in total. The following directories have a mean clipping rate exceeding 0.4%: Clipping Call rate (in %) 0.43 SES0065 0.43 SES0110 0.41 SES0172 0.59 SES0203 0.42 SES0315 0.55 SES0316 0.42 SES0347 0.54 SES0353 0.44 SES0355 0.44 SES0530 0.44 SES0548 0.41 SES0670 0.42 SES0793 0.40 SES0896 0.67 SES1131 0.48 SES1757 0.47 SES1788 0.40 SES1794 0.40 SES1810 0.47 SES1811 0.58 SES1883 0.48 SES1899 0.44 SES1928 0.51 SES2088 3.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items (including Z1): Mean Occurrences -2150 - -2125 : 1 -700 - -675 : 32 -675 - -650 : 85 -650 - -625 : 3 -600 - -575 : 105 -575 - -550 : 143 -550 - -525 : 63 -525 - -500 : 84 -500 - -475 : 143 -475 - -450 : 143 -450 - -425 : 128 -425 - -400 : 116 -400 - -375 : 143 -375 - -350 : 206 -350 - -325 : 157 -325 - -300 : 339 -300 - -275 : 460 -275 - -250 : 388 -250 - -225 : 352 -225 - -200 : 253 -200 - -175 : 221 -175 - -150 : 285 -150 - -125 : 291 -125 - -100 : 350 -100 - -75 : 469 -75 - -50 : 674 -50 - -25 : 1138 -25 - 0 : 2009 0 - 25 : 7003 25 - 50 : 21368 50 - 75 : 1880 75 - 100 : 281 100 - 125 : 73 125 - 150 : 57 150 - 175 : 35 175 - 200 : 50 200 - 225 : 69 225 - 250 : 3 250 - 275 : 3 275 - 300 : 1 300 - 325 : 2 325 - 350 : 4 350 - 375 : 2 375 - 400 : 4 400 - 425 : 1 425 - 450 : 2 450 - 475 : 1 500 - 525 : 4 600 - 625 : 1 625 - 650 : 1 Mean distribution per call: Mean Occurrences -700 - -675 : 1 -675 - -650 : 2 -600 - -575 : 2 -575 - -550 : 4 -550 - -525 : 2 -525 - -500 : 1 -500 - -475 : 4 -475 - -450 : 4 -450 - -425 : 3 -425 - -400 : 3 -400 - -375 : 3 -375 - -350 : 5 -350 - -325 : 2 -325 - -300 : 10 -300 - -275 : 9 -275 - -250 : 12 -250 - -225 : 6 -225 - -200 : 8 -200 - -175 : 4 -175 - -150 : 12 -150 - -125 : 7 -125 - -100 : 12 -100 - -75 : 8 -75 - -50 : 19 -50 - -25 : 24 -25 - 0 : 47 0 - 25 : 162 25 - 50 : 584 50 - 75 : 23 75 - 100 : 4 100 - 125 : 2 175 - 200 : 1 200 - 225 : 2 The files with a mean value below -500 appeared to contain severe noises or buzzes, and are also found in our statistic analyses of file lengths and SNR. Therefore, they are not further specified here. 3.4 Signal to Noise Ratio (SNR) We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items (including Z1): SNR occurrences 0 - 5 : 94 5 - 10 : 160 10 - 15 : 294 15 - 20 : 980 20 - 25 : 2654 25 - 30 : 5954 30 - 35 : 9561 35 - 40 : 12792 40 - 45 : 6477 45 - 50 : 549 50 - 55 : 84 55 - 60 : 14 60 - 65 : 9 65 - 70 : 1 70 - 75 : 1 80 - 85 : 2 SNR distribution over calls: SNR occurrences 0 - 5 : 2 5 - 10 : 3 10 - 15 : 4 15 - 20 : 19 20 - 25 : 64 25 - 30 : 156 30 - 35 : 241 35 - 40 : 363 40 - 45 : 131 45 - 50 : 7 50 - 55 : 2 => By looking and listening to a subset of the files in sessions with a mean SNR below 10 dB, we concluded that these are all severely distorted due to background noise / buzzes. Directories with a mean SNR between 10 and 20 dB are potentially distorted by severe noise buzzes. The calls with mean SNR values below 10 dB were: Session SNR(dB) SES0322 4.5 SES1238 5.0 SES1253 6.0 SES1290 6.5 SES2034 7.0 All these calls were also observed in our analyses on extreme file lengths (see subsection 3.1) =========================================================================== 5. ANNOTATION FILE - File empty? OK, no empty files - all mnemonics should be SAM mnemonics or explicitly defined in documentation OK - Mandatory (SAM) mnemonics: LHD: V5.0 DBN: SPEECHDAT(M)_ VOL: FIXED0_ SES: DIR: SRC: CCD: RED: RET: SAM: 8000 BEG: END: SNB: 1 SBF: SSB: 8 QNT: A-LAW CMP: GZIP, 1.2.4 SCD: SEX: male/female/unknown ! SEX and AGE may also only appear in (one letter) ! in speaker table if SCD is provided ! in label file AGE: ! mnemo is not SAM REG: LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , , , EXT: [if needed for LBR and LBO, > 80 char] ELF: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword . no line may exceed 80 chars => We found 6 files that missed one or more of the obligatory mnemonics: . The files A01110N1.PTO, A00197S3.PTO and A00197S7.PTO have the string EXT: LBO: Here a carriage return was omitted. . A01013P1.PTO and A00500E1.PTO exist but are empty. . A00386E2.PTO misses LBD, LBR, LBO en ELF. (The corresponding speech file does not exist.) => There are lots of label files with line lengths of more than 80 characters: Number of files Line length (chars) 4217 81 2998 82 2535 83 2248 84 2003 85 1418 86 1158 87 651 88 502 89 311 90 220 91 130 92 73 93 26 94 6 95 14 96 11 97 2 98 2 99 1 100 - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM EXP: SYS: DAT: SPA: PHM: ! mnemo is not SAM ACC: ! mnemo is not SAM NET: fixed/gsm ... ! mnemo is not SAM EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM REP: PCF: RCC: ENV: ASS: ! mnemo is not SAM The optional mnemonics used are ACC, NET and PHM. - No illegal mnemonics used => There is one file with an illegal mnemonic: Freq. mnemonic 1 m acesso às emissões de televisão. This 'mnemonic' is found in file A00197S6.PTO. The cause is a carriage return at an improper place. - All files must contain the same mnemonics. This holds as well for the optional mnemonics. => Not all label files have the same mnemonics. The mnemonics ACC, NET and PHM are only in a subset of the label files. Freq. mnemonic 39638 ACC 325 NET 3147 PHM It appears that PHM was used in the label file whenever the caller stated to speak through a cordless phone. It appears that NET was used in the label file whenever the caller stated to use a GSM (mobile) phone. Accordingly, the following sessions were recorded over the mobile net: SES0490 SES0546 SES0736 SES0810 SES0861 SES1877 SES2015 SES2133 and the following session probably as well: SES0197 SES1110 =>Thus we find that 8 and possibly 10 calls were collected over the mobile net, which is in conflict with the SpeechDat(M) intention to collect fixed network calls only. - Field values should be in the correct format => In all RET mnemonics the time is given in hours and minutes. The seconds were left out. => For most annotation files in SES0366 the fields after RED and RET are empty. => Further the prompt was forgotten after the LBR mnemonic in file A00031C1.PTO. - Each lowest subdirectory does not refer to multiple sheet ids. => In seven sessions there was no sheet id for the menemonic SHT, i.e. the field was empty. This was the case for: SES1192 SES1234 SES0228 SES0230 SES0232 SES0260 SES0865 For all of these sessions there was only a sheet number for item C1 (and in SES0232 also for E2); the sheet id fields for the other items are empty. There are no indications that more calls were merged in these sessions. - For spontaneous speech LBR should be left blank or contain a mnemonic word (like ). => Instead of this, the explicit question is put in the LBR field of the sponaneous items (Z1, Q1-3, D1, T1, P1). - Obligatory and optional label mnemonics not provided in the label files should be provided in the file `CONTENTS.LST' from which this information can be derived (and added to the label file by the validating institute, if necessary). OK - Transliterations only in lower case letters, also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. All words (except letter spellings) are in lower case. => Names are not capitalised either. - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [Filled_Pause] [Speaker_Other] [Nonspeaker_Other] Other symbols (and language equivalents) must be mentioned in the documentation The following symbols were observed to denote non-speech acoustic events: [ah] [cliques] [eh] [hesitação] [hum] [nada] [no_speech] [oh] [pausa] [pigarreio] [respiração] [riso] [ruído bocal] [ruído de linha] [ruídos de fundo] [ruídos de linha] [ruídos do orador] [rádio] [sopro] [telefone] [tosse] [vozes] [ãh] => All markers, but [no_speech], are described in section 6 of the documentation. It appeared that markers consisting of more words ([ruído etc]) contain a rather arbitrary number of blanks between the words. => In the LBO field of file A00197A5.PTO a square bracket [ misses in [ruído bocal]. - Asterisks should be used to indicate mispronunciations OK - Tildes should be used to indicate truncations OK - According to a spelling check on annotated text (including bracket check) up to 1% errors may be found => Not done - The label files are associated with the correct speech files. (This cannot be done automatically at this moment. We can only point at files that are incidentally found as mismatched during the transcription and/or speech file validation) Not checked - Assessment of speech items in terms of SNR, presence of additional noise, adherence to prompting text is provided (optional) Not present ======================================================================== 6. LEXICON - Check lexicon existence OK - Lexicon contents should be taken from actual utterances (from LBO) OK - The entries should be alphabetically ordered OK - In transcriptions only SAMPA symbols are allowed OK The symbols j, j~, l~, w and w~ are not standard SAMPA, but they are explained in section 8 of the documentation. - Capitals only in proper names,spelled words, and in single letters derived from abbreviations (exception: German) Capitals were only used for spelled letters. => Names are not capitalised. - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] OK, there is no frequency information. - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) Not used - Orthographic entries are as a rule splitted by apostrophes, but not by dashes. OK - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) Nearly every word present in the transcription field of the label files can also be found in the lexicon. We counted only 6 missing entries: autárquia bocal] dezassesis femenino países- ço 'bocal]' origins from the one transcription with unbalanced square brackets. All these words occurred only once. We assume that they have been misspelled in the transcription and were therefore excluded from the lexicon. => Obviously, the right way to go would have been to correct the transcriptions. . Check for overcompleteness (invalid words have a * and should not be in lexicon) (the same goes for words truncated due to a recording error; this is indicated by ~) The following 13 words were found in the lexicon, but in none of the transcriptions: anúncio cais cas emissões formações intrução março obvio osteiro rechalar* relutados suismo virificou-se Overcompleteness is not much of a problem from a practical point of view. The entry rechalar* should not have been included anyhow, since words with an asterisk (due to mispronunciation) should not be included in the lexicon according to the SpeechDat conventions. - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. Not provided. ========================================================================== 7. SPEAKERS - Speaker database file . check existence OK - Allowed formats: a. SAM mnemonics b. record file with commas as field separators and strings between double quotes OK, option b. was chosen. - Obligatory information: SAM: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE 4. region of call REG OK, SES was used as key field. - Optional information: . height HET . weight WET . native language NLN . accent ACC . ethnic group ETH . education level EDL . smoking habits SMK . pathologies PTH . socio-economic status SOC Not provided - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% The balance of sexes was OK. The following distribution was observed: F: 541 = 54.54 % M: 451 = 45.46 % - Balance of regions . which regions and how many of each should match specification in documentation file Acores: 28 = 2.82 % Africa: 32 = 3.23 % Alentejo: 100 = 10.08 % Algarve: 13 = 1.31 % Australia: 1 = 0.10 % Beira-Alta: 38 = 3.83 % Beira-Baixa: 56 = 5.65 % Beira-Litoral: 91 = 9.17 % Brasil: 1 = 0.10 % Entre-Douro-e-Minho: 272 = 27.42 % Estremadura: 298 = 30.04 % Franca: 5 = 0.50 % India: 1 = 0.10 % Luxemburgo: 1 = 0.10 % Macau: 1 = 0.10 % Madeira: 8 = 0.81 % Ribatejo: 19 = 1.92 % Transmontano: 25 = 2.52 % There some small deviances from the documentation. These are caused by the 9 missing calls on the first CD. - Balance of ages . which age groups and how many of each should match specification in documentation file . A minimum of 20% of speakers must be in following age groups: 17-30, 31-45, 46-60. A maximum of 40% speakers may be younger than 17 or older than 60. OK, the following distribution was observed: under 17: 16 = 1.61 % 17 - 30 : 339 = 34.17 % 31 - 45 : 433 = 43.65 % 46 - 60 : 196 = 19.76 % over 60 : 8 = 0.81 % This is also in line with the documentation (if we round the percentages). => There is one inconsistency between the speaker table and the label files. For the speaker in SES1983 the age is 24 according to the speaker table, and 23 according to the label file. ======================================================================= 8. RECORDING CONDITIONS - Digital telephone line => There is no information on this in the documentation. - A-law coding OK - Specification of wireless telephone or not (optional) Was provided by using, in the label files, the mnemonic PHM for calls with cordless handsets. - Time stamps on file OK - Recording information may be stored in a separate file (optional) - this file may have two formats: a. SAM mnemonics b. record table with commas as field separators and strings between double quotes - The primary key in the label file is the RCC mnemonic - name of file: ~\TABLE\REC_COND.SAM or ~\TABLE\REC_COND.TBL - Information: SAM: . recording conditions code RCC . region of call REG . telephone area code ARC . environment ENV . telephone model PHM . telephone network NET . recording city CTY . recording car CAR . speed SPD . fan noise FAN . ground type GRD . wipes WIP There is no such table. ============================================================================= 9. TRANSCRIPTION This validation is carried out by taking 5% of the short items and 5% of the long items in the corpus. The transcriptions in the label files for these samples are checked by listening to the corresponding speech files. This check is performed by native speakers of the language involved. Short items are: - isolated digit - time phrases - date phrases - yes/no questions - place name - application words Long items are: - connected digits - natural numbers - money amounts - spelled words - application phrases - phonetically rich sentences - The evaluation comprises the following criteria . did the speaker actually speak the transliterated words . did the speaker speak the prompted text . is translitteration of non-speech acoustics events correct . speech quality, line quality . up to 5% transcription errors are allowed A random selection of 1160 long items and 795 short items was used for transcription validation. A. Long items In 86 of the 1160 checked items a correction was considered necessary. By far the most corrections (58) were related to the transcription of non-speech acoustic events. There were 28 corrections in the transcription itself. We did not observe errors of another type. A total of 86 errors on a total of 1160 checked items yields an error rate of 7.41%. Serious errors concerning the transcription itself were observed in 28 cases yielding an error rate of 2.41%, which is well below the 5% criterion. B. Short items In 54 of the 795 checked items a correction was considered necessary. Most corrections (31) were related to the transcription of non-speech acoustic events. There were 23 corrections in the transcription itself. We did not observe errors of another type. This yields a total of 23 serious errors on 795 items. This is an error rate of 2.89%, which is well below the criterion value of 5%. ============================================================================ 10. SUMMARY Below we give a brief overview of our findings with respect to the Portuguese database. The subsections follow the order of the various topics in the previous sections of the report. In general the Portuguese database complies well with the SpeechDat(M) format specifications as formulated in Deliverable 1.4.1 of the SpeechDat(M) project. The documentation is sufficiently complete and correct. The database contains 9 calls too few on the first CD. Still, the database is sufficiently complete with respect to the SpeechDat criteria. At least 8 of the calls are not collected over the fixed network but over the mobile telephone network (GSM). The lexicon and the speaker table are well formated and fairly complete. The transcription of the utterances is of good quality. A more detailed summary of our findings follows below. 1. Documentation The documentation in FIXED0PT\DOC\DESIGN.DOC is OK. It contains most of the essential information. We miss information about which items are on the third CD. More information about the Portuguese telephone network in section 5 of the documentation would be helpful. There is no information on spelling alternatives in Portuguese. There is no table of contents. 2. Data base structure and file names The SpeechDat conventions were followed nicely in the directory names and the file names of the database. For 14 labels files there was no corresponding speech file. All these files are in SES0386. For two speech files there was no corresponding label file. 3. Items There are 9 calls missing on the first CD. All obligatory items were recorded. None is structurally missing. The one additional, optional item is: Z1: 'Fuzzy' answer to yes/no question Concentrating only on the obligatory items, we found that 7 calls missed 1 item, 6 calls missed 2 items, and 1 call missed 16 (!) items. This very poor call is SES0386. It misses all speech files except the phonetically rich sentences. Thus, we find 13 calls missing up to three items, and one call missing more. This is well within the SpeechDat limits. However, as mentioned, not all calls are there. On the first CD only 541 calls are present instead of 550. For this reason we have another 9 calls that miss more than three items. But the database still remains within the completeness criteria imposed by SpeechDat if we take these missing calls into account. If we take into account the files that do not have the intended speech in their transcription as effectively missing, then we end up with a total of 8+106=124 calls that miss up to three items. 4. Sampled data files We observed 25 calls with a mean duration over 15 s (per file). These were all very noisy. Clipping is a considerable problem in this database The clipping rates that we observed are relatively high. A lot of files are clipped at some locations. At least 1750 files are severely clipped, but possibly a lot more. At least 24 calls as a whole suffer from severe clipping. There are 5 calls with a very low mean SNR (below 10 dB). All these calls appeared to be extremely noisy and were also observed among the 25 calls with very large mean durations. 5. Label files All mnemonics used are legal. We found only one illegal mnemonic: M ACESSO ÀS EMISSÕES DE TELEVISÃO. This 'mnemonic' is found in file A00197S6.PTO. The cause is a carriage return at an improper place. There are very many label files with line lengths of more than 80 characters. We found 6 files that missed one or more of the obligatory mnemonics. The mnemonics ACC, NET and PHM are only in a subset of the label files. >From the use of the mnemonic NET it appears that 8 and possibly 10 calls were collected over the mobile net, which is in conflict with the SpeechDat(M) intention to collect fixed network calls only. In all RET mnemonics the time is given in hours and minutes. The seconds were omitted. In seven sessions there was no sheet id for the menemonic SHT, i.e. the field was empty in most of the files. Names are not capitalised in the transcriptions. 6. Lexicon The lexicon was well in agreement with the SpeechDat specifications. There were only 6 entries missing in the lexicon. Probably all these words are misspelled, since they were all observed only once in the label files. As in the transcriptions the names are not capitalised. 7. Speakers There is a speaker table which is well formated. The balance of sexes is OK. The distribution of the speakers over the regions is reasonably well in accordance with the documentation. There some small deviances from the documentation. These are caused by the 9 missing calls on the first CD. The distribution of the speakers' ages is OK. 8. Recording platform The recording platform is OK. Howver it is not clear, how much of the transmission of the calls is done over digital lines. 9. Transcription A random selection of 1160 long items and 795 short items was used for transcription validation. A. Long items In 86 of the 1160 checked items a correction was considered necessary. By far the most corrections (58) were related to the transcription of non-speech acoustic events. There were 28 corrections in the transcription itself. We did not observe errors of another type. A total of 86 errors on a total of 1160 checked items yields an error rate of 7.41%. Serious errors concerning the transcription itself were observed in 28 cases yielding an error rate of 2.41%, which is below the 5% criterion. B. Short items In 54 of the 795 checked items a correction was considered necessary. Most corrections (31) were related to the transcription of non-speech acoustic events. There were 23 corrections in the transcription itself. We did not observe errors of another type. This yields a total of 23 serious errors on 795 items. This is an error rate of 2.89%, which is below the criterion value of 5%. =========================================================================