THE SPEECHDAT (M) FRENCH TEXT MATERIAL

1. INTRODUCTION

In this paper, the construction of the word- and sentence material for the response sheets used in the collection of the SpeechDat (M) corpus for the French language is reported.

2. THE MATERIAL

The material consists of both prompted and free speech. The advantage of prompted speech is the wide vocabulary coverage. Free speech is more natural, but it takes more items to obtain a good vocabulary coverage. Since speech recognition in voice response applications is one of the major driving forces behind the project, a relatively large part of the items are numbers, times or dates.

The speech material to be collected should be appropriate to train isolated and/or connected digit recognizers, as well as sub-word based recognizers.

Three types of responses are prompted:

- 36 read items
- 8 spontaneous answers to printed questions
- 2 spontaneous answers to non-printed questions

The following items were chosen:

Items to be read:

- 1 isolated digit
- 3 connected digits
- 1 money amount
- 2 times (analog and digital)
- 2 dates (analog and digital)
- 3 words
- 3 spelled words (the same words as the 3 spoken as normal words)
- 9 application words
- 3 application words in a carrier sentence
- 9 phonetically rich sentences

The following list of printed questions is asked:

- 4 yes/no questions
"Are you ready to start?"
"Do you use a cordless phone?"
"Have you lived in another country for a long period of time?"
"Is French your native language?"

- 4 open questions
"Please, give a number between a thousand and a million."
"What is your date of birth?"
"In which department did you go to school for the first time?"
"Please, give your comments on this recording."

The following unprinted questions were prompted (but the response form advised the subject that a question would be asked at this point in the recording session):

- "Please, give a time between noon and midnight."
- "Please, give an amount of francs between a thousand and a million."

The list of items is not exactly following the SpeechDat(M) specifications; there is only one natural number instead of three, extra items are:

- 3 extra application words,
- 1 extra yes/no question,
- 1 comments item,
- 1 extra time item,
- 3 words that are spelled are also read.

3. PHONETICALLY RICH SENTENCES

The SpeechDat (M) corpus for the French language contains nine phonetically rich sentences per session. These sentences were taken from a large text corpus that contained many issues of the newspaper Le Monde. The IDIAP grapheme-phoneme converter was used to create the most probable transcription of each sentence.

The last stage was to create 1,000 sets of nine sentences, each sentence appearing in only one set. The sets were formed in such a way that each one contained at least two tokens of all phonemes in the French language (except for some very infrequent phonemes). The list of phonemes (in SAMPA notation) finally used for coverage purposes follows:

Phoneme Word  Transcription

p      pont       p o~
b      bon        b o~
t      temps      t a~
d      dans       d a~
k      quand      k a~
g      gant       g a~
f      femme      f a m
v      vent       v a~
s      sans       s a~ 
z      zone       z o n
S      champ      S a~
Z      gens       Z a~ 
j      ion        j o~ (can be realized as a fricative or as an approximant)
m      mont       m o~
n      nom        n o~
J      oignon     o J o~
N      camping    k a~ p i N (only found in loan words)
l      long       l o~
R      rond       R o~  
w      coin       k w e~
H      juin       Z H e~
i      si         s i
e      ses        s e
E      seize      s E z
a      patte      p a t
A      pâte       p A t 
O      comme      k O m 
o      gros       g R o
u      doux       d u
y      du         d y
2      deux       d 2
9      neuf       n 9 f
@      justement  Z y s t @ m a~ 
e~     vin        v e~
a~     vent       v a~ 
o~     bon        b o~  
9~     brun       b R 9~ 

For more detailed information, see the website http://www.phon.ucl.ac.uk/home/sampa/french.html

4. DETAILED ANALYSIS OF THE TEXT MATERIAL

4.1. One Isolated Digit

This item is READ.

Digit vocabulary: "un, deux, trois, quatre, cinq, six, sept, huit, neuf, zéro" (zero ... nine)

The ten digits appeared with equal frequency on the response sheets.

4.2. Three Connected Digits

Applications of digit string recognition are, among others, telephone number and credit card number recognition. Thus, the prompts aimed to elicit formulations typical for these types of applications. All connected digits are READ.

- One 6-digit string: a number expressed as 6 single digits (READ)
- One 8-digit string, introduced as a telephone number. French telephone numbers are formulated as 4 2-digit strings; printed in 4 blocks of 2 numbers each (READ)
- One 16-digit string, introduced as a credit card number; printed in 4 blocks of 4 numbers each (READ)

Vocabulary:
- digits: "un, deux, ..., zéro"
- 'teens': "onze, douze, treize, quatorze, quinze, seize, dix-sept, dix-huit, dix-neuf" (eleven ... nineteen)
- 'tys': "dix, vingt, trente, quarante, cinquante, soixante, soixante-dix, quatre-vingt, quatre-vingt-dix" (ten, twenty ... ninety)


For telephone and credit card numbers the syntax of the strings obeyed the relevant rules (e.g., subscriber number cannot start with a 0). For the rest care was taken to obtain equal frequency for all vocabulary items.

4.3. One Natural Number

These data are useful to be able to construct speech references for any non-negative number below one million. This item is SPONTANEOUS, i.e., the speakers have been prompted to say any number between 1,000 and 1,000,000: " S'il vous plait, dites un nombre quelconque entre mille et un million" (Please, say a number between one thousand and a million)

Vocabulary:
- digits
- 'teens'
- 'tys'
- "cent(s), mille, virgule" (cent(s), thousand, comma)

Since this item is spontaneous, there is no a priori attempt to obtain uniform distributions of the vocabulary items.

4.4. Two Money Amounts

The prompts aimed to elicit typical phrases used with money amounts. Each prompt sheet contained one small money amount (less than 100 FF), including centimes (however, only x0, x5, 99 were used, with equal frequency [x = 1, .. , 9]), and one large amount. The small money amount was READ. The large money amount was SPONTANEOUS prompted by the surprise question "Please, give an amount of francs between a thousand and a million."

Vocabulary:
- digits, 'teens', 'tys'
- "cent(s), mille"
- "franc(s), centime(s)"

4.5. Two Times of Day

There are two methods to report a time: 'digital', e.g. 23:49, or 'analog', e.g. "sept heure et demi" (seven hours and a half).

One time was digital, with preference for evening times. On the session sheet they were represented as e.g. 20:15. Only times of the form xx:x0 and xx:x5 have been prompted. The other time has been printed explicitly as an analog time, including the words "aujourd'hui" (today) and "demain" (tomorrow). Both items are READ.

Vocabulary:
- aujourd'hui, demain
- digits 0-25, 30, 35, 40, 45, 50, 55.
- à matin, après-midi, heure(s), minute(s), moins, quart, trois (quarts), et, le, demi, midi, minuit, soir, un(e), ce(t, tte), nuit)

4.6. Three Dates

Again there are two ways to specify a date: 'digital', e.g. 27/12/96 or 'analog', e.g. Vendredi, premier Mai 1996 (Fridays, the first of May 1996). - One READ digital date, presented as e.g. 27/12/1995 has been recorded.

All dates were in the interval between 1996 and 2000

- One READ analog date, including weekday
- One question concerning a date, resulting in a SPONTANEOUS answer "Quelle-est votre date de naissance?" (What is your date of birth?)

Vocabulary:
- weekdays: lundi, mardi, mercredi, jeudi, vendredi, samedi, dimanche
- months: janvier, février, mars, avril, mai, juin, juillet, août, septembre, octobre, novembre, décembre
- ordinal: premier
- cardinal numbers: digits 2-31

4.7. Four Yes/no questions

Four questions that should result in SPONTANEOUS yes/no responses were
asked.

The following questions were printed on the sheet:
"Etes-vous prêt à commencer?" Confirmation expected (Are you ready to start?)
"Utilisez-vous un téléphone sans fil?" More negation than confirmation expected (Do you use a cordless phone?)
"Avez-vous déjà vécue à l'étranger pendant un long laps de
temps?" Negation expected (Have you lived abroad for a long period of time?)
"Votre langue maternelle est-elle le francais?"
Confirmation expected (Is French your native language?)

4.8. Three Spelled words

The words to be spelled were aimed at a uniform distribution of all vocabulary letters. The spellings are READ. There was a 500 word inventory, which is extracted from the phonetically rich material. The number of letters per word ranges from 5 to 11.

Vocabulary:
- A-Z
- apostrophe, accent-aigu, accent-grave, accent-circonflexe,
cédille, tréma, tiret, trait d'union, lettre capitale, majuscule,
minuscule, deux (e.g. lettre: L E deux T R E)
- lexicon

The callers were first asked to read the words and then to spell these.

4.9. 9 Application words

Each subject read 9 application words. In addition three sentences were included to obtain READ utterances with embedded keywords. These can be used for word-spotter training. There were five different carrier phrases per keyword.

The following table lists the application words:

activer
aide
annonce
annuaire
annuler
appeler
arrêter
astérisque
autres
canal
changer
chaîne
composer
conférence
connecter
continuer
date
dièse
désactiver
effacer
en arrière
encore une fois
enregistrer
externe
fin
information
interne
jouer
lecture
menu
messages
modifier
mémoire
nom
nouveau
nouvelle
numéro
opérateur
pause
programmer
précédant
rappeler
rembobiner
retour
répondeur
répéter
station
stop
suite
suivant
transférer
téléphone
verrouiller
écouter


The 3 sentences have been READ.

4.10. Other

The following question was asked to obtain an indication of the regional/ dialectical background of the caller.

"Dans quel département êtes-vous allé à l'école la première fois?" (In which department have you gone to school for the first time?)

Finally the caller was asked for comments. "S'il vous plaît, faites-nous part de vos commentaires, remarques ou impressions." (Please, give your comments, remarks or impressions).



APPENDIX

APPENDIX 1. DATABASES


This section lists the databases and sections of those databases which have been used to generate the 40 items of the single session sheet templates.

1.1. Isolated digits


0 - 9

1.2. Connected digits: 6-, 8-, and 16-digit strings

6-digit strings
7 8 4 5 3 2
9 4 7 6 2 7
1 7 3 6 3 6
6 0 9 3 1 6
...

8-digit strings
64 64 73 16
76 71 38 80
78 02 36 88
82 29 51 44
15 19 32 30
...

16-digit strings
7578 9920 9568 4104
3187 2752 7569 3424
2429 9496 5895 4220
9560 2496 3191 3314
4418 7612 9150 1984
...

1.3. Money amounts

0 F
0 F 05
0 F 10
0 F 15
0 F 20
0 F 25
...
99 F 85
99 F 90
99 F 95
99 F 99
100 F
101 F
102 F
...
9996 F
9997 F
9998 F
9999 F

1.4. Times of day: digital

13 heures
13 heures 05
13 heures 10
13 heures 15
13 heures 20
...
23 heures 50
23 heures 55
0 heure
0 heure 05
1 heure 50
1 heure 55
2 heures

1.5. Times of day: analog

demain matin
demain après-midi
demain à minuit
demain à minuit et quart
demain midi
demain soir
ce midi
ce minuit
cet soir
cet après-midi
ce soir à minuit
cette nuit à minuit et un quart
minuit une
minuit deux
demain, 7 heures (et) 38 (minutes)
aujourd'hui, 23 heures moins 59

1.6. Dates: digital

1/1/1996
...
31/1/1996
...
25/12/1999
31/12/1999

1.7. Dates: analog

lundi, le premier janvier 1996
lundi, 2 janvier 1996
dimanche, le 31 décembre 1999
...

1.8. Spelled words

É V I N C É
É V O L U T I O N
É X O N É R É
 G É E S
Ê T R E S
A É R E R
A É R I E N
A É R O P O R T
Q U A S I M E N T
Q U A T R I É M E
Q U E R E L L E
Z O N A G E
Z O O M S
B U R E A U
C É L É B R E
C É R A M I Q U E
...

1.9. Application words (complete list)

activer
aide
annonce
annuaire
annuler
appeler
arrêter
astérisque
autres
canal
changer
chaîne
composer
conférence
connecter
continuer
date
dièse
désactiver
effacer
en arrière
encore une fois
enregistrer
externe
fin
information
interne
jouer
lecture
menu
messages
modifier
mémoire
nom
nouveau
nouvelle
numéro
opérateur
pause
programmer
précédant
rappeler
rembobiner
retour
répondeur
répéter
station
stop
suite
suivant
transférer
téléphone
verrouiller
écouter

1.10 Phonetically rich sentences

Le Monde:
ce projet nécessite la construction d'une énorme écluse.
ce projet est accompagnée d'un projet immobilier correspondant à 4,500 habitants.
il est actuellement étudié par trois architectes locaux.
les sondages archéologiques sont terminés depuis fin janvier.
une société d'économie mixte sera crée fin janvier.
les premiers coups de pioche devraient être donées à la fin de l'année.
une première tranche sera commercialisée en 1992.
à Montpellier, la dimension du projet interdit toute échance précise.
deux ans après, on a annoncé en grande pompe la création de Port-Mariane.
ce port fluvial se situe au coeur d'un nouveau quartier de 20,000 habitants à l'est de la ville.
ce nouveau quartier est aéré par trois espaces verts. seule apparait comme définitive aujourd'hui le debut de la tion de la future mairie.
...
...

1.11. Application words embedded in carrier sentences

il renonce à prendre sa retraite à cette date.
vous effacer la prochain chanson sur la cassette.
effacez les erreurs présentes dans ca texte.
cette démarche novatrice a mis fin à ces pratiques dépassés.
le Parc national de Cévennes a apportée une aide technique.
le lundi suivant, le volume des transactions doublait.
le président de la République leur a fait parvenir un message.
...
...

Description of the structure of the corpus as it appears on the CD-ROMs

[There is a file which gives he structure of the corpus as it appears on the CD-ROMs].

1. NUMBER OF SESSIONS

1,302 sessions (calls) were recorded and processed. Of these, 1000 sessions are put on the CD-ROMs for SpeechDat(M). The remaining 302 sessions have been transferred to Philips Research Aachen separately.

2. DIVISION OF THE SESSIONS

The SpeechDat(M) corpus for the French language contains 1,000 sessions. It is delivered on three (3) CD-ROMs.

The phonetically rich sentences that occur within each session are put on a separate CD-ROM. The other two CD-ROMs contain the remaining items.

3. COMPRESSION AND STRUCTURE

The data on the CD-ROMs are stored as gzipped, A-law compressed waveforms. Each waveform file contains a single item. The files do not contain any header in accordance with the ESPRIT Project SAM standards. Instead, each signal file is accompanied by a label file in which the relevant information is stored.

The complete corpus consists of 3 CD-ROMs. The speakers are stored in arbitrary order. Male and female speakers are not stored in separate directory trees.

4. DIRECTORY STRUCTURE

Blocks of 100 sessions (directories) are created to ensure the manageability of the large number of files. Each session contains all speech files of a single speaker.

The final directory structure of the CD-ROM has four levels:
\<database_name>\<volume>\<block>\<session>\<file>

<database_name> is defined as <name><#><language code>, where <name> can be FIXED, MOBIL or VERIF, anticipating Speechdat main phase requirements for fixed, mobile, speaker verification databases. <#> is 0 for SpeechDat(M), 1 for SpeechDat main phase. <language code> is the ISO 2-letters code for the language that is recorded. In the case of this SpeechDat(M) French corpus the database name is FIXED0FR.

<volume> is defined as CD<vv>, where <vv> is a progressive number from 00 to 02, specifying the physical CD-ROM containing the material

<block> is defined as BLOCK<nn> where <nn> is a progressive number from 00 to 99. One block contains 100 calls.

<session> is defined as SES<nnxx> where <nn> is the same as in block and <xx> is a number between 00 and 99. This is the numeric call sequence identification number, which is also encoded in each filename. As there are no more than 50 utterances per call, the total number of speech files and associated transcription files does not exceed the CD-ROM recommended limit of approximately 100 files in a directory.

<file> is defined as 'A0nnxxcc.FRf', where A0 stands for SpeechDat(M), fixed.
<nnxx> is the same as in session.
<cc> is the item code. It is a unique code for each item:

A1-A9: 9 application words out of a vocabulary of 54 words
C1: 6 digit ID number
C2: 8 digit telephone number
C3: 16 digit credit card number
D1-D3: 3 dates
E1-E3: 3 carrier sentences with application words
I1: isolated digit
L1-L3: 3 spelled words
M1: large money amount
M2: small money amount
N1: natural number
P1: place name
Q1-Q4: 4 yes/no questions
R1: comments
S1-S9: 9 phonetically rich sentences
T1-T3: 3 times
W1-W3: words, the same as the spelled words, but now spoken as normal words

<f> is O for Orthographic, A for A-law and Z for compressed. In the case of this corpus all speech files are compressed using 'gzip', thus having the Z at the end of each file name. When uncompressing the files the Z should be changed into an A.

5. LABEL FILE

Each (compressed) speech data file is accompanied by a label file. It contains information about recording and speaker conditions of the speech data file. The label file is in SAM format. An example can be found below:
LHD: SAM, 5.00
DBN: SpeechDat(M)_French
VOL: FIXED0FR_01
SES: 0864
SHT: 0969
REG: 
SEX: F
AGE: 21
DIR: \FIXED0FR\CD01\BLOCK08\SES0864
SRC: A00864A4.FRZ
CCD: A4
REP: SPEX, LEIDSCHENDAM, THE NETHERLANDS
RED: 23/Jun/1995
RET: 21:36:28
ASS: OK
BEG: 0
END: 32511
SAM: 8000
SNB: 1
SBF:
SSB: 8
QNT: A-LAW
CMP: GZIP 1.2.4
LBD:
LBR: 0, 32511, , , , enregistrer
LBO: 0, 16256, 32511, [bip] enregistrer
ELF:
This concerns the 4th application word, read from sheet nr. 0969, of session 0864, to be found on CD01.

The meaning of the mnemonics is defined and explained in deliverable LRE 63314 D 1.4.1 to be obtained from the WWW site http://www.phonetik.uni-muenchen.de/SpeechDat.html

6. SPEAKER.SAM

In \FIXED0FR\TABLE the file SPEAKER.SAM can be found with information about each speaker. Information concerns the session ID corresponding with this speaker, age and sex of the speaker, the mother tongue of the speaker (French or not) and the region where the speaker went to school for the first time. An example:

SES: 0000
SEX: M
AGE: 30
NLN: FRENCH
ACC: Picardie

Note: as key for each speaker, we used SES instead of SCD (speaker code).

7. CONTENTS.LST

In the same directory \FIXED0FR\TABLE there is the file CONTENTS.LST containing information of each file on the CDs. All the information is also present in the label file going with speech file. The following fields are present, with their mnemonic in the label file between brackets:

CDROM volume name (VOL:)
full pathname (DIR:)
speech file name (SRC:)
speaker code (SCD:)
speaker sex (SEX:)
speaker age (AGE:)
assessment (ASS:)
orthographic transcription of the uttered item (LBO:)

The seventh field, assessment, comes instead of region (REG:), because the region of call is unknown.

An example:
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000C1.FRZ
   0000    M       30      NOISE   [bruit/] [bouche] cinq deux
trois deux sept cinq [/bruit]
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000Q1.FRZ
   0000    M       30      OK      oui [bruit]
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000Q2.FRZ
   0000    M       30      OTHER   [bouche] non non
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000Q3.FRZ
   0000    M       30      OK      [bouche] non pas vraiment
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000D1.FRZ
   0000    M       30      OK      [bouche] le dix-neuf juin:
soixante quatre [bruit]
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000P1.FRZ
   0000    M       30      OK      [bouche] le Nord [bruit]
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000Q4.FRZ
   0000    M       30      OK      oui [bruit]
FIXED0FR_00     \FIXED0FR\CD00\BLOCK00\SES0000  A00000C2.FRZ
   0000    M       30      OK      [bouche] trente-deux cinquante-neuf
trente-deux cinquante-six
FIXED0FR_02     \FIXED0FR\CD02\BLOCK00\SES0000  A00000S1.FRZ
   0000    M       30      OK      [bouche] ils sont pourtant
partis en même temps de la même nébuleuse [bruit]
FIXED0FR_02     \FIXED0FR\CD02\BLOCK00\SES0000  A00000S2.FRZ
   0000    M       30      OK      [bouche] l'irascible new-yorkais
a en effet une nouvelle fois cédé le vingt et un
janvier à son tempérament colérique [bouche]

8. HANDSET.LST


Also in the directory \FIXED0FR\TABLE there is the file HANDSET.LST, which gives of each session the type of handset that is used. This is either 'CORDED' or 'CORDLESS'.

9. LEXICON.TBL


The file LEXICON.TBL, also to be found in \FIXED0FR\TABLE contains the phonemic representation (in SAMPA) of each word form on the CDs. An extract:

demande 57 d @ m a~ d demandent 5 d @ m a~ d demander 16 d @ m a~ d e demanderaient 4 d @ m a~ d @ R &/ demanderons 1 d @ m a~ d @ R o~ demanderont 1 d @ m a~ d @ R o~ demandes 22 d @ m a~ d demandeurs 7 d @ m a~ d 9 R demandez 2 d @ m a~ d &/ demandiez 1 d @ m a~ d i E/ demandons 10 d @ m a~ d o~ demandé 27 d @ m a~ d &/

The layout is word, frequency, phonemic representation, separated by tabs. The symbols of the phonemic representation are separated by spaces.

10. SUMMARY.TXT

The file SUMMARY.TXT, to be found in \FIXED0FR\DOC gives a listing of all the items on the CD. Whereas contents.lst gives all items on all CD's, summary.txt gives only the items that are on one specific CD. The format of summary.txt is full path name, session number, items, recording date, recording time all separated by tabs.
The items that are present are printed in one string, if an item is not present two hyphens ('--') are printed in its place.

An example:
\FIXED0FR\CD00\BLOCK00\SES0000  0000    A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
     26/Apr/1995     11:34:16

\FIXED0FR\CD00\BLOCK00\SES0001  0001    A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
     27/Apr/1995     15:09:44

\FIXED0FR\CD00\BLOCK00\SES0002  0002    A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
     27/Apr/1995     15:23:46

\FIXED0FR\CD00\BLOCK00\SES0003  0003    A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
     28/Apr/1995     14:31:06

\FIXED0FR\CD00\BLOCK00\SES0004  0004    A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
     01/May/1995     22:49:26

\FIXED0FR\CD00\BLOCK00\SES0005  0005    A1A2A3A4A5A6A7A8A9C1C2C3D1D2D3E1E2E3I1L1L2L3M1M2N1P1Q1Q2Q3Q4R1T1T2T3W1W2W3
     04/May/1995     12:12:22

\

11. MISSING.TXT


Also in the directory \FIXED0FR is the file MISSING.TXT, in which a list of all missing items can be found.

RECORDING PLATFORM: FUNCTIONAL DESIGN

1. INTRODUCTION

The recordings of the French SpeechDat(M) corpus are done on the same recording platform as was used for the Dutch Polyphone corpus. For the French recordings an international green number was hired. The lines enter the Netherlands digitally and they remain digital. The recordings are done on an OS/2 based PC. The application is developed by means of the "Show-'N-Tel" application generator. The underlying hardware is a combination of a Rhetorex (RDSP16000) voice board and a ACULAB (1TR6 ISDN-30) telephone interface.

Apart from this OS/2 based recording platform, a UNIX computer is used for permanent storage of sampled data. This environment provides 16 independent input lines, from which calls can be recorded and stored.

This report defines the recording process for SpeechDat (M) French. Specific information about the hardware can be found in appendix A of this document.

2. OVERVIEW OF THE RECORDING PROCESS

Three different phases in the recording process of individual calls may be discerned:

1. Initial setup: During the initial setup, a unique session-ID is generated, the available space on the platform is checked and the directories are created, which must be available for the succeeding phases.

2. Recording: During the recording phase, the prompts are produced and the tokens are recorded.

3. Transfer to permanent storage: During the transfer to permanent storage, a set of files corresponding to exactly one speaker is transferred from PC to a UNIX machine.

These actions are preceded by a one-time preparation phase, in which the recording protocol is defined, the system prompts are recorded, the speech material is defined, subjects are recruited, etc.

One of the features of the platform is that it has a ISDN-30 telephone interface and, thanks to the voice board, it has the capacity to handle 16 recording sessions simultaneously. A recording session starts with an incoming call on one of the 30 time slots of the telephone interface. If less than 16 lines are active, the call is answered. If all 16 lines are already active, the call is not answered and the caller gets a busy tone. If the call is answered, the available disk space on the recording platform is checked. If the amount of disk space is under a specified minimum, an appropriate message is played to the caller and the system disconnects the call. For each call that is accepted, a unique, 6-digit session-ID is generated. This is done by a separate program, compiled c-code, that is called from the "Show-'N-Tel" script. A directory for this particular session is then created.

After this initialization, a short introductory message is played, followed by a sequence of prompts for utterances to be recorded. To do this, the program first reads a database to determine the token type for the token to be recorded. This database is actually a text file, containing two lines for each token to be recorded. The first line corresponds to the token-ID and is a two- digit string, while the second line contains either the term "word", "sentence" or "longsentence". This information is used to differentiate between the recording of (short) words or (longer) sentences.

The system messages are all stored in files in a single directory. There are guidance messages, such as introductory phrases, and prompts, intended to elicit a vocal response from the speaker. The prompts are all stored in files, with names corresponding to the token-ID. The system messages are spoken by two different voices. The prompts for utterances are spoken by a female speaker. The system messages that are used as guidance messages are spoken by a male speaker.

The recorded vocal responses of the callers are stored in files with names 'token-id.smp'. After the last utterance has been recorded, a short message is played and the connection is closed. After this hangup, a file is created with the name: 'session-id.ses'. A background job periodically detects the existence of this file and figures out the session-id from the filename. The files that correspond to this session are then transferred for permanent storage. This transfer is done via ethernet using NFS (for OS/2).

3. INTERACTION BETWEEN CALLERS AND THE RECORDING PLATFORM

The core of the application lies in the prompting for utterances and the actual recording of those utterances. This section describes several details regarding this aspect of the application.

* Recording is terminated either by a time-out, which occurs if the maximum duration for a vocal response is exceeded, or by a silence detection. If the maximum duration is reached, the recording is skipped, so in practice all recordings are terminated by silence detection. For this a silence interval is specified. If a silence interval of this duration is detected, recording is terminated. If the size of the resulting speech file is not larger than the size of the silence interval, the file is disscarded and no speech has been detected. If the size of the file is indeed larger, an utterance is taken to have been recorded (minimum duration is 0.125 seconds).

* Utterances are divided into two categories: words and sentences. Words have a corresponding silence interval of 1.92 seconds. Sentences have an interval of 2.56 seconds. The maximum duration for a vocal response is for words 12 seconds, for sentences 30 seconds.

* Immediately after a prompt has been played, the recording starts. The silence before the utterance is retained. The maximum duration is determined by the recording silence interval: 1.92 and 2.56 seconds for words and sentences respectively. If the caller does not respond to a prompt within the duration of the silence interval, a message is played and the prompt is repeated. If the caller still does not respond to the prompt, a message is played followed by the next prompt. If three tokens in a row have been skipped, a message is played and the session is aborted.

* It is not possible to detect if a caller speaks too early. Therefore, no specific action is taken.

* Recording an utterance continues until a period of silence has been detected. The duration of this period of silence depends on the type of utterance to be recorded: word or sentence. If a recording exceeds the maximum duration for this particular type of utterance, the recording is ended, a message is played and the prompt is repeated. If the caller responds with a new utterance which exceeds the allotted time, another message is played, followed by the next prompt. The recording though is retained. If three consecutive prompts are answered with utterances exceeding the allowed time frame, a message is played and the session is aborted.

Because the silence detection in the recording platform was upset by the DC offset in the telephone connections and background noise when subjects called from noisy factory floors, a large number of calls were aborted too early.

Therefore, it was decided to adapt the recording protocol to these conditions. The new protocol took effect as of June 6, 1995: Using 390 complete sessions the average duration of the individual items was computed. As long as silence detection appeared to work properly, the original protocol was followed. However, as soon as end-of-utterance had failed, the recording platform switched to the alternative protocol, in which recording continued for the average duration of this item (based on the first 390 calls) plus the 'silence' duration for the type of stimulus (1.92 ms for words and 2.56 ms for sentences).

When the new protocol took effect, a new assessment value PLATFORM was added; it was given to those items for which end-of-utterance detection did not work, but that did not contain disturbing background noise.

* If a caller hangs up before the final item has been recorded, the session is aborted.

* After the final utterance has been recorded, a message is played and the connection is terminated.

4. EXCEPTION HANDLING IN THE SCRIPT

The script itself is very straightforward. Few things can go wrong. No branches are present, except those corresponding to exceptions. The exceptions in the interaction with caller mentioned above, are identified in the script as follows:

- Silence: A caller does not respond to a prompt.
- Babble: A caller's response takes up more than the allotted time.
- Hangup: A caller disconnects, before all utterances have been recorded.
- Diskfull: The disk of the recording platform is full at the start of a session.

5. SAMPLED DATA FILE FORMAT AND FILE NAMES

Each utterance is stored as a sampled data file. The sampled data files contain 8-bit A-law coded PCM samples, 64 kByte/s, sampled at 8000 Hz. The sampled data files contain no header and the names are derived from the prompt-ID. Prompts 1 through 50 result in files "1.smp" through "50.smp", and the files are stored in a directory denoted by a session-ID.

APPENDIX A: INSTALLED HARDWARE AND SOFTWARE ON RECORDING PLATFORM

A.1. The ISDN-connection

The recording platform for SpeechDat (M) French is based on a ISDN-30 connection with 1TR6 signalling (i.e. a primary rate German ISDN, 2 Mbs connection). The Dutch PSTN infrastructure guarantees that a speech signal with an ISDN connection as its destination remains in an A-law coded digital form after the first major network switch that it encounters. All 30 lines of the connection can be reached by one telephone number.

The Aculab MVIP/PEB E1/G703 PC card provides a means of connection between the telephone network and various kinds of PC based speech and data processing cards. The E1 card may handle up to 30 separate calls at one time. Processors on the E1 card control all of the call signalling (call setup, call acceptance, call clearing etc.) in response to commands from an application program running on the host computer. One such information element that is available is the number used by the caller on calling into the card, the DDI number.

There is a requirement that audio signals over the E1 Card transmitted are encoded using CCITT A-law PCM. Connection between the E1 card and the ISDN network termination port is via the Line Interface Unit that provides the high voltage isolation and EMC protection.

A.2. The Voice Processing Platform


The Rhetorex RDSP/16000 Voice Processing Platform is a sixteen telephone port single slot voice processing board for digital line interfaces. It can be installed in IBM PC/AT or ISA bus compatible computers and must be connected to the digital telephony interface card via the MVIP bus. Also included is a 512 channel non-blocking switch matrix that provides Enhanced Compliant MVIP switching. All of the available channels can be routed to the MVIP bus for transfer to the Rhetorex RDSP card, as well as other MVIP resource cards, such as voice recognition, FAX, or subscriber line cards. The matrix is also capable of acting as a central switch by switching data from one MVIP card to another.

A.3. Configuration

The platform is based on a PC, type Compaq Prolinea 4/33i, 33 MHz, 250 Mb Hard Disk, 16 Mb RAM. The operating system installed on the PC is OS/2 version 2.1.

The fully installed PC contains three add-on ISA boards. These three boards are:

1. 1TR6-card, ACULAB: I/O-address: 380
base address dual ported memory: 0xD000:0000 (64KBytes)
IRQ: 5

2. RDSP16000-voicecard, Rhetorex: I/O-address: 390
base address dual ported memory: 0xC800:0000 (4KBytes)
IRQ: 3

3. ETHERNET-card, RACAL-MILGO: I/O-address: 300
base address dual ported memory: 0xCC00:0000 (16KBytes)
IRQ: 4

A.4. Installed software

The following software is installed on the platform:
- OS/2, version 2.1
- Rhetorex drivers for voice board
- ACULAB drivers for 1TR6 card
- Show-'N-Tel application generating software
- Borland C for OS/2
- Perl programming language for OS/2
- TCP-IP with NFS for OS/2